Building a Private AI Tutor for a Microsoft Learn Course

In this article, we will look into one of my recent projects, which is the creation of a specialized private AI tutor for a Microsoft Learn course for the Power BI certification. Given the results of the job market analysis discussed in my most recent article and the fact that I already wanted to refresh my Power BI knowledge, this is the perfect opportunity for this project.  The overall aim was to provide an interface where users could ask questions and receive accurate and relevant responses based on the content of the course. We will now look into the details of the implementation.

Data Collection

To build the knowledge base for our AI tutor, we will utilize Python along with libraries like Selenium and BeautifulSoup to scrape all necessary information from the web. By converting each section of the course into individual text files, we can create a comprehensive database to serve as our tutor’s knowledge repository. Thereby, we have to utilize Selenium, which allows us to retrieve HTML code that is dynamically generated by JavaScript. We basically retrieve the raw HTML of each page and extract the actual content with the help of BeautifulSoup. Following, we clean the data by removing links to quizzes and interactive exercises in the course and splitting up the raw text into individual chapters by using the heading structure of the HTML.

This follows the same logic as the PDF conversion pipeline that was used for the ChatGPT flashcard generator and document summarizer, where the headings are used to split up a larger file into chapters. This semantically correct split of the text is important for the following steps and improves our results. Running our scraping pipeline results in the retrieval of 196 pages of course content, which are split into 528 text files that ultimately serve as our knowledge base.

Implementing the AI tutor

The core of our AI tutor is built with the LangChain library, which is a framework that facilitates building applications with large language models (LLMs) like those from the GPT model family. To enable our AI tutor to effectively process and retrieve information, we need to load and prepare the data before we can make an inference that returns an answer to our questions. This pipeline follows a step-by-step process, which is as follows:

  1. Loading the documents
  2. Creating word embeddings
  3. Storing the embeddings in a vector database
  4. Retrieving documents with a similarity search
  5. Passing the retrieved documents to a LLM

Steps 1 to 3 are only done once at the beginning, and steps 4 and 5 are repeated for every question that we ask.

Loading Text and Storing Vectors

Loading the documents is straightforward and done with LangChain’s DirectoryLoader class. Thereby, the functionality of the class basically iterates over the files in a directory and stores them in a list. The more interesting part happens when we convert the raw text documents into embeddings in the next step. Embeddings allow for the creation of a vectorized representation of text, enabling us to analyze and compare pieces of text in the vector space. This is particularly valuable for conducting semantic searches, as it allows us to identify texts that are most similar based on their vector representations. For our project, we use the OpenAIEmbeddings class also from LangChain, which serves as our access point to OpenAI’s embedding service. The last data preparation step consists of storing the embeddings in a vector database, which allows for efficient retrieval. For our project, we use Faiss, a vector database developed by Meta.

Querying Process and Answer Generation

Once our database is ready, we can input questions or queries into our AI tutoring system. Utilizing similarity search algorithms, the vector database retrieves relevant documents based on query similarities. This retrieval process enables quick access to specific sections of course material that may hold answers or explanations related to our inquiries. Upon retrieving relevant documents from our knowledge base, we can then employ a GPT model to formulate accurate answers in response to user queries. The LLM ensures that we receive high-quality and contextually relevant responses tailored specifically to our questions. For better comprehension, we will now walk through the process step by step.

First, we set up the question, which will be “How can I share reports with Power BI?” or in German “Wie kann ich mit Power BI Berichte teilen?”. We first use this question for a similarity search in the vector database, which retrieves the following four documents:

You can see that three of the four documents are very relevant to our question about sharing reports, which means that the similarity search worked well. Here is also the proof that our initial semantically correct splitting of the documents paid off, because otherwise we would retrieve a lot of noise from the database. For example, a common strategy for this type of application if we do not have a structure to build from is to use a fuzzy approach where a large document is split at defined, overlapping word intervals. The LLM that interprets the retrieved documents can certainly handle this, but it is always better to clean the input data as much as possible to improve the accuracy and stability of the output.

As our last step, these four documents are now passed to the GPT model with the instructions to formulate an answer to our question based on the context of the four documents. LangChain handles the prompt creation and the API call for inference, and we get the following answer to our question:

The answer is pretty solid and very relevant to our question. Another very useful feature is that the hallucinations of the LLM are kept to a minimum with this approach. Hallucinations are incorrect answers that the model comes up with. For example, if we ask the model, “What is a squirrel-cage rotor?” or in German, “Was ist ein Kurzschlussläufer?” we get the following answer:

Conclusion

The development of the specialized private AI tutor for the Microsoft Azure Power BI course introduces an innovative way of engaging with educational content. By leveraging the power of Python, web scraping techniques, and advanced libraries like LangChain and GPT models, we successfully built an intelligent tutoring system capable of providing personalized and accurate responses. The biggest advantage of this type of system is definitely the increased speed of information retrieval. We basically built a specialized search engine that dynamically compiles the search results in the most targeted and accurate fashion possible, providing a direct answer to a question.

Overall, LangChain and similar frameworks provide a novel approach to utilizing state-of-the-art LLMs and have spawned a new wave of sophisticated AI applications. The system that we built here with a few lines of code can be leveraged in a lot of different scenarios. For example, the most recent generation of customer support chatbots are probably all running on the same or at least a similar pipeline. A knowledge base is set up with very relevant information about a company’s products or services, and customer questions are used the same way as we used for our AI tutor. The same system can also be used for company internal knowledge, interacting with user manuals, online documentation, or research papers. Although the responses are not always perfect and one should always check the source, which can be printed with the answer, it feels like we are moving into an era of real AI integration into a vast number of applications. The code for this project can be found on GitHub.