NLP – BigDataTime

Topic Modelling with LLMs

Marco Schweiss — Fri, 01 Dec 2023 18:15:36 +0000

The field of natural language processing has witnessed revolutionary advancements with the introduction of large language models (LLMs) such as ChatGPT and GPT-4. This has prompted an exploration of how these LLMs can be utilized in topic modeling, an essential technique for comprehending extensive textual corpora. The aim of this project was to examine the trade-offs between traditional statistical algorithms and LLMs for topic modeling. Additionally, the goal was to develop a new approach for topic modeling utilizing the GPT model family, while providing recommendations on when to employ each method. A thorough literature review was conducted to establish the theoretical foundation.

Subsequently, an experiment was designed to compare the two approaches across three diverse datasets containing different types of documents, namely news, abstracts and tweets. Topic coherence and diversity metrics were employed for quantitative assessment, complemented by qualitative evaluation through GPT-4. Traditional models were found to outperform the LLM-based approach on the news and abstract datasets. Conversely, both qualitative and quantitative analysis demonstrated that the novel LLM approach yielded superior results on the tweet dataset.

It was determined that traditional topic models are better suited for larger datasets, providing a higher-level overview of topics. In contrast, LLM-based approaches excel at generating granular and interpretable topics from smaller datasets. These findings offer practical recommendations regarding strategy selection in topic modeling applications.

To read the full article please refer to this PDF file.

Discovering In-Demand Skills in Austria’s Data Science Job Market

Marco Schweiss — Sun, 29 Oct 2023 10:35:34 +0000

In today’s rapidly evolving world of data science, staying ahead of the curve and understanding market demand is crucial. Aspiring data scientists often wonder which skills to focus on developing and what kind of compensation they can expect. In a recent project, I scraped public job postings in Austria to uncover the most sought-after skills and explore salary ranges for data science-related positions. Join me as we unveil intriguing insights into Austria’s data science job market.

Scraping Data

The first step of the project was to collect a dataset of job postings from an online resource, in this case the online job board Karriere.at. The goal was to capture valuable information like required skills and associated salaries, allowing us to gain actionable insights. To start the data collection, I first searched for the term “data science” on the job board, which yielded 310 results at the time I started the project.

The next step was to simply copy the HTML source code of the website, which contained all the links to the individual job postings. By leveraging the Requests and BeautifoulSoup Python libraries, I first extracted the links, then scraped the HTML and extracted the text for each listing. This resulted in 310 text files containing the job description in an unstructured format.

Extracting Skills and Salaries

To extract the skills and salaries from the scraped job postings, I employed OpenAI’s Chat Completions API with the GPT-3.5-Turbo model, or for short, ChatGPT. The large-language model allows us to effectively extract the information of interest from the unstructured data via a well-crafted prompt. The skill and salary extraction is thereby split into two tasks, so the first processing of the postings only extracts the skills, and the second run extracts the salaries. The image below shows the unstructured text data and the extracted skills and salary plus salary period, which will be used for further analysis.

Data Cleaning and DataFrame creation

Once the necessary information was collected, I transformed it into a structured format suitable for comprehensive analysis. This involved organizing the extracted skills and their associated salaries into a clean DataFrame, ready for deeper exploration. To clean up the skills, the first step was to tokenize the comma-separated list returned by the Chat Completions API. Following was lowercasing, removal of spaces, hyphens and special characters, and also the removal of company names like Microsoft and Google. All these steps help to basically reduce the noise in the data that results from different wordings. For example, by looking into the data, I saw a lot of different wordings for Azure, namely MS Azure, Microsoft Azure and microsoft azure which all denote the same skill. Finally, employing all these cleaning steps resulted in a list of 785 different skills, which were consequently used to create a one-hot encoded DataFrame for further analysis.

Analyzing Sought-After Skills

With our properly structured dataset at hand, we can now delve into analyzing the most sought-after skills in Austria’s data science job market. This is done by simply counting the rows that contain the value “true” for the respective skill and sorting in descending order. If we look at the bar plot below, we can identify key skills that repeatedly cropped up across numerous job postings.

Revealing Salary Ranges

In addition to uncovering the essential skills sought by employers, we also examine the salary ranges offered for data-related roles in Austria. By creating a box plot of the provided salaries, we can see the distribution of salaries over the different data-related roles. Here we get a mean and median of around 49,000 euros.

Insights and Limitations

Throughout the analysis, several noteworthy findings emerged. The first finding is that Python and SQL are by far the most sought-after skills, with around 50 percent of the jobs demanding these two skills. Looking further at place three, we can see that there is a demand for Power BI, which is a business intelligence product offered by Microsoft. Following, we have a mixture of several programming languages and technologies. If we look at the salaries, we can see that we have several outliers, especially towards the lower end, which are probably a result of an error from the salary extraction pipeline or postings that give hourly wages that are not accounted for. The same is true for the skill extraction, which is comprehensive but not perfect. This could be accounted for by fine-tuning a model for this task. Nonetheless, given the zero-shot prompt approach, I am very happy with the results, which coincide with common knowledge about the field of data science and related job roles.

Conclusion

Our data analysis project sheds light on the current landscape of Austria’s data science job market by uncovering sought-after skills and highlighting salary ranges. According to our analysis, a well-rounded approach to acquiring skills would first require a solid understanding of Python and its usage in data-related scenarios. The following is a study of SQL and the concept of relational databases. A business intelligence solution like Power BI further adds to the skill set, which is now more focused on data analytics than data science. In general, the results of our analysis state a higher demand for data analytics than data science.

Alternatively, or to further expand one’s skill set, the usage of the Microsoft Azure platform for data-related workloads would be a great addition. Programming languages like Java, JavaScript, C#, or C++ and version control with GIT are more in the realm of software engineering. Armed with this knowledge, aspiring data scientists can effectively prioritize their skill development efforts. As technologies evolve and new demands emerge, it’s crucial to stay informed about market dynamics. The code for the analysis can be found in this repository.

Building a Private AI Tutor for a Microsoft Learn Course

Marco Schweiss — Wed, 25 Oct 2023 09:56:44 +0000

In this article, we will look into one of my recent projects, which is the creation of a specialized private AI tutor for a Microsoft Learn course for the Power BI certification. Given the results of the job market analysis discussed in my most recent article and the fact that I already wanted to refresh my Power BI knowledge, this is the perfect opportunity for this project. The overall aim was to provide an interface where users could ask questions and receive accurate and relevant responses based on the content of the course. We will now look into the details of the implementation.

Data Collection

To build the knowledge base for our AI tutor, we will utilize Python along with libraries like Selenium and BeautifulSoup to scrape all necessary information from the web. By converting each section of the course into individual text files, we can create a comprehensive database to serve as our tutor’s knowledge repository. Thereby, we have to utilize Selenium, which allows us to retrieve HTML code that is dynamically generated by JavaScript. We basically retrieve the raw HTML of each page and extract the actual content with the help of BeautifulSoup. Following, we clean the data by removing links to quizzes and interactive exercises in the course and splitting up the raw text into individual chapters by using the heading structure of the HTML.

This follows the same logic as the PDF conversion pipeline that was used for the ChatGPT flashcard generator and document summarizer, where the headings are used to split up a larger file into chapters. This semantically correct split of the text is important for the following steps and improves our results. Running our scraping pipeline results in the retrieval of 196 pages of course content, which are split into 528 text files that ultimately serve as our knowledge base.

Implementing the AI tutor

The core of our AI tutor is built with the LangChain library, which is a framework that facilitates building applications with large language models (LLMs) like those from the GPT model family. To enable our AI tutor to effectively process and retrieve information, we need to load and prepare the data before we can make an inference that returns an answer to our questions. This pipeline follows a step-by-step process, which is as follows:

Loading the documents
Creating word embeddings
Storing the embeddings in a vector database
Retrieving documents with a similarity search
Passing the retrieved documents to a LLM

Steps 1 to 3 are only done once at the beginning, and steps 4 and 5 are repeated for every question that we ask.

Loading Text and Storing Vectors

Loading the documents is straightforward and done with LangChain’s DirectoryLoader class. Thereby, the functionality of the class basically iterates over the files in a directory and stores them in a list. The more interesting part happens when we convert the raw text documents into embeddings in the next step. Embeddings allow for the creation of a vectorized representation of text, enabling us to analyze and compare pieces of text in the vector space. This is particularly valuable for conducting semantic searches, as it allows us to identify texts that are most similar based on their vector representations. For our project, we use the OpenAIEmbeddings class also from LangChain, which serves as our access point to OpenAI’s embedding service. The last data preparation step consists of storing the embeddings in a vector database, which allows for efficient retrieval. For our project, we use Faiss, a vector database developed by Meta.

Querying Process and Answer Generation

Once our database is ready, we can input questions or queries into our AI tutoring system. Utilizing similarity search algorithms, the vector database retrieves relevant documents based on query similarities. This retrieval process enables quick access to specific sections of course material that may hold answers or explanations related to our inquiries. Upon retrieving relevant documents from our knowledge base, we can then employ a GPT model to formulate accurate answers in response to user queries. The LLM ensures that we receive high-quality and contextually relevant responses tailored specifically to our questions. For better comprehension, we will now walk through the process step by step.

First, we set up the question, which will be “How can I share reports with Power BI?” or in German “Wie kann ich mit Power BI Berichte teilen?”. We first use this question for a similarity search in the vector database, which retrieves the following four documents:

You can see that three of the four documents are very relevant to our question about sharing reports, which means that the similarity search worked well. Here is also the proof that our initial semantically correct splitting of the documents paid off, because otherwise we would retrieve a lot of noise from the database. For example, a common strategy for this type of application if we do not have a structure to build from is to use a fuzzy approach where a large document is split at defined, overlapping word intervals. The LLM that interprets the retrieved documents can certainly handle this, but it is always better to clean the input data as much as possible to improve the accuracy and stability of the output.

As our last step, these four documents are now passed to the GPT model with the instructions to formulate an answer to our question based on the context of the four documents. LangChain handles the prompt creation and the API call for inference, and we get the following answer to our question:

The answer is pretty solid and very relevant to our question. Another very useful feature is that the hallucinations of the LLM are kept to a minimum with this approach. Hallucinations are incorrect answers that the model comes up with. For example, if we ask the model, “What is a squirrel-cage rotor?” or in German, “Was ist ein Kurzschlussläufer?” we get the following answer:

Conclusion

The development of the specialized private AI tutor for the Microsoft Azure Power BI course introduces an innovative way of engaging with educational content. By leveraging the power of Python, web scraping techniques, and advanced libraries like LangChain and GPT models, we successfully built an intelligent tutoring system capable of providing personalized and accurate responses. The biggest advantage of this type of system is definitely the increased speed of information retrieval. We basically built a specialized search engine that dynamically compiles the search results in the most targeted and accurate fashion possible, providing a direct answer to a question.

Overall, LangChain and similar frameworks provide a novel approach to utilizing state-of-the-art LLMs and have spawned a new wave of sophisticated AI applications. The system that we built here with a few lines of code can be leveraged in a lot of different scenarios. For example, the most recent generation of customer support chatbots are probably all running on the same or at least a similar pipeline. A knowledge base is set up with very relevant information about a company’s products or services, and customer questions are used the same way as we used for our AI tutor. The same system can also be used for company internal knowledge, interacting with user manuals, online documentation, or research papers. Although the responses are not always perfect and one should always check the source, which can be printed with the answer, it feels like we are moving into an era of real AI integration into a vast number of applications. The code for this project can be found on GitHub.

PDF Summarization and Flashcard Generation with ChatGPT

Marco Schweiss — Mon, 17 Jul 2023 13:42:35 +0000

In an era where information overload is prevalent, finding efficient ways to extract key insights from vast amounts of data has become crucial. As an aspiring student navigating the sea of academic papers during my time at university, I often found myself spending countless hours reading and analyzing research articles. Motivated by the desire to streamline this process, I embarked on a journey to develop a PDF summarizer using ChatGPT. Additionally, I wanted to leverage ChatGPTs language capabilities to automate the process of generating flashcards from the learning material.

By leveraging the power of OpenAI’s state-of-the-art language model, I aimed to create an automated tool that would help me and others extract essential knowledge from lengthy scholarly texts. The primary objective was to facilitate efficient research and enhance the learning experience for students like myself.

In this article, I will delve into the intricacies of building a PDF summarizer and flashcard generator with ChatGPT. I will discuss the underlying methods and the challenges encountered along the way.

The Portable Document Format

PDF, short for Portable Document Format, has become one of the most widely used file formats for document sharing and distribution. Its popularity is attributed to its ability to preserve the original formatting of a document across various platforms. However, this format presents unique challenges when it comes to extracting information and transforming it into a structured format.

The main hurdle lies in the fact that PDFs store data as a series of graphical elements rather than plain text. While these graphical elements maintain the visual appearance of the document, they pose difficulties for programs aiming to extract meaningful content automatically. Directly accessing and parsing text from a PDF can be complex due to variations in layout, fonts, and non-standard encoding schemes.

Another challenge is that PDFs often contain complex structures such as tables, headings, footnotes, citations, and images. These elements require specialized algorithms and techniques for accurate understanding and extraction. Moreover, despite efforts towards standardized file formats like tagged PDFs (PDF/A), many documents are still produced without such metadata, making it even more challenging to extract relevant information systematically.

To overcome these obstacles and bring the data within PDFs into a structured format, one must employ smart techniques such as optical character recognition (OCR) to convert graphical elements back into text data. Additionally, algorithms need to analyze the layout structure to identify headings, sections, paragraphs, and other textual components accurately, which brings us to the PDF Extract API from Adobe.

PDF Extract API

The PDF Extract API is a cloud-based service to automatically extract content and structural information from PDF documents. It can extract text, tables, and figures from both native and scanned PDFs. When extracting text, the API breaks it down into contextual blocks like paragraphs, headings, lists, and footnotes. It includes font, styling, and formatting information, providing comprehensive output. Tables within the PDF are extracted and parsed, including contents and table formatting for each cell. The API can also extract figures or images from the PDF, saving them as individual PNG files for further use.

We will leverage this API to create a JSON file containing the PDF content and in turn, use this JSON file to create text files that we will feed into ChatGPT for further processing.

The Problem with Summarization

Text summarization is a natural language processing technique aimed at condensing large bodies of text into shorter summaries while still capturing the essential information. With the rise of transformer models like ChatGPT, extractive and abstractive summarization methods have seen significant advancements.

However, when dealing with long-form content such as PDF documents, transformers face challenges due to their limited input length. Transformers typically have a maximum token limit, necessitating the truncation or splitting of longer texts. This approach can lead to the loss of important context and affect the quality of the summary.

To address this limitation and preserve information integrity, we will leverage the headings present in the PDF documents. Headings provide structural organization to the text and can serve as natural breakpoints for splitting the document into smaller sections. By splitting the document based on headings, we can ensure that each section is more coherent and self-contained, resulting in better summarization outcomes.

PDF to JSON

As already explained the first step of our pipeline is the conversion of a raw PDF into a JSON file and the extraction of tables and figures. We will use the paper Attention is All You Need by Vaswani and colleagues which is an important paper about transformer model architecture with over 80 000 citations for testing our approach. If you open the PDF you can see that we have a perfect candidate with a variety of elements such as headings, sub-headings, lists, math formulas, images, footnotes, and tables.

For a detailed explanation of the JSON files structure please refer to the documentation of the API provider. Sending the PDF to the Extract API returns the following files:

From here on out we are confronted with three distinct problems:

How do we get the information in the JSON file into a format suitable for our use case?
How do we integrate the tables into the text files?
How do we handle the figures?

JSON to Chapters

We will start by solving problem one which is the most straightforward to solve. By using the documentation we can iterate through the JSON file and transform the data to our target format, which is a folder of text files each corresponding to a section. A section is thereby everything that is contained between two headings.

The second problem of integrating the tables back into the text files is also relatively easy to solve. When the JSON file indicates a table in its path attribute we read the table into a DataFrame, convert it into a CSV, and instead of saving the data to a file we save it inside our text file. By using | as column separators, ChatGPT is capable of understanding that this should be a table.

The third part on our list is where it gets a bit more interesting. Let’s look at the following two images from the figure folder. Here is the first image

and here is the second image:

Given that this application should also be used for generating flashcards, we need a way to tell apart diagrams and formulas and also to incorporate mathematical formulas into our text. We will omit the diagrams for the time being.

Training a Formula Detection Model

To solve the issue we will train a CNN on a dataset that I compiled from various different sources. We basically need a model that can distinguish between math formulas and everything else that could be included in a PDF. This can range from figures and diagrams to illustrations and images of everything else. The approach to training this model is the same as the one that I have already used for the facial emotion classifier. Given the relatively simple task for the modified VGG19, because the two image categories are relatively distinct, we get an accuracy of around 99% which will have to suffice.

Now we integrate this model into our initial script and every time the JSON file path attribute indicates a Figure, the model is called to predict whether we have a formula or not. Depending on the decision the figure is either skipped or sent to the Mathpix API which returns us a Latex representation of the formula. The Latex code is then included in the text where it can be interpreted by ChatGPT. An example of a text file can be seen below:

Chapters to Content

With the preprocessing of the PDFs done, we can now move on to the second big part of the pipeline which is the summarization and the creation of flashcards. As I have already mentioned we will use ChatGPT’s capabilities to do this. By using ChatGPT’s API we can improve our results by providing examples of how we want the task done by sending context in the form of a preceding conversation. To convert the information from the raw text files into high-quality summaries and flashcards we will incorporate an intermediate step that chunks the raw text data into blocks of related information. The image below shows the text of chapter eleven in a chunked format.

After chunking we then let ChatGPT create a summary

and a list of flashcards:

The image does not show all the flashcards that were created. The complete text file can be downloaded here. As you may have noticed, the summary of the section is not really that much shorter than the original section. We will discuss how different lengths of text effects the summaries and the quality of the outputs later on. We also instruct ChatGPT to create flashcards from the summaries which gives us the following list:

Content to Documents

The last step of the pipeline is to convert the text files into documents that can be further used for studying. We will first look into the CSV files that can be used in flashcard software like Anki. Here we will create two CSV files one for the flashcards from the chunked files and one for the summary flashcards.

In addition to the question and answer that makes up a flashcard, we also include the source (document name), the section heading, and the two combined inside a tags column, which is a sorting mechanism used by Anki. Given the correct settings inside Anki, we can import the 238 flashcards from the chunked text and the 113 flashcards from the summaries. An example of how this looks inside Anki in its final form can be seen below:

The second type of document which we will create is two Word documents that contain the chunked sections and the summarized sections.

Discussion of the Results

Does our pipeline here perform flawlessly? No. Is it enough so that I could save a lot of time on tedious work by hand? Yes, absolutely.

Summaries

First of all, let’s talk about the rate of shrinkage that a document undergoes. The size of the resulting summary is dependent on how large the individual sections are and I think the GPT model handles this very well because larger sections get condensed more, and smaller sections that contain less information to begin with, do not get summarized too much. As you can see from our example, we have around 30% percent shrinkage of the original text, if we only consider the actual text as graphics, references and the abstract have been removed.

On the other hand, I tested this pipeline on my written university lectures that have several hundred pages and here I often got a reduction of around 50% or more, of course depending on the structure of the document. So a rule of thumb is that a lot of text without much headings and graphics gets condensed more and a document with sparse text gets condensed less.

The summaries that this pipeline generates can also be used as input for more summaries and solves the problem of too large input contexts, of course only if the original document is not too large to begin with. Here is an example of a one-page summary based on our five-page summary, that could be used for example to speed up a literature review :

Another idea would be to build a recursive summarizer where we first summarize sub-sections and then summarize these summaries into section-level summaries. Similar like OpenAI did in their case study. Regarding the quality of the summary I can of course only give you my subjective opinion as there is no standardized process to measure the quality of a summary. First I would like to explain how I use summaries in my overall approach to studying for a subject. The summary is used to get a first understanding of how the topics of a subject relate to each other and effectively reduces the time to do so by omitting detailed information.

So in essence you can focus on building a higher-level overview of how topics relate and care for the details of a specific topic later on. Given that I studied successfully for several exams with this approach I am personally satisfied with the quality of the summaries and the time that the automated generation saved me.

Flashcards

Next, we will look into the flashcards. As you may have noticed, the flashcards are very granular which means that practically every bit of information is transformed into a question-and-answer format. The idea here is to curate the flashcards during the revision process, so you have built an understanding of what the learning material is about with the summaries, then studied the material in detail with the chunked and original text and finally revise it with the flashcards.

During this revision process, you can then remove or rephrase sub-optimal cards and also check gaps in your existing knowledge. Additionally, you can evaluate the cards during studying which also can be considered a form of interleaved retrieval practice and results in better learning outcomes at least in my case. And last but not least the automated flashcard generation reduced the time it took to study for an exam, as I found curating the cards is faster than writing them from scratch. Of course, you can also just study the summary cards and call it a day.

Conclusion

In conclusion, the use of the PDF summarization pipeline built with ChatGPT has proven to be a game-changer in reducing study time and eliminating tedious handwriting work for me. The possibility for models like ChatGPT to act as a multitask-learner with minimal input is really fascinating and allows for a vast amount of use cases like this one. It is important to note that I use the GPT-3.5 model although GPT-4 is already available because the price per token demands it. A larger document with 100-200 pages costs around 2-3 $ at this time with the GPT-3.5 model and would cost 10 times as much with GPT-4.

I already did tests with GPT-4 but the capabilities in regards to the transformation of text are not that much different between the two models to justify the costs. This also coincides with what OpenAI reported about the difference between the two models. Given the rapid advancements of this technology and the large number of tasks that can be solved with large-language models, I am looking forward to years of interesting application development. This time there will be no code on GitHub, if my coding skills interest you please refer to my other works.

Training of a Spam Classification Model

Marco Schweiss — Fri, 14 Jul 2023 09:14:55 +0000

The rise of short message communication in recent years has provided individuals and businesses with a fast and interactive way to communicate with their peers and customers. However, this has also led to an increase in the number of spam messages, particularly on publicly available channels. In response to this, we will build a classification model that can identify if a message is spam and allows for an adjustable risk level, ensuring that low-risk messages are immediately displayed while others are moved to a folder for further analysis. We will base our model on this dataset.

To achieve these objectives, this project will follow the CRISP-DM framework, a widely used methodology that provides a structured approach to planning a data science project. We will cover all the necessary steps, including business understanding, data understanding, data preparation, modeling, evaluation, and a suggestion for deployment.

Organizing the Project with CRISP-DM

The first step in the CRISP-DM methodology is the Business Understanding step, which involves understanding the business objectives and requirements of the given task. In this step, we need to define the goals and objectives of the project and identify the stakeholders. This will help us to define the scope of the project and ensure that we deliver a solution that meets the needs of the stakeholders. We will base this project on a fictitious company that has tasked us with building a spam filter for their open channel to enable effective communication with their customers while maintaining message quality.

Business Understanding

For this project, the business objective is to build a spam filter for a new open communication channel that the company wants to install for its products. The aim is to enable fast and interactive feedback for their customers or possible future customers. The spam filter should be able to identify spam messages from short messages and allow for adjustments with respect to the risk of allowing spam to pass. This means that the spam filter should be adjustable via different spam-risk levels, e.g., low-risk (very restrictive) and high-risk (not restrictive). The messages passing the low-risk level are immediately displayed, while the other ones are moved to a folder for further analysis.

The stakeholders in this project are the company, the service team who will be responsible for using the spam filter to filter out spam messages, and the customers, who will be using the open channel to provide feedback or ask questions on the company’s products. It is important to keep in mind the expectations of these stakeholders while developing the spam filter so that we can deliver a solution that is effective and meets the needs of all groups. For the company the emphasis lies on not losing customers from misclassified messages, the service team wants a system that simplifies their daily work and the customers do not want to read too many spam messages, although a lost customer will probably hurt the company more than a spam message entering the channel.

In addition to the primary objectives of the project, it is also important to consider any constraints and assumptions that may impact the project. For example, we could have limitations in terms of the resources available to us, such as the amount of data that we have access to or the computing power required to train and test our models. We may also have to make assumptions about the types of messages that are likely to be classified as spam and adjust our approach accordingly. By considering these constraints and assumptions, we can ensure that we deliver a solution that is realistic and feasible given the resources available to us.

Constraints and Assumptions

Based on the dataset and the resources available, we can identify the following constraints:

The dataset only includes English messages, so the model will only be able to classify English messages accurately.
The dataset has a limited number of messages, which may not be representative of all types of messages that the company may receive.
The analysis and model training will be done on a home desktop PC, which may have limitations in terms of computational power and memory.

and assumptions:

The messages in the dataset are correctly labeled as ham or spam.
The dataset is a representative sample of the types of messages that the company may receive.
The spam filter will be used in a similar context as the dataset, so the model will generalize well to new messages.

Data Understanding

To start the data understanding step, we will first load the dataset and check for any missing values and duplicates. We will use the Pandas library to load the CSV file containing the messages and perform basic exploratory data analysis (EDA).

From the above output, we can see that there are no missing values in the dataset, but there are 403 duplicate records. We will remove these duplicate records from the dataset and look at the number of resulting unique records.

Next, we will explore the distribution of the target variable label with the help of Matplotlib and Seaborn. Additionally, we look at the distribution of the character count of the messages.

From the first visualization, we can see that the dataset is imbalanced as there are more ham messages than spam messages. This must be accounted for before we train our classifier.

Plotting the distribution of the message lengths, we can observe that spam messages tend to have longer message lengths compared to ham messages. Further looking into the number of digits and special characters reveals that spam messages tend to have more digits.

Another interesting feature that we will explore is the number of unique words per message type

and lastly, we plot word clouds to spot possible differences in vocabulary.

With this, we conclude the EDA and move further on to the data preparation step where we will balance the dataset and choose the most interesting features for training the model. The engineered features for each message will be message lengths, the number of unique words, and a TF-IDF representation of the vocabulary. We will omit the special character distribution as ham and spam messages are very similar in this regard.

Data Preparation

For the next step, we perform data preprocessing and feature engineering on the SMS message dataset. We load the data from the source CSV file, remove duplicates, and create new columns such as message length, number of digits, number of unique words, and a lemmatized version of the messages. The text data is then transformed into a matrix of TF-IDF values using the TfidfVectorizer class of the scikit-learn library. The resulting features are saved along with the trained vectorizer. Additionally, the dataset is balanced by sampling an equal number of ham and spam messages. The target variable is encoded as 0 for ham and 1 for spam and finally, the prepared dataset is saved as a new CSV file.

Model Training

For training, we first load our prepared dataset of messages and split it into training and testing sets. Then we perform a grid search with cross-validation to find the best hyperparameters for our classifier which will be a decision tree. Additionally, the feature importances are retrieved from the trained classifier, and a barplot is created to visualize the top 10 most important features.

Examining the plot, we can see that the model clearly favors the number of digits in a message as a decision factor. Following we have specific words and our other engineered features.

Error Analysis

After training our model, we will now further analyze the performance. To do this we compute a confusion matrix on our training and test datasets.

Additionally, we also retrieve accuracy, precision, and recall for each dataset respectively.

From the metrics, we can conclude that the model performs sufficiently well on our initial dataset. Additionally, we will look into the distribution of misclassified messages.

With this, we can assume that the model is not particularly biased toward one of the message types. Finally, we will test the model on 10 new messages that were generated with a language model and consist of 5 ham and 5 spam messages. The test results can be examined in the analysis Jupyter Notebook and revealed a reasonable out-of-sample performance and generalization. Although the emphasis of the model on digit counts makes spam messages with no digits and ham messages with more digits hard to classify correctly.

Deployment

To integrate the spam classification model into the service team’s workflow, we could develop a web application using Django. The steps involved in creating and deploying this system would be as follows:

Set up the Django project structure and configure the necessary settings.
Design the database schema to store messages and their classifications.
Set up an API endpoint that receives incoming messages, assuming we have the possibility to control message flow from our communication channels’ backend.
Implement the web interface using Django’s template system and HTML/CSS.
Integrate the spam classification model into the project and develop the function to classify messages.
Implement an adjustable spam level control, allowing the service team to set the level of stringency for classifying messages as spam. The level would be a percentage threshold that can be used for comparison with the probabilities of class membership that the model returns in percentages.
Deploy the Django application on a cloud platform.
Conduct thorough testing and quality assurance to ensure functionality and performance.
Provide documentation and training to guide the service team on using the web application effectively.
Maintain and improve the system based on feedback and ongoing monitoring.

Conclusion

The objective of this project was to build a classification model capable of identifying spam in short messages. The CRISP-DM methodology was followed, covering business understanding, data understanding, data preparation, modeling, evaluation, and a suggestion for deployment. The dataset’s quality was assessed, and findings were visualized for better comprehension. A decision tree classifier was created, trained, and tested using the provided dataset and additional out-of-sample data. An error analysis was conducted to identify weaknesses in the approach, which would be the reliance on digits for classifying spam. Finally, a proposal was made to integrate the model into the daily work of the service team. The solution would be in the form of a graphical user interface (GUI) based on a Django web application.

In summary, this project successfully developed a classification model for spam identification in short messages. The model returns class membership certainties in percentages which can be further utilized to build logic for adjustable risk levels. Further improvements to the model could be made by gathering additional data or data that is more representative of the use case, i.e. short messages of product feedback or questions. Additionally, there is the possibility of engineering alternative features to reduce reliance on the digit count. The project demonstrates the efficacy of data science methodologies in addressing real-world challenges, offering a valuable solution for the company. The code for this project can be viewed in this GitHub repo.

Analysis of a Customer Complaint Dataset

Marco Schweiss — Thu, 13 Jul 2023 10:35:00 +0000

In today’s digital age, businesses rely heavily on customer feedback and satisfaction to maintain a competitive edge. Understanding customer complaints and addressing them promptly is vital for ensuring customer loyalty and improving overall product and service quality. With the rise of online platforms and social media, companies have a vast amount of data at their disposal in the form of customer reviews, tweets, and other sources of feedback.

In this project report, we focus on analyzing a dataset containing customer complaints from Comcast, one of the largest cable television and internet service providers in the United States. By leveraging Natural Language Processing (NLP) techniques, our objective is to gain insights into the key issues faced by Comcast customers and identify patterns or trends within the complaints that can inform business decisions.

Conception

The first step is to create a concept to describe everything that belongs to the data analysis workflow. Given that anything that is overlooked or forgotten in this phase has a negative effect on the implementation later and will lead, in the worst case, to useless results, this step is perhaps the most important of the entire process.

Dataset

The dataset consists of four columns containing the author, post date, rating, and the actual complaint. Additionally, we have around 5500 records which provide enough information to generate relevant insights. The dataset can be found on Kaggle with this link.

Cleaning the Complaints

To get cleaned texts, we will look into three different approaches. First of all, for all three approaches, stopwords are removed with the NLTK library. Then for the first option, the words in the complaints will be lemmatized. For the second option, n-grams are formed, and for the last option, noun phrases will be extracted with the TextBlob library, which can be considered an API to the NLTK library. To get a look at the result of the individual preprocessing options, the WordCloud library is used to create word clouds.

Vectorization

For the vectorization step, the Bag-of-Words and TF-IDF algorithms will be used. Here the three preprocessing options can be combined with each of the two vectorization methods. In this analysis, the two algorithms implementations of the Sklearn library will be used.

Topic Modeling

In the last step of the analysis, either LSA or LDA will be used for topic modeling. Here the idea is to get the most prominent words for a number of generated topics which can then be interpreted. As for vectorization, the Sklearn library is used for this step. Given the aforementioned options, the analysis will consist of 3 * 2 * 2 = 12 variations that can be compared.

Development

In this phase, we will conduct the data analysis based on the developed concept. This PDF shows the implementation step-by-step.

Results and Analysis

For the discussion of the results, we will first look into the quality of the customer complaint dataset. The author claims that the complaints contained in the dataset are obtained with web scraping from a public source. Given that the source is not disclosed, there is no way to check where the data was obtained. Nonetheless, the dataset with around 5500 entries was deemed suitable for this project, as the source of the data does not have an impact on the methodology presented in this project. As already mentioned before, the analysis resulted in around 64 topics that could be interpreted. All in all, there were five interpreted topics that could be derived from the analysis: Problems with customer service, problems with internet speed, problems with the internet cable (possible downtimes), complaints about the billing, and problems with the TV receiver.

For possible further improvements of the analysis, there would be three scenarios. First of all, there is the possibility to tweak the parameters of the respective machine learning algorithms. For example, the number of topics to be extracted could be altered, or additional stop words could be removed. The second possibility would be to perform the analysis with different techniques and methodologies. There are a variety of possibilities that could be explored here. For example, to get more detailed insights about the topics of complaint, Word2Vec or GloVe could be employed. Additionally, there is the possibility of using transformer-style models like BERT or the GPT family of models to extract topics from the complaints

Solving this task definitely gave me a deeper understanding of natural language processing techniques and the topic modeling workflow in general. The explored process of preprocessing – vectorization – topic modeling with the employed algorithms can now serve as a foundation from which other methodologies can be tested and compared in future NLP projects.

For possible similar projects in the future, I would definitely try to implement a workflow based on transformer models, especially the relatively newly released GPT-3 model from OpenAI. As I have already experienced the astonishing capabilities of this model regarding its comprehension of language in other projects, it would be interesting to implement a topic modeling workflow with this model. This could possibly lead to more detailed findings of an analysis. The code for this project can be found on GitHub.

Discovering Trends in Scientific Publications

Marco Schweiss — Mon, 10 Jul 2023 13:51:50 +0000

The trend is your friend is an old saying that urges investors to jump on existing trends to maximize profit. Leveraging trends to get the most optimal outcome cannot only be used in finance but also in a variety of other scenarios. In this article, I will demonstrate how to detect trending topics in scientific publications. This serves the purpose of facilitating the conversation between researchers and stakeholders interested in research cooperation.

To extract trending topics, we will make use of exploratory data analysis, followed by unsupervised machine learning on a large dataset of scientific publications. A trending topic can be casually defined as an area that gets more interest over time. The detection of trends in scientific papers can be very valuable for a variety of different reasons, for example, to allocate funding for new emerging ideas and to remove funding from areas that are already losing interest.

To effectively tackle the detection of trends in scientific literature, algorithmic methods for evaluating large corpora of text, have become a necessity. Given the vast amount of published papers and the increasing number of daily publications, manual review is not feasible. Consequently, machine learning methods are employed to handle the large amounts of produced text data. For this article, the emphasis lies on unsupervised machine learning methods and the preprocessing of text data.

Methods and Techniques

For the article to be self-contained and to give the reader a short introduction, this chapter is concerned with a quick overview of the employed methods. We will talk about techniques from the domains of natural language processing, text vectorization, and topic modeling.

Part-Of-Speech Tagging

For processing text and uncovering underlying meaning, it is necessary to distinguish between the different features which make up a sentence and ultimately a language. These different features called part-of-speech, are represented by the type of words that are defined by their grammatical functionality, for example, a noun, verb, or adjective.

To efficiently analyze datasets with large amounts of text, an automatic tagging system for part-of-speech is necessary. Thereby tagging can be referred to as the appointment of tags or descriptions to the individual words.

The tagging system that we will use is part of the NLTK software package for Python, which is a toolkit for facilitating natural language processing tasks. The part-of-speech tagger implemented in NLTK works by processing a tokenized text and returning the respective tags for each word. Underlying the NLTK algorithm is the Universal Part-of-Speech Tagset, which was developed by Petrov et al. (2011) and serves as the framework for the classification of text features.

Text Vectorization

To operate machine learning methods with text data, it is necessary to first convert the text corpus into a suitable format. Given that machine learning algorithms are designed to work with numerical input data, the text data is usually represented as a vector. There are a variety of different methods for transferring text into a vector. We will focus on the term frequency-inverse document frequency or short TF-IDF method. Thereby the TF-IDF method provides the importance of specific words in a single text relative to the rest of the corpus. Underlying this technique is the assumption that there is a higher probability of uncovering the meaning of a text by emphasizing the more unique and rare terms of a single document.

The value for the TF-IDF is calculated for a single term and can be summarized as the scaled frequency of a term occurring in a unique document which is then normalized by the inverse of the scaled frequency of the term in all text data combined. We will use the TfidfVectorizer which is included in the Scikit-Learn library for Python. In this implementation, the algorithm takes a corpus of text, vectorizes the text, calculates the TF-IDF values, and returns a sparse matrix for further processing.

Topic Modeling

For gaining insight into the topics which are contained in a corpus of text documents, we will use a topic modeling method called Latent-Dirichlet-Allocation or short LDA. In the case of LDA, a topic is defined as a probability distribution over words. Additionally, a single text can be seen as a mixture of different topics. Thereby the LDA technique allows a fuzzy approach for topic modeling, due to the possibility of a word occurring in more than one topic. Due to the nature of the LDA technique, the resulting topics tend to be easier to interpret, because words that often appear in the same text are grouped to the same topic. This way of topic modeling is more oriented toward the way a human would expect it.

To operate the LDA method there are two necessary inputs, with the first input being a vectorized text corpus. The second input is the number of topics that should be extracted from the data. Choosing the correct number of topics is thereby a process of trial and error and reflects on the interpretability of the resulting topics. As for the TF-IDF method, we will use the LDA algorithm which is implemented in the Scikit-Learn package.

Dataset and Process Overview

As fuel for our trend analysis, we will use a dataset from arXiv containing a collection of scientific papers ranging in the millions. The dataset is maintained by Cornell University, where we will use the 37th version of the dataset which was published on the 8th of August 2021. Following is a short overview of the steps that we will take to process the arXiv text corpus.

Data Exploration

The first part of the topic extraction process is concerned with the exploration of the structure and content of the dataset. This includes a first view of the raw data, checks on missing values, and the visualization of publications on a time axis to reveal underlying trends.

Dataset Structure

We start by importing Pandas and loading a single entry of the dataset into a DataFrame. Printing the content for visual inspection reveals that the raw data consists of fourteen distinct columns. The next step is then to filter the dataset to only retain data necessary for trend detection and topic modeling. Therefore we only include the title, abstract, category, and publication date of the papers.

After settling on the required information, all records of the aforementioned four columns are imported and converted into a DataFrame. A quick check for possible missing values in the data reveals a well-maintained dataset without missing entries. The shape of the DataFrame is set at around 1.9 million entries with four columns.

Visualizing Publication Trends

From here we leverage the categories of the papers to visualize the number of publications per scientific field over time. This should determine the direction for further analysis and reduce the ambiguity of the relatively large dataset. We create a first view of the dataset by counting the publications per category:

As can be seen in the bar graph, the vast majority of publications are contained in the physics group with math in second and computer science in third place. Given that a scientific paper can have more than one category, the bar graph counts more publications than contained in the initial dataset. Nonetheless, for a general overview, this is sufficient.

To detect trends, the next step is to display the publications as a time series. We split the publications by category and group them by their publication date. For better visibility and to smooth the trendline, we also compute a 3-month moving average of the publications. Additionally, we restrict the graph to publications after the year 1999.

Looking at the trendlines, we can see that scientific publications, in general, are on an upwards trend. Nonetheless, if the timeframe is restricted to the most recent years, computer science and physics can be considered the top trending fields if we lay our focus solely on the number of published papers. The publication frequency at the moment is thereby around an average of two hundred publications per day, with an upwards tendency. From here on we will restrict the dataset to these two categories and to publications made after the year 2019, to get the most current topics.

Data Preparation

After settling on our two major scientific fields, we will now prepare the text data for topic modeling. This step includes first of all splitting the data into two separated DataFrames (one for each field), the generation of new features from the existing data, and following the vectorization of the text corpus to make the data suitable for downstream algorithms.

Feature Generation and Vectorization

To facilitate the interpretability of the extracted topics, we extract the nouns of the respective abstracts. Nouns are more suitable for general topic modeling because nouns carry the most meaning inside a text. This step also reduces the amount of text data to be analyzed and speeds up the topic modeling process. To further clean the data, extracted nouns containing special characters are removed and all text is set to lowercase.

Following we use the TF-IDF method to vectorize the extracted nouns. This step gets us two sparse matrices which we can now use for topic modeling. Additionally, we also have a list of feature names that we will use for topic visualization.

Topic Modeling

After completion of the data preparation step, we can now feed the TF-IDF matrix into our topic modeling algorithm (LDA). Given that the number of topics must be set before generating a topic, this step usually involves a lot of trial and error. After testing different parameters, the generation of twenty topics from each of the two categories subjectively leads to the best interpretability of the resulting topics.

Interpretation

To facilitate the interpretation of the generated topics, we will leverage the sub-categories of the publications that were part of the dataset. Given that the sub-categories with the most publications can also be considered to be the most trending categories, a word cloud is a quick and easy method to extract the top categories from the two DataFrames. The word cloud on the left represents computer science and the word cloud on the right physics:

With the help of the category taxonomy published on the arXiv website, the category tags can be interpreted. Thereby the most frequently occurring categories for computer science are machine learning (cs.LG, stat.ML), artificial intelligence (cs.AI) and computer vision (cs.CV). For physics, the categories astrophysics (astro-ph), condensed matter (cond-mat), and quantum
physics (quant-ph) are the top trending categories. A full list of categories can be viewed on the arXiv website.

To finally determine the trending topics in detail, we look at what topics were generated by the LDA algorithm. To make interpretation more feasible, we display the five most frequent words of a specific topic as a vertical bar graph. Here is an example from computer science and physics respectively:

A possible interpretation would be that topic 1 from computer science can be interpreted to be centered around neural networks and deep learning and topic 19 from physics concerning itself with quantum computing. The full list of generated topics can be viewed here for computer science and here for physics.

Detecting Trends

Given the results of our preceding analysis of the arXiv dataset, we can detect a lot of possible trends in current research. Focusing on the two major categories computer science and physics and further with the most common sub-categories in mind, the following table shows topics that I could derive from the results:

Conclusion

Identifying trends in science is valuable for a lot of reasons. Funding can be strategically allocated to efficiently support current trends and research cooperations can be established in fields that have a lot of traction. This offers a mutual benefit to both researchers and the supporting stakeholders. The code for the analysis can be found in this GitHub repo.

References

Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied text analysis with python: Enabling language aware data products with machine learning. Sebastopol, CA: O’Reilly Media.

Berry, M. W. (Ed.). (2004). Survey of text mining: Clustering, classification, and retrieval. New York, NY: Springer.

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with python. Sebastopol: O’Reilly Media Inc.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research(3), 993–1022.

Cornell University. (2021). Category taxonomy. Retrieved from https://arxiv.org/category_taxonomy

Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. Proceesings of the third conference on applied natural language processing, 133.

Fiona Martin, & Mark Johnson. (2015). More efficient topic modelling through a noun only approach. In Alta.

Lane, H., Howard, C., & Hapke, H. (2019). Natural language processing in action: Understanding, analyzing, and generating text with python. Shelter Island: Manning.

Petrov, S., Das, D., & McDonald, R. (2011). A universal part-of-speech tagset. Computing Research Repository.

Prabhakaran, V., Hamilton, W. L., McFarland, D., & Jurafsky, D. (2016). Predicting the rise and fall of scientific topics from trends in their rhetorical framing. Proceedings of the 54th annual meeting of the association for computational linguistics.

Roy, S., Gevry, D., & Pottenger, W. (2002). Methodologies for trend detection in textual data mining.

Schachter, P., & Shopen, T. (2007). Parts-of-speech systems. Language typology and syntactic description, 1–60.

Voutilainen, A. (2012). Part-of-speech tagging. The Oxford Handbook of Computational Linguistics.

Witten, I. H., Pal, C. J., Frank, E., & Hall, M. A. (2017). Data mining: Practical machine learning tools and techniques (4th ed.). Cambridge, MA: Morgan Kaufmann.