Machine Learning – BigDataTime

Topic Modelling with LLMs

Marco Schweiss — Fri, 01 Dec 2023 18:15:36 +0000

The field of natural language processing has witnessed revolutionary advancements with the introduction of large language models (LLMs) such as ChatGPT and GPT-4. This has prompted an exploration of how these LLMs can be utilized in topic modeling, an essential technique for comprehending extensive textual corpora. The aim of this project was to examine the trade-offs between traditional statistical algorithms and LLMs for topic modeling. Additionally, the goal was to develop a new approach for topic modeling utilizing the GPT model family, while providing recommendations on when to employ each method. A thorough literature review was conducted to establish the theoretical foundation.

Subsequently, an experiment was designed to compare the two approaches across three diverse datasets containing different types of documents, namely news, abstracts and tweets. Topic coherence and diversity metrics were employed for quantitative assessment, complemented by qualitative evaluation through GPT-4. Traditional models were found to outperform the LLM-based approach on the news and abstract datasets. Conversely, both qualitative and quantitative analysis demonstrated that the novel LLM approach yielded superior results on the tweet dataset.

It was determined that traditional topic models are better suited for larger datasets, providing a higher-level overview of topics. In contrast, LLM-based approaches excel at generating granular and interpretable topics from smaller datasets. These findings offer practical recommendations regarding strategy selection in topic modeling applications.

To read the full article please refer to this PDF file.

Emotion Detection in Images

Marco Schweiss — Fri, 14 Jul 2023 08:14:04 +0000

In an increasingly digital and interconnected world, understanding human emotions has become a critical aspect of various industries. Emotions significantly impact our decision-making processes, behavior, and overall experience. Marketing departments are particularly interested in leveraging emotion detection technology to gain insights into consumer preferences, optimize advertising strategies, and enhance user experiences. This project report focuses on the development and implementation of an emotion detection system capable of analyzing images to identify the underlying emotions expressed by individuals.

Conception

To solve this task, I decided to train a convolutional neural network (CNN) with a public dataset consisting of images of faces categorized into seven different emotions. To get a viable solution, the concept of transfer learning will be employed. This should result in a solution with state-of-the-art accuracy regarding the classification of emotions.

Basic Frameworks

The primary frameworks for implementing the emotion recognition solution will be TensorFlow and OpenCV. TensorFlow is an open-source library that makes it easy to train, test, and develop neural networks. The name TensorFlow stems from the use of multidimensional data (tensors) as input data sent through several intermediate layers. These layers can be individually selected to build up a neuronal network architecture. Additionally, the library also contains several pre-trained models which can be used for transfer learning. The second library, OpenCV, which is a computer vision library, features a lot of useful algorithms that will be used to extract faces and manipulate images for the inference step.

Transfer Learning and Model Choice

With transfer learning, it is possible to use an existing model to solve different but related problems to the original purpose the model was conceived for. The model’s pre-trained weights are reused, and the architecture of the original model is slightly altered. Transfer learning conserves time, generally results in better performance, and shrinks the scale of necessary training data sets. The model that the emotion classifier for this project is based on is the VGG19 model. The VGG19 is thereby a deep CNN model, which means that the model consists of multiple layers, including a 1000 neurons dense output layer. This is due to the fact that VGG19 was trained to classify 1000 different categories. To fit the model for the use case of detecting emotions, the original output layers, including the 1000-neuron dense layer, will be swapped with new output layers corresponding to the number of emotions that we want to identify. Additionally, we lock the original weights of the VGG19 model and train only our modified layers on the emotion dataset.

Dataset Selection and Issues

To train the model for emotion recognition, the FER2013 dataset, which consists of around 30000 images of facial expressions, will be used. The images are thereby classified into seven different categories of emotion. The dataset also features around 3500 test images for validation of the trained model. Although the dataset features a reasonable amount of images for training, there is the problem of some categories being under and over-represented. This introduces the problem of bias, as, for example, the Happy category contains around 7000 images and the Disgust category features only around 400 images. To combat this issue, the data will be augmented, for example, by flipping, rotating, or cropping images. Here is the link to the dataset.

Model Training and Inference

The overall process of implementing the solution for emotion recognition can be split up into two distinct parts, the model creation, and the inference part. The model creation procedure consists thereby of augmenting the dataset, setting up the training environment and architecture with TensorFlow, and then starting the training process.

The second step, which presents the actual solution for making a prediction, consists of the trained model, which is loaded into a Jupyter Notebook for inference. The process of inference consists thereby of loading an image into a Jupyter Notebook, extracting the face with OpenCV, preprocessing the image, and passing the image to the trained model for prediction. The prediction is visualized as the image with the classified emotion written on it.

Development

After creating a concept, the next step was to start the Implementation. This PDF explains the development process in detail.

Conclusion

Looking back on the creation of the emotion classifier, I can say that this project was a very pleasant and rewarding learning experience, regarding computer vision and the development workflow of deep learning models. Given the issues regarding data quality, bias-variance problems during training, hardware limitations, and the trial and error of hyperparameter tuning, the task also proved to be more difficult than initially thought. As the saying goes, the devil lies in the detail, and I can now understand why training deep learning models is by a lot of machine learning practitioners considered a mixture of art and science.

Personally, I am satisfied with the final result, although I wished for a bit higher accuracy and more emotions. To improve upon this work, a better dataset like AffectNet which has around one million images could be used to train a model on better hardware like TPUs. Grid-search for finding the best parameters and model structure could also be used to get an overall better performance. The code for this project can be found in this repository.

Discovering Trends in Scientific Publications

Marco Schweiss — Mon, 10 Jul 2023 13:51:50 +0000

The trend is your friend is an old saying that urges investors to jump on existing trends to maximize profit. Leveraging trends to get the most optimal outcome cannot only be used in finance but also in a variety of other scenarios. In this article, I will demonstrate how to detect trending topics in scientific publications. This serves the purpose of facilitating the conversation between researchers and stakeholders interested in research cooperation.

To extract trending topics, we will make use of exploratory data analysis, followed by unsupervised machine learning on a large dataset of scientific publications. A trending topic can be casually defined as an area that gets more interest over time. The detection of trends in scientific papers can be very valuable for a variety of different reasons, for example, to allocate funding for new emerging ideas and to remove funding from areas that are already losing interest.

To effectively tackle the detection of trends in scientific literature, algorithmic methods for evaluating large corpora of text, have become a necessity. Given the vast amount of published papers and the increasing number of daily publications, manual review is not feasible. Consequently, machine learning methods are employed to handle the large amounts of produced text data. For this article, the emphasis lies on unsupervised machine learning methods and the preprocessing of text data.

Methods and Techniques

For the article to be self-contained and to give the reader a short introduction, this chapter is concerned with a quick overview of the employed methods. We will talk about techniques from the domains of natural language processing, text vectorization, and topic modeling.

Part-Of-Speech Tagging

For processing text and uncovering underlying meaning, it is necessary to distinguish between the different features which make up a sentence and ultimately a language. These different features called part-of-speech, are represented by the type of words that are defined by their grammatical functionality, for example, a noun, verb, or adjective.

To efficiently analyze datasets with large amounts of text, an automatic tagging system for part-of-speech is necessary. Thereby tagging can be referred to as the appointment of tags or descriptions to the individual words.

The tagging system that we will use is part of the NLTK software package for Python, which is a toolkit for facilitating natural language processing tasks. The part-of-speech tagger implemented in NLTK works by processing a tokenized text and returning the respective tags for each word. Underlying the NLTK algorithm is the Universal Part-of-Speech Tagset, which was developed by Petrov et al. (2011) and serves as the framework for the classification of text features.

Text Vectorization

To operate machine learning methods with text data, it is necessary to first convert the text corpus into a suitable format. Given that machine learning algorithms are designed to work with numerical input data, the text data is usually represented as a vector. There are a variety of different methods for transferring text into a vector. We will focus on the term frequency-inverse document frequency or short TF-IDF method. Thereby the TF-IDF method provides the importance of specific words in a single text relative to the rest of the corpus. Underlying this technique is the assumption that there is a higher probability of uncovering the meaning of a text by emphasizing the more unique and rare terms of a single document.

The value for the TF-IDF is calculated for a single term and can be summarized as the scaled frequency of a term occurring in a unique document which is then normalized by the inverse of the scaled frequency of the term in all text data combined. We will use the TfidfVectorizer which is included in the Scikit-Learn library for Python. In this implementation, the algorithm takes a corpus of text, vectorizes the text, calculates the TF-IDF values, and returns a sparse matrix for further processing.

Topic Modeling

For gaining insight into the topics which are contained in a corpus of text documents, we will use a topic modeling method called Latent-Dirichlet-Allocation or short LDA. In the case of LDA, a topic is defined as a probability distribution over words. Additionally, a single text can be seen as a mixture of different topics. Thereby the LDA technique allows a fuzzy approach for topic modeling, due to the possibility of a word occurring in more than one topic. Due to the nature of the LDA technique, the resulting topics tend to be easier to interpret, because words that often appear in the same text are grouped to the same topic. This way of topic modeling is more oriented toward the way a human would expect it.

To operate the LDA method there are two necessary inputs, with the first input being a vectorized text corpus. The second input is the number of topics that should be extracted from the data. Choosing the correct number of topics is thereby a process of trial and error and reflects on the interpretability of the resulting topics. As for the TF-IDF method, we will use the LDA algorithm which is implemented in the Scikit-Learn package.

Dataset and Process Overview

As fuel for our trend analysis, we will use a dataset from arXiv containing a collection of scientific papers ranging in the millions. The dataset is maintained by Cornell University, where we will use the 37th version of the dataset which was published on the 8th of August 2021. Following is a short overview of the steps that we will take to process the arXiv text corpus.

Data Exploration

The first part of the topic extraction process is concerned with the exploration of the structure and content of the dataset. This includes a first view of the raw data, checks on missing values, and the visualization of publications on a time axis to reveal underlying trends.

Dataset Structure

We start by importing Pandas and loading a single entry of the dataset into a DataFrame. Printing the content for visual inspection reveals that the raw data consists of fourteen distinct columns. The next step is then to filter the dataset to only retain data necessary for trend detection and topic modeling. Therefore we only include the title, abstract, category, and publication date of the papers.

After settling on the required information, all records of the aforementioned four columns are imported and converted into a DataFrame. A quick check for possible missing values in the data reveals a well-maintained dataset without missing entries. The shape of the DataFrame is set at around 1.9 million entries with four columns.

Visualizing Publication Trends

From here we leverage the categories of the papers to visualize the number of publications per scientific field over time. This should determine the direction for further analysis and reduce the ambiguity of the relatively large dataset. We create a first view of the dataset by counting the publications per category:

As can be seen in the bar graph, the vast majority of publications are contained in the physics group with math in second and computer science in third place. Given that a scientific paper can have more than one category, the bar graph counts more publications than contained in the initial dataset. Nonetheless, for a general overview, this is sufficient.

To detect trends, the next step is to display the publications as a time series. We split the publications by category and group them by their publication date. For better visibility and to smooth the trendline, we also compute a 3-month moving average of the publications. Additionally, we restrict the graph to publications after the year 1999.

Looking at the trendlines, we can see that scientific publications, in general, are on an upwards trend. Nonetheless, if the timeframe is restricted to the most recent years, computer science and physics can be considered the top trending fields if we lay our focus solely on the number of published papers. The publication frequency at the moment is thereby around an average of two hundred publications per day, with an upwards tendency. From here on we will restrict the dataset to these two categories and to publications made after the year 2019, to get the most current topics.

Data Preparation

After settling on our two major scientific fields, we will now prepare the text data for topic modeling. This step includes first of all splitting the data into two separated DataFrames (one for each field), the generation of new features from the existing data, and following the vectorization of the text corpus to make the data suitable for downstream algorithms.

Feature Generation and Vectorization

To facilitate the interpretability of the extracted topics, we extract the nouns of the respective abstracts. Nouns are more suitable for general topic modeling because nouns carry the most meaning inside a text. This step also reduces the amount of text data to be analyzed and speeds up the topic modeling process. To further clean the data, extracted nouns containing special characters are removed and all text is set to lowercase.

Following we use the TF-IDF method to vectorize the extracted nouns. This step gets us two sparse matrices which we can now use for topic modeling. Additionally, we also have a list of feature names that we will use for topic visualization.

Topic Modeling

After completion of the data preparation step, we can now feed the TF-IDF matrix into our topic modeling algorithm (LDA). Given that the number of topics must be set before generating a topic, this step usually involves a lot of trial and error. After testing different parameters, the generation of twenty topics from each of the two categories subjectively leads to the best interpretability of the resulting topics.

Interpretation

To facilitate the interpretation of the generated topics, we will leverage the sub-categories of the publications that were part of the dataset. Given that the sub-categories with the most publications can also be considered to be the most trending categories, a word cloud is a quick and easy method to extract the top categories from the two DataFrames. The word cloud on the left represents computer science and the word cloud on the right physics:

With the help of the category taxonomy published on the arXiv website, the category tags can be interpreted. Thereby the most frequently occurring categories for computer science are machine learning (cs.LG, stat.ML), artificial intelligence (cs.AI) and computer vision (cs.CV). For physics, the categories astrophysics (astro-ph), condensed matter (cond-mat), and quantum
physics (quant-ph) are the top trending categories. A full list of categories can be viewed on the arXiv website.

To finally determine the trending topics in detail, we look at what topics were generated by the LDA algorithm. To make interpretation more feasible, we display the five most frequent words of a specific topic as a vertical bar graph. Here is an example from computer science and physics respectively:

A possible interpretation would be that topic 1 from computer science can be interpreted to be centered around neural networks and deep learning and topic 19 from physics concerning itself with quantum computing. The full list of generated topics can be viewed here for computer science and here for physics.

Detecting Trends

Given the results of our preceding analysis of the arXiv dataset, we can detect a lot of possible trends in current research. Focusing on the two major categories computer science and physics and further with the most common sub-categories in mind, the following table shows topics that I could derive from the results:

Conclusion

Identifying trends in science is valuable for a lot of reasons. Funding can be strategically allocated to efficiently support current trends and research cooperations can be established in fields that have a lot of traction. This offers a mutual benefit to both researchers and the supporting stakeholders. The code for the analysis can be found in this GitHub repo.

References

Bengfort, B., Bilbro, R., & Ojeda, T. (2018). Applied text analysis with python: Enabling language aware data products with machine learning. Sebastopol, CA: O’Reilly Media.

Berry, M. W. (Ed.). (2004). Survey of text mining: Clustering, classification, and retrieval. New York, NY: Springer.

Bird, S., Klein, E., & Loper, E. (2009). Natural language processing with python. Sebastopol: O’Reilly Media Inc.

Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. the Journal of machine Learning research(3), 993–1022.

Cornell University. (2021). Category taxonomy. Retrieved from https://arxiv.org/category_taxonomy

Cutting, D., Kupiec, J., Pedersen, J., & Sibun, P. (1992). A practical part-of-speech tagger. Proceesings of the third conference on applied natural language processing, 133.

Fiona Martin, & Mark Johnson. (2015). More efficient topic modelling through a noun only approach. In Alta.

Lane, H., Howard, C., & Hapke, H. (2019). Natural language processing in action: Understanding, analyzing, and generating text with python. Shelter Island: Manning.

Petrov, S., Das, D., & McDonald, R. (2011). A universal part-of-speech tagset. Computing Research Repository.

Prabhakaran, V., Hamilton, W. L., McFarland, D., & Jurafsky, D. (2016). Predicting the rise and fall of scientific topics from trends in their rhetorical framing. Proceedings of the 54th annual meeting of the association for computational linguistics.

Roy, S., Gevry, D., & Pottenger, W. (2002). Methodologies for trend detection in textual data mining.

Schachter, P., & Shopen, T. (2007). Parts-of-speech systems. Language typology and syntactic description, 1–60.

Voutilainen, A. (2012). Part-of-speech tagging. The Oxford Handbook of Computational Linguistics.

Witten, I. H., Pal, C. J., Frank, E., & Hall, M. A. (2017). Data mining: Practical machine learning tools and techniques (4th ed.). Cambridge, MA: Morgan Kaufmann.

Anomaly Detection in a Factory Setting

Marco Schweiss — Mon, 10 Jul 2023 11:46:36 +0000

While working at my current workplace, the opportunity arose to design a proof-of-concept for a system that effectively identifies potential anomalies within the production cycle. By leveraging the expertise of the shop floor employees and utilizing machine learning techniques, the objective was to create a system that seamlessly integrates with the existing factory infrastructure.

The factory specializes in the manufacturing of staircases and there was a desire to implement a decision support system that proactively alerts our CNC operators regarding issues with their current workpiece. In my role as an aspiring data scientist, I was more than happy to address this need by developing an anomaly detection system that enhances the factory’s operational efficiency.

In collaboration with the experienced shop floor employees, who possess domain knowledge accumulated over their extensive tenure at the production site, we identified key indicators of anomalies: temperature, humidity, and sound volume. These factors were determined to be reliable indicators of potential anomalies within the production cycle, including the presence of faulty items.

By proactively identifying potential issues within the production cycle, the system would enable timely interventions, leading to enhanced productivity, reduced downtime, and improved product quality. In the subsequent sections of this project report will delve into the details of the approach, methodology, and specific techniques employed to develop and implement the anomaly detection system.

System Overview

We will, first of all, discuss the general implementation procedure of the project and how the components of the whole system fit together. So, the overall anomaly detection package consists of two working parts. The first component is a Python script that contains a class for simulating the generation of three distinct sensor readings, as there were no sensors mounted during the development of this proof-of-concept. The virtual sensor readings are temperature, humidity, and volume which are generated at a frequency of two hertz and are sent via HTTP requests.

On the receiving end and as the second component we have a web application based on Django. The web application features a REST-API endpoint for receiving the virtual sensor data, a machine learning model for predicting anomalies, an administration interface for authenticating simulated machines, and a SQL database for storing logged data.

Implementation Procedure

In the case of this project, the first step was centered around the core of the whole system, which is the machine learning model. To get a working model it was necessary to research the necessary techniques for the anomaly detection use case. Thereby the model was first trained with self-generated data and then serialized for later usage in the web application.

After the creation of a usable model, the first version of the web application was developed. This first version included the API endpoint for receiving data and the administration interface for registering machines. To complement the web application, the next step was to write the script for generating sensor data.

In the final step, the web application was then finished by complementing the data logging functionality and migrated to the cloud.

Training Data Characteristics

An anomaly can be described as a data point with characteristics diverging from the data considered normal. For the use case of this project, we can make the assumption that our sensor readings follow a Gaussian distribution, as sensor readings normally do. To train the model a dataset with thousand sensor readings for temperature, humidity, and volume was generated. So we have a thousand data points for temperature with a mean of 25° Celsius and a standard deviation of 5° Celsius. For humidity, we have a thousand readings with 60% as the mean and 5% percent as the standard deviation. And finally, we have a thousand readings for volume with a mean of 100 decibels and a standard deviation of 10 decibels.

Model Selection

For the actual machine learning model, I decided to choose an isolation forest. The isolation forest algorithm is an unsupervised method for anomaly detection and is based on the principle that anomalies are rare and have characteristics that distinguish them from normal data points. The algorithm basically produces an ensemble of binary trees with anomalies resulting in short path lengths on the trees.

For every new data point that we want to predict, the model returns an anomaly score and a category. Thereby negative anomaly scores indicate an anomaly and positive anomaly scores predict a normal value. Given that the isolation forest is an eager learner with a very fast classification speed, it is ideal for the streaming data prediction of this project.

Training the Model

For this project, I used the implementation provided by the Scikit-learn library. It is important to note that the isolation forest needs a predefined level of expected anomalies. So all hyperparameters were left on their default values, except the expected anomaly percentage which was set to 5%. This value corresponds to the actual amount of flawed workpieces in the real production environment. Training the model and predicting the labels for the training dataset yielded the following graph with the blue points representing predicted anomalies.

The model was then serialized so that it could be used later on.

Web App Overview

The web application really consists of two distinct components which are packaged into two separate apps inside a Django project. First, we have the API which handles incoming data and returns predictions. And as the second component, there is the administration interface which allows for registering machines and examination of logged parts. Let’s start with a detailed view of the API.

Prediction API

So, the endpoint of the API which handles incoming sensor readings is based on a Django-Rest-Framework API View. The endpoint’s functionality consists of four consecutive steps. First, we have a check if the request is from an authorized machine. If the received token and the machine’s description match a registered machine the next step is to create a new part.

After the creation, the logging of the sensor values and the predictions happens. Additionally, the prediction is returned in JSON format. The last step is to finish the part by making it available in the administration interface. All these individual steps are triggered by an additional value that is contained in the request of the machining script.

Administration Interface

To complement the API, the application features an administration interface. The interface allows for registering machines and viewing the logged parts. To use the interface, logging in to the admin account is required.

Machine Registration

To register machines, we must first navigate to the Machines page where we can enter a machine description into the registration form. The authentication token is then automatically generated by the application. After registration, the machine description and the token must be copied and inserted into the machining script. Of course, it is also possible to register more than one machine.

Part Inspection

The second part of the administration interface is the part overview which can be reached by entering the Parts page. Early in the development of the application, the decision was made to split the continuous sensor measurement into segments that correspond to the machining of a single part. So, in other words, each machine has three sensors with the sensor readings linked to a specific part. A part is thereby a random number of sensor values.

The anomaly score of the part is the cumulative score of all individual negative anomaly scores of the sensor readings. Additionally, parts have a start time and an end time. Additionally, the logged sensor values can be downloaded in CSV format.

Machine Simulation

The machine simulation is really a class that generates a random number of sensor readings and transmits them to the API via consecutive POST requests. To use the machining script, an instance of the machine class must be initiated. Here the registered machine name and token must be passed to the constructor of the class.

In the second step, the part machining can be simulated by calling the process method of the new machine instance and passing the new parts name as a parameter. After executing the script, a random number of sensor readings is generated. I decided to sort the sensor readings before sending them so that the anomalies are centered around a specific section of a part and not randomly placed inside the data. I did this by labeling the anomalies, sorting them, and then removing the labels before sending them. In theory, this would happen automatically in the real environment.

Demonstration

So now let’s look into the final part of the report which is a small demonstration of how to use the whole system.

Login

The first step is to log in and register a new machine. For this, we boot up the local Django server, enter the address of the application inside a browser and enter the credentials.

Machine Registration

After successful login, we land on the homepage. Here are two smaller texts with instructions on how to use the application. We start by registering a new machine. For this, we click the Machines button which leads to the machines page. Here we type milling_01 into the registration form and click create machine. Now the new machine and the generated authentication token are displayed under the form.

Set Up the Simulation Script

Before we can start simulating a part, the machining script must be modified. We initialize a new machine with the registered name, enter the authentication token, and call the process method of the machine with the part’s name as an argument.

Simulating a Part

The data generation is then started by executing the Python script. The simulated part can take from one to 5 minutes. If everything worked correctly our first returned value is a confirmation that a new part has been created. Following are the real-time predictions of the sensor measurements and lastly a message that the part has been finished. It is important to note that the application supports multiple machines running simultaneously.

Part Inspection

A finished part can be inspected by clicking the Part button on the homepage. Here we have an overview of the most important parameters like the anomaly score and you can also download the sensor values or delete the part. In the production environment, the anomaly score would now be used to have an educated guess about the quality of a single part.

Conclusion

Moving forward, the system will be further refined and optimized based on the feedback and insights gained during the actual production with real sensor readings. Continuous monitoring, evaluation, and improvement will be essential to maximize the system’s performance and adaptability. As always the code and instructions on how to use the project are available on GitHub.

References

Liu, Fei Tony; Ting, Kai Ming; Zhou, Zhi Hua: Isolation Forest. In: Eighth IEEE International Conference on Data Mining 2008, S. 413 422.

Ul Islam, Raihan; Hossain, Mohammad Shahadat; Andersson, Karl (2018): A novel anomaly detection algorithm for sensor data under uncertainty. In: Soft Computing 22 (5), S. 1623 1639