Python – BigDataTime

Analysing Website Technologies – A Complete Data Science Project

Marco Schweiss — Fri, 15 Dec 2023 18:18:30 +0000

As the digital landscape continues to evolve at a rapid pace, understanding the technologies that power websites has become increasingly crucial. From front-end frameworks to server-side languages, the array of tools and platforms available can be overwhelming for businesses and developers alike. In this data science project, we will do an analysis of website technologies to uncover insights that can inform strategic decision-making in the ever-changing world of web development. The benefits of such an analysis could be for example a database of possible business leads or having a basis for a data-driven hiring strategy.

The scope of this project ranges across the whole data science workflow, from raw data acquisition to reporting of the results. The list below highlights the individual milestones of the project, which we will discuss in detail later on:

Compiling a reasonably sized dataset of website links with the corresponding industry branch and the technologies used to run the web presence
Wrangling the data into an accessible format that can be used for further analysis
Creation of an interactive web-based report featuring self-service analytics that can be used by decision makers

Gathering the dataset

If you ask data professionals about their favorite task, the answer will most certainly be scraping of unstructured web data, which can be considered the modern equivalent of working in a 19th century coal mine. Of course, web scraping was also the starting point for this project, with our first contact point being a well-known public database of Austrian companies. The idea here was to leverage their already existing categorization of websites into industry branches, which we can use for example to spot possible differences in technology choice. Maybe a web agency is specifically targeting retail businesses and want to exactly know their prevalent technology choices. By using the categories, we can add another layer of granularity to our dataset and answer such questions.

Gathering the data is done with a Python script that utilizes several libraries, including requests, BeautifulSoup, Wappalyzer, and fire. It uses the requests library to make an HTTP GET request and checks the status code of the response. If the status code is 200, indicating a successful request, the HTML content is returned. If the status code is 404, the script prints a message and skips the site. For status codes 400 and 500, the script continues the loop to retry the request. This method ensures that the script handles different scenarios when fetching web pages.

Overall, the script is designed to scrape multiple pages of each industry branch from the company database until a threshold of 1000 websites including technology stacks are gathered for each branch. The method iterates over the pagination, gets the URL for each company profile, and fetches the HTML content like discussed above. It then uses BeautifulSoup to parse the HTML and extract the actual website link for a company from the profile page. If a website link is found, the script uses the Wappalyzer library to analyze the technology stack of the website. The results are then written to a CSV file.

We always check if the actual website link returns a 200 status code so that we do not scrape broken links. Additionally, the fire library is employed to create a command-line interface for the script to allow running multiple instances of the script in tandem. Theoretically, we could use multiple threads, but having a dedicated terminal and status bar makes the whole process more manageable. To facilitate comprehension, I included a diagram of the scraping process, which can be seen here.

Wrangling and database creation

At the end of the data acquisition, we now have ten different CSV files (one for each branch), with each one containing 1000 websites and their technologies. The next steps of the project are concerned with bringing the raw data into a suitable format for the final analysis and visualization. First we load each of the ten CSV files and convert them into a single Pandas DataFrame which looks as following:

If we examine the “Techstack” column, we can see dictionaries containing the technologies. These dictionaries are thereby stored as strings. Now we have multiple possibilities to format our data, for example we could split our dictionaries and store each link, branch and technology combination as a distinct record in the DataFrame. Considering that we will also enrich our data with additional categories for the technologies and that a technology can have multiple categories, we quickly get a lot of redundant data to store. If we flatten out all possible combination, we blow up our DataFrame from 10000 to around 60000 records.

This would not be an issue for this small dataset, but if we want to incrementally add new data in the future, this would lead to issues at some point in time. To alleviate this problem, we will create a dedicated SQL database in third normal form to minimize the storage of redundant data. This database will then serve as the basis for our analysis and the web-based report. The ER-diagram below describes the schema that we will implement:

We create the database with the Sqlite3 library starting by creating a schema and following we fill the “Website”, “Technology” and “Category” table. The next step is also relatively straightforward, where we use loops and checks to fill up the so-called join tables named “Website_Technology” and “Technology_Category”. If you ask why the schema is set up in this form, consider the following example: A website has jQuery, WordPress and PHP in their technology stack. Regarding the categories of each technology, we have “JavaScript” library for jQuery, “Programming Language” for PHP and for WordPress we have two categories, namely “Blog” and “CMS” (Content Management System).

Now instead of creating an individual record for each possible combination, we just fill the “Website”, “Technology” and “Category” table with the respective entities and then create an entry for each relation in the join tables. So going back to our example, we would create three entries in the “Website_Technology” table, with all entries pointing at the website and each entry pointing at one of the technologies. Now we do the same for the “Technology_Category” table, and we can effectively reduce redundant data storage to a minimum. Additionally, we get the convenience of using SQL for our analysis in the next step

Deploying the Report

To follow suit with current state-of-the-art solutions for data visualization like Power BI or Tableau, we will create an interactive report with some basic self-service analytics. To do this, we use the Streamlit framework, which is optimized for fast development of data-apps with Python. The architecture of the web application is thereby relatively straightforward, because we basically connect standard front-end components like sliders, drop down menus and forms with SQL commands that interface with the database. With the additional integration of plotting capabilities in Streamlit, we can then build very neat data dashboards in a reasonable time. A quick overview of the architecture can be seen below:

Now there is not much left to say about the web app except that it can be viewed with the following link, so please take a look for yourself. -> Techstack Analysis Report

Conclusion

In this article, we looked at a project that encompassed the whole data science workflow, from gathering raw data to the creation of an interactive report that could be used by decision makers to guide their strategies. Such a project features a lot of different technologies and techniques that must be orchestrated to deliver a usable final result. Nonetheless, this usage of different tools for creative problem-solving, is probably the most exciting part of data science. As always, the code for this project can be found on GitHub in the following repository.

Topic Modelling with LLMs

Marco Schweiss — Fri, 01 Dec 2023 18:15:36 +0000

The field of natural language processing has witnessed revolutionary advancements with the introduction of large language models (LLMs) such as ChatGPT and GPT-4. This has prompted an exploration of how these LLMs can be utilized in topic modeling, an essential technique for comprehending extensive textual corpora. The aim of this project was to examine the trade-offs between traditional statistical algorithms and LLMs for topic modeling. Additionally, the goal was to develop a new approach for topic modeling utilizing the GPT model family, while providing recommendations on when to employ each method. A thorough literature review was conducted to establish the theoretical foundation.

Subsequently, an experiment was designed to compare the two approaches across three diverse datasets containing different types of documents, namely news, abstracts and tweets. Topic coherence and diversity metrics were employed for quantitative assessment, complemented by qualitative evaluation through GPT-4. Traditional models were found to outperform the LLM-based approach on the news and abstract datasets. Conversely, both qualitative and quantitative analysis demonstrated that the novel LLM approach yielded superior results on the tweet dataset.

It was determined that traditional topic models are better suited for larger datasets, providing a higher-level overview of topics. In contrast, LLM-based approaches excel at generating granular and interpretable topics from smaller datasets. These findings offer practical recommendations regarding strategy selection in topic modeling applications.

To read the full article please refer to this PDF file.

Discovering In-Demand Skills in Austria’s Data Science Job Market

Marco Schweiss — Sun, 29 Oct 2023 10:35:34 +0000

In today’s rapidly evolving world of data science, staying ahead of the curve and understanding market demand is crucial. Aspiring data scientists often wonder which skills to focus on developing and what kind of compensation they can expect. In a recent project, I scraped public job postings in Austria to uncover the most sought-after skills and explore salary ranges for data science-related positions. Join me as we unveil intriguing insights into Austria’s data science job market.

Scraping Data

The first step of the project was to collect a dataset of job postings from an online resource, in this case the online job board Karriere.at. The goal was to capture valuable information like required skills and associated salaries, allowing us to gain actionable insights. To start the data collection, I first searched for the term “data science” on the job board, which yielded 310 results at the time I started the project.

The next step was to simply copy the HTML source code of the website, which contained all the links to the individual job postings. By leveraging the Requests and BeautifoulSoup Python libraries, I first extracted the links, then scraped the HTML and extracted the text for each listing. This resulted in 310 text files containing the job description in an unstructured format.

Extracting Skills and Salaries

To extract the skills and salaries from the scraped job postings, I employed OpenAI’s Chat Completions API with the GPT-3.5-Turbo model, or for short, ChatGPT. The large-language model allows us to effectively extract the information of interest from the unstructured data via a well-crafted prompt. The skill and salary extraction is thereby split into two tasks, so the first processing of the postings only extracts the skills, and the second run extracts the salaries. The image below shows the unstructured text data and the extracted skills and salary plus salary period, which will be used for further analysis.

Data Cleaning and DataFrame creation

Once the necessary information was collected, I transformed it into a structured format suitable for comprehensive analysis. This involved organizing the extracted skills and their associated salaries into a clean DataFrame, ready for deeper exploration. To clean up the skills, the first step was to tokenize the comma-separated list returned by the Chat Completions API. Following was lowercasing, removal of spaces, hyphens and special characters, and also the removal of company names like Microsoft and Google. All these steps help to basically reduce the noise in the data that results from different wordings. For example, by looking into the data, I saw a lot of different wordings for Azure, namely MS Azure, Microsoft Azure and microsoft azure which all denote the same skill. Finally, employing all these cleaning steps resulted in a list of 785 different skills, which were consequently used to create a one-hot encoded DataFrame for further analysis.

Analyzing Sought-After Skills

With our properly structured dataset at hand, we can now delve into analyzing the most sought-after skills in Austria’s data science job market. This is done by simply counting the rows that contain the value “true” for the respective skill and sorting in descending order. If we look at the bar plot below, we can identify key skills that repeatedly cropped up across numerous job postings.

Revealing Salary Ranges

In addition to uncovering the essential skills sought by employers, we also examine the salary ranges offered for data-related roles in Austria. By creating a box plot of the provided salaries, we can see the distribution of salaries over the different data-related roles. Here we get a mean and median of around 49,000 euros.

Insights and Limitations

Throughout the analysis, several noteworthy findings emerged. The first finding is that Python and SQL are by far the most sought-after skills, with around 50 percent of the jobs demanding these two skills. Looking further at place three, we can see that there is a demand for Power BI, which is a business intelligence product offered by Microsoft. Following, we have a mixture of several programming languages and technologies. If we look at the salaries, we can see that we have several outliers, especially towards the lower end, which are probably a result of an error from the salary extraction pipeline or postings that give hourly wages that are not accounted for. The same is true for the skill extraction, which is comprehensive but not perfect. This could be accounted for by fine-tuning a model for this task. Nonetheless, given the zero-shot prompt approach, I am very happy with the results, which coincide with common knowledge about the field of data science and related job roles.

Conclusion

Our data analysis project sheds light on the current landscape of Austria’s data science job market by uncovering sought-after skills and highlighting salary ranges. According to our analysis, a well-rounded approach to acquiring skills would first require a solid understanding of Python and its usage in data-related scenarios. The following is a study of SQL and the concept of relational databases. A business intelligence solution like Power BI further adds to the skill set, which is now more focused on data analytics than data science. In general, the results of our analysis state a higher demand for data analytics than data science.

Alternatively, or to further expand one’s skill set, the usage of the Microsoft Azure platform for data-related workloads would be a great addition. Programming languages like Java, JavaScript, C#, or C++ and version control with GIT are more in the realm of software engineering. Armed with this knowledge, aspiring data scientists can effectively prioritize their skill development efforts. As technologies evolve and new demands emerge, it’s crucial to stay informed about market dynamics. The code for the analysis can be found in this repository.

Building a Private AI Tutor for a Microsoft Learn Course

Marco Schweiss — Wed, 25 Oct 2023 09:56:44 +0000

In this article, we will look into one of my recent projects, which is the creation of a specialized private AI tutor for a Microsoft Learn course for the Power BI certification. Given the results of the job market analysis discussed in my most recent article and the fact that I already wanted to refresh my Power BI knowledge, this is the perfect opportunity for this project. The overall aim was to provide an interface where users could ask questions and receive accurate and relevant responses based on the content of the course. We will now look into the details of the implementation.

Data Collection

To build the knowledge base for our AI tutor, we will utilize Python along with libraries like Selenium and BeautifulSoup to scrape all necessary information from the web. By converting each section of the course into individual text files, we can create a comprehensive database to serve as our tutor’s knowledge repository. Thereby, we have to utilize Selenium, which allows us to retrieve HTML code that is dynamically generated by JavaScript. We basically retrieve the raw HTML of each page and extract the actual content with the help of BeautifulSoup. Following, we clean the data by removing links to quizzes and interactive exercises in the course and splitting up the raw text into individual chapters by using the heading structure of the HTML.

This follows the same logic as the PDF conversion pipeline that was used for the ChatGPT flashcard generator and document summarizer, where the headings are used to split up a larger file into chapters. This semantically correct split of the text is important for the following steps and improves our results. Running our scraping pipeline results in the retrieval of 196 pages of course content, which are split into 528 text files that ultimately serve as our knowledge base.

Implementing the AI tutor

The core of our AI tutor is built with the LangChain library, which is a framework that facilitates building applications with large language models (LLMs) like those from the GPT model family. To enable our AI tutor to effectively process and retrieve information, we need to load and prepare the data before we can make an inference that returns an answer to our questions. This pipeline follows a step-by-step process, which is as follows:

Loading the documents
Creating word embeddings
Storing the embeddings in a vector database
Retrieving documents with a similarity search
Passing the retrieved documents to a LLM

Steps 1 to 3 are only done once at the beginning, and steps 4 and 5 are repeated for every question that we ask.

Loading Text and Storing Vectors

Loading the documents is straightforward and done with LangChain’s DirectoryLoader class. Thereby, the functionality of the class basically iterates over the files in a directory and stores them in a list. The more interesting part happens when we convert the raw text documents into embeddings in the next step. Embeddings allow for the creation of a vectorized representation of text, enabling us to analyze and compare pieces of text in the vector space. This is particularly valuable for conducting semantic searches, as it allows us to identify texts that are most similar based on their vector representations. For our project, we use the OpenAIEmbeddings class also from LangChain, which serves as our access point to OpenAI’s embedding service. The last data preparation step consists of storing the embeddings in a vector database, which allows for efficient retrieval. For our project, we use Faiss, a vector database developed by Meta.

Querying Process and Answer Generation

Once our database is ready, we can input questions or queries into our AI tutoring system. Utilizing similarity search algorithms, the vector database retrieves relevant documents based on query similarities. This retrieval process enables quick access to specific sections of course material that may hold answers or explanations related to our inquiries. Upon retrieving relevant documents from our knowledge base, we can then employ a GPT model to formulate accurate answers in response to user queries. The LLM ensures that we receive high-quality and contextually relevant responses tailored specifically to our questions. For better comprehension, we will now walk through the process step by step.

First, we set up the question, which will be “How can I share reports with Power BI?” or in German “Wie kann ich mit Power BI Berichte teilen?”. We first use this question for a similarity search in the vector database, which retrieves the following four documents:

You can see that three of the four documents are very relevant to our question about sharing reports, which means that the similarity search worked well. Here is also the proof that our initial semantically correct splitting of the documents paid off, because otherwise we would retrieve a lot of noise from the database. For example, a common strategy for this type of application if we do not have a structure to build from is to use a fuzzy approach where a large document is split at defined, overlapping word intervals. The LLM that interprets the retrieved documents can certainly handle this, but it is always better to clean the input data as much as possible to improve the accuracy and stability of the output.

As our last step, these four documents are now passed to the GPT model with the instructions to formulate an answer to our question based on the context of the four documents. LangChain handles the prompt creation and the API call for inference, and we get the following answer to our question:

The answer is pretty solid and very relevant to our question. Another very useful feature is that the hallucinations of the LLM are kept to a minimum with this approach. Hallucinations are incorrect answers that the model comes up with. For example, if we ask the model, “What is a squirrel-cage rotor?” or in German, “Was ist ein Kurzschlussläufer?” we get the following answer:

Conclusion

The development of the specialized private AI tutor for the Microsoft Azure Power BI course introduces an innovative way of engaging with educational content. By leveraging the power of Python, web scraping techniques, and advanced libraries like LangChain and GPT models, we successfully built an intelligent tutoring system capable of providing personalized and accurate responses. The biggest advantage of this type of system is definitely the increased speed of information retrieval. We basically built a specialized search engine that dynamically compiles the search results in the most targeted and accurate fashion possible, providing a direct answer to a question.

Overall, LangChain and similar frameworks provide a novel approach to utilizing state-of-the-art LLMs and have spawned a new wave of sophisticated AI applications. The system that we built here with a few lines of code can be leveraged in a lot of different scenarios. For example, the most recent generation of customer support chatbots are probably all running on the same or at least a similar pipeline. A knowledge base is set up with very relevant information about a company’s products or services, and customer questions are used the same way as we used for our AI tutor. The same system can also be used for company internal knowledge, interacting with user manuals, online documentation, or research papers. Although the responses are not always perfect and one should always check the source, which can be printed with the answer, it feels like we are moving into an era of real AI integration into a vast number of applications. The code for this project can be found on GitHub.

PDF Summarization and Flashcard Generation with ChatGPT

Marco Schweiss — Mon, 17 Jul 2023 13:42:35 +0000

In an era where information overload is prevalent, finding efficient ways to extract key insights from vast amounts of data has become crucial. As an aspiring student navigating the sea of academic papers during my time at university, I often found myself spending countless hours reading and analyzing research articles. Motivated by the desire to streamline this process, I embarked on a journey to develop a PDF summarizer using ChatGPT. Additionally, I wanted to leverage ChatGPTs language capabilities to automate the process of generating flashcards from the learning material.

By leveraging the power of OpenAI’s state-of-the-art language model, I aimed to create an automated tool that would help me and others extract essential knowledge from lengthy scholarly texts. The primary objective was to facilitate efficient research and enhance the learning experience for students like myself.

In this article, I will delve into the intricacies of building a PDF summarizer and flashcard generator with ChatGPT. I will discuss the underlying methods and the challenges encountered along the way.

The Portable Document Format

PDF, short for Portable Document Format, has become one of the most widely used file formats for document sharing and distribution. Its popularity is attributed to its ability to preserve the original formatting of a document across various platforms. However, this format presents unique challenges when it comes to extracting information and transforming it into a structured format.

The main hurdle lies in the fact that PDFs store data as a series of graphical elements rather than plain text. While these graphical elements maintain the visual appearance of the document, they pose difficulties for programs aiming to extract meaningful content automatically. Directly accessing and parsing text from a PDF can be complex due to variations in layout, fonts, and non-standard encoding schemes.

Another challenge is that PDFs often contain complex structures such as tables, headings, footnotes, citations, and images. These elements require specialized algorithms and techniques for accurate understanding and extraction. Moreover, despite efforts towards standardized file formats like tagged PDFs (PDF/A), many documents are still produced without such metadata, making it even more challenging to extract relevant information systematically.

To overcome these obstacles and bring the data within PDFs into a structured format, one must employ smart techniques such as optical character recognition (OCR) to convert graphical elements back into text data. Additionally, algorithms need to analyze the layout structure to identify headings, sections, paragraphs, and other textual components accurately, which brings us to the PDF Extract API from Adobe.

PDF Extract API

The PDF Extract API is a cloud-based service to automatically extract content and structural information from PDF documents. It can extract text, tables, and figures from both native and scanned PDFs. When extracting text, the API breaks it down into contextual blocks like paragraphs, headings, lists, and footnotes. It includes font, styling, and formatting information, providing comprehensive output. Tables within the PDF are extracted and parsed, including contents and table formatting for each cell. The API can also extract figures or images from the PDF, saving them as individual PNG files for further use.

We will leverage this API to create a JSON file containing the PDF content and in turn, use this JSON file to create text files that we will feed into ChatGPT for further processing.

The Problem with Summarization

Text summarization is a natural language processing technique aimed at condensing large bodies of text into shorter summaries while still capturing the essential information. With the rise of transformer models like ChatGPT, extractive and abstractive summarization methods have seen significant advancements.

However, when dealing with long-form content such as PDF documents, transformers face challenges due to their limited input length. Transformers typically have a maximum token limit, necessitating the truncation or splitting of longer texts. This approach can lead to the loss of important context and affect the quality of the summary.

To address this limitation and preserve information integrity, we will leverage the headings present in the PDF documents. Headings provide structural organization to the text and can serve as natural breakpoints for splitting the document into smaller sections. By splitting the document based on headings, we can ensure that each section is more coherent and self-contained, resulting in better summarization outcomes.

PDF to JSON

As already explained the first step of our pipeline is the conversion of a raw PDF into a JSON file and the extraction of tables and figures. We will use the paper Attention is All You Need by Vaswani and colleagues which is an important paper about transformer model architecture with over 80 000 citations for testing our approach. If you open the PDF you can see that we have a perfect candidate with a variety of elements such as headings, sub-headings, lists, math formulas, images, footnotes, and tables.

For a detailed explanation of the JSON files structure please refer to the documentation of the API provider. Sending the PDF to the Extract API returns the following files:

From here on out we are confronted with three distinct problems:

How do we get the information in the JSON file into a format suitable for our use case?
How do we integrate the tables into the text files?
How do we handle the figures?

JSON to Chapters

We will start by solving problem one which is the most straightforward to solve. By using the documentation we can iterate through the JSON file and transform the data to our target format, which is a folder of text files each corresponding to a section. A section is thereby everything that is contained between two headings.

The second problem of integrating the tables back into the text files is also relatively easy to solve. When the JSON file indicates a table in its path attribute we read the table into a DataFrame, convert it into a CSV, and instead of saving the data to a file we save it inside our text file. By using | as column separators, ChatGPT is capable of understanding that this should be a table.

The third part on our list is where it gets a bit more interesting. Let’s look at the following two images from the figure folder. Here is the first image

and here is the second image:

Given that this application should also be used for generating flashcards, we need a way to tell apart diagrams and formulas and also to incorporate mathematical formulas into our text. We will omit the diagrams for the time being.

Training a Formula Detection Model

To solve the issue we will train a CNN on a dataset that I compiled from various different sources. We basically need a model that can distinguish between math formulas and everything else that could be included in a PDF. This can range from figures and diagrams to illustrations and images of everything else. The approach to training this model is the same as the one that I have already used for the facial emotion classifier. Given the relatively simple task for the modified VGG19, because the two image categories are relatively distinct, we get an accuracy of around 99% which will have to suffice.

Now we integrate this model into our initial script and every time the JSON file path attribute indicates a Figure, the model is called to predict whether we have a formula or not. Depending on the decision the figure is either skipped or sent to the Mathpix API which returns us a Latex representation of the formula. The Latex code is then included in the text where it can be interpreted by ChatGPT. An example of a text file can be seen below:

Chapters to Content

With the preprocessing of the PDFs done, we can now move on to the second big part of the pipeline which is the summarization and the creation of flashcards. As I have already mentioned we will use ChatGPT’s capabilities to do this. By using ChatGPT’s API we can improve our results by providing examples of how we want the task done by sending context in the form of a preceding conversation. To convert the information from the raw text files into high-quality summaries and flashcards we will incorporate an intermediate step that chunks the raw text data into blocks of related information. The image below shows the text of chapter eleven in a chunked format.

After chunking we then let ChatGPT create a summary

and a list of flashcards:

The image does not show all the flashcards that were created. The complete text file can be downloaded here. As you may have noticed, the summary of the section is not really that much shorter than the original section. We will discuss how different lengths of text effects the summaries and the quality of the outputs later on. We also instruct ChatGPT to create flashcards from the summaries which gives us the following list:

Content to Documents

The last step of the pipeline is to convert the text files into documents that can be further used for studying. We will first look into the CSV files that can be used in flashcard software like Anki. Here we will create two CSV files one for the flashcards from the chunked files and one for the summary flashcards.

In addition to the question and answer that makes up a flashcard, we also include the source (document name), the section heading, and the two combined inside a tags column, which is a sorting mechanism used by Anki. Given the correct settings inside Anki, we can import the 238 flashcards from the chunked text and the 113 flashcards from the summaries. An example of how this looks inside Anki in its final form can be seen below:

The second type of document which we will create is two Word documents that contain the chunked sections and the summarized sections.

Discussion of the Results

Does our pipeline here perform flawlessly? No. Is it enough so that I could save a lot of time on tedious work by hand? Yes, absolutely.

Summaries

First of all, let’s talk about the rate of shrinkage that a document undergoes. The size of the resulting summary is dependent on how large the individual sections are and I think the GPT model handles this very well because larger sections get condensed more, and smaller sections that contain less information to begin with, do not get summarized too much. As you can see from our example, we have around 30% percent shrinkage of the original text, if we only consider the actual text as graphics, references and the abstract have been removed.

On the other hand, I tested this pipeline on my written university lectures that have several hundred pages and here I often got a reduction of around 50% or more, of course depending on the structure of the document. So a rule of thumb is that a lot of text without much headings and graphics gets condensed more and a document with sparse text gets condensed less.

The summaries that this pipeline generates can also be used as input for more summaries and solves the problem of too large input contexts, of course only if the original document is not too large to begin with. Here is an example of a one-page summary based on our five-page summary, that could be used for example to speed up a literature review :

Another idea would be to build a recursive summarizer where we first summarize sub-sections and then summarize these summaries into section-level summaries. Similar like OpenAI did in their case study. Regarding the quality of the summary I can of course only give you my subjective opinion as there is no standardized process to measure the quality of a summary. First I would like to explain how I use summaries in my overall approach to studying for a subject. The summary is used to get a first understanding of how the topics of a subject relate to each other and effectively reduces the time to do so by omitting detailed information.

So in essence you can focus on building a higher-level overview of how topics relate and care for the details of a specific topic later on. Given that I studied successfully for several exams with this approach I am personally satisfied with the quality of the summaries and the time that the automated generation saved me.

Flashcards

Next, we will look into the flashcards. As you may have noticed, the flashcards are very granular which means that practically every bit of information is transformed into a question-and-answer format. The idea here is to curate the flashcards during the revision process, so you have built an understanding of what the learning material is about with the summaries, then studied the material in detail with the chunked and original text and finally revise it with the flashcards.

During this revision process, you can then remove or rephrase sub-optimal cards and also check gaps in your existing knowledge. Additionally, you can evaluate the cards during studying which also can be considered a form of interleaved retrieval practice and results in better learning outcomes at least in my case. And last but not least the automated flashcard generation reduced the time it took to study for an exam, as I found curating the cards is faster than writing them from scratch. Of course, you can also just study the summary cards and call it a day.

Conclusion

In conclusion, the use of the PDF summarization pipeline built with ChatGPT has proven to be a game-changer in reducing study time and eliminating tedious handwriting work for me. The possibility for models like ChatGPT to act as a multitask-learner with minimal input is really fascinating and allows for a vast amount of use cases like this one. It is important to note that I use the GPT-3.5 model although GPT-4 is already available because the price per token demands it. A larger document with 100-200 pages costs around 2-3 $ at this time with the GPT-3.5 model and would cost 10 times as much with GPT-4.

I already did tests with GPT-4 but the capabilities in regards to the transformation of text are not that much different between the two models to justify the costs. This also coincides with what OpenAI reported about the difference between the two models. Given the rapid advancements of this technology and the large number of tasks that can be solved with large-language models, I am looking forward to years of interesting application development. This time there will be no code on GitHub, if my coding skills interest you please refer to my other works.

Training of a Spam Classification Model

Marco Schweiss — Fri, 14 Jul 2023 09:14:55 +0000

The rise of short message communication in recent years has provided individuals and businesses with a fast and interactive way to communicate with their peers and customers. However, this has also led to an increase in the number of spam messages, particularly on publicly available channels. In response to this, we will build a classification model that can identify if a message is spam and allows for an adjustable risk level, ensuring that low-risk messages are immediately displayed while others are moved to a folder for further analysis. We will base our model on this dataset.

To achieve these objectives, this project will follow the CRISP-DM framework, a widely used methodology that provides a structured approach to planning a data science project. We will cover all the necessary steps, including business understanding, data understanding, data preparation, modeling, evaluation, and a suggestion for deployment.

Organizing the Project with CRISP-DM

The first step in the CRISP-DM methodology is the Business Understanding step, which involves understanding the business objectives and requirements of the given task. In this step, we need to define the goals and objectives of the project and identify the stakeholders. This will help us to define the scope of the project and ensure that we deliver a solution that meets the needs of the stakeholders. We will base this project on a fictitious company that has tasked us with building a spam filter for their open channel to enable effective communication with their customers while maintaining message quality.

Business Understanding

For this project, the business objective is to build a spam filter for a new open communication channel that the company wants to install for its products. The aim is to enable fast and interactive feedback for their customers or possible future customers. The spam filter should be able to identify spam messages from short messages and allow for adjustments with respect to the risk of allowing spam to pass. This means that the spam filter should be adjustable via different spam-risk levels, e.g., low-risk (very restrictive) and high-risk (not restrictive). The messages passing the low-risk level are immediately displayed, while the other ones are moved to a folder for further analysis.

The stakeholders in this project are the company, the service team who will be responsible for using the spam filter to filter out spam messages, and the customers, who will be using the open channel to provide feedback or ask questions on the company’s products. It is important to keep in mind the expectations of these stakeholders while developing the spam filter so that we can deliver a solution that is effective and meets the needs of all groups. For the company the emphasis lies on not losing customers from misclassified messages, the service team wants a system that simplifies their daily work and the customers do not want to read too many spam messages, although a lost customer will probably hurt the company more than a spam message entering the channel.

In addition to the primary objectives of the project, it is also important to consider any constraints and assumptions that may impact the project. For example, we could have limitations in terms of the resources available to us, such as the amount of data that we have access to or the computing power required to train and test our models. We may also have to make assumptions about the types of messages that are likely to be classified as spam and adjust our approach accordingly. By considering these constraints and assumptions, we can ensure that we deliver a solution that is realistic and feasible given the resources available to us.

Constraints and Assumptions

Based on the dataset and the resources available, we can identify the following constraints:

The dataset only includes English messages, so the model will only be able to classify English messages accurately.
The dataset has a limited number of messages, which may not be representative of all types of messages that the company may receive.
The analysis and model training will be done on a home desktop PC, which may have limitations in terms of computational power and memory.

and assumptions:

The messages in the dataset are correctly labeled as ham or spam.
The dataset is a representative sample of the types of messages that the company may receive.
The spam filter will be used in a similar context as the dataset, so the model will generalize well to new messages.

Data Understanding

To start the data understanding step, we will first load the dataset and check for any missing values and duplicates. We will use the Pandas library to load the CSV file containing the messages and perform basic exploratory data analysis (EDA).

From the above output, we can see that there are no missing values in the dataset, but there are 403 duplicate records. We will remove these duplicate records from the dataset and look at the number of resulting unique records.

Next, we will explore the distribution of the target variable label with the help of Matplotlib and Seaborn. Additionally, we look at the distribution of the character count of the messages.

From the first visualization, we can see that the dataset is imbalanced as there are more ham messages than spam messages. This must be accounted for before we train our classifier.

Plotting the distribution of the message lengths, we can observe that spam messages tend to have longer message lengths compared to ham messages. Further looking into the number of digits and special characters reveals that spam messages tend to have more digits.

Another interesting feature that we will explore is the number of unique words per message type

and lastly, we plot word clouds to spot possible differences in vocabulary.

With this, we conclude the EDA and move further on to the data preparation step where we will balance the dataset and choose the most interesting features for training the model. The engineered features for each message will be message lengths, the number of unique words, and a TF-IDF representation of the vocabulary. We will omit the special character distribution as ham and spam messages are very similar in this regard.

Data Preparation

For the next step, we perform data preprocessing and feature engineering on the SMS message dataset. We load the data from the source CSV file, remove duplicates, and create new columns such as message length, number of digits, number of unique words, and a lemmatized version of the messages. The text data is then transformed into a matrix of TF-IDF values using the TfidfVectorizer class of the scikit-learn library. The resulting features are saved along with the trained vectorizer. Additionally, the dataset is balanced by sampling an equal number of ham and spam messages. The target variable is encoded as 0 for ham and 1 for spam and finally, the prepared dataset is saved as a new CSV file.

Model Training

For training, we first load our prepared dataset of messages and split it into training and testing sets. Then we perform a grid search with cross-validation to find the best hyperparameters for our classifier which will be a decision tree. Additionally, the feature importances are retrieved from the trained classifier, and a barplot is created to visualize the top 10 most important features.

Examining the plot, we can see that the model clearly favors the number of digits in a message as a decision factor. Following we have specific words and our other engineered features.

Error Analysis

After training our model, we will now further analyze the performance. To do this we compute a confusion matrix on our training and test datasets.

Additionally, we also retrieve accuracy, precision, and recall for each dataset respectively.

From the metrics, we can conclude that the model performs sufficiently well on our initial dataset. Additionally, we will look into the distribution of misclassified messages.

With this, we can assume that the model is not particularly biased toward one of the message types. Finally, we will test the model on 10 new messages that were generated with a language model and consist of 5 ham and 5 spam messages. The test results can be examined in the analysis Jupyter Notebook and revealed a reasonable out-of-sample performance and generalization. Although the emphasis of the model on digit counts makes spam messages with no digits and ham messages with more digits hard to classify correctly.

Deployment

To integrate the spam classification model into the service team’s workflow, we could develop a web application using Django. The steps involved in creating and deploying this system would be as follows:

Set up the Django project structure and configure the necessary settings.
Design the database schema to store messages and their classifications.
Set up an API endpoint that receives incoming messages, assuming we have the possibility to control message flow from our communication channels’ backend.
Implement the web interface using Django’s template system and HTML/CSS.
Integrate the spam classification model into the project and develop the function to classify messages.
Implement an adjustable spam level control, allowing the service team to set the level of stringency for classifying messages as spam. The level would be a percentage threshold that can be used for comparison with the probabilities of class membership that the model returns in percentages.
Deploy the Django application on a cloud platform.
Conduct thorough testing and quality assurance to ensure functionality and performance.
Provide documentation and training to guide the service team on using the web application effectively.
Maintain and improve the system based on feedback and ongoing monitoring.

Conclusion

The objective of this project was to build a classification model capable of identifying spam in short messages. The CRISP-DM methodology was followed, covering business understanding, data understanding, data preparation, modeling, evaluation, and a suggestion for deployment. The dataset’s quality was assessed, and findings were visualized for better comprehension. A decision tree classifier was created, trained, and tested using the provided dataset and additional out-of-sample data. An error analysis was conducted to identify weaknesses in the approach, which would be the reliance on digits for classifying spam. Finally, a proposal was made to integrate the model into the daily work of the service team. The solution would be in the form of a graphical user interface (GUI) based on a Django web application.

In summary, this project successfully developed a classification model for spam identification in short messages. The model returns class membership certainties in percentages which can be further utilized to build logic for adjustable risk levels. Further improvements to the model could be made by gathering additional data or data that is more representative of the use case, i.e. short messages of product feedback or questions. Additionally, there is the possibility of engineering alternative features to reduce reliance on the digit count. The project demonstrates the efficacy of data science methodologies in addressing real-world challenges, offering a valuable solution for the company. The code for this project can be viewed in this GitHub repo.

Emotion Detection in Images

Marco Schweiss — Fri, 14 Jul 2023 08:14:04 +0000

In an increasingly digital and interconnected world, understanding human emotions has become a critical aspect of various industries. Emotions significantly impact our decision-making processes, behavior, and overall experience. Marketing departments are particularly interested in leveraging emotion detection technology to gain insights into consumer preferences, optimize advertising strategies, and enhance user experiences. This project report focuses on the development and implementation of an emotion detection system capable of analyzing images to identify the underlying emotions expressed by individuals.

Conception

To solve this task, I decided to train a convolutional neural network (CNN) with a public dataset consisting of images of faces categorized into seven different emotions. To get a viable solution, the concept of transfer learning will be employed. This should result in a solution with state-of-the-art accuracy regarding the classification of emotions.

Basic Frameworks

The primary frameworks for implementing the emotion recognition solution will be TensorFlow and OpenCV. TensorFlow is an open-source library that makes it easy to train, test, and develop neural networks. The name TensorFlow stems from the use of multidimensional data (tensors) as input data sent through several intermediate layers. These layers can be individually selected to build up a neuronal network architecture. Additionally, the library also contains several pre-trained models which can be used for transfer learning. The second library, OpenCV, which is a computer vision library, features a lot of useful algorithms that will be used to extract faces and manipulate images for the inference step.

Transfer Learning and Model Choice

With transfer learning, it is possible to use an existing model to solve different but related problems to the original purpose the model was conceived for. The model’s pre-trained weights are reused, and the architecture of the original model is slightly altered. Transfer learning conserves time, generally results in better performance, and shrinks the scale of necessary training data sets. The model that the emotion classifier for this project is based on is the VGG19 model. The VGG19 is thereby a deep CNN model, which means that the model consists of multiple layers, including a 1000 neurons dense output layer. This is due to the fact that VGG19 was trained to classify 1000 different categories. To fit the model for the use case of detecting emotions, the original output layers, including the 1000-neuron dense layer, will be swapped with new output layers corresponding to the number of emotions that we want to identify. Additionally, we lock the original weights of the VGG19 model and train only our modified layers on the emotion dataset.

Dataset Selection and Issues

To train the model for emotion recognition, the FER2013 dataset, which consists of around 30000 images of facial expressions, will be used. The images are thereby classified into seven different categories of emotion. The dataset also features around 3500 test images for validation of the trained model. Although the dataset features a reasonable amount of images for training, there is the problem of some categories being under and over-represented. This introduces the problem of bias, as, for example, the Happy category contains around 7000 images and the Disgust category features only around 400 images. To combat this issue, the data will be augmented, for example, by flipping, rotating, or cropping images. Here is the link to the dataset.

Model Training and Inference

The overall process of implementing the solution for emotion recognition can be split up into two distinct parts, the model creation, and the inference part. The model creation procedure consists thereby of augmenting the dataset, setting up the training environment and architecture with TensorFlow, and then starting the training process.

The second step, which presents the actual solution for making a prediction, consists of the trained model, which is loaded into a Jupyter Notebook for inference. The process of inference consists thereby of loading an image into a Jupyter Notebook, extracting the face with OpenCV, preprocessing the image, and passing the image to the trained model for prediction. The prediction is visualized as the image with the classified emotion written on it.

Development

After creating a concept, the next step was to start the Implementation. This PDF explains the development process in detail.

Conclusion

Looking back on the creation of the emotion classifier, I can say that this project was a very pleasant and rewarding learning experience, regarding computer vision and the development workflow of deep learning models. Given the issues regarding data quality, bias-variance problems during training, hardware limitations, and the trial and error of hyperparameter tuning, the task also proved to be more difficult than initially thought. As the saying goes, the devil lies in the detail, and I can now understand why training deep learning models is by a lot of machine learning practitioners considered a mixture of art and science.

Personally, I am satisfied with the final result, although I wished for a bit higher accuracy and more emotions. To improve upon this work, a better dataset like AffectNet which has around one million images could be used to train a model on better hardware like TPUs. Grid-search for finding the best parameters and model structure could also be used to get an overall better performance. The code for this project can be found in this repository.

Writing a CNC Workpiece Positioning Program

Marco Schweiss — Thu, 13 Jul 2023 14:39:45 +0000

The starting point for this next project was a defect in one of our CNC portal milling machines at my workplace, which is used for manufacturing staircase components. The defect did not directly involve the production system itself, but rather the projection laser used for setting up the vacuum clamps and for positioning the workpiece on the milling table.

A conversation with the manufacturer revealed that the projection laser had already been discontinued and there were no longer any spare parts available for it. The only solution would have been a custom-made replacement, as the machine control only had an outdated RS232 interface, which was no longer installed by the manufacturer as a standard feature.

Considering the relatively high cost of several tens of thousands of euros for a new projection laser and the challenging circumstances involved in assembling and disassembling the laser system at around 6m from the ground, I opted for a software solution.

In addition to the aforementioned points, an application for calculating the positioning of the workpiece and vacuum clamps provided a speed advantage, as it could also calculate the necessary zero-point shifts. Normally, the workpiece would be placed at a location determined by the CAM software and then manually adjusted due to the complex shape of the machine table. This trial-and-error process typically consumed a significant amount of time.

Importing the G-Code

As a first step, it was necessary to read the G-code files into the development environment and locate the lines that defined the contours of the workpieces. To do this I leveraged the start and end strings that the CAM software provided. (WANGENKONTOUR) or (AUFG AUSSENK) as the start flag and M00 as the end flag.

Next, I had to extract the actual numerical values from the G-code strings, which would serve as the basis for further computations. To do this I wrote a function that has the task to find any coordinates labeled with an X or Y in the G-code lines and gather them into two separate lists. The plot below shows the contour of the stringer if we plot the extracted coordinates:

Placing the Vacuum Clamps

Let us now look at the real challenge of this project, which was to find a way to position the vacuum clamps on the machine table and how the clamps had to be shifted within the workpiece. My idea was to re-render the contour of the workpiece to fill in the missing values because as you can see in the G-code example above there are big gaps between the points of the original contour. The fine contour is then used to find the position of the clamps inside the workpiece.

My algorithm operates based on the following principle. First, a fictitious line is calculated using the centroid coordinates of the staircase component.

Then, points on the centroid line are sought for the centroids of the two outermost clamps until the distance between the clamps and the contour of the staircase component reaches 4cm.

If we now plot the clamps, this first step looks like in the image below:

The remaining clamps are then placed on the centroid line, offset by the width of one clamp. This results in three clamps on the left and three clamps on the right, as a workpiece is always secured with exactly six clamps.

As you may have noticed, the clamps on the right have a different spacing than the ones on the left. If we plot the machine table beside the stringer it should become clear why this is the case.

It would have been great if the project was finished by now, however, due to the complex machine table, a total of four different algorithms had to be developed since the staircase components can have lengths ranging from 50cm to 6m and sometimes require significantly different positioning of the vacuum clamps within the workpiece and on the machine table. We will look at the different configurations at the end of the article.

Positioning on the Machine Table

After calculating the clamping positions, a suitable spot on the machine table had to be chosen. This calculation also depends on the length of the staircase components. In the next step, the zero offsets and the measurements for clamp and raw workpiece placement are calculated. Finally, a plot is made with the positioning of the workpiece relative to the vacuum clamps and the positioning of the vacuum clamps relative to the machine table.

In addition, the plot, which is output in the form of a PDF in A4 format, shows the necessary zero offsets that are fed into the CNC control panel. A finished computation gives a plot like the one below:

As already mentioned we will also look at the other possibilities of workpiece placement which are showcased in this PDF file.

Conclusion

This project will always have a special place in my heart as my first real attempt at developing an actual useful application. Ironically the project probably features the ugliest code that I have ever written but on the other hand, has also created the most value. The application could prevent a thirty thousand euro investment and has effectively reduced the time it takes to process one workpiece by around 25 percent. Additionally, working on the machine is now less error-prone, and training new employees has become faster.

In the years since I developed the positioning script I always played with the thought of refactoring the code, but given the time it would take to entangle the code and the fact that it really works flawlessly so far, I couldn’t really bother. Of course, you can have a look for yourself, as the code is hosted on GitHub.

Visually Exploring an Ecommerce Dataset

Marco Schweiss — Thu, 13 Jul 2023 12:05:52 +0000

When working with data, it is often helpful to explore it visually first. This allows us to get a feel for the data and identify patterns. There are a few different ways to do this, with the exact approach depending on the type of data to be analyzed. Visual representation of data and information is known as data visualization. This can occur by utilizing visual elements like graphs, maps, and charts. Data visualization tools make it simple for people to research, identify, and visualize patterns, outliers, and trends. Thereby data visualization methods have improved by a substantial degree over the past several years.

However, researchers are often not receiving sufficient training in how they can design their visualizations to best communicate the findings of their analyses. What provides significance to the gathered raw data is data interpretation. Data interpretation is the process of assigning meaning, using various methods to analyze the gathered information, and reviewing the data prior to reaching relevant conclusions.

Regardless of the field, data visualization can be beneficial by providing data in the most comprehensible means possible. Therefore data should be correctly represented so that it can be well interpreted. In the scope of this paper, we will discuss the visual exploration of e-commerce data to illustrate the usage of visual data exploration techniques for gaining insights into an unknown dataset.

Data Storytelling

Data storytelling is a new way to explain data that is both informative and engaging. By using stories, data visualization, and other techniques, data storytelling can help people understand complex data in a more meaningful way. For example, a study employed an empirical analysis of data collected from business analytics practitioners and found that data storytelling competency is linked positively to business performance.

Accurately and quickly communicating and presenting ideas and findings to others has become essential for success in today’s economy. It is now vital to master the art and science of data storytelling using frameworks and techniques to help execute compelling stories with data. It is insufficient to transform raw data only into sophisticated visuals. You also need to be able to communicate numbers effectively in a narrative form. Learning effective data storytelling can help researchers to acquire the aptitudes required to articulate ideas using persuasive and unforgettable data stories.

Additionally, data storytelling continues to be experiencing a growth in popularity in digital journalism with media companies, but work has yet to be made to properly utilize these resources for a more in-depth analysis of what makes a good data story and how they are created.

Exploratory Data Analysis

Processing data, observing abnormalities, and discovering connections are essential not only for science but also for everyday life. Exploratory Data Analysis (EDA) is an approach that employs graphical techniques to examine and interpret a given dataset. The method is used to discover trends and patterns, as well as to verify assumptions with the aid of statistical summaries and graphs.

EDA reduces the weight of traditional views of statistics that are based on randomness, stochastic models, and population parameters. These views are more concerned with questions related to the precision of estimates and the importance of a finding. In contrast, EDA acknowledges the limitations of this conventional paradigm and puts more emphasis on data exploration using whatever methods the analyst deems appropriate.

As already explained, visualizing complex data is beneficial for understanding it throughout the conducted data analysis. However, as the datasets increase in size and complexity, static visualizations are starting to become impractical or impossible. Although there are other interactive techniques that can provide useful discoveries, modern-day data analysis tools normally support only opportunistic exploration that is prone to be inefficient and insufficient.

Possible approaches to mitigate this issue have already been proposed, for example, the usage of systematic yet flexible design goals, which can be considered a reference architecture that seeks to present an overview of the analysis process, suggest ambiguous zones, allows for annotating important states, support teamwork, and facilitate sustainable reuse of established methodologies.

As EDA is used to see what data can tell us before the modeling task, it is a crucial step before getting into modeling your data to solve business problems. Thereby EDA can be employed in a variety of different fields. Sociologists might find it appropriate to use EDA techniques before attempting to specify complex models of social behavior. The psychologist might find applications for these techniques in studies of attitude formation and change. Another example would be the economist who might employ them fruitfully before testing econometric models of either micro- or macroeconomic behavior, and the historian might find EDA techniques useful to develop indicators of historical change.

Almost all data analysts can find the EDA perspective helpful in resolving data analysis problems. Although EDA can be a powerful tool, researchers should be thoroughly knowledgeable about the techniques described in exploratory data analysis.

EDA in E-Commerce

Technology evolves rapidly, and shopping trends shift on a daily basis, putting e-commerce in a constant state of adaptation. Thereby e-commerce analysis is the process of discovering, interpreting, and communicating data patterns from areas related to online businesses. By using online customer data, one can discover the exact alterations in online customer habits and purchasing trends. More intelligent decisions will be made based on data which should result in more online sales being made. E-commerce analytics can also include a wide range of benchmarks relating to the full customer journey, such as discovery, acquisition, conversion, retention, and advocacy.

The aim of most companies is to pursue revenue opportunities from all possible sources, thereby growing their net income by increasing revenues. However, most companies focus on their growth areas where they have built the distinctive capabilities that set them apart from others. Data and analytics are crucial elements for successful performance measurement and management for growth. The company and customers’ relationships offer a plethora of data and data sources that are usually plentiful with information.

Additionally, there are various types of analytical techniques for companies to improve performance, for example, predictive and prescriptive analytics. Predictive analytics employs a variety of statistical techniques, from data mining, predictive modeling, and machine learning, that analyze current and historical facts to make predictions about the future or otherwise unknown events or categories of data that are not yet available. Prescription analytics uses the data to determine and make recommendations about the best next steps to be taken. An example would be how to schedule sales representatives based on the predicted demand for goods and services.

Exploratory data analysis is very important in e-commerce because it guarantees that businesses ask the right questions using the insights from exploring their data and validating their business premise after a thorough investigation. When it comes to business-related research, EDA gives the best merit by helping a data scientist expound on the correct results which equal the required business conditions. Moreover, it allows stakeholders to check whether they have asked the correct questions. Additionally, there are many possibilities available that can help industries incorporate EDA into Business Intelligence software.

Due to the need for exploratory data analysis in various fields such as business, finance, and particularly in e-commerce, it is necessary to study and understand the importance of exploratory data analysis in performing various research in e-commerce. The objective of this paper is to demonstrate the process of visually exploring a dataset by analyzing e-commerce data through exploratory analysis. The dataset used is the Online Retail Dataset which can be found in the UCI Machine Learning Repository. The exploratory analysis is conducted with the Python programming language and popular libraries like Pandas, Matplotlib, and Seaborn, which are employed for handling and visualization of the data.

Data Exploration

We will start by exploring the structure and content of the e-commerce dataset. Our goal is to gain an overview of the dataset and identify opportunities for intriguing data visualization.

Dataset Structure

To get an initial glimpse of the data, we will load the CSV file into a Pandas DataFrame. We will display the first five rows of the dataset and examine its shape, revealing 8 columns and approximately 550,000 rows.

Now we can perform various aggregations and visualizations to explore intriguing characteristics of the data. We will discuss these aggregations and visualizations in detail in the following sections.

Transactions per Country

After examining the overall shape of the e-commerce dataset and establishing that we are dealing with customer invoices, our next step is to analyze the Country column. We are going to perform a quick analysis to determine the number of transactions in each of the 38 different countries present in the dataset. We will then group the transactions by country and print the aggregated data for a rapid overview. This will allow us to identify any notable patterns or trends. This aggregation reveals that there is a heavy emphasis on the United Kingdom in the dataset with around 500000 transactions, whereas other countries count at most 10000 transactions. The following pie chart illustrates this finding:

Exploring the Unit Prices

After examining the transactions per county, our next step will be to investigate the UnitPrice column, which represents the price in sterling for a single quantity of each item in an invoice. Our initial plan is to identify the highest and lowest unit prices in order to determine the range of values.

By analyzing the extreme values in the dataset, we can see that it contains not only regular product invoices but also special types of invoices. These special items have exceptionally large and negative unit prices, which will certainly impact our subsequent analysis. Therefore, we will focus our further analysis more on the actual products themselves.

After detecting these non-regular invoices, we are going to remove all invoices with special descriptions and prices. To obtain a preliminary list of items to remove, we will extract all items with a unit price above 1000 pound sterling.

This step shows that only items with a non-numerical stock code have such high unit prices, while regular items generally have lower unit prices. We will then proceed to remove these items from the dataset. Furthermore, we will decide to exclude all invoices with a negative and zero unit price as they are also considered special and will be excluded from further analysis.

In the upcoming step, we will sort the dataset by unit price and display the items with the highest

and lowest unit prices:

It seems that most of the special invoices are gone, but also that there are still non-regular items in the dataset as can be seen by the item with the stock code S. Given the already excluded items, we can note that the special invoices tend to have a stock code that only consists of letters compared to the numerical-only values or mixed stock codes of the other items. We will now further filter out invoices with stock codes that contain only letters which leaves us with the following list of removed item types:

With this, we conclude cleaning the dataset and also add an additional column containing the revenue that the invoice generated. We do this by multiplying the unit price with the quantity of the respective invoice. With the now cleaned and enriched dataset, we can take a closer look at the individual invoices.

Examining the Invoices Revenue

We are going to take the first step in obtaining a better understanding of the invoice’s revenue by computing the essential statistics:

There is a common consensus in the literature that it always holds true that when the mean is greater than the median, we have a positive skew, and vice versa, we have a negative skew when the mean is smaller than the median. Computing the density plot of the revenue distribution gives us the following graph:

Furthermore, we are creating two additional density plots to obtain a comprehensive understanding of the distribution. The first plot reveals the presence of extremely large outliers in the dataset, resulting in a high variance. Due to the impact of these outliers, we must exercise caution when interpreting the computed statistics, as they are influenced by these extreme values.

An additional intriguing aspect of the dataset is the correlation between the unit price, quantity, and revenue of the invoices. This correlation produces the following heatmap:

The heatmap shows that the unit price is not really correlated with either the revenue or the quantity but that the quantity shows a high correlation with the revenue. With this finding, it can be argued that the revenue is generated by a higher quantity of items per sale. Next, we will look at a scatterplot that shows the count of the individual unit prices:

The plot shows that the majority of the items on the invoices have lower unit prices but that there are also some items with a larger unit price in the hundreds.

Inspecting the Items

After examining the individual invoices, we are now examining the statistics that we can derive by grouping the invoices based on their stock codes, or, in other words, aggregating data for each item. Furthermore, we are enriching the data with various new data points for each item. For example, we are calculating the percentage of returns, cash in sales, and net results.

Through this analysis, we are uncovering interesting insights. We have discovered that the item generating the highest net results is a three-story cake stand, which has generated over 160,000 pounds sterling in revenue. On the other hand, we have also identified the item with the lowest net results, which is a wooden advent calendar generating -45 pounds sterling. These examples highlight a significant bias towards positive net results, which is definitely advantageous for business operations.

To validate our findings, we are creating distribution plots that confirm the aforementioned bias toward positive net results.

Similar to the revenue plots of the invoices, we are also dealing with extreme outliers here, but only in the positive direction, as shown by the extremely long right tail. After getting an overview of the individual items and observing the extreme tails, the next interesting point is to examine how much of the net results are generated by a specific percentage of items. The following graph displays this relationship:

We are attempting to find the percentage of the items, resulting in a value of 6.34 percent. This indicates that the 248 items with the highest revenue, which account for 6.34 percent of the inventory, are generating approximately 50 percent of the total revenue.

Examining the Monthly Aggregations

After we have examined the specific items, we now focus on monthly aggregations in the analysis. In this step, we create a new DataFrame similar to the previous steps. To gain an initial understanding of the monthly aggregations, we create two boxplots that illustrate the distribution of sales and returns per month.

The boxplot of the sales shows that the sales range from just under 300000 to around 750000 with the majority of months having sales numbers from under 400000 to just under 600000. The boxplot of the returns shows relatively constant returns with two massive outliers. Interestingly these two massive outliers are December and January, which can be attributed to the fact that the winter holidays demand a more lenient return policy from the retailers. To further investigate the common saying that the winter season is the most important for e-commerce, we create a line plot of the sales, returns, and net results and a piechart of the net results. The line plot is thereby normalized so that all three datasets share the same y-axis.

We can see that starting with August, the sales and net results start to ramp up and peak in November, with a smaller pullback in December. The returns are relatively constant throughout the year, except in December and January, which are extreme outliers. These findings coincide with other research, which shows that the winter months are the strongest for retail.

Conclusion

Visually exploring a dataset is vital to gaining important insights into a dataset. Like in the examined e-commerce dataset, data exploration can help to understand the data, to find trends, and to make decisions. For example, we saw that most of the transactions happened in the United Kingdom, so it is fair to assume that the retailer operates a UK-based business. Additionally, looking into the unit prices of the individual transactions revealed that the retailer deals with items that mostly are on the cheaper side but nonetheless generates revenue by selling larger quantities.

Also, by examining the net results, there was an interesting finding that around 6 percent of the items generated 50 percent of the net results. Although one year of data is perhaps not enough to make decisions about which items do perform well, if there were more data, it would be possible to reduce the catalog of items to the best-performing ones. This could eventually reduce storage costs and further increase the total results of the retailer. Lastly, by looking at the monthly aggregations of the transactions, it could be established that the winter months generate the most revenue and that the retailer should focus the business on these months.

Although a lot of this information can be retrieved by looking it up, for example, in the description of the official dataset, the main point of this paper was to show that all this information can be obtained solely by exploring and visualizing the dataset. The Jupyter Notebook containing the data exploration can be viewed in this GitHub repository.

Analysis of a Customer Complaint Dataset

Marco Schweiss — Thu, 13 Jul 2023 10:35:00 +0000

In today’s digital age, businesses rely heavily on customer feedback and satisfaction to maintain a competitive edge. Understanding customer complaints and addressing them promptly is vital for ensuring customer loyalty and improving overall product and service quality. With the rise of online platforms and social media, companies have a vast amount of data at their disposal in the form of customer reviews, tweets, and other sources of feedback.

In this project report, we focus on analyzing a dataset containing customer complaints from Comcast, one of the largest cable television and internet service providers in the United States. By leveraging Natural Language Processing (NLP) techniques, our objective is to gain insights into the key issues faced by Comcast customers and identify patterns or trends within the complaints that can inform business decisions.

Conception

The first step is to create a concept to describe everything that belongs to the data analysis workflow. Given that anything that is overlooked or forgotten in this phase has a negative effect on the implementation later and will lead, in the worst case, to useless results, this step is perhaps the most important of the entire process.

Dataset

The dataset consists of four columns containing the author, post date, rating, and the actual complaint. Additionally, we have around 5500 records which provide enough information to generate relevant insights. The dataset can be found on Kaggle with this link.

Cleaning the Complaints

To get cleaned texts, we will look into three different approaches. First of all, for all three approaches, stopwords are removed with the NLTK library. Then for the first option, the words in the complaints will be lemmatized. For the second option, n-grams are formed, and for the last option, noun phrases will be extracted with the TextBlob library, which can be considered an API to the NLTK library. To get a look at the result of the individual preprocessing options, the WordCloud library is used to create word clouds.

Vectorization

For the vectorization step, the Bag-of-Words and TF-IDF algorithms will be used. Here the three preprocessing options can be combined with each of the two vectorization methods. In this analysis, the two algorithms implementations of the Sklearn library will be used.

Topic Modeling

In the last step of the analysis, either LSA or LDA will be used for topic modeling. Here the idea is to get the most prominent words for a number of generated topics which can then be interpreted. As for vectorization, the Sklearn library is used for this step. Given the aforementioned options, the analysis will consist of 3 * 2 * 2 = 12 variations that can be compared.

Development

In this phase, we will conduct the data analysis based on the developed concept. This PDF shows the implementation step-by-step.

Results and Analysis

For the discussion of the results, we will first look into the quality of the customer complaint dataset. The author claims that the complaints contained in the dataset are obtained with web scraping from a public source. Given that the source is not disclosed, there is no way to check where the data was obtained. Nonetheless, the dataset with around 5500 entries was deemed suitable for this project, as the source of the data does not have an impact on the methodology presented in this project. As already mentioned before, the analysis resulted in around 64 topics that could be interpreted. All in all, there were five interpreted topics that could be derived from the analysis: Problems with customer service, problems with internet speed, problems with the internet cable (possible downtimes), complaints about the billing, and problems with the TV receiver.

For possible further improvements of the analysis, there would be three scenarios. First of all, there is the possibility to tweak the parameters of the respective machine learning algorithms. For example, the number of topics to be extracted could be altered, or additional stop words could be removed. The second possibility would be to perform the analysis with different techniques and methodologies. There are a variety of possibilities that could be explored here. For example, to get more detailed insights about the topics of complaint, Word2Vec or GloVe could be employed. Additionally, there is the possibility of using transformer-style models like BERT or the GPT family of models to extract topics from the complaints

Solving this task definitely gave me a deeper understanding of natural language processing techniques and the topic modeling workflow in general. The explored process of preprocessing – vectorization – topic modeling with the employed algorithms can now serve as a foundation from which other methodologies can be tested and compared in future NLP projects.

For possible similar projects in the future, I would definitely try to implement a workflow based on transformer models, especially the relatively newly released GPT-3 model from OpenAI. As I have already experienced the astonishing capabilities of this model regarding its comprehension of language in other projects, it would be interesting to implement a topic modeling workflow with this model. This could possibly lead to more detailed findings of an analysis. The code for this project can be found on GitHub.