Marco Schweiss – BigDataTime

Analysing Website Technologies – A Complete Data Science Project

Marco Schweiss — Fri, 15 Dec 2023 18:18:30 +0000

As the digital landscape continues to evolve at a rapid pace, understanding the technologies that power websites has become increasingly crucial. From front-end frameworks to server-side languages, the array of tools and platforms available can be overwhelming for businesses and developers alike. In this data science project, we will do an analysis of website technologies to uncover insights that can inform strategic decision-making in the ever-changing world of web development. The benefits of such an analysis could be for example a database of possible business leads or having a basis for a data-driven hiring strategy.

The scope of this project ranges across the whole data science workflow, from raw data acquisition to reporting of the results. The list below highlights the individual milestones of the project, which we will discuss in detail later on:

Compiling a reasonably sized dataset of website links with the corresponding industry branch and the technologies used to run the web presence
Wrangling the data into an accessible format that can be used for further analysis
Creation of an interactive web-based report featuring self-service analytics that can be used by decision makers

Gathering the dataset

If you ask data professionals about their favorite task, the answer will most certainly be scraping of unstructured web data, which can be considered the modern equivalent of working in a 19th century coal mine. Of course, web scraping was also the starting point for this project, with our first contact point being a well-known public database of Austrian companies. The idea here was to leverage their already existing categorization of websites into industry branches, which we can use for example to spot possible differences in technology choice. Maybe a web agency is specifically targeting retail businesses and want to exactly know their prevalent technology choices. By using the categories, we can add another layer of granularity to our dataset and answer such questions.

Gathering the data is done with a Python script that utilizes several libraries, including requests, BeautifulSoup, Wappalyzer, and fire. It uses the requests library to make an HTTP GET request and checks the status code of the response. If the status code is 200, indicating a successful request, the HTML content is returned. If the status code is 404, the script prints a message and skips the site. For status codes 400 and 500, the script continues the loop to retry the request. This method ensures that the script handles different scenarios when fetching web pages.

Overall, the script is designed to scrape multiple pages of each industry branch from the company database until a threshold of 1000 websites including technology stacks are gathered for each branch. The method iterates over the pagination, gets the URL for each company profile, and fetches the HTML content like discussed above. It then uses BeautifulSoup to parse the HTML and extract the actual website link for a company from the profile page. If a website link is found, the script uses the Wappalyzer library to analyze the technology stack of the website. The results are then written to a CSV file.

We always check if the actual website link returns a 200 status code so that we do not scrape broken links. Additionally, the fire library is employed to create a command-line interface for the script to allow running multiple instances of the script in tandem. Theoretically, we could use multiple threads, but having a dedicated terminal and status bar makes the whole process more manageable. To facilitate comprehension, I included a diagram of the scraping process, which can be seen here.

Wrangling and database creation

At the end of the data acquisition, we now have ten different CSV files (one for each branch), with each one containing 1000 websites and their technologies. The next steps of the project are concerned with bringing the raw data into a suitable format for the final analysis and visualization. First we load each of the ten CSV files and convert them into a single Pandas DataFrame which looks as following:

If we examine the “Techstack” column, we can see dictionaries containing the technologies. These dictionaries are thereby stored as strings. Now we have multiple possibilities to format our data, for example we could split our dictionaries and store each link, branch and technology combination as a distinct record in the DataFrame. Considering that we will also enrich our data with additional categories for the technologies and that a technology can have multiple categories, we quickly get a lot of redundant data to store. If we flatten out all possible combination, we blow up our DataFrame from 10000 to around 60000 records.

This would not be an issue for this small dataset, but if we want to incrementally add new data in the future, this would lead to issues at some point in time. To alleviate this problem, we will create a dedicated SQL database in third normal form to minimize the storage of redundant data. This database will then serve as the basis for our analysis and the web-based report. The ER-diagram below describes the schema that we will implement:

We create the database with the Sqlite3 library starting by creating a schema and following we fill the “Website”, “Technology” and “Category” table. The next step is also relatively straightforward, where we use loops and checks to fill up the so-called join tables named “Website_Technology” and “Technology_Category”. If you ask why the schema is set up in this form, consider the following example: A website has jQuery, WordPress and PHP in their technology stack. Regarding the categories of each technology, we have “JavaScript” library for jQuery, “Programming Language” for PHP and for WordPress we have two categories, namely “Blog” and “CMS” (Content Management System).

Now instead of creating an individual record for each possible combination, we just fill the “Website”, “Technology” and “Category” table with the respective entities and then create an entry for each relation in the join tables. So going back to our example, we would create three entries in the “Website_Technology” table, with all entries pointing at the website and each entry pointing at one of the technologies. Now we do the same for the “Technology_Category” table, and we can effectively reduce redundant data storage to a minimum. Additionally, we get the convenience of using SQL for our analysis in the next step

Deploying the Report

To follow suit with current state-of-the-art solutions for data visualization like Power BI or Tableau, we will create an interactive report with some basic self-service analytics. To do this, we use the Streamlit framework, which is optimized for fast development of data-apps with Python. The architecture of the web application is thereby relatively straightforward, because we basically connect standard front-end components like sliders, drop down menus and forms with SQL commands that interface with the database. With the additional integration of plotting capabilities in Streamlit, we can then build very neat data dashboards in a reasonable time. A quick overview of the architecture can be seen below:

Now there is not much left to say about the web app except that it can be viewed with the following link, so please take a look for yourself. -> Techstack Analysis Report

Conclusion

In this article, we looked at a project that encompassed the whole data science workflow, from gathering raw data to the creation of an interactive report that could be used by decision makers to guide their strategies. Such a project features a lot of different technologies and techniques that must be orchestrated to deliver a usable final result. Nonetheless, this usage of different tools for creative problem-solving, is probably the most exciting part of data science. As always, the code for this project can be found on GitHub in the following repository.

Topic Modelling with LLMs

Marco Schweiss — Fri, 01 Dec 2023 18:15:36 +0000

The field of natural language processing has witnessed revolutionary advancements with the introduction of large language models (LLMs) such as ChatGPT and GPT-4. This has prompted an exploration of how these LLMs can be utilized in topic modeling, an essential technique for comprehending extensive textual corpora. The aim of this project was to examine the trade-offs between traditional statistical algorithms and LLMs for topic modeling. Additionally, the goal was to develop a new approach for topic modeling utilizing the GPT model family, while providing recommendations on when to employ each method. A thorough literature review was conducted to establish the theoretical foundation.

Subsequently, an experiment was designed to compare the two approaches across three diverse datasets containing different types of documents, namely news, abstracts and tweets. Topic coherence and diversity metrics were employed for quantitative assessment, complemented by qualitative evaluation through GPT-4. Traditional models were found to outperform the LLM-based approach on the news and abstract datasets. Conversely, both qualitative and quantitative analysis demonstrated that the novel LLM approach yielded superior results on the tweet dataset.

It was determined that traditional topic models are better suited for larger datasets, providing a higher-level overview of topics. In contrast, LLM-based approaches excel at generating granular and interpretable topics from smaller datasets. These findings offer practical recommendations regarding strategy selection in topic modeling applications.

To read the full article please refer to this PDF file.

Why Are Financial Time Series Called Random Walks?

Marco Schweiss — Wed, 15 Nov 2023 18:34:49 +0000

Financial time series are notoriously hard to model and are often referred to as random walks. But why? In this article, we will take a closer look at what a random walk is and the reasons why financial time series are described as such. As a prerequisite for a full comprehension of the article, a basic understanding of general time series analysis and ARMA models is necessary.

What is a financial time series?

A financial time series is a type of data that chronicles the evolution of prices or other relevant economic indicators over certain periods of time. A series typically contains information on currency exchange rates, stock prices, or market indices. Financial time series analysis is the process of studying such time series to identify patterns and trends that could be useful for forecasting. As an example of a financial time series, the graph below shows daily Bitcoin closing prices for the year 2021.

One of the main difficulties when modeling financial time series is that the data tends to be non-stationary, which means that the mean and variance of the data change over time. This makes it difficult to make accurate predictions, as different models may apply at different points in time. Furthermore, financial markets are highly volatile and subject to rapid change due to external factors such as news reports and political events. This makes it hard to accurately predict future movements in prices using historical information alone, since current trends may not remain consistent over a long period of time.

Modeling techniques

There are various methods for modeling general time series, for example, linear regression, moving averages, Holt-Winters, and ARMA models, though not all of these models work well for financial time series. The focus of this article will be on ARMA models, and for now, let us examine the basics and tackle the problem of the non-stationarity of financial time series. The basic ARMA models need at least the assumption of weak stationarity to work properly, which means we have a constant mean, finite variance, and the covariance between two observations depends only on the distance between them. To recall, the classic ARMA model consists of the autoregressive (AR) and the moving average (MA) model. A method for achieving stationarity is differencing where we subtract a lagged version of the time series from itself. So let us look at the following plot.

The plot shows again the Bitcoin closing prices but this time the values at the first lag t-1 were subtracted from the value at time t. We can see that the differences in price look like they have no visible trend, which means we should have achieved stationarity. But how can we be (a bit more) certain?

Testing for stationarity

To aid the visual inspection, there are statistical tests for addressing the stationarity concern, with one of these tests being the Dickey-Fuller (DF) test and its expanded version, the Augmented-Dickey-Fuller (ADF) test. The DF test works by examining if we have a unit-root process. So given the AR(1) model denoted by $X_{t} =\phi X_{t-1} + \epsilon_{t}$ we have a unit root process if $\phi=1$. Therefore, the DF test has the following hypotheses:
$$ \begin{aligned} & H_0: \phi=1\\ & H_1:|\phi|<1 & \end{aligned} $$ The problem is that the DF test is only meaningful if we can model our time series with an AR(1) model. To account for this shortcoming, the ADF test was created which can be used for all ARMA(p,q) models. Now let us examine the following ACF and PACF plots of the BTC closing prices:

Although the PACF plot cuts off after the first lag, which would be suitable for an AR(1) model, the ACF is only slowly tailing off instead of the exponential decrease that we would like to see. This means two things, first the AR(1) is not a suitable model, and second, we have our first bit of evidence for a unit root in our time series because this ACF pattern is common for unit root processes. Applying the ADF test yields around -2.34 for the ADF test statistic and a p-value of 0.158 which means we cannot reject our null hypothesis of having a unit root. In contrast, if we apply the ADF test to the differenced time series we get an ADF test statistic of around -20.1 and an extremely small p-value which lets us reject the null hypothesis and accept that the differenced series is stationary. So now we can start fitting an ARMA model, or?

Random walks

A random walk is a process in probability theory where the probable location of a point is determined based on random motions, with probabilities of moving some distance in some direction being the same at each step. To illustrate this we plot the correlogram of the differenced closing prices.

This plot of the ACF and PACF of the differenced series sums up why financial time series are described as random walks. As we can see there is no autocorrelation to be found and the only model we can fit would be an ARMA(0,0) which, let’s be real, would not be of particular help for making money in the stock market.

Conclusion

In this article, we looked into the basics of financial time series and modeling techniques, focusing on ARMA models and their prerequisites. Additionally, we spoke about the concept of stationarity and how to objectively assess it. Finally, we discussed random walks and showed why financial time series are called random walks by analyzing the correlogram of a stationary series of differenced daily BTC closing prices. We conclude with the finding that future prices can not be predicted by past prices, at least not with classical ARMA models.

References

Adrian, T. (2022). Interest Rate Increases, Volatile Markets, Signal Rising Financial Stability Risks. Retrieved from https://www.imf.org/en/Blogs/Articles/2022/10/11/interest-rate-increases-volatile-markets-signal-rising-financial-stability-risks

Encyclopedia Britannica (2023). Random walk. Retrieved from https://www.britannica.com/science/random-walk

Iordanova, T. (2022). An Introduction to Non-Stationary Processes. Investopedia. Retrieved from https://www.investopedia.com/articles/trading/07/stationary.asp

Montgomery, D. C., Jennings, C. L., & Kulahci, M. (2008). Introduction to time series analysis and forecasting. Wiley series in probability and statistics. Hoboken, NJ: Wiley

Shumway, R. H., & Stoffer, D. S. (2006). Time Series Analysis and Its Applications: With R Examples (2nd ed.). SpringerLink Bücher. Cham: Springer

Tsay, R. S. (2005). Analysis of financial time series (2nd ed.). Wiley series in probability and statistics. Hoboken, NJ: Wiley

Zivot, E., & Wang, J. (2006). Modeling Financial Time Series with S-PLUS® (2nd ed.). New York, NY: Springer New York

Discovering In-Demand Skills in Austria’s Data Science Job Market

Marco Schweiss — Sun, 29 Oct 2023 10:35:34 +0000

In today’s rapidly evolving world of data science, staying ahead of the curve and understanding market demand is crucial. Aspiring data scientists often wonder which skills to focus on developing and what kind of compensation they can expect. In a recent project, I scraped public job postings in Austria to uncover the most sought-after skills and explore salary ranges for data science-related positions. Join me as we unveil intriguing insights into Austria’s data science job market.

Scraping Data

The first step of the project was to collect a dataset of job postings from an online resource, in this case the online job board Karriere.at. The goal was to capture valuable information like required skills and associated salaries, allowing us to gain actionable insights. To start the data collection, I first searched for the term “data science” on the job board, which yielded 310 results at the time I started the project.

The next step was to simply copy the HTML source code of the website, which contained all the links to the individual job postings. By leveraging the Requests and BeautifoulSoup Python libraries, I first extracted the links, then scraped the HTML and extracted the text for each listing. This resulted in 310 text files containing the job description in an unstructured format.

Extracting Skills and Salaries

To extract the skills and salaries from the scraped job postings, I employed OpenAI’s Chat Completions API with the GPT-3.5-Turbo model, or for short, ChatGPT. The large-language model allows us to effectively extract the information of interest from the unstructured data via a well-crafted prompt. The skill and salary extraction is thereby split into two tasks, so the first processing of the postings only extracts the skills, and the second run extracts the salaries. The image below shows the unstructured text data and the extracted skills and salary plus salary period, which will be used for further analysis.

Data Cleaning and DataFrame creation

Once the necessary information was collected, I transformed it into a structured format suitable for comprehensive analysis. This involved organizing the extracted skills and their associated salaries into a clean DataFrame, ready for deeper exploration. To clean up the skills, the first step was to tokenize the comma-separated list returned by the Chat Completions API. Following was lowercasing, removal of spaces, hyphens and special characters, and also the removal of company names like Microsoft and Google. All these steps help to basically reduce the noise in the data that results from different wordings. For example, by looking into the data, I saw a lot of different wordings for Azure, namely MS Azure, Microsoft Azure and microsoft azure which all denote the same skill. Finally, employing all these cleaning steps resulted in a list of 785 different skills, which were consequently used to create a one-hot encoded DataFrame for further analysis.

Analyzing Sought-After Skills

With our properly structured dataset at hand, we can now delve into analyzing the most sought-after skills in Austria’s data science job market. This is done by simply counting the rows that contain the value “true” for the respective skill and sorting in descending order. If we look at the bar plot below, we can identify key skills that repeatedly cropped up across numerous job postings.

Revealing Salary Ranges

In addition to uncovering the essential skills sought by employers, we also examine the salary ranges offered for data-related roles in Austria. By creating a box plot of the provided salaries, we can see the distribution of salaries over the different data-related roles. Here we get a mean and median of around 49,000 euros.

Insights and Limitations

Throughout the analysis, several noteworthy findings emerged. The first finding is that Python and SQL are by far the most sought-after skills, with around 50 percent of the jobs demanding these two skills. Looking further at place three, we can see that there is a demand for Power BI, which is a business intelligence product offered by Microsoft. Following, we have a mixture of several programming languages and technologies. If we look at the salaries, we can see that we have several outliers, especially towards the lower end, which are probably a result of an error from the salary extraction pipeline or postings that give hourly wages that are not accounted for. The same is true for the skill extraction, which is comprehensive but not perfect. This could be accounted for by fine-tuning a model for this task. Nonetheless, given the zero-shot prompt approach, I am very happy with the results, which coincide with common knowledge about the field of data science and related job roles.

Conclusion

Our data analysis project sheds light on the current landscape of Austria’s data science job market by uncovering sought-after skills and highlighting salary ranges. According to our analysis, a well-rounded approach to acquiring skills would first require a solid understanding of Python and its usage in data-related scenarios. The following is a study of SQL and the concept of relational databases. A business intelligence solution like Power BI further adds to the skill set, which is now more focused on data analytics than data science. In general, the results of our analysis state a higher demand for data analytics than data science.

Alternatively, or to further expand one’s skill set, the usage of the Microsoft Azure platform for data-related workloads would be a great addition. Programming languages like Java, JavaScript, C#, or C++ and version control with GIT are more in the realm of software engineering. Armed with this knowledge, aspiring data scientists can effectively prioritize their skill development efforts. As technologies evolve and new demands emerge, it’s crucial to stay informed about market dynamics. The code for the analysis can be found in this repository.

Building a Private AI Tutor for a Microsoft Learn Course

Marco Schweiss — Wed, 25 Oct 2023 09:56:44 +0000

In this article, we will look into one of my recent projects, which is the creation of a specialized private AI tutor for a Microsoft Learn course for the Power BI certification. Given the results of the job market analysis discussed in my most recent article and the fact that I already wanted to refresh my Power BI knowledge, this is the perfect opportunity for this project. The overall aim was to provide an interface where users could ask questions and receive accurate and relevant responses based on the content of the course. We will now look into the details of the implementation.

Data Collection

To build the knowledge base for our AI tutor, we will utilize Python along with libraries like Selenium and BeautifulSoup to scrape all necessary information from the web. By converting each section of the course into individual text files, we can create a comprehensive database to serve as our tutor’s knowledge repository. Thereby, we have to utilize Selenium, which allows us to retrieve HTML code that is dynamically generated by JavaScript. We basically retrieve the raw HTML of each page and extract the actual content with the help of BeautifulSoup. Following, we clean the data by removing links to quizzes and interactive exercises in the course and splitting up the raw text into individual chapters by using the heading structure of the HTML.

This follows the same logic as the PDF conversion pipeline that was used for the ChatGPT flashcard generator and document summarizer, where the headings are used to split up a larger file into chapters. This semantically correct split of the text is important for the following steps and improves our results. Running our scraping pipeline results in the retrieval of 196 pages of course content, which are split into 528 text files that ultimately serve as our knowledge base.

Implementing the AI tutor

The core of our AI tutor is built with the LangChain library, which is a framework that facilitates building applications with large language models (LLMs) like those from the GPT model family. To enable our AI tutor to effectively process and retrieve information, we need to load and prepare the data before we can make an inference that returns an answer to our questions. This pipeline follows a step-by-step process, which is as follows:

Loading the documents
Creating word embeddings
Storing the embeddings in a vector database
Retrieving documents with a similarity search
Passing the retrieved documents to a LLM

Steps 1 to 3 are only done once at the beginning, and steps 4 and 5 are repeated for every question that we ask.

Loading Text and Storing Vectors

Loading the documents is straightforward and done with LangChain’s DirectoryLoader class. Thereby, the functionality of the class basically iterates over the files in a directory and stores them in a list. The more interesting part happens when we convert the raw text documents into embeddings in the next step. Embeddings allow for the creation of a vectorized representation of text, enabling us to analyze and compare pieces of text in the vector space. This is particularly valuable for conducting semantic searches, as it allows us to identify texts that are most similar based on their vector representations. For our project, we use the OpenAIEmbeddings class also from LangChain, which serves as our access point to OpenAI’s embedding service. The last data preparation step consists of storing the embeddings in a vector database, which allows for efficient retrieval. For our project, we use Faiss, a vector database developed by Meta.

Querying Process and Answer Generation

Once our database is ready, we can input questions or queries into our AI tutoring system. Utilizing similarity search algorithms, the vector database retrieves relevant documents based on query similarities. This retrieval process enables quick access to specific sections of course material that may hold answers or explanations related to our inquiries. Upon retrieving relevant documents from our knowledge base, we can then employ a GPT model to formulate accurate answers in response to user queries. The LLM ensures that we receive high-quality and contextually relevant responses tailored specifically to our questions. For better comprehension, we will now walk through the process step by step.

First, we set up the question, which will be “How can I share reports with Power BI?” or in German “Wie kann ich mit Power BI Berichte teilen?”. We first use this question for a similarity search in the vector database, which retrieves the following four documents:

You can see that three of the four documents are very relevant to our question about sharing reports, which means that the similarity search worked well. Here is also the proof that our initial semantically correct splitting of the documents paid off, because otherwise we would retrieve a lot of noise from the database. For example, a common strategy for this type of application if we do not have a structure to build from is to use a fuzzy approach where a large document is split at defined, overlapping word intervals. The LLM that interprets the retrieved documents can certainly handle this, but it is always better to clean the input data as much as possible to improve the accuracy and stability of the output.

As our last step, these four documents are now passed to the GPT model with the instructions to formulate an answer to our question based on the context of the four documents. LangChain handles the prompt creation and the API call for inference, and we get the following answer to our question:

The answer is pretty solid and very relevant to our question. Another very useful feature is that the hallucinations of the LLM are kept to a minimum with this approach. Hallucinations are incorrect answers that the model comes up with. For example, if we ask the model, “What is a squirrel-cage rotor?” or in German, “Was ist ein Kurzschlussläufer?” we get the following answer:

Conclusion

The development of the specialized private AI tutor for the Microsoft Azure Power BI course introduces an innovative way of engaging with educational content. By leveraging the power of Python, web scraping techniques, and advanced libraries like LangChain and GPT models, we successfully built an intelligent tutoring system capable of providing personalized and accurate responses. The biggest advantage of this type of system is definitely the increased speed of information retrieval. We basically built a specialized search engine that dynamically compiles the search results in the most targeted and accurate fashion possible, providing a direct answer to a question.

Overall, LangChain and similar frameworks provide a novel approach to utilizing state-of-the-art LLMs and have spawned a new wave of sophisticated AI applications. The system that we built here with a few lines of code can be leveraged in a lot of different scenarios. For example, the most recent generation of customer support chatbots are probably all running on the same or at least a similar pipeline. A knowledge base is set up with very relevant information about a company’s products or services, and customer questions are used the same way as we used for our AI tutor. The same system can also be used for company internal knowledge, interacting with user manuals, online documentation, or research papers. Although the responses are not always perfect and one should always check the source, which can be printed with the answer, it feels like we are moving into an era of real AI integration into a vast number of applications. The code for this project can be found on GitHub.

Statistical Hypothesis Testing with Parametric Tests

Marco Schweiss — Wed, 18 Oct 2023 12:09:27 +0000

Many scientific fields require methods for making statistical inferences with the data that was generated during reserach. A cornerstone of statistical inference is hypothesis testing, which is a method used in many research scenarios. In this article, we will look into the key elements and the general procedure to conduct a hypothesis test according to modern standards. Additionally, we will examine two selected hypothesis testing methods in detail and.

The Hypothesis Testing Framework

The field of statistics can be separated into two general paradigms regarding their applications. First, there are descriptive statistics, which are applied to summarize or describe features of obtained data. On the contrary, there are inferential statistics for drawing conclusions about a population with the help of data obtained from samples. Many issues occurring in inferential statistics can be classified into one of three categories, estimation, confidence sets, or hypothesis testing. The latter of the three categories, hypothesis testing, has a broad range of applications and is used in practically all scientific fields.

Before we look at the hypothesis testing procedure, the meaning of hypothesis testing and the general approach must be defined. As already explained, hypothesis testing is a method of statistical inference, with the approach to hypothesis testing varying if we look at different paradigms of statistical inference. The article is focused on the null hypothesis significance testing framework, which resulted from combining the work of Fisher and Pearson- Neyman into a single approach. Going forward, when the term hypothesis testing is mentioned, it is in the context of the null hypothesis significance testing framework.

Null Hypothesis and Alternative Hypothesis

The process of hypothesis testing is, at its core, a decision problem and very similar to the scientific method in general. The decision is between two hypotheses that make assumptions about the parameters of a population. First, there is the null hypothesis, which is denoted by $H_{0}$. The null hypothesis is assumed to be true until it is proven false. On the other hand, there is the alternative hypothesis, denoted by $H_{1}$. The alternative hypothesis is declared true if the null hypothesis is disproven. Given that researchers often use hypothesis testing to prove an effect of some sort, most of the time, the null hypothesis claims that there is no effect. In contrast, the alternative hypothesis claims that an effect exists.

So in summary, first the status quo is observed, followed by the setting up of a theory, which is then tested and compared against the test results. After the evaluation of the test results, the hypothesis is then rejected or further examined.

Parametric and Non-Parametric Tests

After defining the null and alternative hypotheses, we can further specify the hypothesis testing framework. It is crucial to make a distinction between the two types of hypothesis testing. The first type of test is the parametric test, which is based on the assumption that the population from which the sample data for the test is drawn follows a particular distribution with corresponding parameters. Most of the time, there is an assumption of a normal distribution. Contrary to parametric tests, non-parametric tests are conducted without making any assumptions about the population’s distribution. The type of test to conduct must be chosen according to the tested hypothesis and the available sample data. Going forward, we will focus on parametric hypothesis tests.

Test Statistics

To conduct a hypothesis test, a test statistic is necessary. The test statistic acts as the foundation on which the decision to reject the null hypothesis is made. A test statistic is thus a function used to evaluate the sample data. The resulting value of the test statistic is a single number representing all the sample data. There are various hypothesis tests, with each test employing its own specific test statistic, for example, the t-test uses the t-statistic, while the chi-squared test uses the chi-squared statistic. Like the choice of test type in general, the selection of a specific test statistic depends on the nature of the research question and the type of data being analyzed.

Significance Level

The next necessity for a successful hypothesis test are the rules for interpreting the values derived from the test statistic. To identify when a null hypothesis has to be rejected, a significance level must be determined. The significance level denoted by $\alpha$ is thereby the limit of how unlikely a computed value of a test statistic must be before the null hypothesis is rejected. The significance level also defines a range of resulting test statistic values, which leads to a rejection of the null hypothesis. This range of values is called the rejection region. The value for the significance level must be defined before conducting a hypothesis test and can be chosen arbitrarily, although the value should not be greater than 10 percent.

Population and Sample Distribution

Like already mentioned, most parametric tests assume a normally distributed population. If the population distribution is normal, then the sample data’s distribution is also normal. Additionally, suppose the population is assumed to be non-normally distributed. According to the central limit theorem, the sampling distribution can nonetheless be assumed to be normal for a sufficiently large sample size. In this case, the sampling distribution approaches a normal distribution when the sample size increases. If not stated otherwise for a specific testing method, a sample size of 30 is large enough to assume the sampling distribution is approximately normal.

Two-Sided and One-Sided Hypothesis Tests

After looking into the sample distribution’s necessary criteria, we can now further define the types of hypothesis tests. First, we denote $\theta$ as a population parameter and $\theta_{0}$ as a known value. Following, we denote the null hypothesis as $H_{0}:\theta = \theta_{0}$. The first type of test is called a two-tailed test, with $H_{1}:\theta \neq \theta_{0}$ as the alternative hypothesis. For the second type of hypothesis test, we have the one-tailed test with either $H_{1}:\theta > \theta_{0}$ or $H_{1}:\theta < \theta_{0}$ as alternative hypotheses.

The hypothesis tests are thereby called tailed tests because the rejection region is located in the tail regions of the sampling distribution curve. Resulting from their respective hypothesis definitions, the two-tailed test has a rejection region in both tails each, and the one-tailed test has a rejection region in either the left tail or the right tail. Consequently, we speak of a left-tailed test or a right-tailed test.

Testing Considerations

As already explained, the type of hypothesis test is dependent on the hypothesis, and the proposed hypothesis depends on the research scenario. In a scientific setting, it is often sufficient to determine if there is any effect at all. Consequently, the two-tailed test is the most prominently used test because positive and negative effects can be tested. The consensus is to use a two-tailed test if there is no reason to use a one-tailed test. Although the one-tailed test can outperform a two-tailed test, it should only be chosen for proving an effect that can exclusively be positive or negative.

Additionally, there is the possibility that the research only cares about either a positive or negative effect. From an ethical standpoint, like for the significance value, it is essential to choose which kind of test to take before the sample data is inspected because the rejection region’s location can influence if a null hypothesis is rejected.

Interpretation of Testing Results

Considering all the previous points, we now have all the essential elements of a statistical hypothesis test. Given that the population can be assumed to be normally distributed or the central limit theorem can be employed, a hypothesis test’s overall process is as follows: The first step is to define the two hypotheses H0 and H1 according to the research question, which results in a two-tailed or one-tailed test. Second, a significance level is chosen, and the rejection region’s location concerning the test tails is determined. Following, the result is computed with the sample data and the test statistic. The next step after obtaining our computed value is the correct interpretation of the results. From this point on, there are two general possibilities for interpretation. First, the critical-value approach and second, test interpretation with a p-value.

Critical-Value Approach

For the critical-value approach, we interpret the computed value with the help of a table. The computed value is thereby compared to the table’s values, resulting in a decision to reject or fail to reject the null hypothesis. Which table to choose depends on the distribution of the respective hypothesis test, for example, the normal distribution table, the t-distribution table, or the F-distribution table. How to correctly interpret the test results with the respective table depends on the employed hypothesis testing method and the structure of the table itself.

The p-Value

The second method for interpreting test results is the p-value approach. The p-value, also called the attained significance level, is used to express exactly how strong or weak the rejection of the null hypothesis is. It is the probability of observing the test statistic’s calculated value under the assumption that the null hypothesis is true. A smaller p-value indicates strong evidence, and a bigger p-value indicates weak evidence against the null hypothesis.

The p-value is thereby obtained from the computed value of the test statistic. It can either be calculated with statistical software or chosen from a table. With the acquired p-value, decisions regarding the rejection of the null hypothesis can be made by comparing it to the desired significance level. If the obtained p-value is smaller than the predefined significance level, the null hypothesis is rejected.

Criticism of the p-Value Approach

I already mentioned that the null hypothesis significance testing framework combines two different paradigms into a single framework. In particular, the p-value approach combines the usage of a null and alternative hypothesis, which stems from Pearson and Neyman and the concept of the p-value from Fisher, into a single procedure. The controversy around this combination stems from the fact that the original concept of Fisher’s p-value was only a probability of observing an outcome as weird as the hypothesis test’s outcome.

There was also no prerequisite to interpret the p-value under the assumption that the null hypothesis is true because there was no second hypothesis in Fisher’s testing procedure. According to various sources, although the combination of the two paradigms is the most practiced form of hypothesis testing in current scientific research, the p-value is also one of the most misinterpreted and controversial concepts.

t-Test for Two Dependent Samples

The first of the two hypothesis tests examined in detail is the t-test for two dependent samples, which is also called the paired t-test. The characteristic of dependent samples is that every sample has a dependent second sample, and vice versa. The test is based on the t-distribution and examines the hypothesis that two dependent samplings represent two populations with different means.

An example of a paired t-test would be to test random samples of people, administer a treatment, and then perform another test. For this example, the t-test evaluates if the difference between the two sampling means is statistically significant, or, in other words, if the treatment had an effect.

Prerequisites and Assumptions

First of all, a paired t-test is conducted with interval/ratio data. Interval/ratio data has the important characteristic that it can be processed mathematically. Examples of interval or ratio data are population size, age, median income, or average times. Another prerequisite is that the samples are selected randomly and that the population from which the sampling is drawn follows a normal distribution. If the population distribution is unknown, the sampling should have a size of at least 30 to assume a normal distribution according to the central limit theorem.

Another critical assumption for the paired t-test is that both samplings’ population variances are equal. This characteristic, called homogeneity of variance, is essential with regard to the accuracy of the paired t-test. The homogeneity of variance is also a critical prerequisite to guaranteeing the reliability of other parametric tests and can be evaluated with various tests, for example, Cochran’s G test.

Testing Procedure

To start the paired t-test, like in every other hypothesis test, the null hypothesis $H_{0}$ and the alternative hypothesis $H_{1}$ must be defined correctly. As already explained, a hypothesis test can be one-tailed or two-tailed, depending on the established hypotheses. Given that the paired t-test is concerned with means, the null hypothesis is denoted as $H_{0}: \mu_{1} = \mu_{2}$. For a two-tailed test, we have $H_{1}: \mu_{1} \neq \mu_{2}$ and for a one-tailed test, we have $H_{1}: \mu_{1} > \mu_{2}$ or $H_{1}: \mu_{1} < \mu_{2}$ as alternative hypotheses. After defining the hypotheses, the significance level must be chosen. As already explained, for ethical reasons, this must happen before the sample data is inspected.

For computing the actual values of the test, the direct-difference method is employed. For our test data, given that we have $n$ different test subjects, we get $2*n$ different data entries in our sample data. The $2*n$ data entries of the sample data are split into two dependent datasets, $X_{1}$ and $X_{2}$, representing, for example, a pre-test score and a post-test score of the $n$ subjects. Additionally, each of the two dependent data points in $X_{1}$ and $X_{2}$ yields a difference $D$ by subtracting $X_{2}$ from $X_{1}$.

Following, we sum all the differences $D$ to $\Sigma D$ and additionally the sum of the squares of all differences to $\Sigma D^{2}$. The next step is to calculate the mean of the differences $\bar{D}$ and the estimated population standard deviation of the differences $\tilde{s}_{D}$ with

\[\bar{D}=\frac{\Sigma D}{n} \quad \quad \tilde{s}_{D}=\sqrt{\frac{\Sigma D^{2}-\frac{(\Sigma D)^{2}}{n}}{n-1}}\]

With the help of $\tilde{s}_{D}$, the standard error of the mean difference $s_{\bar{D}}$ can be calculated, which in turn is used to compute the t-value with the test statistic for the paired t-test and $\bar{D}$

\[s_{\bar{D}}=\frac{\tilde{s}_{D}}{\sqrt{n}} \quad \quad t=\frac{\bar{D}}{s_{\bar{D}}}\]

Test Intepretation

From this point on, there are two possible methods for interpreting the results of the test. The first method is the critical value approach, where the computed value $t$ is compared to the table of Student’s t distribution, which can be found here. To correctly use the table, we need an additional parameter, the degrees of freedom, which is calculated with the formula $df = n-1$, with $n$ being the number of test subjects.

To evaluate the computed value with the help of the table, we first search for the row corresponding to the degrees of freedom. The column is then chosen according to the type of test (one- or two-tailed) and the significance level. The resulting cross-section is the value against which our computed value is compared. Depending on our hypothesis, there are three possible conclusions, for $H_{0}: \mu_{1} = \mu_{2}$ the null hypothesis is rejected if the computed absolute value is equal or greater than the resulting table value; for $H_{1}: \mu_{1} > \mu_{2}$ the null hypothesis is rejected when the computed value is positive and equal or greater than the table value; and for $H_{1}: \mu_{1} < \mu_{2}$ the null hypothesis is rejected if the computed value is negative and the computed absolute value is equal or greater than the table value.

For the p-value approach, we choose the p-value from the t-table. First, we search for the row corresponding to the degrees of freedom for our test. Then there are two possibilities for choosing the p-value. First, if the test’s computed value is equal to the table value, depending on whether the test is one-tailed or two-tailed, the corresponding significance level is chosen from one of the first two rows on the table. Alternatively, suppose we cannot find the exact t-value in the table. In that case, the p-value can be chosen by locating the nearest bigger and smaller table values for the computed critical value and their significance levels and approximately locating the p-value between the two obtained significance levels. The resulting p-value is then compared against the significance level $\alpha$ chosen at the beginning of the test, and if the p-value is smaller than $\alpha$, the null hypothesis is rejected.

Single-Factor Within Subjects ANOVA

The second of the presented hypothesis testing methods is called the single-factor within-subjects analysis of variance, which is, going forward, abbreviated to SFWS-ANOVA. The SFWS-ANOVA, like the paired t-test, works on dependent samples and population means, but in contrast to the t-test, the SFWS-ANOVA can evaluate more than only two dependent samplings. In essence, the SFWS-ANOVA is used to test the hypothesis that in a set of $k$ dependent samples with $k \geq 2$ at least two samplings represent populations with different means.

If the test result is statistically significant, it can be concluded that at least two of the $k$ samples belong to populations with different means. The computation of the SFWS-ANOVA’s test statistic is based on the F-distribution. An example of a SFWS-ANOVA test scenario would be a study about different treatments where the same $n$ subjects get $k$ different treatments, which are then compared.

Prerequisites and Assumptions

Like the paired t-test, the SFWS-ANOVA is conducted with interval/ratio data. Additionally, it is assumed that the population follows a normal distribution and that the samples are chosen randomly. The last prerequisite is called the sphericity assumption and is the SFWS-ANOVA’s equivalent to the paired t-test’s homogeneity of variance assumption. The sphericity assumption can be evaluated with Mauchly’s sphericity test. It is important to note that if the number of samples $k$ equals two, the paired t-test and the SFWS-ANOVA yield the same results, therefore going forward, we assume that the SFWS-ANOVA is only employed for $k \geq 3$.

Testing Procedure

As always, the test starts with the definition of $H_{0}$ and $H_{1}$. The null hypothesis is thereby defined as $H_{0}: \mu_{1} = \mu_{2} = \mu_{3}$, under the prerequisite that $k = 3$. For different $k$ the null hypothesis must be altered accordingly. The alternative hypothesis is defined as $H_{1}$ Not $H_{0}$, which means that at least two of the $k$ population means differ from each other. This notation is used because $H_{1}: \mu_{1} \neq \mu_{2} \neq \mu_{3}$ would imply that all three means must differ from each other. Now, before inspecting the sample data, like for every hypothesis test, a significance level must be determined.

For the actual test data, given that we have $n$ subjects and $k$ samplings, we get $n*k$ dependent samples, which are split into $X_{1},…,X_{k}$ different datasets. Thereby, each $X_{k}$ represents the sample data of one of the $k$ samplings. To further process the data, for each of the subjects $k$ samplings, the sum of the scores $\Sigma S$ is calculated. In addition $\Sigma X_{T}$, which is the sum of all the $n*k$ sample scores and $\Sigma X^{2}_{T}$, which is the sum of all squared scores is computed. Going forward, the total sum of squares $S S_{T}$ and the between-conditions sum of squares $S S_{B C}$ are calculated with

\[S S_{T}=\Sigma X_{T}^{2}-\frac{\left(\Sigma X_{T}\right)^{2}}{n*k} \quad S S_{B C}=\sum_{j=1}^{k}\left[\frac{\left(\Sigma X_{j}\right)^{2}}{n}\right]-\frac{\left(\Sigma X_{T}\right)^{2}}{n*k}\]

with $\Sigma X_{j}$ denoting the sum of the $n$ samples in the $j^{th}$ sampling. With the next equation the between-subjects sum of squares $S S_{B S}$ is calculated

\[S S_{B S}=\sum_{i=1}^{n}\left[\frac{\left(\Sigma S_{i}\right)^{2}}{k}\right]-\frac{\left(\Sigma X_{T}\right)^{2}}{n*k}\]

The residual sum of squares $S S_{\mathrm{res}}$ is calculated with the help of the previous three parameters, therefore $S S_{\mathrm{res}}=S S_{T}-S S_{B C}-S S_{B S}$. Additionally the between-conditions degrees of freedom $d f_{B C}$ is calculated with $d f_{B C}=k-1$ and the residual degrees of freedom $d f_{\mathrm{res}}$ with $d f_{\mathrm{res}}=(n-1)(k-1)$. With these parameters we compute the mean square between-conditions $M S_{B C}$ and the mean square residual $M S_{\mathrm{res}}$

\[M S_{B C}=\frac{S S_{B C}}{d f_{B C}} \quad M S_{\mathrm{res}}=\frac{S S_{\mathrm{res}}}{d f_{\mathrm{res}}}\]

The last step is to calculate the F-ratio, which is the test statistic for the SFWS-ANOVA.

\[F=\frac{M S_{B C}}{M S_{\mathrm{res}}}\]

Test Interpretation

Like for the paired t-test, there is the possibility to interpret the computed value with the critical-value approach or to obtain the p-value either from a table or with computation and compare it against the chosen significance level. For the critical-value approach, the computed value is compared to one of the F-distribution tables, which can be found here. The tables are used by selecting the right table corresponding to the chosen significance level and then searching for the value in the cross-section of the calculated degrees of freedom.

For the SFWS-ANOVA, the column in the F-table, which is called the degree of freedom for the numerator, corresponds to $df_{BC}$ and the row, called the degree of freedom for the denominator, corresponds to $d f_{\mathrm{res}}$. If the computed value is equal to or greater than the table critical value, the null hypothesis is rejected, and there is the conclusion that there are at least two populations with significantly different means. It is important to note that the linked F-tables are only supposed to be used for evaluating a non-directional alternative hypothesis.

To obtain the p-value, we search for the computed value independent of the table’s specified significance level. Like for the critical value, the correct row and column correspond to the numerator and denominator. If the computed value is identical to one of the table’s values, the p-value is equal to this specific table’s significance level. If the computed value is not in the table, like for the paired t-test, we choose the nearest smaller and bigger values across the tables. The actual p-value then lies between the two significance levels of the table with the next smaller value and the table with the next bigger value. Alternatively, the exact p-value can be calculated with the help of statistical software. If the obtained p-value is smaller than the predefined significance level, the null hypothesis is rejected.

Experiment – Blood Pressure Treatment

For better comprehension, we will now look at a practical application in the form of a fictive testing scenario for a blood pressure medication. To test if the medication has any effect at all, a paired t-test on 10 subjects is employed.

Hypotheses and Significance Level

Like for every hypothesis test, first the hypotheses and the significance level are defined. For the null hypothesis, I choose $H_{0} : \mu_{1} = \mu_{2}$ and for the alternative hypothesis, $H_{1} : \mu_{1} \neq \mu_{2}$, which in turn results in a two-tailed test. The significance level is chosen as $\alpha = 0.05$.

Sampling and Sample Preparation

To generate the sample data, we use the Numpy library for Python, with the sample data consisting of 2 x 10 samples randomly drawn from two different normal populations. The first sample set was generated with a mean of $\mu = 140$ and a standard deviation $\sigma=5$ and the second sampling with $\mu=135$ and $\sigma=5$ to simulate a reduction of the mean systolic blood pressure of the subjects.

With this, the conditions for a normally distributed population and randomly selected samples are fulfilled. Additionally, given that $\sigma_{1} = \sigma_{2}$ and in turn $\sigma^{2}_{1} = \sigma^{2}_{2}$, the condition for homogeneity of variance is also fulfilled. Additionally the values for the differences $D$ and $D^{2}$ and the sums of the differences $\Sigma D$ and $\Sigma D^{2}$ were computed from the generated sample data. The following table features the sample data and the respective parameters.

Computation of the Test Statistic

Going forward the mean of the differences $\bar{D}$ is calculated with

\[\bar{D}=\frac{\Sigma D}{n} = \frac{45}{10} = 4.5\]

and the estimated population standard deviation of the differences $\tilde{s}_{D}$ with

\[\tilde{s}_{D}=\sqrt{\frac{\Sigma D^{2}-\frac{(\Sigma D)^{2}}{n}}{n-1}} = \sqrt{\frac{487 -\frac{(45)^{2}}{10}}{10-1}} \approx 5.62\]

Following the standard error of the mean difference $s_{\bar{D}}$ is computed with

\[s_{\bar{D}}=\frac{\tilde{s}_{D}}{\sqrt{n}} = \frac{5.62}{\sqrt{10}} \approx 1.78\]

and finally with the test statistic of the paired t-test

\[t=\frac{\bar{D}}{s_{\bar{D}}} = \frac{4.5}{1.78} \approx 2.53\]

Interpretation of the Result

With the computed t-value and the degrees of freedom $df = n-1 = 9$, the critical value in the t-table can be selected. The corresponding table value for the computed t-value and the $9$ degrees of freedom would be nearly in the middle of the table values $2.262$ and $2.821$, which correspond to the significance levels $0.02$ to $0.05$ for the two-tailed test. Given that the computed critical value is smaller than the table value for $\alpha = 0.05$, the test yielded enough evidence to reject the null hypothesis. Alternatively, if the test is interpreted with the p-value approach, given that the computed t-value is nearly between $0.05$ and $0.02$, the estimated p-value would be $0.035$.

With $0.035 < 0.05$, the null hypothesis can also be rejected, and according to the alternative hypothesis, the two samplings represent two populations with different means, or in the context of the experiment scenario, the medication had an effect on the systolic blood pressure.

Conclusion

The establishment of a scientific framework for hypothesis testing has been an ongoing concern throughout the history of the statistical evaluation of sample data. Starting with the historical controversies around the creators of the different testing frameworks to the critique against the seemingly random combination of the hypothesis testing paradigm of Pearson-Neymann with Fisher’s p-value and the resulting misinterpretations of test results. Nonetheless, hypothesis testing is an invaluable statistical tool in many scientific disciplines, especially the life sciences, and continues to be a cornerstone of modern research.

References

Boos, D. D., & Stefanski, L. A. (2011). P-value precision and reproducibility. The American Statistician, 65(4), 213–221.

Christensen, R. (2005). Testing fisher, neyman, pearson, and bayes. The American Statistician, 59(2), 121–126.

Frost, J. (2020). Hypothesis testing: An intuitive guide for making data driven decisions. John Wiley & Sons.

Goodman, S. (2008). A dirty dozen: twelve p-value misconceptions. In Seminars in hematology (Vol. 45, pp. 135–140).

Halsey, L. G., Curran-Everett, D., Vowler, S. L., & Drummond, G. B. (2015). The fickle p value generates irreproducible results. Nature methods, 12(3), 179–185.

Kraemer, H. C. (2019). Is it time to ban the p value? Jama Psychiatry, 76(12), 1219–1220.

Lane, D. M. (2020). Within-subjects anova. Retrieved from http://onlinestatbook.com/2/ analysis of variance/within-subjects.html

Lillestøl, J. (2014). Statistical inference : Paradigms and controversies in historic perspective. Retrieved from https://www.nhh.no/globalassets/departments/business-and -management-science/research/lillestol/statistical inference.pdf

Mann, P. S., & Lacke, C. J. (2010). Introductory statistics (7th ed.). Hoboken, NJ: John Wiley & Sons.

Mauchly, J. W. (1940). Significance test for sphericity of a normal n-variate distribution. The Annals of Mathematical Statistics, 11(2), 204–209.

Murphy, K. R., & Myors, B. (2004). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests (2nd ed.). Mahwah, NJ: Erlbaum.

Rice, J. A. (2007). Mathematical statistics and data analysis (3rd ed.). Belmont, Calif.: Thomson/ Brooks/Cole.

Roussas, G. G. (2003). Introduction to probablility and statistical inference. Amsterdam and Boston: Academic Press.

Sheskin, D. J. (2003). Handbook of parametric and nonparametric statistical procedures. Chapman and Hall/CRC.

Spickard, J. V. (2016). Research basics: Design to data analysis in six steps. Sage Publications.

Wackerly, D. D., Mendenhall, W., & Scheaffer, R. L. (2008). Mathematical statistics with applications (7th ed.). Belmont, CA: Thomson Higher Education.

Wasserman, L. (2004). All of statistics: A concise course in statistical inference. New York, NY: Springer.

Witte, R. S., & Witte, J. S. (2017). Statistics (11th ed.). Hoboken, NJ: John Wiley & Sons Inc.

Young, G. A., & Smith, R. L. (2005). Essentials of statistical inference (Vol. 16). Cambridge: Cam- 22 bridge University Press. Zhang, S. (1998). Fourteen homogeneity of variance tests: When and how to use them. Retrieved from https://files.eric.ed.gov/fulltext/ED422392.pdf 23

PDF Summarization and Flashcard Generation with ChatGPT

Marco Schweiss — Mon, 17 Jul 2023 13:42:35 +0000

In an era where information overload is prevalent, finding efficient ways to extract key insights from vast amounts of data has become crucial. As an aspiring student navigating the sea of academic papers during my time at university, I often found myself spending countless hours reading and analyzing research articles. Motivated by the desire to streamline this process, I embarked on a journey to develop a PDF summarizer using ChatGPT. Additionally, I wanted to leverage ChatGPTs language capabilities to automate the process of generating flashcards from the learning material.

By leveraging the power of OpenAI’s state-of-the-art language model, I aimed to create an automated tool that would help me and others extract essential knowledge from lengthy scholarly texts. The primary objective was to facilitate efficient research and enhance the learning experience for students like myself.

In this article, I will delve into the intricacies of building a PDF summarizer and flashcard generator with ChatGPT. I will discuss the underlying methods and the challenges encountered along the way.

The Portable Document Format

PDF, short for Portable Document Format, has become one of the most widely used file formats for document sharing and distribution. Its popularity is attributed to its ability to preserve the original formatting of a document across various platforms. However, this format presents unique challenges when it comes to extracting information and transforming it into a structured format.

The main hurdle lies in the fact that PDFs store data as a series of graphical elements rather than plain text. While these graphical elements maintain the visual appearance of the document, they pose difficulties for programs aiming to extract meaningful content automatically. Directly accessing and parsing text from a PDF can be complex due to variations in layout, fonts, and non-standard encoding schemes.

Another challenge is that PDFs often contain complex structures such as tables, headings, footnotes, citations, and images. These elements require specialized algorithms and techniques for accurate understanding and extraction. Moreover, despite efforts towards standardized file formats like tagged PDFs (PDF/A), many documents are still produced without such metadata, making it even more challenging to extract relevant information systematically.

To overcome these obstacles and bring the data within PDFs into a structured format, one must employ smart techniques such as optical character recognition (OCR) to convert graphical elements back into text data. Additionally, algorithms need to analyze the layout structure to identify headings, sections, paragraphs, and other textual components accurately, which brings us to the PDF Extract API from Adobe.

PDF Extract API

The PDF Extract API is a cloud-based service to automatically extract content and structural information from PDF documents. It can extract text, tables, and figures from both native and scanned PDFs. When extracting text, the API breaks it down into contextual blocks like paragraphs, headings, lists, and footnotes. It includes font, styling, and formatting information, providing comprehensive output. Tables within the PDF are extracted and parsed, including contents and table formatting for each cell. The API can also extract figures or images from the PDF, saving them as individual PNG files for further use.

We will leverage this API to create a JSON file containing the PDF content and in turn, use this JSON file to create text files that we will feed into ChatGPT for further processing.

The Problem with Summarization

Text summarization is a natural language processing technique aimed at condensing large bodies of text into shorter summaries while still capturing the essential information. With the rise of transformer models like ChatGPT, extractive and abstractive summarization methods have seen significant advancements.

However, when dealing with long-form content such as PDF documents, transformers face challenges due to their limited input length. Transformers typically have a maximum token limit, necessitating the truncation or splitting of longer texts. This approach can lead to the loss of important context and affect the quality of the summary.

To address this limitation and preserve information integrity, we will leverage the headings present in the PDF documents. Headings provide structural organization to the text and can serve as natural breakpoints for splitting the document into smaller sections. By splitting the document based on headings, we can ensure that each section is more coherent and self-contained, resulting in better summarization outcomes.

PDF to JSON

As already explained the first step of our pipeline is the conversion of a raw PDF into a JSON file and the extraction of tables and figures. We will use the paper Attention is All You Need by Vaswani and colleagues which is an important paper about transformer model architecture with over 80 000 citations for testing our approach. If you open the PDF you can see that we have a perfect candidate with a variety of elements such as headings, sub-headings, lists, math formulas, images, footnotes, and tables.

For a detailed explanation of the JSON files structure please refer to the documentation of the API provider. Sending the PDF to the Extract API returns the following files:

From here on out we are confronted with three distinct problems:

How do we get the information in the JSON file into a format suitable for our use case?
How do we integrate the tables into the text files?
How do we handle the figures?

JSON to Chapters

We will start by solving problem one which is the most straightforward to solve. By using the documentation we can iterate through the JSON file and transform the data to our target format, which is a folder of text files each corresponding to a section. A section is thereby everything that is contained between two headings.

The second problem of integrating the tables back into the text files is also relatively easy to solve. When the JSON file indicates a table in its path attribute we read the table into a DataFrame, convert it into a CSV, and instead of saving the data to a file we save it inside our text file. By using | as column separators, ChatGPT is capable of understanding that this should be a table.

The third part on our list is where it gets a bit more interesting. Let’s look at the following two images from the figure folder. Here is the first image

and here is the second image:

Given that this application should also be used for generating flashcards, we need a way to tell apart diagrams and formulas and also to incorporate mathematical formulas into our text. We will omit the diagrams for the time being.

Training a Formula Detection Model

To solve the issue we will train a CNN on a dataset that I compiled from various different sources. We basically need a model that can distinguish between math formulas and everything else that could be included in a PDF. This can range from figures and diagrams to illustrations and images of everything else. The approach to training this model is the same as the one that I have already used for the facial emotion classifier. Given the relatively simple task for the modified VGG19, because the two image categories are relatively distinct, we get an accuracy of around 99% which will have to suffice.

Now we integrate this model into our initial script and every time the JSON file path attribute indicates a Figure, the model is called to predict whether we have a formula or not. Depending on the decision the figure is either skipped or sent to the Mathpix API which returns us a Latex representation of the formula. The Latex code is then included in the text where it can be interpreted by ChatGPT. An example of a text file can be seen below:

Chapters to Content

With the preprocessing of the PDFs done, we can now move on to the second big part of the pipeline which is the summarization and the creation of flashcards. As I have already mentioned we will use ChatGPT’s capabilities to do this. By using ChatGPT’s API we can improve our results by providing examples of how we want the task done by sending context in the form of a preceding conversation. To convert the information from the raw text files into high-quality summaries and flashcards we will incorporate an intermediate step that chunks the raw text data into blocks of related information. The image below shows the text of chapter eleven in a chunked format.

After chunking we then let ChatGPT create a summary

and a list of flashcards:

The image does not show all the flashcards that were created. The complete text file can be downloaded here. As you may have noticed, the summary of the section is not really that much shorter than the original section. We will discuss how different lengths of text effects the summaries and the quality of the outputs later on. We also instruct ChatGPT to create flashcards from the summaries which gives us the following list:

Content to Documents

The last step of the pipeline is to convert the text files into documents that can be further used for studying. We will first look into the CSV files that can be used in flashcard software like Anki. Here we will create two CSV files one for the flashcards from the chunked files and one for the summary flashcards.

In addition to the question and answer that makes up a flashcard, we also include the source (document name), the section heading, and the two combined inside a tags column, which is a sorting mechanism used by Anki. Given the correct settings inside Anki, we can import the 238 flashcards from the chunked text and the 113 flashcards from the summaries. An example of how this looks inside Anki in its final form can be seen below:

The second type of document which we will create is two Word documents that contain the chunked sections and the summarized sections.

Discussion of the Results

Does our pipeline here perform flawlessly? No. Is it enough so that I could save a lot of time on tedious work by hand? Yes, absolutely.

Summaries

First of all, let’s talk about the rate of shrinkage that a document undergoes. The size of the resulting summary is dependent on how large the individual sections are and I think the GPT model handles this very well because larger sections get condensed more, and smaller sections that contain less information to begin with, do not get summarized too much. As you can see from our example, we have around 30% percent shrinkage of the original text, if we only consider the actual text as graphics, references and the abstract have been removed.

On the other hand, I tested this pipeline on my written university lectures that have several hundred pages and here I often got a reduction of around 50% or more, of course depending on the structure of the document. So a rule of thumb is that a lot of text without much headings and graphics gets condensed more and a document with sparse text gets condensed less.

The summaries that this pipeline generates can also be used as input for more summaries and solves the problem of too large input contexts, of course only if the original document is not too large to begin with. Here is an example of a one-page summary based on our five-page summary, that could be used for example to speed up a literature review :

Another idea would be to build a recursive summarizer where we first summarize sub-sections and then summarize these summaries into section-level summaries. Similar like OpenAI did in their case study. Regarding the quality of the summary I can of course only give you my subjective opinion as there is no standardized process to measure the quality of a summary. First I would like to explain how I use summaries in my overall approach to studying for a subject. The summary is used to get a first understanding of how the topics of a subject relate to each other and effectively reduces the time to do so by omitting detailed information.

So in essence you can focus on building a higher-level overview of how topics relate and care for the details of a specific topic later on. Given that I studied successfully for several exams with this approach I am personally satisfied with the quality of the summaries and the time that the automated generation saved me.

Flashcards

Next, we will look into the flashcards. As you may have noticed, the flashcards are very granular which means that practically every bit of information is transformed into a question-and-answer format. The idea here is to curate the flashcards during the revision process, so you have built an understanding of what the learning material is about with the summaries, then studied the material in detail with the chunked and original text and finally revise it with the flashcards.

During this revision process, you can then remove or rephrase sub-optimal cards and also check gaps in your existing knowledge. Additionally, you can evaluate the cards during studying which also can be considered a form of interleaved retrieval practice and results in better learning outcomes at least in my case. And last but not least the automated flashcard generation reduced the time it took to study for an exam, as I found curating the cards is faster than writing them from scratch. Of course, you can also just study the summary cards and call it a day.

Conclusion

In conclusion, the use of the PDF summarization pipeline built with ChatGPT has proven to be a game-changer in reducing study time and eliminating tedious handwriting work for me. The possibility for models like ChatGPT to act as a multitask-learner with minimal input is really fascinating and allows for a vast amount of use cases like this one. It is important to note that I use the GPT-3.5 model although GPT-4 is already available because the price per token demands it. A larger document with 100-200 pages costs around 2-3 $ at this time with the GPT-3.5 model and would cost 10 times as much with GPT-4.

I already did tests with GPT-4 but the capabilities in regards to the transformation of text are not that much different between the two models to justify the costs. This also coincides with what OpenAI reported about the difference between the two models. Given the rapid advancements of this technology and the large number of tasks that can be solved with large-language models, I am looking forward to years of interesting application development. This time there will be no code on GitHub, if my coding skills interest you please refer to my other works.

Training of a Spam Classification Model

Marco Schweiss — Fri, 14 Jul 2023 09:14:55 +0000

The rise of short message communication in recent years has provided individuals and businesses with a fast and interactive way to communicate with their peers and customers. However, this has also led to an increase in the number of spam messages, particularly on publicly available channels. In response to this, we will build a classification model that can identify if a message is spam and allows for an adjustable risk level, ensuring that low-risk messages are immediately displayed while others are moved to a folder for further analysis. We will base our model on this dataset.

To achieve these objectives, this project will follow the CRISP-DM framework, a widely used methodology that provides a structured approach to planning a data science project. We will cover all the necessary steps, including business understanding, data understanding, data preparation, modeling, evaluation, and a suggestion for deployment.

Organizing the Project with CRISP-DM

The first step in the CRISP-DM methodology is the Business Understanding step, which involves understanding the business objectives and requirements of the given task. In this step, we need to define the goals and objectives of the project and identify the stakeholders. This will help us to define the scope of the project and ensure that we deliver a solution that meets the needs of the stakeholders. We will base this project on a fictitious company that has tasked us with building a spam filter for their open channel to enable effective communication with their customers while maintaining message quality.

Business Understanding

For this project, the business objective is to build a spam filter for a new open communication channel that the company wants to install for its products. The aim is to enable fast and interactive feedback for their customers or possible future customers. The spam filter should be able to identify spam messages from short messages and allow for adjustments with respect to the risk of allowing spam to pass. This means that the spam filter should be adjustable via different spam-risk levels, e.g., low-risk (very restrictive) and high-risk (not restrictive). The messages passing the low-risk level are immediately displayed, while the other ones are moved to a folder for further analysis.

The stakeholders in this project are the company, the service team who will be responsible for using the spam filter to filter out spam messages, and the customers, who will be using the open channel to provide feedback or ask questions on the company’s products. It is important to keep in mind the expectations of these stakeholders while developing the spam filter so that we can deliver a solution that is effective and meets the needs of all groups. For the company the emphasis lies on not losing customers from misclassified messages, the service team wants a system that simplifies their daily work and the customers do not want to read too many spam messages, although a lost customer will probably hurt the company more than a spam message entering the channel.

In addition to the primary objectives of the project, it is also important to consider any constraints and assumptions that may impact the project. For example, we could have limitations in terms of the resources available to us, such as the amount of data that we have access to or the computing power required to train and test our models. We may also have to make assumptions about the types of messages that are likely to be classified as spam and adjust our approach accordingly. By considering these constraints and assumptions, we can ensure that we deliver a solution that is realistic and feasible given the resources available to us.

Constraints and Assumptions

Based on the dataset and the resources available, we can identify the following constraints:

The dataset only includes English messages, so the model will only be able to classify English messages accurately.
The dataset has a limited number of messages, which may not be representative of all types of messages that the company may receive.
The analysis and model training will be done on a home desktop PC, which may have limitations in terms of computational power and memory.

and assumptions:

The messages in the dataset are correctly labeled as ham or spam.
The dataset is a representative sample of the types of messages that the company may receive.
The spam filter will be used in a similar context as the dataset, so the model will generalize well to new messages.

Data Understanding

To start the data understanding step, we will first load the dataset and check for any missing values and duplicates. We will use the Pandas library to load the CSV file containing the messages and perform basic exploratory data analysis (EDA).

From the above output, we can see that there are no missing values in the dataset, but there are 403 duplicate records. We will remove these duplicate records from the dataset and look at the number of resulting unique records.

Next, we will explore the distribution of the target variable label with the help of Matplotlib and Seaborn. Additionally, we look at the distribution of the character count of the messages.

From the first visualization, we can see that the dataset is imbalanced as there are more ham messages than spam messages. This must be accounted for before we train our classifier.

Plotting the distribution of the message lengths, we can observe that spam messages tend to have longer message lengths compared to ham messages. Further looking into the number of digits and special characters reveals that spam messages tend to have more digits.

Another interesting feature that we will explore is the number of unique words per message type

and lastly, we plot word clouds to spot possible differences in vocabulary.

With this, we conclude the EDA and move further on to the data preparation step where we will balance the dataset and choose the most interesting features for training the model. The engineered features for each message will be message lengths, the number of unique words, and a TF-IDF representation of the vocabulary. We will omit the special character distribution as ham and spam messages are very similar in this regard.

Data Preparation

For the next step, we perform data preprocessing and feature engineering on the SMS message dataset. We load the data from the source CSV file, remove duplicates, and create new columns such as message length, number of digits, number of unique words, and a lemmatized version of the messages. The text data is then transformed into a matrix of TF-IDF values using the TfidfVectorizer class of the scikit-learn library. The resulting features are saved along with the trained vectorizer. Additionally, the dataset is balanced by sampling an equal number of ham and spam messages. The target variable is encoded as 0 for ham and 1 for spam and finally, the prepared dataset is saved as a new CSV file.

Model Training

For training, we first load our prepared dataset of messages and split it into training and testing sets. Then we perform a grid search with cross-validation to find the best hyperparameters for our classifier which will be a decision tree. Additionally, the feature importances are retrieved from the trained classifier, and a barplot is created to visualize the top 10 most important features.

Examining the plot, we can see that the model clearly favors the number of digits in a message as a decision factor. Following we have specific words and our other engineered features.

Error Analysis

After training our model, we will now further analyze the performance. To do this we compute a confusion matrix on our training and test datasets.

Additionally, we also retrieve accuracy, precision, and recall for each dataset respectively.

From the metrics, we can conclude that the model performs sufficiently well on our initial dataset. Additionally, we will look into the distribution of misclassified messages.

With this, we can assume that the model is not particularly biased toward one of the message types. Finally, we will test the model on 10 new messages that were generated with a language model and consist of 5 ham and 5 spam messages. The test results can be examined in the analysis Jupyter Notebook and revealed a reasonable out-of-sample performance and generalization. Although the emphasis of the model on digit counts makes spam messages with no digits and ham messages with more digits hard to classify correctly.

Deployment

To integrate the spam classification model into the service team’s workflow, we could develop a web application using Django. The steps involved in creating and deploying this system would be as follows:

Set up the Django project structure and configure the necessary settings.
Design the database schema to store messages and their classifications.
Set up an API endpoint that receives incoming messages, assuming we have the possibility to control message flow from our communication channels’ backend.
Implement the web interface using Django’s template system and HTML/CSS.
Integrate the spam classification model into the project and develop the function to classify messages.
Implement an adjustable spam level control, allowing the service team to set the level of stringency for classifying messages as spam. The level would be a percentage threshold that can be used for comparison with the probabilities of class membership that the model returns in percentages.
Deploy the Django application on a cloud platform.
Conduct thorough testing and quality assurance to ensure functionality and performance.
Provide documentation and training to guide the service team on using the web application effectively.
Maintain and improve the system based on feedback and ongoing monitoring.

Conclusion

The objective of this project was to build a classification model capable of identifying spam in short messages. The CRISP-DM methodology was followed, covering business understanding, data understanding, data preparation, modeling, evaluation, and a suggestion for deployment. The dataset’s quality was assessed, and findings were visualized for better comprehension. A decision tree classifier was created, trained, and tested using the provided dataset and additional out-of-sample data. An error analysis was conducted to identify weaknesses in the approach, which would be the reliance on digits for classifying spam. Finally, a proposal was made to integrate the model into the daily work of the service team. The solution would be in the form of a graphical user interface (GUI) based on a Django web application.

In summary, this project successfully developed a classification model for spam identification in short messages. The model returns class membership certainties in percentages which can be further utilized to build logic for adjustable risk levels. Further improvements to the model could be made by gathering additional data or data that is more representative of the use case, i.e. short messages of product feedback or questions. Additionally, there is the possibility of engineering alternative features to reduce reliance on the digit count. The project demonstrates the efficacy of data science methodologies in addressing real-world challenges, offering a valuable solution for the company. The code for this project can be viewed in this GitHub repo.

Emotion Detection in Images

Marco Schweiss — Fri, 14 Jul 2023 08:14:04 +0000

In an increasingly digital and interconnected world, understanding human emotions has become a critical aspect of various industries. Emotions significantly impact our decision-making processes, behavior, and overall experience. Marketing departments are particularly interested in leveraging emotion detection technology to gain insights into consumer preferences, optimize advertising strategies, and enhance user experiences. This project report focuses on the development and implementation of an emotion detection system capable of analyzing images to identify the underlying emotions expressed by individuals.

Conception

To solve this task, I decided to train a convolutional neural network (CNN) with a public dataset consisting of images of faces categorized into seven different emotions. To get a viable solution, the concept of transfer learning will be employed. This should result in a solution with state-of-the-art accuracy regarding the classification of emotions.

Basic Frameworks

The primary frameworks for implementing the emotion recognition solution will be TensorFlow and OpenCV. TensorFlow is an open-source library that makes it easy to train, test, and develop neural networks. The name TensorFlow stems from the use of multidimensional data (tensors) as input data sent through several intermediate layers. These layers can be individually selected to build up a neuronal network architecture. Additionally, the library also contains several pre-trained models which can be used for transfer learning. The second library, OpenCV, which is a computer vision library, features a lot of useful algorithms that will be used to extract faces and manipulate images for the inference step.

Transfer Learning and Model Choice

With transfer learning, it is possible to use an existing model to solve different but related problems to the original purpose the model was conceived for. The model’s pre-trained weights are reused, and the architecture of the original model is slightly altered. Transfer learning conserves time, generally results in better performance, and shrinks the scale of necessary training data sets. The model that the emotion classifier for this project is based on is the VGG19 model. The VGG19 is thereby a deep CNN model, which means that the model consists of multiple layers, including a 1000 neurons dense output layer. This is due to the fact that VGG19 was trained to classify 1000 different categories. To fit the model for the use case of detecting emotions, the original output layers, including the 1000-neuron dense layer, will be swapped with new output layers corresponding to the number of emotions that we want to identify. Additionally, we lock the original weights of the VGG19 model and train only our modified layers on the emotion dataset.

Dataset Selection and Issues

To train the model for emotion recognition, the FER2013 dataset, which consists of around 30000 images of facial expressions, will be used. The images are thereby classified into seven different categories of emotion. The dataset also features around 3500 test images for validation of the trained model. Although the dataset features a reasonable amount of images for training, there is the problem of some categories being under and over-represented. This introduces the problem of bias, as, for example, the Happy category contains around 7000 images and the Disgust category features only around 400 images. To combat this issue, the data will be augmented, for example, by flipping, rotating, or cropping images. Here is the link to the dataset.

Model Training and Inference

The overall process of implementing the solution for emotion recognition can be split up into two distinct parts, the model creation, and the inference part. The model creation procedure consists thereby of augmenting the dataset, setting up the training environment and architecture with TensorFlow, and then starting the training process.

The second step, which presents the actual solution for making a prediction, consists of the trained model, which is loaded into a Jupyter Notebook for inference. The process of inference consists thereby of loading an image into a Jupyter Notebook, extracting the face with OpenCV, preprocessing the image, and passing the image to the trained model for prediction. The prediction is visualized as the image with the classified emotion written on it.

Development

After creating a concept, the next step was to start the Implementation. This PDF explains the development process in detail.

Conclusion

Looking back on the creation of the emotion classifier, I can say that this project was a very pleasant and rewarding learning experience, regarding computer vision and the development workflow of deep learning models. Given the issues regarding data quality, bias-variance problems during training, hardware limitations, and the trial and error of hyperparameter tuning, the task also proved to be more difficult than initially thought. As the saying goes, the devil lies in the detail, and I can now understand why training deep learning models is by a lot of machine learning practitioners considered a mixture of art and science.

Personally, I am satisfied with the final result, although I wished for a bit higher accuracy and more emotions. To improve upon this work, a better dataset like AffectNet which has around one million images could be used to train a model on better hardware like TPUs. Grid-search for finding the best parameters and model structure could also be used to get an overall better performance. The code for this project can be found in this repository.

Writing a CNC Workpiece Positioning Program

Marco Schweiss — Thu, 13 Jul 2023 14:39:45 +0000

The starting point for this next project was a defect in one of our CNC portal milling machines at my workplace, which is used for manufacturing staircase components. The defect did not directly involve the production system itself, but rather the projection laser used for setting up the vacuum clamps and for positioning the workpiece on the milling table.

A conversation with the manufacturer revealed that the projection laser had already been discontinued and there were no longer any spare parts available for it. The only solution would have been a custom-made replacement, as the machine control only had an outdated RS232 interface, which was no longer installed by the manufacturer as a standard feature.

Considering the relatively high cost of several tens of thousands of euros for a new projection laser and the challenging circumstances involved in assembling and disassembling the laser system at around 6m from the ground, I opted for a software solution.

In addition to the aforementioned points, an application for calculating the positioning of the workpiece and vacuum clamps provided a speed advantage, as it could also calculate the necessary zero-point shifts. Normally, the workpiece would be placed at a location determined by the CAM software and then manually adjusted due to the complex shape of the machine table. This trial-and-error process typically consumed a significant amount of time.

Importing the G-Code

As a first step, it was necessary to read the G-code files into the development environment and locate the lines that defined the contours of the workpieces. To do this I leveraged the start and end strings that the CAM software provided. (WANGENKONTOUR) or (AUFG AUSSENK) as the start flag and M00 as the end flag.

Next, I had to extract the actual numerical values from the G-code strings, which would serve as the basis for further computations. To do this I wrote a function that has the task to find any coordinates labeled with an X or Y in the G-code lines and gather them into two separate lists. The plot below shows the contour of the stringer if we plot the extracted coordinates:

Placing the Vacuum Clamps

Let us now look at the real challenge of this project, which was to find a way to position the vacuum clamps on the machine table and how the clamps had to be shifted within the workpiece. My idea was to re-render the contour of the workpiece to fill in the missing values because as you can see in the G-code example above there are big gaps between the points of the original contour. The fine contour is then used to find the position of the clamps inside the workpiece.

My algorithm operates based on the following principle. First, a fictitious line is calculated using the centroid coordinates of the staircase component.

Then, points on the centroid line are sought for the centroids of the two outermost clamps until the distance between the clamps and the contour of the staircase component reaches 4cm.

If we now plot the clamps, this first step looks like in the image below:

The remaining clamps are then placed on the centroid line, offset by the width of one clamp. This results in three clamps on the left and three clamps on the right, as a workpiece is always secured with exactly six clamps.

As you may have noticed, the clamps on the right have a different spacing than the ones on the left. If we plot the machine table beside the stringer it should become clear why this is the case.

It would have been great if the project was finished by now, however, due to the complex machine table, a total of four different algorithms had to be developed since the staircase components can have lengths ranging from 50cm to 6m and sometimes require significantly different positioning of the vacuum clamps within the workpiece and on the machine table. We will look at the different configurations at the end of the article.

Positioning on the Machine Table

After calculating the clamping positions, a suitable spot on the machine table had to be chosen. This calculation also depends on the length of the staircase components. In the next step, the zero offsets and the measurements for clamp and raw workpiece placement are calculated. Finally, a plot is made with the positioning of the workpiece relative to the vacuum clamps and the positioning of the vacuum clamps relative to the machine table.

In addition, the plot, which is output in the form of a PDF in A4 format, shows the necessary zero offsets that are fed into the CNC control panel. A finished computation gives a plot like the one below:

As already mentioned we will also look at the other possibilities of workpiece placement which are showcased in this PDF file.

Conclusion

This project will always have a special place in my heart as my first real attempt at developing an actual useful application. Ironically the project probably features the ugliest code that I have ever written but on the other hand, has also created the most value. The application could prevent a thirty thousand euro investment and has effectively reduced the time it takes to process one workpiece by around 25 percent. Additionally, working on the machine is now less error-prone, and training new employees has become faster.

In the years since I developed the positioning script I always played with the thought of refactoring the code, but given the time it would take to entangle the code and the fact that it really works flawlessly so far, I couldn’t really bother. Of course, you can have a look for yourself, as the code is hosted on GitHub.