App Development – BigDataTime

Building a Private AI Tutor for a Microsoft Learn Course

Marco Schweiss — Wed, 25 Oct 2023 09:56:44 +0000

In this article, we will look into one of my recent projects, which is the creation of a specialized private AI tutor for a Microsoft Learn course for the Power BI certification. Given the results of the job market analysis discussed in my most recent article and the fact that I already wanted to refresh my Power BI knowledge, this is the perfect opportunity for this project. The overall aim was to provide an interface where users could ask questions and receive accurate and relevant responses based on the content of the course. We will now look into the details of the implementation.

Data Collection

To build the knowledge base for our AI tutor, we will utilize Python along with libraries like Selenium and BeautifulSoup to scrape all necessary information from the web. By converting each section of the course into individual text files, we can create a comprehensive database to serve as our tutor’s knowledge repository. Thereby, we have to utilize Selenium, which allows us to retrieve HTML code that is dynamically generated by JavaScript. We basically retrieve the raw HTML of each page and extract the actual content with the help of BeautifulSoup. Following, we clean the data by removing links to quizzes and interactive exercises in the course and splitting up the raw text into individual chapters by using the heading structure of the HTML.

This follows the same logic as the PDF conversion pipeline that was used for the ChatGPT flashcard generator and document summarizer, where the headings are used to split up a larger file into chapters. This semantically correct split of the text is important for the following steps and improves our results. Running our scraping pipeline results in the retrieval of 196 pages of course content, which are split into 528 text files that ultimately serve as our knowledge base.

Implementing the AI tutor

The core of our AI tutor is built with the LangChain library, which is a framework that facilitates building applications with large language models (LLMs) like those from the GPT model family. To enable our AI tutor to effectively process and retrieve information, we need to load and prepare the data before we can make an inference that returns an answer to our questions. This pipeline follows a step-by-step process, which is as follows:

Loading the documents
Creating word embeddings
Storing the embeddings in a vector database
Retrieving documents with a similarity search
Passing the retrieved documents to a LLM

Steps 1 to 3 are only done once at the beginning, and steps 4 and 5 are repeated for every question that we ask.

Loading Text and Storing Vectors

Loading the documents is straightforward and done with LangChain’s DirectoryLoader class. Thereby, the functionality of the class basically iterates over the files in a directory and stores them in a list. The more interesting part happens when we convert the raw text documents into embeddings in the next step. Embeddings allow for the creation of a vectorized representation of text, enabling us to analyze and compare pieces of text in the vector space. This is particularly valuable for conducting semantic searches, as it allows us to identify texts that are most similar based on their vector representations. For our project, we use the OpenAIEmbeddings class also from LangChain, which serves as our access point to OpenAI’s embedding service. The last data preparation step consists of storing the embeddings in a vector database, which allows for efficient retrieval. For our project, we use Faiss, a vector database developed by Meta.

Querying Process and Answer Generation

Once our database is ready, we can input questions or queries into our AI tutoring system. Utilizing similarity search algorithms, the vector database retrieves relevant documents based on query similarities. This retrieval process enables quick access to specific sections of course material that may hold answers or explanations related to our inquiries. Upon retrieving relevant documents from our knowledge base, we can then employ a GPT model to formulate accurate answers in response to user queries. The LLM ensures that we receive high-quality and contextually relevant responses tailored specifically to our questions. For better comprehension, we will now walk through the process step by step.

First, we set up the question, which will be “How can I share reports with Power BI?” or in German “Wie kann ich mit Power BI Berichte teilen?”. We first use this question for a similarity search in the vector database, which retrieves the following four documents:

You can see that three of the four documents are very relevant to our question about sharing reports, which means that the similarity search worked well. Here is also the proof that our initial semantically correct splitting of the documents paid off, because otherwise we would retrieve a lot of noise from the database. For example, a common strategy for this type of application if we do not have a structure to build from is to use a fuzzy approach where a large document is split at defined, overlapping word intervals. The LLM that interprets the retrieved documents can certainly handle this, but it is always better to clean the input data as much as possible to improve the accuracy and stability of the output.

As our last step, these four documents are now passed to the GPT model with the instructions to formulate an answer to our question based on the context of the four documents. LangChain handles the prompt creation and the API call for inference, and we get the following answer to our question:

The answer is pretty solid and very relevant to our question. Another very useful feature is that the hallucinations of the LLM are kept to a minimum with this approach. Hallucinations are incorrect answers that the model comes up with. For example, if we ask the model, “What is a squirrel-cage rotor?” or in German, “Was ist ein Kurzschlussläufer?” we get the following answer:

Conclusion

The development of the specialized private AI tutor for the Microsoft Azure Power BI course introduces an innovative way of engaging with educational content. By leveraging the power of Python, web scraping techniques, and advanced libraries like LangChain and GPT models, we successfully built an intelligent tutoring system capable of providing personalized and accurate responses. The biggest advantage of this type of system is definitely the increased speed of information retrieval. We basically built a specialized search engine that dynamically compiles the search results in the most targeted and accurate fashion possible, providing a direct answer to a question.

Overall, LangChain and similar frameworks provide a novel approach to utilizing state-of-the-art LLMs and have spawned a new wave of sophisticated AI applications. The system that we built here with a few lines of code can be leveraged in a lot of different scenarios. For example, the most recent generation of customer support chatbots are probably all running on the same or at least a similar pipeline. A knowledge base is set up with very relevant information about a company’s products or services, and customer questions are used the same way as we used for our AI tutor. The same system can also be used for company internal knowledge, interacting with user manuals, online documentation, or research papers. Although the responses are not always perfect and one should always check the source, which can be printed with the answer, it feels like we are moving into an era of real AI integration into a vast number of applications. The code for this project can be found on GitHub.

PDF Summarization and Flashcard Generation with ChatGPT

Marco Schweiss — Mon, 17 Jul 2023 13:42:35 +0000

In an era where information overload is prevalent, finding efficient ways to extract key insights from vast amounts of data has become crucial. As an aspiring student navigating the sea of academic papers during my time at university, I often found myself spending countless hours reading and analyzing research articles. Motivated by the desire to streamline this process, I embarked on a journey to develop a PDF summarizer using ChatGPT. Additionally, I wanted to leverage ChatGPTs language capabilities to automate the process of generating flashcards from the learning material.

By leveraging the power of OpenAI’s state-of-the-art language model, I aimed to create an automated tool that would help me and others extract essential knowledge from lengthy scholarly texts. The primary objective was to facilitate efficient research and enhance the learning experience for students like myself.

In this article, I will delve into the intricacies of building a PDF summarizer and flashcard generator with ChatGPT. I will discuss the underlying methods and the challenges encountered along the way.

The Portable Document Format

PDF, short for Portable Document Format, has become one of the most widely used file formats for document sharing and distribution. Its popularity is attributed to its ability to preserve the original formatting of a document across various platforms. However, this format presents unique challenges when it comes to extracting information and transforming it into a structured format.

The main hurdle lies in the fact that PDFs store data as a series of graphical elements rather than plain text. While these graphical elements maintain the visual appearance of the document, they pose difficulties for programs aiming to extract meaningful content automatically. Directly accessing and parsing text from a PDF can be complex due to variations in layout, fonts, and non-standard encoding schemes.

Another challenge is that PDFs often contain complex structures such as tables, headings, footnotes, citations, and images. These elements require specialized algorithms and techniques for accurate understanding and extraction. Moreover, despite efforts towards standardized file formats like tagged PDFs (PDF/A), many documents are still produced without such metadata, making it even more challenging to extract relevant information systematically.

To overcome these obstacles and bring the data within PDFs into a structured format, one must employ smart techniques such as optical character recognition (OCR) to convert graphical elements back into text data. Additionally, algorithms need to analyze the layout structure to identify headings, sections, paragraphs, and other textual components accurately, which brings us to the PDF Extract API from Adobe.

PDF Extract API

The PDF Extract API is a cloud-based service to automatically extract content and structural information from PDF documents. It can extract text, tables, and figures from both native and scanned PDFs. When extracting text, the API breaks it down into contextual blocks like paragraphs, headings, lists, and footnotes. It includes font, styling, and formatting information, providing comprehensive output. Tables within the PDF are extracted and parsed, including contents and table formatting for each cell. The API can also extract figures or images from the PDF, saving them as individual PNG files for further use.

We will leverage this API to create a JSON file containing the PDF content and in turn, use this JSON file to create text files that we will feed into ChatGPT for further processing.

The Problem with Summarization

Text summarization is a natural language processing technique aimed at condensing large bodies of text into shorter summaries while still capturing the essential information. With the rise of transformer models like ChatGPT, extractive and abstractive summarization methods have seen significant advancements.

However, when dealing with long-form content such as PDF documents, transformers face challenges due to their limited input length. Transformers typically have a maximum token limit, necessitating the truncation or splitting of longer texts. This approach can lead to the loss of important context and affect the quality of the summary.

To address this limitation and preserve information integrity, we will leverage the headings present in the PDF documents. Headings provide structural organization to the text and can serve as natural breakpoints for splitting the document into smaller sections. By splitting the document based on headings, we can ensure that each section is more coherent and self-contained, resulting in better summarization outcomes.

PDF to JSON

As already explained the first step of our pipeline is the conversion of a raw PDF into a JSON file and the extraction of tables and figures. We will use the paper Attention is All You Need by Vaswani and colleagues which is an important paper about transformer model architecture with over 80 000 citations for testing our approach. If you open the PDF you can see that we have a perfect candidate with a variety of elements such as headings, sub-headings, lists, math formulas, images, footnotes, and tables.

For a detailed explanation of the JSON files structure please refer to the documentation of the API provider. Sending the PDF to the Extract API returns the following files:

From here on out we are confronted with three distinct problems:

How do we get the information in the JSON file into a format suitable for our use case?
How do we integrate the tables into the text files?
How do we handle the figures?

JSON to Chapters

We will start by solving problem one which is the most straightforward to solve. By using the documentation we can iterate through the JSON file and transform the data to our target format, which is a folder of text files each corresponding to a section. A section is thereby everything that is contained between two headings.

The second problem of integrating the tables back into the text files is also relatively easy to solve. When the JSON file indicates a table in its path attribute we read the table into a DataFrame, convert it into a CSV, and instead of saving the data to a file we save it inside our text file. By using | as column separators, ChatGPT is capable of understanding that this should be a table.

The third part on our list is where it gets a bit more interesting. Let’s look at the following two images from the figure folder. Here is the first image

and here is the second image:

Given that this application should also be used for generating flashcards, we need a way to tell apart diagrams and formulas and also to incorporate mathematical formulas into our text. We will omit the diagrams for the time being.

Training a Formula Detection Model

To solve the issue we will train a CNN on a dataset that I compiled from various different sources. We basically need a model that can distinguish between math formulas and everything else that could be included in a PDF. This can range from figures and diagrams to illustrations and images of everything else. The approach to training this model is the same as the one that I have already used for the facial emotion classifier. Given the relatively simple task for the modified VGG19, because the two image categories are relatively distinct, we get an accuracy of around 99% which will have to suffice.

Now we integrate this model into our initial script and every time the JSON file path attribute indicates a Figure, the model is called to predict whether we have a formula or not. Depending on the decision the figure is either skipped or sent to the Mathpix API which returns us a Latex representation of the formula. The Latex code is then included in the text where it can be interpreted by ChatGPT. An example of a text file can be seen below:

Chapters to Content

With the preprocessing of the PDFs done, we can now move on to the second big part of the pipeline which is the summarization and the creation of flashcards. As I have already mentioned we will use ChatGPT’s capabilities to do this. By using ChatGPT’s API we can improve our results by providing examples of how we want the task done by sending context in the form of a preceding conversation. To convert the information from the raw text files into high-quality summaries and flashcards we will incorporate an intermediate step that chunks the raw text data into blocks of related information. The image below shows the text of chapter eleven in a chunked format.

After chunking we then let ChatGPT create a summary

and a list of flashcards:

The image does not show all the flashcards that were created. The complete text file can be downloaded here. As you may have noticed, the summary of the section is not really that much shorter than the original section. We will discuss how different lengths of text effects the summaries and the quality of the outputs later on. We also instruct ChatGPT to create flashcards from the summaries which gives us the following list:

Content to Documents

The last step of the pipeline is to convert the text files into documents that can be further used for studying. We will first look into the CSV files that can be used in flashcard software like Anki. Here we will create two CSV files one for the flashcards from the chunked files and one for the summary flashcards.

In addition to the question and answer that makes up a flashcard, we also include the source (document name), the section heading, and the two combined inside a tags column, which is a sorting mechanism used by Anki. Given the correct settings inside Anki, we can import the 238 flashcards from the chunked text and the 113 flashcards from the summaries. An example of how this looks inside Anki in its final form can be seen below:

The second type of document which we will create is two Word documents that contain the chunked sections and the summarized sections.

Discussion of the Results

Does our pipeline here perform flawlessly? No. Is it enough so that I could save a lot of time on tedious work by hand? Yes, absolutely.

Summaries

First of all, let’s talk about the rate of shrinkage that a document undergoes. The size of the resulting summary is dependent on how large the individual sections are and I think the GPT model handles this very well because larger sections get condensed more, and smaller sections that contain less information to begin with, do not get summarized too much. As you can see from our example, we have around 30% percent shrinkage of the original text, if we only consider the actual text as graphics, references and the abstract have been removed.

On the other hand, I tested this pipeline on my written university lectures that have several hundred pages and here I often got a reduction of around 50% or more, of course depending on the structure of the document. So a rule of thumb is that a lot of text without much headings and graphics gets condensed more and a document with sparse text gets condensed less.

The summaries that this pipeline generates can also be used as input for more summaries and solves the problem of too large input contexts, of course only if the original document is not too large to begin with. Here is an example of a one-page summary based on our five-page summary, that could be used for example to speed up a literature review :

Another idea would be to build a recursive summarizer where we first summarize sub-sections and then summarize these summaries into section-level summaries. Similar like OpenAI did in their case study. Regarding the quality of the summary I can of course only give you my subjective opinion as there is no standardized process to measure the quality of a summary. First I would like to explain how I use summaries in my overall approach to studying for a subject. The summary is used to get a first understanding of how the topics of a subject relate to each other and effectively reduces the time to do so by omitting detailed information.

So in essence you can focus on building a higher-level overview of how topics relate and care for the details of a specific topic later on. Given that I studied successfully for several exams with this approach I am personally satisfied with the quality of the summaries and the time that the automated generation saved me.

Flashcards

Next, we will look into the flashcards. As you may have noticed, the flashcards are very granular which means that practically every bit of information is transformed into a question-and-answer format. The idea here is to curate the flashcards during the revision process, so you have built an understanding of what the learning material is about with the summaries, then studied the material in detail with the chunked and original text and finally revise it with the flashcards.

During this revision process, you can then remove or rephrase sub-optimal cards and also check gaps in your existing knowledge. Additionally, you can evaluate the cards during studying which also can be considered a form of interleaved retrieval practice and results in better learning outcomes at least in my case. And last but not least the automated flashcard generation reduced the time it took to study for an exam, as I found curating the cards is faster than writing them from scratch. Of course, you can also just study the summary cards and call it a day.

Conclusion

In conclusion, the use of the PDF summarization pipeline built with ChatGPT has proven to be a game-changer in reducing study time and eliminating tedious handwriting work for me. The possibility for models like ChatGPT to act as a multitask-learner with minimal input is really fascinating and allows for a vast amount of use cases like this one. It is important to note that I use the GPT-3.5 model although GPT-4 is already available because the price per token demands it. A larger document with 100-200 pages costs around 2-3 $ at this time with the GPT-3.5 model and would cost 10 times as much with GPT-4.

I already did tests with GPT-4 but the capabilities in regards to the transformation of text are not that much different between the two models to justify the costs. This also coincides with what OpenAI reported about the difference between the two models. Given the rapid advancements of this technology and the large number of tasks that can be solved with large-language models, I am looking forward to years of interesting application development. This time there will be no code on GitHub, if my coding skills interest you please refer to my other works.

Writing a CNC Workpiece Positioning Program

Marco Schweiss — Thu, 13 Jul 2023 14:39:45 +0000

The starting point for this next project was a defect in one of our CNC portal milling machines at my workplace, which is used for manufacturing staircase components. The defect did not directly involve the production system itself, but rather the projection laser used for setting up the vacuum clamps and for positioning the workpiece on the milling table.

A conversation with the manufacturer revealed that the projection laser had already been discontinued and there were no longer any spare parts available for it. The only solution would have been a custom-made replacement, as the machine control only had an outdated RS232 interface, which was no longer installed by the manufacturer as a standard feature.

Considering the relatively high cost of several tens of thousands of euros for a new projection laser and the challenging circumstances involved in assembling and disassembling the laser system at around 6m from the ground, I opted for a software solution.

In addition to the aforementioned points, an application for calculating the positioning of the workpiece and vacuum clamps provided a speed advantage, as it could also calculate the necessary zero-point shifts. Normally, the workpiece would be placed at a location determined by the CAM software and then manually adjusted due to the complex shape of the machine table. This trial-and-error process typically consumed a significant amount of time.

Importing the G-Code

As a first step, it was necessary to read the G-code files into the development environment and locate the lines that defined the contours of the workpieces. To do this I leveraged the start and end strings that the CAM software provided. (WANGENKONTOUR) or (AUFG AUSSENK) as the start flag and M00 as the end flag.

Next, I had to extract the actual numerical values from the G-code strings, which would serve as the basis for further computations. To do this I wrote a function that has the task to find any coordinates labeled with an X or Y in the G-code lines and gather them into two separate lists. The plot below shows the contour of the stringer if we plot the extracted coordinates:

Placing the Vacuum Clamps

Let us now look at the real challenge of this project, which was to find a way to position the vacuum clamps on the machine table and how the clamps had to be shifted within the workpiece. My idea was to re-render the contour of the workpiece to fill in the missing values because as you can see in the G-code example above there are big gaps between the points of the original contour. The fine contour is then used to find the position of the clamps inside the workpiece.

My algorithm operates based on the following principle. First, a fictitious line is calculated using the centroid coordinates of the staircase component.

Then, points on the centroid line are sought for the centroids of the two outermost clamps until the distance between the clamps and the contour of the staircase component reaches 4cm.

If we now plot the clamps, this first step looks like in the image below:

The remaining clamps are then placed on the centroid line, offset by the width of one clamp. This results in three clamps on the left and three clamps on the right, as a workpiece is always secured with exactly six clamps.

As you may have noticed, the clamps on the right have a different spacing than the ones on the left. If we plot the machine table beside the stringer it should become clear why this is the case.

It would have been great if the project was finished by now, however, due to the complex machine table, a total of four different algorithms had to be developed since the staircase components can have lengths ranging from 50cm to 6m and sometimes require significantly different positioning of the vacuum clamps within the workpiece and on the machine table. We will look at the different configurations at the end of the article.

Positioning on the Machine Table

After calculating the clamping positions, a suitable spot on the machine table had to be chosen. This calculation also depends on the length of the staircase components. In the next step, the zero offsets and the measurements for clamp and raw workpiece placement are calculated. Finally, a plot is made with the positioning of the workpiece relative to the vacuum clamps and the positioning of the vacuum clamps relative to the machine table.

In addition, the plot, which is output in the form of a PDF in A4 format, shows the necessary zero offsets that are fed into the CNC control panel. A finished computation gives a plot like the one below:

As already mentioned we will also look at the other possibilities of workpiece placement which are showcased in this PDF file.

Conclusion

This project will always have a special place in my heart as my first real attempt at developing an actual useful application. Ironically the project probably features the ugliest code that I have ever written but on the other hand, has also created the most value. The application could prevent a thirty thousand euro investment and has effectively reduced the time it takes to process one workpiece by around 25 percent. Additionally, working on the machine is now less error-prone, and training new employees has become faster.

In the years since I developed the positioning script I always played with the thought of refactoring the code, but given the time it would take to entangle the code and the fact that it really works flawlessly so far, I couldn’t really bother. Of course, you can have a look for yourself, as the code is hosted on GitHub.

Building a Web Scraping Pipeline for Covid-19 Incidences

Marco Schweiss — Thu, 13 Jul 2023 08:39:49 +0000

The internet is a nearly limitless source of information. This content is available to anyone with an internet connection and can be found in the form of text, audio, images, and videos. Because not every piece of information or type of format is important to each user, data retrieval can be a tedious task. Often, rather than being a convenience, this abundance of information might be a problem. To combat this issue and to satisfy the increasing need for data, processes that facilitate the retrieval of data need to be implemented. An example of such a process is web scraping which is used to create platforms where interesting data from different sources is aggregated to reduce information overload.

A good use case to employ web scraping technology is thereby the challenges of COVID-19-related data retrieval and aggregation. Thereby web scraping can be used to address the problem that different countries publish their data in various formats, channels, and intervals. This can make it difficult for researchers to fetch vital datasets for COVID-19-related research in an effective manner. Research concerning COVID-19 is an essential and ongoing effort starting just a few months since the virus began spreading around the globe. Thereby researchers are making great efforts to understand the virus’s characteristics, the causes of its effects, and possible precautionary care.

Not only can academic research benefit from facilitated access to data, but also for the general public, this aggregated data can be helpful. But in contrast to academics, normal people are used to seeing more easily interpretable figures and graphs that illustrate the development of data. Thereby different types of data visualizations like how COVID-19 affects specific geographic locations or age groups and the visualization of key metrics over time can aid in the understanding of the virus characteristics.

As has been explained, individuals often spend a lot of time and effort trying to find the most recent information from multiple sources. This is inconvenient, and as a result, we need a system capable of providing targeted, aggregated, and up-to-date information for important COVID-19-related metrics, which are normally scattered around various websites. The developed web scraping application, which is specialized in aggregating data from the DACH region, allows users to access a visualization of three key metrics, namely the total infections, total deaths, and the seven-day incidence.

This concise overview of three of the most important metrics regarding COVID-19 aims to reduce information overload while still providing a good picture of the current situation in the three countries of the DACH region. Thereby the data is scraped from three selected sources at a daily interval. To facilitate COVID-19-related research, the data which has been scraped since deploying the application can be downloaded in HDF5 format and further analyzed.

A Primer on Web Scraping

Web scraping is a data mining technique used to extract unstructured information from various online resources, format the data so it can be saved and analyzed in a database, and then convert the obtained data into a structure that is more valuable for the user. A well-organized web scraper routinely scans targeted data sources and gathers valuable information into an actionable database. Web scraping has many forms, including copying and pasting, HTML parsing, or text grabbing.

The retrieval of the data can roughly be divided into two steps which are executed sequentially. First, a connection to the requested website is established, and following the content of the website is extracted. In essence, the web scraper copies the classical behavior of a user when surfing the web, searching for and downloading information, only that the actions are automated. To date, a variety of tools and techniques for web scraping are in existence, with the two most popular being parsing the HTML of a website or simulating a complete browser. The latter of the methods can be used if dynamically generated websites that rely on JavaScript being executed by the browser prohibit the first method.

The internet’s popularity was the catalyst of the need to scrape online resources. The first famous scrapers were created to search for information by search engine developers. These types of scrapers go through most of the internet, analyze every web page, extract information from it, and build a searchable index. The use of web scraping has grown in a variety of diverse fields, including e-commerce, content creation for the web, real estate, tourism, finance, and law.

Literature Review

Before creating the web application, a literature review was conducted to get an overview of possible use cases and research applications of web scraping technology. For example, public health and epidemiology investigations frequently involve web scraping. Researchers can possibly identify risks and food dangers and also anticipate possible future epidemics by scraping and analyzing textual data stored on the world wide web.

An example would be that Pollett and co used a web scraping approach to gather unorganized online news data to enable them to identify outbreaks and epidemics of vector-borne diseases. Another example would be Walid and associates, who gathered Twitter data over two years to develop a cancer detection and prediction algorithm. Thereby the team combined the Twitter data and sentiment analysis with natural language processing. In 2015, Majumder et al. used web-scraped health information from HealthMap in conjunction with Google Trend time series information to determine the projected size of a Zika virus pandemic.

As already mentioned, web scraping is also being used to identify food-related dangers. Ihm and coworkers developed a method to defuse and manage food emergencies in Korea by scraping social media and news for events involving food hazards. Another example would be the usage of web scraping for gathering Yelp reviews to discover cases of food-related illnesses which were not reported. This was done by the New York City Department of Health and Mental Hygiene between 2013 and 2014.

Similar studies have also been conducted in regard to COVID-19-related issues. Lan et al. sought to assess the discourse on policies to deal with the coronavirus pandemic through official press articles, social networks, and media outlets. Xu and colleagues scraped Weibo postings from Wuhan to assess public sentiment, understanding, and attitudes in the lead-up to the early stages of the COVID-19 pandemic. Future policy decisions and future epidemic responses may be guided by their findings.

Project Methodology

One significant barrier is the lack of a standardized, regularly updated, and high-quality data source for information on COVID-19 cases in the DACH region. This is due to the fact that the different nations distribute information on unique platforms, institutions, and time cycles that negatively affect researchers’ access to crucial coronavirus datasets for medical analyses at satisfactory resolutions.

To overcome this problem, the developed web application is set up to look for statistical information on COVID-19 using web scraping procedures. Three websites, namely www.rki.de, coron-avirus.datenfakten.at, and www.corona-in-zahlen.de, form the basis for the collection and aggregation of COVID-19 key metrics. The information about the current development is thereby scraped daily and saved inside a relational database. Following, we will discuss the inner workings of the proposed application.

The application can really be split into two parts according to their functionality, which is the front-end for display and download of the gathered data and the actual web scraping. The whole application is thereby set up as a web application based on the Django library, a well-known Python library for web applications.

Data Scraping Logic

The task of gathering the COVID-19 data from the three aforementioned websites is done by a single Python script which is implemented as a Django command and executed by a cronjob which is registered on the server hosting the web application. For scraping, the application leverages the Requests and BeautifulSoup4 library, which will be reviewed in detail later on.

As already mentioned, the web scraping process is triggered by the cronjob at the scheduled time of eleven PM. The script works by sending an HTTP request to retrieve the raw HTML, which is then parsed with BeautifulSoup4. The parsing process yields a Soup object which is then examined to extract the needed information, which in our case are total infections, total deaths, and the 7-day incidence for each website, respectively. The resulting 9 data points, three for each county, are then saved in the relational database, which is implemented with the SQLite library. The following image illustrates the process.

After discussing the general process of the scraping logic, we will take a look into the two Python libraries used for web scraping, namely Requests and BeautifulSoup4.

Requests

Requests is a simple, tried-and-true HTTP library for the Python programming language that allows sending HTTP requests very easily. The Requests library is built upon Python’s urllib3, but it lets you manage the vast majority of HTTP use cases with code that is short, pretty, and easy to use. To get a web page, we have to call the get method of the library and issue the website URL as a parameter. The response of the server is then contained in the response object. When working on an application, it’s often required to deal with raw, JSON, or binary response. Therefore it is needed to decode the response of a web server, which is also handled by the Requests library.

BeautifulSoup4

Beautiful Soup is a Python library for extracting data from HTML and XML files. The library formats and organizes the data from the web by fixing bad HTML and giving us easily traversable Python objects corresponding to XML structures. The most widely used object in the BeautifulSoup library is the BeautifulSoup object which takes the HTML content of a webpage and a parser as arguments. In our case, the HTML content is the response object and specifically the raw string format of the response object and the Python built-in html.parser, which are both passed as arguments.

Data Visualization and Provision

After looking into the web scraping part of the application, we will now discuss the second part of the application, which is concerned with the visualization of the gathered data and the provision of the data as a HDF5 file. The front end of the web application is implemented as a single page which is served by a Django view function. Thereby the function works by gathering all data from the database inside a dictionary which is then used to generate a figure. The figure is then sent to the browser for rendering. Like the data visualization, the provision of a HDF5 file works on a similar basis. The process starts by clicking the download button on the index page of the web application. This triggers the registered view, which first queries all entries in the database, but instead of creating a figure for visualization, the web application then creates a HDF5 file and serves it as a file to the user.

HDF5

To make the gathered dataset accessible for researchers and to facilitate analysis, the format chosen for data provision is the HDF5 file format. HDF5 is used for storing and exchanging data. It supports all types of data, is designed for flexible data input and output, and can store large volumes of complex information. The HDF5 data model is composed of three elements which are datasets, groups, and attributes. A user can use these elements to construct particular application formats that organize data in a way that suits the problem domain. HDF5 also handles all issues related to cross-platform storage, so sharing data with others is a straightforward matter. The actual HDF5 library is implemented in C, and both of the most popular Python interfaces, h5py, and PyTables, are designed as a wrapper around the C library.

Pandas HDF5 Capabilities

Pandas is a Python library for data analysis with several data structures and functions designed to make working with structured or tabular data more intuitive. Additionally and more relevant for this project, the Pandas library contains functionality for reading and writing HDF5 files, which is implemented with the PyTables library. Thereby a Pandas DataFrame, which is the library table-like data structure, can easily be converted into the datasets, groups, and attributes that make up a HDF5 file.

Discussion of the Results

Users can access the most recent COVID-19-related information from three websites covering the DACH region with the help of the developed web application. Since the data is scattered around different websites and often buried under a lot of other information, this application provides added value to the original data. People typically do not have the time to research many websites to gather data or information since they are too busy with everyday activities, which makes a central overview of key metrics very convenient.

Given that the application is processing data that is not originally our own, we must also examine the legal concerns which are tied to a web scraping application. Scraping data for personal use poses practically no problem. However, if the acquired data is published again, then it is crucial to consider which type of data is processed. Internationally recognized court trials have helped define what’s acceptable for scraping websites. These trials indicate that if copied data are facts, the data can be repurposed. However, data that is original, for example, ideas, news, and opinions, cannot be used because many times the information is copyrighted. Given that the application scrapes factual information, we can establish a case for the scraping, aggregation, and republishing of this data.

Comparison with the Literature

To compare the proposed methodology of this project, a review of similar studies was conducted. For example, in 2021, Lan and colleagues created an accessible toolkit that can automatically scan, scrape, collect, unify, and save the data from COVID-19 data sources from different countries across the world. They also used Python as a programming language and well-known packages like Beau-tifulSoup, Selenium, and Pandas for the collection and processing of the data. In a single run, the application will gather information from all over the world. The information is then consolidated in a single database for sharing after post-processing and data cleansing. The COVID-Scraper also provides a visualization feature to make the whole data available for viewing as a web service. Additionally, the data can be downloaded as a PDF. The tool developed in this study also features the ability to process PDFs, images, and dynamically generated websites.

Another interesting study used Python with libraries like Scrapy, which is focused on the development of web crawlers, to scrape big scholarly databases like PubMed and arXiv. The process involved first searching for COVID-19-related papers in a variety of databases and then collecting the metadata of the search results with web scraping techniques. The acquired metadata was then processed with the Pandas library, which included cleaning and normalizing the data, and then merged into a single dataset in the form of a CSV file. The resulting dataset features over 40,000 records of metadata related to COVID-19 research. The intended usage of the aggregated dataset is to detect possible trends, gaps, or other interesting facts about COVID-19 research in general. The study also used version control systems for the Python code.

By comparing the developed web application with other approaches from the literature, we can establish a case for using the proposed techniques and technology but also derive ideas for improving the application further. For example, a limitation of the application in its current form is that dynamically generated websites cannot be scraped because a library like Selenium is needed. Selenium thereby simulates a browser that can execute JavaScript that gathers information from the server by asynchronous requests. This reveals data that stays hidden when using libraries like Requests. Additionally, the application can be further improved by allowing for file-based scraping. Given that a lot of data sources publish their information as PDFs or CSV files, this can further broaden the range of data that can be processed by the application.

Conclusion

The ever-increasing amount of information available on the internet and the scale of that information’s development over time leads to search queries that produce vague and unspecific results. However, by making use of information extraction methods, we can diminish the reach of this issue and focus useful information on a single platform. Billions of people worldwide have been affected by the outbreak of the coronavirus. Organizations, governments, and researchers are working diligently to research the issue of COVID-19 and achieve the goal of returning civilians to their natural way of living as soon as possible.

Researchers around the world have used comprehensive COVID-19 records to provide information that contributed to the progress of COVID-19 research. In order to efficiently obtain, compile, preserve, and exchange data that is open to the public, we have implemented an online application that scrapes selected COVID-19 metrics from the aforementioned three websites. It is useful for residents of the DACH area to witness the progression of COVID-19 over time. Under the proposed parameters, the CoronaScraper is a highly adaptable and automated tool that facilitates research and provides targeted information to the public. As always the code is available on GitHub.

Reference

Al Walid, M. H. (2019). Data analysis and visualization of continental cancer situation by twitter scraping. International Journal of Modern Education and Computer Science, 11(7), 23–31.

Chandra, R. V. (2015). Python requests essentials. Birmingham: Packt Publishing. Collete, A. (2013). Python and hdf5: Unlocking scientific data. Sebastopol: O’Reilly.

Comba, J. L. D. (2020). Data visualization for the understanding of covid-19. Computing in Science & Engineering, 22(6), 81–86.

Deepak, K. M., & Lisha, S. (2016). A dive into web scraper world. 2016 3rd International Conference on Computing for Sustainable Global Development , 689–693.

Dongthi, T., Nguyen, Kieuthi, Chung, T., & Dong Thi Thao, N. (2020). New trends in technology application in education and capacities of universities lecturers during the covid-19 pandemic. nter-national Journal of Mechanical and Production Engineering Research and Development (10), 1709–1714.

Hajba, G. L. (2018). Website scraping with python: Using beautifulsoup and scrapy. Berkeley: Apress L. P.

Harris, J. K., Mansour, R., Choucair, B., Olson, J., Nissen, C., & Bhatt, J. (2014). Health department use of social media to identify foodborne illness - chicago, illinois, 2013-2014. Morbidity and Mortality Weekly Report , 63(32), 681–685.

Ihm, H., Jang, K., Lee, K., Jang, G., Seo, M.-G., Han, K., & Myaeng, S.-H. (2017). Multi-source food hazard event extraction for public health. In 2017 ieee international conference on big data and smart computing (pp. 414–417).

Kenneth, R. (2022). Requests: Http for humans. Retrieved from https://requests.readthedocs.io/en/latest/

La, V.-P., Pham, T.-H., Ho, M.-T., Nguyen, M.-H., P. Nguyen, K.-L., Vuong, T.-T., . . . Vuong, Q.-H. (2020). Policy response, social media and science journalism for the sustainability of the public health system amid the covid-19 outbreak: The vietnam lessons. Sustainability , 12(7), 2931.

Lan, H., Sha, D., Malarvizhi, A. S., Liu, Y., Li, Y., Meister, N., . . . Yang, C. P. (2021). Covid-scraper: An open-source toolset for automatically scraping and processing global multi scale spatiotemporal covid-19 records. Ieee Access, 9, 84783–84798.

Lawson, R. (2015). Web scraping with python: Scrape data from any website with the power of python. Birmingham: Packt Publishing.

Majumder, M. S., Santillana, M., Mekaru, S. R., McGinnis, D. P., Khan, K., & Brownstein, J. S. (2016). Utilizing nontraditional data sources for near real-time estimation of transmission dynamics during the 2015-2016 colombian zika virus disease outbreak. JMIR public health and surveillance, 2(1), e30.

McKinney, W. (2018). Python for data analysis: Data wrangling with pandas, numpy, and ipython (2nd ed.). Sebastopol: O’Reilly.

Mitchell, R. (2015). Web scraping with python: Collecting data from the modern web (1st ed.). Sebastopol: O’Reilly.

Pandas. (2022). Input/output — pandas documentation. Retrieved from https://pandas.pydata.org/pandas-docs/dev/reference/io.html#hdfstore-pytables-hdf5

Pollett, S., Althouse, B. M., Forshey, B., Rutherford, G. W., & Jarman, R. G. (2017). Internet-based biosurveillance methods for vector-borne diseases: Are they novel public health tools or just novelties? PLoS Neglected Tropical Diseases, 11(11), e0005871.

Richardson, L. (2021). Beautiful soup documentation. Retrieved from https://beautiful-soup-4.readthedocs.io/en/latest/S. D. S, S. (2015). A comparative study on web scraping. In Proceedings of 8th international research conference.

Santos, B. S., Silva, I., Da Ribeiro-Dantas, M. C., Alves, G., Endo, P. T., & Lima, L. (2020). Covid-19: A scholarly production dataset report for research analysis. Data in brief , 32, 106178.

Shakra, M., Rabia, Z., Sharaz, A., & Sohail Masood, B. (2019). Exploiting filtering approach with web scrapping for smart online shopping : Penny wise: A wise tool for online shopping. 2019 2nd International Conference on Computing, Mathematics and Engineering Technologies, 1–5.

The HDF Group. (2022). Hdf5. Retrieved from https://portal.hdfgroup.org/display/HDF5/ HDF5 The Octoparse Team. (2022). 9 industries benefiting from web scraper octoparse. Retrieved from https://www.octoparse.com/blog/9-industries-benefit-from-web-scraper-octoparse

Upadhyay, S., Pant, V., Bhasin, S., & Pattanshetti, M. K. (2017). Articulating the construction of a web scraper for massive data extraction. In Proceedings of the 2017 second ieee international conference on electrical, computer and communication technologies (pp. 1–4).

vanden Broucke, S. (2018). Practical web scraping for data science: Best practices and examples with python. New York: Apress.

Xu, Q., Shen, Z., Shah, N., Cuomo, R., Cai, M., Brown, M., . . . Mackey, T. (2020). Characterizing weibo social media posts from wuhan, china during the early stages of the covid-19 pandemic: Qualitative content analysis. JMIR public health and surveillance, 6(4), e24125.

Scrum for a Software Development Project

Marco Schweiss — Wed, 12 Jul 2023 12:57:45 +0000

Lots of new development approaches were introduced to fit the needs of software companies. Most software development companies nowadays aim to produce high-quality software at a low cost and within a short time frame. Software development has become more complicated in recent years, with customer requirements demanding frequent changes and traditional software development life cycle approaches unable to keep up. Given this problem, agile methods were introduced to meet the new requirements of software development companies.

There are many Agile methods being used today, for example, Scrum. Thereby Scrum is probably the most popular framework that people and organizations use for software development. In the last decade, Scrum has transformed from a project management methodology to a way of understanding how dysfunctional teams, departments, and entire organizations are managed. Numerous studies have shown that when following the Scrum framework, companies have increased employee satisfaction, higher quality of code, and improved transparency regarding the current status of the project.

In order to maintain success in today’s business world, the need for fast development and high-quality content is imperative. Organizations these days are being asked to react quickly to technological developments in their industries, and many of them are feeling a lot of pressure related to adapting the business better. In a rapidly changing world, it’s more important than ever to be able to deliver high-quality products at lower costs. To meet this demand, many companies have already realized the benefits of agile methods and adopted them.

To demonstrate the advantages and the overall process of Scrum, we will look into how Scrum would be implemented for the development of an Enterprise-Resource Planning (ERP) system. ERP systems are the backbone of most modern companies and should ideally be in place to support many processes and workflows. The endless functional possibilities an ERP system offers ensure that employees are able to do their jobs in the most efficient way possible. However, they are also among the most expensive and laborious architectures in a company’s IT landscape. Agile management frameworks, like Scrum, were specifically designed to enable large and complex projects such as ERP systems to be successfully implemented.

It is possible to spend more than a year on ERP development, or it could become a permanent work site. One big reason that traditional development processes are not sufficient is that the integration and acceptance tests take place at the end of development. This means that at the end of the project, it can take a lot of time and money to fix the problems and change requests that occurred. Agile methods like Scrum can, by design, circumvent some of these problems.

Primer on Scrum

Before we start discussing the specifics of how Scrum can be utilized for developing an ERP system, we must get a basic understanding of what Scrum actually is and how it works. Scrum is a framework for team-based product development. It comes with Scrum roles, events, artifacts, and rules. The iterative approach allows you to continually create workable chunks of the final product. Thereby Scrum runs on a timebox of no more than one month with a consistent duration called sprint. Sprints produce the possibility of releasing an increment of the product at the end of the cycle.

Scrum Roles

Considering only the roles, Scrum is a straightforward methodology. It only recognizes three roles -the Scrum Master, the Product Owner, and the Development Team.

Product Owner: The Product Owner is the most important role in Scrum. Product owners make decisions that help shape the future of the product, and without a competent Product Owner, projects will likely fail. The optimal Product Owner is usually the one who understands the requirements they’re given from their customers’ perspective best, so they should have excellent communication skills and be able to motivate others.
Scrum Master: While the Product Owner is responsible for managing the product, the Scrum Master is responsible for managing the Scrum process, The Scrum Master’s role is limited to the Scrum process, and they do not have any technical or domain-related work to do. As the Product Owner, the Scrum Master is a leader.
Development Team: The Development Team is responsible for taking the product requirements and turning them into a finished product. The team is self-organizing and typically takes care of impediments themselves. They are empowered and do not shift responsibility to others. If they fail, the Scrum Master helps them.

Scrum Artifacts

As for the roles, Scrum has only a few artifacts: the product backlog, sprint backlog, and product increment.

Product Backlog: The product backlog is the product owner’s list of desired features and improvements for the product. This list can be ever-growing as there are always new ideas about how to extend and improve a product. The product owner is responsible for maintaining the product backlog, although other stakeholders, including the team, should be able to see the list and suggest new items for it.
Sprint Backlog: The sprint backlog, which is owned by the team, contains the product backlog items that the team committed to during sprint planning, as well as any subsequent tasks and reminders. They may also add, remove or change tasks as the sprint progresses.
Product Increment: During each sprint, the team completes a set of features or other deliverables known as the product increment. This product increment should be potentially shippable, meaning it is of high enough quality to give to users. The product owner is in charge of accepting the product increment and making sure it meets the acceptance criteria. Without a product increment, there is no possibility to inspect or make changes to the product.

Scrum Process

After summarizing the roles and artifacts, let us look into the actual Scrum process. Scrum starts with a vision. The vision might not be well-defined at first, but it will become clearer as the project moves forward. Based on this vision, the Product Owner is writing the Product Backlog, which lists out all of the tasks that are required to complete your project. At the start of each iteration, the Product Owner presents priorities from the Product Backlog to the team.

The team then decides on what it thinks it will be able to accomplish within an increment and creates a Sprint Backlog for that iteration. The team is given the responsibility to develop the features they have chosen. They take a closer look at the requirements, think about the technologies that are available, and consider their skills and how capable they are of delivering. When developing a new feature, the team brainstorms and collectively determines the approach, modifying its strategy every day to find out where the difficulties are. The team works on a shared goal and discusses how to meet it. This dialogue is the way that Scrum ensures its effectiveness.

At the end of the sprint, the team presents the new functionality to the Product Owner for review and to note down any changes that are needed. This is called the Sprint Review. Additionally, there is also the sprint retrospective, a recurring meeting at the end of every sprint in order to discuss what went well and what can be improved for the next cycle. It is important to note that the sprint review and the sprint retrospective have quite a few differences. The first one is about the product, while the second is about the team. The Sprint Review serves to maintain customer expectations, while retrospectives in the Scrum process allow teams to become faster and more compatible.

Project Overview

Enterprise Resource Planning systems are some of the world’s largest and most complex software systems. They focus on intra-company processes and they integrate functional and cross-functional business processes to make sure that similar activities and, consequently, data are managed in a consistent manner. The scope of an ERP typically includes operations, human resources, finance and accounting, sales and distribution, and procurement.

As already mentioned, we will look into the implementation of an ERP system to demonstrate how Scrum can be used to tackle complex tasks in an agile way. Specifically, our ERP implementation consists of a data warehouse, which serves as the backbone of the system, a supply chain- (SCM), and customer-relationship-management (CRM) solution and reporting functionality for the finance and accounting department. To demonstrate the Scrum process with this small ERP system, we will first look into how the planning of the project as a whole could take place when considering agile practices.

Agile ERP Development

To avoid the common pitfalls of traditional ERP development, it is advisable to employ agile practices from the get-go. This means splitting up the ERP system into individual components and the formation of cross-functional agile teams, including dedicated product owners, that implement these components. The teams can then tackle their individual tasks according to the Scrum process already described. In the case of this project, we split the development of the ERP system into four parts. Therefore we have four small teams, each working on either the data warehouse, SCM, CRM, or reporting solutions.

Scrum in Detail

After defining the overall approach to adapting an agile process to ERP development, we will look at the Scrum process in detail. Specifically, we examine how Scrum can be employed for the development of the reporting component for the accounting and finance departments.

Gathering the Team

As already explained, we can find three distinct roles inside a Scrum team. We will start with choosing a suitable product owner for our reporting software. Given that the product is used internally inside the company, the ideal product owner is the person responsible for the business process that is targeted by the developed product. In our case, the ideal product owner would be the head of the accounting and finance department. For the other components of the ERP system, we would choose the product owner according to the same criteria.

Our second team member takes on the role of the Scrum master. In the case of this project, the Scrum master will be someone from the software company developing the ERP system. Ideally, it should be a Scrum master who has done the proper training, has a Scrum certificate, and knows how to facilitate Scrum. Additionally, it is important that the Scrum master has at least a bit of domain knowledge because this ensures that the Scrum master can resolve team issues that are based on subtle domain-specific problems.

Lastly, we choose the developers for the project. Here it is not only important who but also how much. A Scrum team, excluding the Scrum master and product owner, should neither be too large nor too small. For example, smaller teams with less than three members can run into skill constraints or not enough interaction inside the team. On the other hand, a team with more than nine members can become too complex to manage. For our project, we choose seven developers so that we end up with a cross-functional team. Thereby cross-functional means that all necessary skills to develop a product are present in the team.

Vision and Product Backlog

To start the development of our reporting software, we first define the vision behind the product. The vision is an important first step because it provides a path to meet the project’s goals. The vision of the reporting software is inspired by an example of scrum.org and could look like the following table.

With the vision in mind, the product owner now writes a product backlog that contains the features that should be implemented. At this stage not all requirements are known, so this first version of the backlog is not set in stone but changes continuously throughout the development. The features in the backlog are expressed as what are called user stories. Thereby a story is a non-technical description of the feature with emphasis on the perspective of a user. The backlog is also sorted by the relevancy of the features, which means that features that are more important at the moment are placed at the top of the backlog. Following we will look at an exemplary product backlog for the reporting software:

As already explained, the features with higher priority are ranked at the top. In our case, the team has agreed to place the login functionality as a top priority. Following, we have data analysis and visualization, and lastly, the generation of reports and data export. In addition to the priority ranking, all stories inside the backlog need to have a workload estimation. For this project, we use story points, which are a scale describing the workload of the individual items relative to each other. This means a story with four-story points has an estimated workload that is four times greater than a story with one story point.

Sprint Planning and Sprint Backlog

With the product backlog prepared, the team is ready to start with the first sprint. On the first day of the sprint, the team meets for what is called sprint planning. Now the decision is made on which of the product backlog items are added to the sprint backlog and is ultimately implemented during the current sprint. Per definition, a sprint should add a valuable increment or, in other terms, a coherent feature to the project.

Our team decides to add the first three items of the product backlog to the sprint backlog, focusing on implementing the basic login functionality and infrastructure for the web-based application. The developers now examine the selected user stories and work out the implementation on a more technical level. An exemplary sprint backlog could look as follows:

In the course of the sprint, this sprint backlog is updated by the team members every day during the daily Scrum. The team also computes a time estimate for the work that remains for each task, which results in what is called a burndown chart.

Sprint Review and Sprint Retrospective

After completing the sprint, our team ships a working increment of the reporting software, namely a basic version supporting login and registration functionality. During the sprint review, a demonstration of the software is given, where feedback and insights are collected. Thereby, the collected feedback influences the decision of what changes should be made to the product backlog. In addition to the sprint review, the team also conducts a sprint retrospective. Here the team discusses what worked out and what was a failure regarding team performance. This event is facilitated by the Scrum master and serves to improve the team’s output further and further. For example, our team could have an outstanding performance at the daily Scrum when the remaining time for work is estimated. In contrast, there could be a lack of communication during actual work on the project. Points like this are discussed at the sprint retrospective and serve as the foundation for the team to improve.

A Comparison of Scrum and Plan-Driven

After wrapping up our exemplary scrum demonstration, we will now look at how Scrum, which is an agile approach to project management, can be compared to a plan-driven approach. The plan-driven or Waterfall approach is a traditional way to develop software and follows a sequential process. However, this methodology is plagued by a lot of drawbacks, as most of the time, a rigid plan cannot be met in software development. In contrast to the rigid process of the Waterfall approach, agile approaches like Scrum are designed to be flexible and adaptable regarding the change of requirements. Following, we will look at a detailed comparison between the two methodologies:

Conclusion

Traditional project management approaches for software development are faced with a variety of problems, given their plan-driven and inflexible nature. To combat these issues, agile project management methods like Scrum were introduced. Although agile project management approaches demand a higher degree of customer involvement, the benefits over traditional plan-driven approaches are clear, especially when faced with high-uncertainty projects.

References

Adobe. (2022). Sprint retrospective. Retrieved from https://www.workfront.com/project-management/methodologies/scrum/sprint-retrospective

Agile Academy. (2022). Was sind story points? Retrieved from https://www.agile-academy.com// de/product-owner/was-sind-story-points/

agile42. (2022). Scrum in a nutshell. Retrieved from https://www.agile42.com/en/agile-community/agile-info-center/scrum-nutshell

Backlog. (2020). How to run a successful sprint review meeting - backlog. Retrieved from https:// backlog.com/blog/successful-sprint-review-meeting/

Blueprintsys. (2022). Scaling agile step 3: Organize in cross-functional teams - blueprint. Retrieved from https://www.blueprintsys.com/scaling-agile/step-3-organize-in-cross-functional-teams

Casanova, D., Lohiya, S., Loufrani, J., Pacca, M., & Peters, P. (2019). Agile in enterprise resource planning: A myth no more. Retrieved from https://www.mckinsey.com/business-functions/mckinsey-digital/our-insights/agile-in-enterprise-resource-planning-a-myth-no-more

Cocco, L., Mannaro, K., Concas, G., & Marchesi, M. (n.d.). Simulating kanban and scrum vs. waterfall with system dynamics. In Lecture notes in business information processing (pp. 117–131).

Cohn, M. (2022). Sprint backlog and the scrum sprint. Retrieved from https://www/.mountaingoatsoftware.com/agile/scrum/scrum-tools/sprint-backlog

Geekbot. (2020). Sprint review vs sprint retrospective: The critical difference. Retrieved from https://geekbot.com/blog/sprint-review-vs-sprint-retrospective-the-critical-difference

Gloger, B. (2010). Scrum. Informatik-Spektrum, 33(2), 195–200.

Gloger, B. (2022). Scrum und erp - passt das zusammen? Retrieved from https://www.borisgloger.com/wp-content/uploads/Publikationen/Whitepapers/ Whitepaper Scrum und ERP de.pdf

MacKay, J. (2020). How to run an effective sprint retrospective. Retrieved from https://plan.io/ blog/sprint-retrospective/#what-is-a-sprint-retrospective

Magal, S. (2012). Integrated business processes with erp systems. Hoboken, NJ: Wiley.

Mahalakshmi, M., & Sundararajan, M. (2013). Traditional sdlc vs scrum methodology–a comparative study. International Journal of Emerging Technology and Advanced Engineering, 3(6), 192–196.

Maximi, D. (2019). The scrum culture: Introducing agile methods in organizations. Berlin: Springer. Mcdonald, K. (2022). How do you select a product owner? Retrieved from https://www/.agilealliance.org/how-do-you-select-a-product-owner/

McKenna, D. (2016). The art of scrum: How scrum masters bind dev teams and unleash agility. Berkeley, CA: Apress.

Mitchell, I. (2017). A typical sprint, play-by-play. Retrieved from https://www.scrum.org/resources/ blog/typical-sprint-play-play

Nguyen, T.-A. (2013). Choosing the right scrum master: 4 must-have qualities. Retrieved from https://openviewpartners.com/blog/choosing-the-right-scrum-master/

Obergfell, Y. (2011). The scrum product backlog. Retrieved from https://www.scrum-institute/.org/The Scrum Product Backlog.php

Ohanian, C. (2022). Adopting agile practices to achieve a competitive advantage. Retrieved from https://www.circusstreet.com/blog/using-agile-practices-to-achieve-a competitive-advantage

Project Management Institute. (2017). Agile practice guide. Newton Square, Pennsylvania: Author. Rubin, K. S. (2014). Essential scrum: Umfassendes scrum-wissen aus der praxis (1st ed.). Frechen:mitp.

Schuurman, R. (2017). 10 tips for product owners on the product vision. Retrieved from https:// www.scrum.org/resources/blog/10-tips-product-owners-product-vision

Schwaber, K., & Sutherland, J. (2017). The scrum guide. Retrieved from https://scrumguides.org/ docs/scrumguide/v2017/2017-Scrum-Guide-US.pdf

Viscardi, S. (2013). The professional scrummaster’s handbook. Birmingham, U.K: Packt Publishing.

Visual Paradigm. (2022). How to write a product vision for scrum project? Retrieved from https:// www.visual-paradigm.com/scrum/how-to-write-scrum-product-vision/

Creating an Airbnb-like Data Mart

Marco Schweiss — Mon, 10 Jul 2023 08:17:40 +0000

Airbnb was founded in October 2007, stemming from personal experiences of encountering overpriced shared apartments and fully booked hotels during a crowded conference in San Francisco. The original name, Airbedandbreakfast, was later shortened to Airbnb in 2009. Serving as an online platform, Airbnb facilitates the connection between hosts and guests and assumes responsibility for managing the booking process.

Transactions are conducted through the platform, with guests making payments for their bookings using credit cards. To ensure that the accommodation matches its description on the platform, hosts receive payment 24 hours after the guest’s arrival. In 2013, Airbnb generated revenue by charging a commission of 6-12% from guests, and 3% from hosts, and amassed a total of $150 million from approximately 10 million overnight stays.

Every user, whether a host or a guest, creates a profile page on Airbnb. Hosts are required to upload at least one photo and provide a phone number, while guests need to provide more comprehensive information. Hosts have the option to describe their accommodations using text and photos. Both guests and hosts can rate each other based on their experiences. The platform also features a calculator function that allows users to estimate their expected income from their own accommodations.

To improve my SQL skills, I created a proof-of-concept for the Airbnb use case in the form of a data mart with the following list of requirements:

Entity Relationship Model (ERM): Development of an ERM that encompasses the definition of data tables, attributes, and relationships among entities pertaining to the Airbnb use case.
Database Management System (DBMS): Selection of a suitable Database Management System (DBMS), such as MySQL or any other SQL-based system, to implement the database for handling Airbnb-related information.
Database Structure: Establishment of an appropriate database structure that facilitates efficient storage and processing of data relevant to the Airbnb use case.
Dummy Data: Generation of realistic dummy data to populate the database, ensuring that it accurately reflects the various scenarios and characteristics encountered in real-world Airbnb situations.
Query Functionality: Creation of well-crafted queries that enable seamless retrieval and presentation of data, effectively showcasing the functionalities of the database.
Documentation: Thorough documentation of all implementation steps, including the development of the ERM, database structure, and SQL statements employed, ensuring comprehensive records of the entire process.
Data Storage Optimization: Implementation of data storage optimization techniques, including proper normalization of the database structure, to minimize redundancy and optimize data storage for essential information.

In the following sections, we will take a look at how the project was conceptualized and developed.

Conception

The first step of the project was to derive an appropriate database design. The database design process includes a crucial stage called database modeling, which holds the utmost significance in the overall database design. Any oversights or omissions during this step can have a detrimental impact on the implementation stage, potentially rendering the database useless.

To initiate the process, the first step involves crafting a comprehensive requirements specification for the Airbnb project. This specification document should encompass a thorough analysis of the requirements, delving into the following aspects in greater detail:

Identification of various roles or user groups involved in the system.
Specification of the actions or activities performed by each of these roles.
Determination of the necessary data and functions required to fulfill the system’s requirements.

The results are compiled in this PDF file.

Development

The next step was to implement the data mart. Throughout this phase, I ensured that all SQL statements were meticulously documented in my database file.

Specifically, I delivered the tables and relations for the database in a SQL data file, aligning them with the concepts outlined in the ER-Diagram. Additionally, I diligently documented every SQL statement involved in the creation process.

To ensure a comprehensive test of the system, I made certain that each table contained a minimum of 20 entries. Furthermore, I created at least one test case for the database, specifically designed to validate its functionality based on the ER-Model.

By adhering to the provided guidelines and ensuring the documentation and testing of the system, a first working version of the data mart was derived. A detailed explanation of the database design and implementation procedure can be seen in this PDF.

Finalization

In this final phase, the goal was to polish and refine the database management system, after having received feedback. Given that only a minor change to the schema of the database was necessary and the first iteration of the data mart was up to standard, this last phase was rather short. This document sums up the project and shows the changes to the database schema.

Conclusion

As for my last application, this project was a great opportunity to gain development experience and also to make a deep dive into pure SQL. I really like the final version of the data mart and especially the location search functionality. The code for the data mart and instructions on how to use it can be found in this GitHub repository.

Developing a Habit Tracker

Marco Schweiss — Sun, 09 Jul 2023 14:17:32 +0000

Developing positive habits and eliminating negative ones can be a challenging endeavor. We all strive to maintain consistency in our actions and achieve personal goals, but it’s not always easy to stay on track. Thankfully, the growing trend of habit trackers has emerged as a valuable tool to support individuals in their pursuit of self-improvement. With a quick visit to any popular app store, you’ll discover a vast array of habit-tracking applications available, catering to various needs and preferences. Nonetheless, in my quest to enhance my programming skills, I have decided to create my own habit-tracking application in Python as one of my first serious attempts at development.

Formally, a habit refers to a clearly defined task that requires periodic completion, such as daily teeth brushing or annual dentist visits. The fundamental components of a habit-tracking app can be outlined as follows:

The application allows users to define multiple habits, each with its own task specification and periodicity.
Users have the ability to mark tasks as completed, indicating that they have been checked off at any given time.
Each habit necessitates at least one task completion within the user-defined period. Failure to complete a habit during the specified period is considered breaking the habit.
If a user successfully completes a habit for a consecutive number of periods without any breaks, it is referred to as establishing a streak. For example, if a user works out every day for two full weeks, they would have a 14-day streak of exercising.
The app not only stores the habits entered by users but also provides analysis features. Users can obtain insights into various aspects, such as determining their longest habit streak or accessing a list of their current daily habits.

Conception

The most important part of any project is the conception phase because anything that is overlooked or forgotten in this phase has a negative effect on the implementation later and might lead to a failed project. So the first step of this project was to describe everything that was necessary to build the app and put it into a written concept. Following we will talk about the general idea for the program flow, the database schema, an overview of the different components, and how they interact with each other.

Program Flow

I decided to design my application similar to classic UNIX command line tools (e.g. ls, grep). This means, that the functionality of the application can be utilized directly from the command line by issuing parameters with the application call. So the application is started up, executes the given command, and terminates after every user interaction.

Data Storage

For persistent data storage, we will use Python’s built-in sqlite3 library which allows us to setup a lightweight SQL database stored in a single file. This database file acts as the backbone of the application, with all functionality reading and writing from and to the database. The idea here is to embed SQL code, which allows precise manipulation and querying of the habit data, into the methods and functions written in Python. With this approach, we have full access to powerful SQL queries inside our application.

The data around a habit is stored in two different tables. The first table holds the data concerning a habit in general and the second table stores the tracking data of a habit, which means that every time a habit is checked-off, the current date is stored as an entry. Additionally, the database sports a periods table containing the supported periods (Daily, Weekly) and a user table for handling login/logout functionality.

Habit Management, Analytics, and CLI

We will store all the logic necessary for creating, deleting, and checking habits, in a class called Habit. The class consists of three methods, create, delete, and check that contain the respective SQL and Python logic. Upon success or failure of a command, a message is printed back to the command line.

For analyzing the habits we will create a separate module containing all necessary functions which will adhere to a functional programming paradigm to avoid side effects and changing the stored data. For implementing the control of the application with command line parameters, the Fire library is used. Fire utilizes a function that takes a module, object, class, or another function as an argument and exposes it to the command line.

To finally bring all the components together, the class Habit and the module containing the analytics functions are assigned to the class Pipeline. The Pipeline class effectively wraps all functionality, is then passed to the aforementioned function of the Fire module and an instance of the class is called in the main Python script.

Development

After discussing the concept that was used as a basis for development, we will now look at the actual implementation in greater detail. We will start off with an overview of the application structure and look into the individual components later on.

Application components

The application is split into 8 different files:

habits.py
- Contains the class for creating, deleting, and checking habits and for exposing functionality to the command line.
analytics.py
- Contains all functions for analyzing habits
pipeline.py
- Contains a class for wrapping all functionality for the CLI
admin.py
- Functions for initializing or deleting the database and generating random data entries
user.py
- Functions for login, logout, and echoing the current user
tracker.py
- The main file that exposes all functionality to the command line
decorators.py
- Contains all decorators
test_tracker.py
- Contains the unit test suite

tracker.py

The tracker.py file contains the main entry for regular program usage concerning management and analyzing habits. This is done by importing the Pipeline class which is then passed to the Fire library in the main loop. The main loop essentially consists of two checks, first if the database is initialized and second, if a user is logged in. After the successful completion of the two checks, Fire’s functionality is called with the user’s argument, effectively executing whatever action the user wants.

habits.py

The habits.py file contains the Habit class, which supports the three methods create, delete and check. The method names are as concise as possible to avoid complicated commands on the user side. When a new instance of the Habit class is created through Fire, the constructor establishes a database connection and assigns it as an attribute to the new instance. With this, the database connection is then available to all three methods. Additionally, the username of the actual user is also assigned as a class attribute.

Creating and Deleting a Habit

The create method takes the habits name and period as input from the user and a third argument entry_date which is at default today’s date. The date always defaults to the actual date with the __filter_date function if a user is logged in, so this argument is purely for the test data generator and the unit test to allow custom dates. The methods flow is first a call to the __filter_date function, second, a check if the period value is correct (Daily, Weekly), third a check if the habit does not already exist, and then an INSERT statement with the username and today’s date.

The delete method only takes the habit’s name as an argument. First, the method checks if the habit exists, followed by a DELETE statement.

Checking a Habit

The check method takes the habit’s name and date as an argument. As for the create method, the date is defaulted to today if a real user is logged in. The check method first checks if the given habit exists in the database, then evaluates if the habit is already checked for today. Then the method queries the given habits ID, followed by the period, and the last date on which the habit has been checked.

Following, the method evaluates if the habit is checked for the first time. If this is the first check, the streak is increased, else it must be evaluated if the streak is broken. If the period is daily, the last check date must be yesterday, else the streak is broken. If the period is weekly, the last check date must be at a maximum, seven days from today, or else the streak is broken. If the streak is broken the breaks must be incremented.

Then it must be checked if the current streak is also the longest streak. To evaluate this the method queries the LongestStreak entry and if the CurrentStreak is greater than the LongestStreak the value is increased. Lastly, a new entry in the tracking data table is inserted.

analytics.py

The analytics.py file contains the functions for analyzing habits and helper functions for establishing a database connection. The functions for normal program usage are combined in the class Analytics which is done solely for combining all functions in a single namespace. All functions are decorated with @staticmethod which allows calling the functions without initializing a class instance, keeping the function as pure as possible.

The functions were implemented with the help of a lot of list and dictionary comprehensions to avoid value assignments. The more complex functions are implemented by nesting functions and the usage of higher-order functions. Lastly, the functions are decorated with the @user_message decorator, which is a self-coded decorator that returns a message to the user if the function returns no data. With this decorator, we can avoid print statements in the functions.

pipeline.py

This file basically imports the Habit and Analytics classes and wraps the functionality of the two with the class Pipeline.

admin.py

The admin.py file contains the functionality to initialize or reset the database, both for regular usage and by the unit test suite to set up a test environment. Additionally, we have a function for generating test data.

Creating and Deleting a Database

The initialize function establishes a database connection and then executes two CREATE TABLE statements. This creates the database.db file with the two tables habits and trackingdata. If a database already existed, the sqlite3 library returns an error, which is captured in an except block, and an error message is returned.

The second function, delete, first checks if a database file exists and then deletes it.

Generating Testdata

The third function testdata uses the Habit classes check method with randomly generated entry dates for a predefined dictionary filled with habits.

user.py

The user.py is used for all functions related to user management. The file contains a class User for wrapping the functions login, logout, and whoami, all decorated with @staticmethod. The first function, login takes a username as an argument and then writes it into the user table of the database.

The second function whoami is used to return the current user from the database. This function is also used by various other functions to get the actual username.

The third function logout, first checks if a user is logged in at the moment and then sets the user’s status to logged out in the user table.

decorators.py

The decoratos.py contains all decorators. The habit tracker contains two scenarios, where decorators are good alternatives to helper functions. First for the functions of the analytics.py and second for the unit test suite. The first decorator @user_message is used to return a message to the user when an analytics function returns an empty dataset. Although not necessary, the user gets a simple message “No data” instead of an empty prompt if a function returns nothing. The core of the user_message decorator is thereby a conditional expression inside a try/except block. The called analytics function is checked with the any function. If the function returns an empty list the conditional expression evaluates to False and the message No Data is returned, else the functions return value is printed.

The second decorator @capture_print captures print statements for the unit test to evaluate. The function first checks if the username is testuser and then redirects the standard output to a variable. After capturing the value of the print statement in the variable, the standard output is then reset to normal.

test_tracker.py

The last component of the habit tracker is the unit test suite contained in test_tracker.py. The unit test suite was implemented with the unittest library included in Python. The test_tracker.py consists of two classes, TestHabit and TestAnalytics, which are then combined in one test suite. Both classes first construct a test environment in the form of a default database. After the set-up, the test for the various methods and functions is executed. After testing the functionality, the test database is deleted and the previous database, if it existed, is restored. Both classes make use of the various assert functions of the unittest library, to evaluate the results of queries or the return values of functions.

For testing the methods of Habit, data is first inserted, deleted, or checked via the respective methods and then the database entries are queried by a separate function. This query result is then compared against predefined data to ensure the methods work as expected. For testing the analytics functions, the test database is filled with predefined data, which is then analyzed. The return values are then compared against the expected values.

Conclusion

Looking back on the creation of the habit tracker, I can say that this project was a very pleasant and rewarding learning experience. The project definitely improved my programming skills by a
wide margin in regard to object-oriented and functional programming. Personally, I am satisfied with the final product’s functionality and the UNIX-tool-like control. The code for the habit tracker can be found in the following GitHub repo.