Analysis of a Customer Complaint Dataset

In today’s digital age, businesses rely heavily on customer feedback and satisfaction to maintain a competitive edge. Understanding customer complaints and addressing them promptly is vital for ensuring customer loyalty and improving overall product and service quality. With the rise of online platforms and social media, companies have a vast amount of data at their disposal in the form of customer reviews, tweets, and other sources of feedback.

In this project report, we focus on analyzing a dataset containing customer complaints from Comcast, one of the largest cable television and internet service providers in the United States. By leveraging Natural Language Processing (NLP) techniques, our objective is to gain insights into the key issues faced by Comcast customers and identify patterns or trends within the complaints that can inform business decisions.

Conception

The first step is to create a concept to describe everything that belongs to the data analysis workflow. Given that anything that is overlooked or forgotten in this phase has a negative effect on the implementation later and will lead, in the worst case, to useless results, this step is perhaps the most important of the entire process.

Dataset

The dataset consists of four columns containing the author, post date, rating, and the actual complaint. Additionally, we have around 5500 records which provide enough information to generate relevant insights. The dataset can be found on Kaggle with this link.

Cleaning the Complaints

To get cleaned texts, we will look into three different approaches. First of all, for all three approaches, stopwords are removed with the NLTK library. Then for the first option, the words in the complaints will be lemmatized. For the second option, n-grams are formed, and for the last option, noun phrases will be extracted with the TextBlob library, which can be considered an API to the NLTK library. To get a look at the result of the individual preprocessing options, the WordCloud library is used to create word clouds.

Vectorization

For the vectorization step, the Bag-of-Words and TF-IDF algorithms will be used. Here the three preprocessing options can be combined with each of the two vectorization methods. In this analysis, the two algorithms implementations of the Sklearn library will be used.

Topic Modeling

In the last step of the analysis, either LSA or LDA will be used for topic modeling. Here the idea is to get the most prominent words for a number of generated topics which can then be interpreted. As for vectorization, the Sklearn library is used for this step. Given the aforementioned options, the analysis will consist of 3 * 2 * 2 = 12 variations that can be compared.

Development

In this phase, we will conduct the data analysis based on the developed concept. This PDFย shows the implementation step-by-step.

Results and Analysis

For the discussion of the results, we will first look into the quality of the customer complaint dataset. The author claims that the complaints contained in the dataset are obtained with web scraping from a public source. Given that the source is not disclosed, there is no way to check where the data was obtained. Nonetheless, the dataset with around 5500 entries was deemed suitable for this project, as the source of the data does not have an impact on the methodology presented in this project. As already mentioned before, the analysis resulted in around 64 topics that could be interpreted. All in all, there were five interpreted topics that could be derived from the analysis: Problems with customer service, problems with internet speed, problems with the internet cable (possible downtimes), complaints about the billing, and problems with the TV receiver.

For possible further improvements of the analysis, there would be three scenarios. First of all, there is the possibility to tweak the parameters of the respective machine learning algorithms. For example, the number of topics to be extracted could be altered, or additional stop words could be removed. The second possibility would be to perform the analysis with different techniques and methodologies. There are a variety of possibilities that could be explored here. For example, to get more detailed insights about the topics of complaint, Word2Vec or GloVe could be employed. Additionally, there is the possibility of using transformer-style models like BERT or the GPT family of models to extract topics from the complaints

Solving this task definitely gave me a deeper understanding of natural language processing techniques and the topic modeling workflow in general. The explored process of preprocessing – vectorization – topic modeling with the employed algorithms can now serve as a foundation from which other methodologies can be tested and compared in future NLP projects.

For possible similar projects in the future, I would definitely try to implement a workflow based on transformer models, especially the relatively newly released GPT-3 model from OpenAI. As I have already experienced the astonishing capabilities of this model regarding its comprehension of language in other projects, it would be interesting to implement a topic modeling workflow with this model. This could possibly lead to more detailed findings of an analysis. The code for this project can be found on GitHub.