Topic Analysis of App Store Reviews: A Heatmap Approach with the App Store Reviews Scraper API

This blog post explains what Topic Analysis is, and how SerpApi's Apple App Store Reviews Scraper API could be utilized to do Topic Analysis with an example script, and a tutorial.

What is Topic Analysis?

Topic analysis of App Store reviews is a method of using machine learning and natural language processing (NLP) to analyze and understand customer reviews of a product or service. It involves using Python, a programming language, to import and process a dataset of reviews, which can be gathered by using tools like SerpApi’s Apple App Store Reviews Scraper API. The purpose of this analysis is to gain insights into the sentiment, or overall attitude, of the reviews towards the product. This can be useful for companies looking to improve their products or for individuals looking to make informed decisions about purchasing a product.

To perform topic analysis, the first step is to preprocess the text data by removing stopwords, which are common words in the English language that do not add meaning to the review (e.g. "the", "a", "an"). This is done using the Natural Language Toolkit (nltk), a library for natural language processing. The reviews are also tokenized or split into individual words or phrases, to make them easier to analyze.

Once the reviews are preprocessed, they are fed into an algorithm, such as Latent Dirichlet Allocation (LDA), which is used to identify common themes or topics within the reviews. The algorithm is trained using a dataset of positive and negative reviews and can then be used to classify new reviews as positive or negative. This is known as sentiment analysis.

The results of the topic analysis can be visualized using data science and data analysis tools like matplotlib or plotly, which allow users to see the distribution of topics within the reviews and how they relate to the overall sentiment of the reviews. This can be useful for identifying areas where a product or service can be improved, or for understanding the overall satisfaction of customers.

The Code

This code uses an external file containing reviews called reviews.json that is created from SerpApi's Apple App Store Reviews Scraper API

from gensim.parsing.preprocessing import STOPWORDS
from gensim.utils import simple_preprocess
import plotly.graph_objects as go
import pandas as pd
import gensim
import json

# Load the JSON data
with open('reviews.json', 'r') as f:
        reviews_data = json.load(f)

# Extract the review text from the JSON data
reviews = [review['text'] for review in reviews_data['reviews']]

# Preprocess the review text to remove stopwords and create a list of tokens
processed_reviews = []
for review in reviews:
        tokens = simple_preprocess(review, deacc=True, min_len=3)
        filtered_tokens = [token for token in tokens if token not in STOPWORDS]
        processed_reviews.append(filtered_tokens)

# Create a dictionary from the processed review text
dictionary = gensim.corpora.Dictionary(processed_reviews)

# Create a bag-of-words representation of the review text
corpus = [dictionary.doc2bow(review) for review in processed_reviews]

# Set the number of topics and the number of passes
num_topics = len(reviews)
num_passes = 0

# Fit the LDA model to the corpus
lda_model = gensim.models.LdaModel(corpus, num_topics=num_topics, id2word=dictionary, passes=num_passes)

# Save the lda_model object to a file
lda_model.save('lda_reviews.gensim')

# Get the list of top words for each topic
top_words = lda_model.print_topics(num_words=10)

# Create a dictionary that maps the topic id to the top words
topic_words = {}
for topic_id, words in top_words:
        topic_words[topic_id] = words.split('+')

# Initialize the dataframe
df = pd.DataFrame()

# Iterate over the reviews in the corpus
for i, review in enumerate(corpus):
        # Get the topic distribution for the review
        topic_distribution = lda_model.get_document_topics(review)

        # Set the values in the dataframe
        for topic, weight in topic_distribution:
                if topic in topic_words:
                        # Get the top word for the topic
                        top_word = topic_words[topic][0].split('*')[1].replace('"', '')

                        # Use the top word as the column name
                        df.loc[reviews[i], top_word] = weight

# Create the heatmap trace
trace = go.Heatmap(
        x=df.index.tolist(),
        y=df.columns.tolist(),
        z=df,
        colorscale='Greens'
)

# Create the figure
fig = go.Figure(data=[trace])

# Show the figure
fig.show()

What is the function of the script?

This code is used to perform topic analysis on a dataset of product reviews. The reviews are stored in a file called 'reviews.json', which is loaded into the script using the 'open' function. The review text is then extracted from the JSON data and stored in a list called 'reviews'.

The next step is to preprocess the review text to make it easier to analyze. This involves removing stopwords, which are common words in the English language that do not add meaning to the review, and creating a list of tokens, or individual words or phrases. This is done using the 'simple_preprocess' function from the gensim library.

Once the review text is preprocessed, a dictionary is created from the processed review text using the 'Dictionary' function from the gensim library. This dictionary is then used to create a bag-of-words representation of the review text, which is a list of tuples where each tuple represents a review and contains a list of words and their frequency in the review.

The bag-of-words representation is then used to fit an LDA model, which is an algorithm that is used to identify common themes or topics within the reviews. The results of the topic analysis are then visualized using a heatmap, which shows the distribution of topics within the reviews and how they relate to the overall sentiment of the reviews. The heatmap is created using the 'Heatmap' function from the plotly library and is displayed using the 'show' function from the go library.

Here is the visualization of the end result:

What else could be done with Apple Reviews Data?

In addition to topic analysis, there are many other things that could be done with the reviews data scraped by SerpApi's Apple App Store Reviews Scraper API. One possibility is to use machine learning and natural language processing techniques to perform sentiment analysis on the reviews. This involves training a machine learning model to classify reviews as positive or negative based on their text content. This can be done using techniques such as logistic regression or naive bayes, which are popular algorithms for classification tasks.

To train a machine learning model for reviews sentiment analysis, a dataset of positive reviews and negative reviews is needed. This dataset can be created by manually labeling the reviews as positive sentiment or negative sentiment. Once the training data is prepared, it can be split into a training set (x_train, y_train) and a test set (x_test, y_test) using the 'train_test_split' function from scikit-learn. The training set is used to train the model, while the test set is used to evaluate the model's performance.

Once the model is trained, it can be used to classify new reviews as positive or negative. This can be done using the 'predict' function, which takes a review as input and returns a prediction of whether the review is positive or negative. The model's performance can be evaluated using metrics like accuracy, which measures how many reviews the model correctly classified.

Τhese classifications then could be visualized by their distribution of positive and negative reviews using tools like numpy, matplotlib, plotly, or pyplot analyzing the language and tone of the reviews using techniques like word embedding or tf-idf, or using the reviews to understand the overall satisfaction of customers with a product or service.

Alternatively, you can enrich the data you have created from the reviews of your apps with other app stores, social media reviews like tweets about your app (plenty of datasets on Kaggle), or reviews from other websites. This way you can have a cross compared result of the success you have on different sources.

I am grateful to the reader for their time and attention. I hope it brings clarity into how SerpApi's Apple App Store Reviews Scraper API could be useful to understand the mindset of your userbase.