Amazon-Reviews-Analysis

Lexical Diversity and Sentiment Intensity in Amazon Fine Food Reviews

LING-460: Textual Analysis with R at UNC-Chapel Hill, Spring 2025.

Table of Contents

  1. Motivation
  2. Research Question
  3. Research Hypothesis
  4. Prediction
  5. Procedure
  6. Analysis Results
  7. Data Analysis
  8. Visualizations
  9. Conclusions
  10. References

Motivation

Consumer reviews provide rich insights into the evaluative language that people use to express their satisfaction or dissatisfaction. The way consumers articulate their experiences can reveal underlying linguistic patterns, where factors such as lexical diversity and emotional tone contribute to the expression of opinions. Understanding these patterns has implications for marketing, consumer research, and sentiment analysis in natural language processing. Our project addresses the broader question of whether the language used in negative reviews differs fundamentally from that used in positive reviews.

Dataset: https://www.kaggle.com/datasets/snap/amazon-fine-food-reviews?resource=download.

Research Question

Do negative reviews exhibit lower lexical diversity and higher negative sentiment intensity than positive reviews?

Research Hypothesis

We hypothesize that:

Prediction

Based on the hypotheses:

Procedure

Our approach involved the following steps:

  1. Data Acquisition:
    We used the publicly available Amazon Fine Food Reviews dataset from Kaggle, containing over 500,000 reviews with ratings spanning from 1 to 5 stars.

  2. Data Preprocessing:
    • Converted Unix timestamps in the Time column to human-readable dates.
    • Ensured that the Score field was numeric.
    • Classified reviews into three sentiment groups based on the score: Negative (1-2 stars), Positive (4-5 stars), and Neutral (3 stars). Only Negative and Positive reviews were retained for the analysis.
    • Computed the Type-Token Ratio (TTR) for each review as an indicator of lexical diversity.
  3. Sentiment Analysis:
    • Using the “bing” sentiment lexicon from the tidytext package, we tokenized the review texts and counted the frequency of negative words.
    • Calculated negative sentiment intensity as the proportion of negative words to the total number of words in each review.
  4. Statistical Testing:
    • We performed two independent-sample t-tests to compare the TTR and negative sentiment intensity between negative and positive review groups.
    • A linear regression model was constructed to assess the combined effect of review sentiment and negative sentiment intensity on lexical diversity.
  5. Visualization:
    • Boxplots were generated to visually compare the distributions of TTR and negative sentiment intensity between the two sentiment groups.
    • Regression visualizations (scatter plots with fitted linear models and faceted views) were created to illustrate the relationship between TTR and negative sentiment intensity by sentiment group.

Analysis Results

Descriptive Statistics

Sentiment Mean TTR SD TTR Mean Negative Intensity SD Negative Intensity Count
Negative 0.807 0.102 0.0368 0.0296 82,037
Positive 0.827 0.100 0.0179 0.0206 443,777

Interpretation:
Negative reviews have a lower average TTR and higher negative sentiment intensity compared to positive reviews.

Statistical Tests

Regression Analysis

The linear regression model predicting TTR using a binary sentiment indicator and negative sentiment intensity produced the following results:

Data Analysis

Our analysis included:

Visualizations

The report includes the following key visualizations:

Boxplot for Type-Token Ratio (TTR) by review sentiment

Boxplot for Negative Sentiment Intensity by review sentiment

Conclusions

Based on our analyses, we conclude the following:

Implications:
These conclusions suggest that evaluative language in negative reviews is characterized by less lexical variety and a greater focus on negative sentiment. This may have implications for improving sentiment analysis tools, refining review summarization techniques, and understanding consumer behavior.

Future Directions:
Further research could explore additional linguistic features (e.g., syntactic complexity, use of modifiers) or investigate how these patterns might evolve over time. Additionally, comparing these findings with other types of online reviews could provide broader insights.

References