Lab 4: Dictionary-Based Analysis

Due: Friday, February 28

Introduction

This lab corresponds with the course material covering dictionary-based analysis, covered in class Thursday, February 20. Submit your completed assignment as a knit R Markdown PDF or HTML file to Dropbox (or printed PDF of a Jupyter Notebook) by the end of day Friday, February 28.

Resources

The data you will use for this lab is taken from a Kaggle Dataset created by Datafiniti provided at https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products which contains over 34,000 reviews of Amazon products like the Kindle, Fire TV, etc. Click the “Download” link which should download a zip file “consumer-reviews-of-amazon-products” that contains 3 csv files. We will use the file called “1429_1.csv” as the dataset for this lab.

Lab

(1 point) Using the link above, load the review dataset into your workspace. We will need the columns containing which product the review is for as well as the review text, indicate which columns they are and create a new cleaned dataframe with only these two columns. Print the first few rows of data.
(1 point) Summarize the data by outputting the products reviewed in this dataset as well as the number of reviews per product.
(1 point) You will see there are 49 products with various review counts. Imagine the case that you are a consumer deciding between two similar products to purchase and are using reviews to help inform this purchase. From our data, select two products that are reasonable to compare and create new dataframes for each product containing only the reviews for that product. Identify which product you chose and print the first few rows of data for each product’s dataframe. Make sure each product has at least 50 reviews. Hint: sort products alphabetically then use the split function.
(1 point) Create a tidy text object for each product’s review data.
(2 points) Clean the text of each dataset’s review column as you see fit and find the top 20 frequent words for each product’s reviews. Which words are unique to each product’s top 20 list? Which words appear in both?
(2 points) Create a dictionary of relevant terms that might aid in our comparison of these two products. Subset the tidytext dataframe that contain these words. What percentage of reviews did you capture with your dictionary for each product?
(2 points) Perform a sentiment analysis on each individual review and then average your results for the entire set using the bing sentiment dictionary. Based on your results, which product would you choose to purchase?