Chapter 7: Processing Human Language
Activity 19: Predicting Sentiments of Movie Reviews
Solution:
- Read the IMDB movie review dataset using pandas in Python:
import pandas as pd
data = pd.read_csv('../../chapter 7/data/movie_reviews.csv', encoding='latin-1')
- Convert the tweets to lowercase to reduce the number of unique words:
data.text = data.text.str.lower()
Note
Keep in mind that "Hello" and "hellow" are not the same to a computer.
- Clean the reviews using RegEx with the clean_str function:
import re
def clean_str(string):
   Â
    string = re.sub(r"https?\://\S+", '', string)
    string = re.sub(r'\<a href', ' ', string)
    string = re.sub(r'&', '', string)
    string = re.sub(r'<br />', ' ', string)
    string = re.sub(r'[_"...