Chapter 7: Topic Modeling
Activity 15: Loading and Cleaning Twitter Data
Solution:
Import the necessary libraries:
import langdetect import matplotlib.pyplot import nltk import numpy import pandas import pyLDAvis import pyLDAvis.sklearn import regex import sklearn
Load the LA Times health Twitter data (latimeshealth.txt) from https://github.com/TrainingByPackt/Applied-Unsupervised-Learning-with-Python/tree/master/Lesson07/Activity15-Activity17:
Note
Pay close attention to the delimiter (it is neither a comma nor a tab) and double-check the header status.
path = '<Path>/latimeshealth.txt' df = pandas.read_csv(path, sep="|", header=None) df.columns = ["id", "datetime", "tweettext"]
Run a quick exploratory analysis to ascertain the data size and structure:
def dataframe_quick_look(df, nrows): print("SHAPE:\n{shape}\n".format(shape=df.shape)) print("COLUMN NAMES:\n{names}\n".format(names=df.columns)) print("HEAD:\n{head}\n".format(head=df.head(nrows))) dataframe_quick_look(df, nrows=2)The output...