In this exercise, we will use the DBWorld e-mails dataset from the UCI Machine Learning repository to compare the relative performance of Naïve Bayes and BayesLogit methods. The dataset contains 64 e-mails from the DBWorld newsletter and the task is to classify the e-mails into either announcements of conferences or everything else. The reference for this dataset is a course by Prof. Michele Filannino (reference 5 in the References section of this chapter). The dataset can be downloaded from the UCI website at https://archive.ics.uci.edu/ml/datasets/DBWorld+e-mails#.
Some preprocessing of the dataset would be required to use it for both the methods. The dataset is in the ARFF format. You need to download the foreign R package (http://cran.r-project.org/web/packages/foreign/index.html) and use the
read.arff( )
method in it to read the file into an R data frame.
Argentina
Australia
Austria
Belgium
Brazil
Bulgaria
Canada
Chile
Colombia
Cyprus
Czechia
Denmark
Ecuador
Egypt
Estonia
Finland
France
Germany
Great Britain
Greece
Hungary
India
Indonesia
Ireland
Italy
Japan
Latvia
Lithuania
Luxembourg
Malaysia
Malta
Mexico
Netherlands
New Zealand
Norway
Philippines
Poland
Portugal
Romania
Russia
Singapore
Slovakia
Slovenia
South Africa
South Korea
Spain
Sweden
Switzerland
Taiwan
Thailand
Turkey
Ukraine
United States