Using cross-validation

(For more resources related to this topic, see here.)

To start from, cross-validation is a common validation technique that can be used to evaluate machine learning models. Cross-validation essentially measures how well the estimated model will generalize some given data. This data is different from the training data supplied to our model, and is called the cross-validation set, or simply validation set, of our model. Cross-validation of a given model is also called rotation estimation.

If an estimated model performs well during cross-validation, we can assume that the model can understand the relationship between its various independent and dependent variables. The goal of cross-validation is to provide a test to determine if a formulated model is overfit on the training data. In the perspective of implementation, cross-validation is a kind of unit test for a machine learning system.

A single round of cross-validation generally involves partitioning all the available sample data into two subsets and then performing training on one subset and validation and/or testing on the other subset. Several such rounds, or folds, of cross-validation must be performed using different sets of data to reduce the variance of the overall cross-validation error of the given model. Any particular measure of the cross-validation error should be calculated as the average of this error over the different folds in cross-validation.

There are several types of cross-validation we can implement as a diagnostic for a given machine learning model or system. Let's briefly explore a few of them as follows:

  • A common type is k-fold cross-validation, in which we partition the cross-validation data into k equal subsets. The training of the model is then performed on subsets of the data and the cross-validation is performed on a single subset.
  • A simple variation of k-fold cross-validation is 2-fold cross-validation, which is also called the holdout method. In 2-fold cross-validation, the training and cross-validation subsets of data will be almost equal in proportion.
  • Repeated random subsampling is another simple variant of cross-validation in which the sample data is first randomized or shuffled and then used as training and cross-validation data. This method is notably not dependent on the number of folds used for cross-validation.
  • Another form of k-fold cross-validation is leave-one-out cross-validation, in which only a single record from the available sample data is used for cross-validation. Leave-one-out cross-validation is essentially k-fold cross-validation in which k is equal to the number of samples or observations in the sample data.

Cross-validation basically treats the estimated model as a black box, that is, it makes no assumptions about the implementation of the model. We can also use cross-validation to select features in a given model by using cross-validation to determine the feature set that produces the best fit model over the given sample data. Of course, there are a couple of limitations of classification, which can be summarized as follows:

  • If a given model is needed to perform feature selection internally, we must perform cross-validation for each selected feature set in the given model. This can be computationally expensive depending on the amount of available sample data.
  • Cross-validation is not very useful if the sample data comprises exactly or nearly equal samples.

In summary, it's a good practice to implement cross-validation for any machine learning system that we build. Also, we can choose an appropriate cross-validation technique depending on the problem we are trying to model as well as the nature of the collected sample data.

For the example that will follow, the namespace declaration should look similar to the following declaration:

(ns my-namespace (:use [clj-ml classifiers data]))

We can use the clj-ml library to cross-validate the classifier we built for the fish packaging plant. Essentially, we built a classifier to determine whether a fish is a salmon or a sea bass using the clj-ml library. To recap, a fish is represented as a vector containing the category of the fish and values for the various features of the fish. The attributes of a fish are its length, width, and lightness of skin. We also described a template for a sample fish, which is defined as follows:

(def fish-template [{:category [:salmon :sea-bass]} :length :width :lightness])

The fish-template vector defined in the preceding code can be used to train a classifier with some sample data. For now, we will not bother about which classification algorithm we have used to model the given training data. We can only assume that the classifier was created using the make-classifier function from the clj-ml library. This classifier is stored in the *classifier* variable as follows:

(def *classifier* (make-classifier ...))

Suppose the classifier was trained with some sample data. We must now evaluate this trained classification model. To do this, we must first create some sample data to cross-validate. For the sake of simplicity, we will use randomly generated data in this example. We can generate this data using the make-sample-fish function. This function simply creates a new vector of some random values representing a fish. Of course, we must not forget the fact that the make-sample-fish function has an in-built partiality, so we create a meaningful pattern in a number of samples created using this function as follows:

(def fish-cv-data (for [i (range 3000)] (make-sample-fish)))

We will need to use a dataset from the clj-ml library, and we can create one using the make-dataset function, as shown in the following code:

(def fish-cv-dataset (make-dataset "fish-cv" fish-template fish-cv-data))

To cross-validate the classifier, we must use the classifier-evaluate function from the clj-ml.classifiers namespace. This function essentially performs k-fold cross-validation on the given data. Other than the classifier and the cross-validation dataset, this function requires the number of folds that we must perform on the data to be specified as the last parameter. Also, we will first need to set the class field of the records in fish-cv-dataset using the dataset-set-class function. We can define a single function to perform these operations as follows:

(defn cv-classifier [folds] (dataset-set-class fish-cv-dataset 0) (classifier-evaluate *classifier* :cross-validation fish-cv-dataset folds))

We will use 10 folds of cross-validation on the classifier. Since the classifier-evaluate function returns a map, we bind this return value to a variable for further use, as follows:

user> (def cv (cv-classifier 10)) #'user/cv

We can fetch and print the summary of the preceding cross-validation using the :summary keyword as follows:

user> (print (:summary cv)) Correctly Classified Instances 2986 99.5333 % Incorrectly Classified Instances 14 0.4667 % Kappa statistic 0.9888 Mean absolute error 0.0093 Root mean squared error 0.0681 Relative absolute error 2.2248 % Root relative squared error 14.9238 % Total Number of Instances 3000 nil

As shown in the preceding code, we can view several statistical measures of performance for our trained classifier. Apart from the correctly and incorrectly classified records, this summary also describes the Root Mean Squared Error (RMSE) and several other measures of error in our classifier. For a more detailed view of the correctly and incorrectly classified instances in the classifier, we can print the confusion matrix of the cross-validation using the :confusion-matrix keyword, as shown in the following code:

user> (print (:confusion-matrix cv)) === Confusion Matrix === a b <-- classified as 2129 0 | a = salmon 9 862 | b = sea-bass nil

As shown in the preceding example, we can use the clj-ml library's classifier-evaluate function to perform a k-fold cross-validation on any given classifier. Although we are restricted to using classifiers from the clj-ml library when using the classifier-evaluate function, we must strive to implement similar diagnostics in any machine learning system we build.

Building a spam classifier

Now that we are familiar with cross-validation, we will build a working machine learning system that incorporates cross-validation. The problem at hand will be that of spam classification, in which we will have to determine the likelihood of a given e-mail being a spam e-mail. Essentially, the problem boils down to binary classification with a few tweaks to make the machine learning system more sensitive to spam. Note that we will not be implementing a classification engine that is integrated with an e-mail server, but rather we will be concentrating on the aspects of training the engine with some data and classifying a given e-mail.

The way this would be used in practice can be briefly explained as follows. A user will receive and read a new e-mail, and will decide whether to mark the e-mail as spam or not. Depending on the user's decision, we must train the e-mail service's spam engine using the new e-mail as data.

In order to train our spam classifier in a more automated manner, we'll have to simply gather data to feed into the classifier. We will need a large amount of data to effectively train a classifier with the English language. Luckily for us, sample data for spam classification can be found easily on the Web. For this implementation, we will use data from the Apache SpamAssassin project.

The Apache SpamAssassin project is an open source implementation of a spam classification engine in Perl. For our implementation, we will use the sample data from this project. You can download this data from For our example, we have used the spam_2 and easy_ham_2 datasets. A Clojure Leiningen project housing our spam classifier implementation will require that these datasets be extracted and placed in the ham/ and spam/ subdirectories of the corpus/ folder. The corpus/ folder should be placed in the root directory of the Leiningen project that is the same folder of the project.clj file.

The features of our spam classifier will be the number of occurrences of all previously encountered words in spam and ham e-mails. By the term ham, we mean "not spam". Thus, there are effectively two independent variables in our model. Also, each word has an associated probability of occurrence in e-mails, which can be calculated from the number of times it's found in spam and ham e-mails and the total number of e-mails processed by the classifier. A new e-mail would be classified by finding all known words in the e-mail's header and body, and then somehow combining the probabilities of occurrences of these words in spam and ham e-mails.

For a given word feature in our classifier, we must calculate the total probability of occurrence of the word by taking into account the total number of e-mails analyzed by the classifier. Also, an unseen term is neutral in the sense that it is neither spam nor ham. Thus, the initial probability of occurrence of any word in the untrained classifier is 0.5. Hence, we use a Bayesian probability function to model the occurrence of a particular word.

In order to classify a new e-mail, we also need to combine the probabilities of occurrences of all the known words found in it. For this implementation, we will use Fisher's method, or Fisher's combined probability test, to combine the calculated probabilities. Although the mathematical proof of this test is beyond the scope of this article, it's important to know that this method essentially estimates the probabilities of several independent probabilities in a given model as a (pronounced as chi-squared) distribution. Such a distribution has an associated number of degrees of freedom. It can be shown that an distribution with degrees of freedom equal to twice the number of combined probabilities k can be formally expressed as follows:

This means that using an distribution with degrees of freedom, the Cumulative Distribution Function (CDF), of the probabilities of the e-mail being a spam or a ham can be combined to reflect a total probability that is high when there are a large number of probabilities with values close to 1.0. Thus, an e-mail is classified as spam only when most of the words in the e-mail have been previously found in spam e-mails. Similarly, a large number of ham keywords would indicate the e-mail is in fact a ham e-mail. On the other hand, a low number of occurrences of spam keywords in an e-mail would have a probability closer to 0.5, in which case the classifier will be unsure of whether the e-mail is spam or ham.

For the example that will follow, we will require the file and cdf-chisq functions from the and Incanter libraries, respectively. The namespace declaration of the example should look similar to the following declaration:

(ns my-namespace (:use [ :only [file]] [incanter.stats :only [cdf-chisq]])

A classifier trained using Fisher's method, as described earlier, will be very sensitive to new spam e-mails. We represent the dependent variable of our model by the probability of a given e-mail being spam. This probability is also termed as the spam score of the e-mail. A low score indicates that an e-mail is ham, while a high score indicates that the e-mail is spam. Of course, we must also include a third class to represent an unknown value in our model. We can define some reasonable limits for the scores of these categories as follows:

(def min-spam-score 0.7) (def max-ham-score 0.4) (defn classify-score [score] [(cond (<= score max-ham-score) :ham (>= score min-spam-score) :spam :else :unsure) score])

As defined earlier, if an e-mail has a score of 0.7 or more, it's a spam e-mail. And a score of 0.5 or less indicates that the e-mail is ham. Also, if the score lies between these two values, we can't effectively decide whether the e-mail is spam or not. We represent these three categories using the keywords :ham, :spam, and :unsure.

The spam classifier must read several e-mails, determine all the words, or tokens, in the e-mails' text and header, and store this information as empirical knowledge to use later. We need to store the number of occurrences a particular word is found in spam and ham e-mails. Thus, every word that the classifier has encountered represents a feature. To represent this information for a single word, we will use a record with three fields as shown in the following code:

(defrecord TokenFeature [token spam ham]) (defn new-token [token] (TokenFeature. token 0 0)) (defn inc-count [token-feature type] (update-in token-feature [type] inc))

The record TokenFeature defined in the preceding code can be used to store the needed information for our spam classifier. The new-token function simply creates a new record for a given token by invoking the record's constructor. Obviously, a word is initially seen zero times in both spam and ham e-mails. We will also need to update these values, and we define the inc-count function to perform an update on the record using the update-in function. Note that the update-in function expects a function to apply to a particular field in the record as the last parameter. We are already dealing with a small amount of a mutable state in our implementation, so let's delegate access to this state through an agent. We would also like to keep track of the total number of ham and spam e-mails; so, we'll wrap these values with agents as well, as shown in the following code:

(def feature-db (agent {} :error-handler #(println "Error: " %2))) (def total-ham (agent 0)) (def total-spam (agent 0))

The feature-db agent defined in the preceding code will be used to store all word features. We define a simple error handler for this agent using the :error-handler keyword parameter. The agent's total-ham and total-spam functions will keep track of the total number of ham and spam e-mails, respectively. We will now define a couple of functions to access these agents as follows:

(defn clear-db [] (send feature-db (constantly {})) (send total-ham (constantly 0)) (send total-spam (constantly 0))) (defn update-feature! "Looks up a TokenFeature record in the database and creates it if it doesn't exist, or updates it." [token f & args] (send feature-db update-in [token] #(apply f (if %1 %1 (new-token token)) args)))

In case you are not familiar with agents in Clojure, we can use the send function to alter the value contained in an agent. This function expects a single argument, that is, the function to apply to its encapsulated value. The agent applies this function on its contained value and updates it if there are no errors. The clear-db function simply initializes all the agents we've defined with an initial value. This is done by using the constantly function that wraps a value in a function that returns the same value. The update-feature! function modifies the value of a given token in the feature-db map and creates a new token if the supplied token is not present in the map of feature-db. Since we will only be incrementing the number of occurrences of a given token, we will pass the inc-count function as a parameter to the update-feature! function.

Now, let's define how the classifier will extract words from a given e-mail. We'll use regular expressions to do this. If we want to extract all the words from a given string, we can use the regular expression [a-zA-Z]{3,}. We can define this regular expression using a literal syntax in Clojure, as shown in the following code. Note that we could also use the re-pattern function to create a regular expression. We will also define all the MIME header fields from which we should also extract tokens. We will do all this with the help of the following code:

(def token-regex #"[a-zA-Z]{3,}") (def header-fields ["To:" "From:" "Subject:" "Return-Path:"])

To match tokens with the regular expression defined by token-regex, we will use the re-seq function, which returns all matching tokens in a given string as a sequence of strings. For the MIME headers of an e-mail, we need to use a different regular expression to extract tokens. For example, we can extract tokens from the "From" MIME header as follows:

user> (re-seq #"From:(.*)\n" "From:\n") (["From:\n" ""])

Note the use of the newline character at the end of the regular expression, which is used to indicate the end of a MIME header in an e-mail.

We can then proceed to extract words from the values returned by matching the regular expression defined in the preceding code. Let's define the following few functions to extract tokens from a given e-mail's headers and body using this logic:

(defn header-token-regex [f] (re-pattern (str f "(.*)\n"))) (defn extract-tokens-from-headers [text] (for [field header-fields] (map #(str field %1) ; prepends field to each word from line (mapcat (fn [x] (->> x second (re-seq token-regex))) (re-seq (header-token-regex field) text))))) (defn extract-tokens [text] (apply concat (re-seq token-regex text) (extract-tokens-from-headers text)))

The header-token-regex function defined in the preceding code returns a regular expression for a given header, such as From:(.*)\n for the "From" header. The extract-tokens-from-headers function uses this regular expression to determine all words in the various header fields of an e-mail and appends the header name to all the tokens found in the header text. The extract-tokens function applies the regular expression over the text and headers of an e-mail and then flattens the resulting lists into a single list using the apply and concat functions. Note that the extract-tokens-from-headers function returns empty lists for the headers defined in header-fields, which are not present in the supplied e-mail header. Let's try this function out in the REPL with the help of the following code:

User≫ (Def Sample-Text "From: 12A1Mailbot1@Web.De Return-Path: ≪12A1Mailbot1@Web.De≫ Mime-Version: 1.0") User≫ (Extract-Tokens-From-Headers Sample-Text) (() ("From:Mailbot" "From:Web") () ("Return-Path:Mailbot" "Return-Path:Web"))

Using the extract-tokens-from-headers function and the regular expression defined by token-regex, we can extract all words comprising of three or more characters from an e-mail's header and text. Now, let's define a function to apply the extract-tokens function on a given e-mail and update the feature map using the update-feature! function with all the words found in the e-mail. We will do all this with the help of the following code:

(defn update-features! "Updates or creates a TokenFeature in database for each token in text." [text f & args] (doseq [token (extract-tokens text)] (apply update-feature! token f args)))

Using the update-features! function in the preceding code, we can train our spam classifier with a given e-mail. In order to keep track of the total number of spam and ham e-mails, we will have to send the inc function to the total-spam or total-ham agents depending on whether a given e-mail is spam or ham. We will do this with the help of the following code:

(defn inc-total-count! [type] (send (case type :spam total-spam :ham total-ham) inc)) (defn train! [text type] (update-features! text inc-count type) (inc-total-count! type))

The inc-total-count! function defined in the preceding code updates the total number of spam and ham e-mails in our feature database. The train! function simply calls the update-features! and inc-total-count! functions to train our spam classifier with a given e-mail and its type. Note that we pass the inc-count function to the update-features! function. Now, in order to classify a new e-mail as spam or ham, we must first define how to extract the known features from a given e-mail using our trained feature database. We will do this with the help of the following code:

(defn extract-features "Extracts all known tokens from text" [text] (keep identity (map #(@feature-db %1) (extract-tokens text))))

The extract-features function defined in the preceding code looks up all known features in a given e-mail by dereferencing the map stored in feature-db and applying it as a function to all the values returned by the extract-tokens function. As mapping the closure #(@feature-db %1) can return () or nil for all tokens that are not present in a feature-db agent, we will need to remove all empty values from the list of extracted features. To do this, we will use the keep function, which expects a function to apply to the non-nil values in a collection and the collection from which all nil values must be filtered out. Since we do not intend to transform the known features from the e-mail, we will pass the identity function, which returns its argument itself as the first parameter to the keep function.

Now that we have extracted all known features from a given e-mail, we must calculate all the probabilities of these features occurring in a spam e-mail. We must then combine these probabilities using Fisher's method we described earlier to determine the spam score of a new e-mail. Let's define the following functions to implement the Bayesian probability and Fisher's method:

(defn spam-probability [feature] (let [s (/ (:spam feature) (max 1 @total-spam)) h (/ (:ham feature) (max 1 @total-ham))] (/ s (+ s h)))) (defn bayesian-spam-probability "Calculates probability a feature is spam on a prior probability assumed-probability for each feature, and weight is the weight to be given to the prior assumed (i.e. the number of data points)." [feature & {:keys [assumed-probability weight] :or {assumed-probability 1/2 weight 1}}] (let [basic-prob (spam-probability feature) total-count (+ (:spam feature) (:ham feature))] (/ (+ (* weight assumed-probability) (* total-count basic-prob)) (+ weight total-count))))

The spam-probability function defined in the preceding code calculates the probability of occurrence of a given word feature in a spam e-mail using the number of occurrences of the word in spam and ham e-mails and the total number of spam and ham e-mails processed by the classifier. To avoid division-by-zero errors, we ensure that the value of the number of spam and ham e-mails is at least 1 before performing division. The bayesian-spam-probability function uses this probability returned by the spam-probability function to calculate a weighted average with the initial probability of 0.5 or 1/2.

We will now implement Fisher's method of combining the probabilities returned by the bayesian-spam-probability function for all the known features found in an e-mail. We will do this with the help of the following code:

(defn fisher "Combines several probabilities with Fisher's method." [probs] (- 1 (cdf-chisq (* -2 (reduce + (map #(Math/log %1) probs))) :df (* 2 (count probs)))))

The fisher function defined in the preceding code uses the cdf-chisq function from the Incanter library to calculate the CDF of the several probabilities transformed by the expression . We specify the number of degrees of freedom to this function using the :df optional parameter. We now need to apply the fisher function to the combined Bayesian probabilities of an e-mail being spam or ham, and combine these values into a final spam score. These two probabilities must be combined such that only a high number of occurrences of high probabilities indicate a strong probability of spam or ham. It has been shown that the simplest way to do this is to average the probability of a spam e-mail and the negative probability of a ham e-mail (or 1 minus the probability of a ham e-mail). We will do this with the help of the following code:

(defn score [features] (let [spam-probs (map bayesian-spam-probability features) ham-probs (map #(- 1 %1) spam-probs) h (- 1 (fisher spam-probs)) s (- 1 (fisher ham-probs))] (/ (+ (- 1 h) s) 2)))

Hence, the score function will return the final spam score of a given e-mail. Let's define a function to extract the known word features from a given e-mail, combine the probabilities of occurrences of these features to produce the e-mail's spam score, and finally classify this spam score as a ham or spam e-mail, represented by the keywords :ham and :spam respectively, as shown in the following code:

(defn classify "Returns a vector of the form [classification score]" [text] (-> text extract-features score classify-score))

So far, we have implemented how we train our spam classifier and use it to classify a new e-mail. Now, let's define some functions to load the sample data from the project's corpus/ folder and use this data to train and cross-validate our classifier, as follows:

(defn populate-emails "Returns a sequence of vectors of the form [filename type]" [] (letfn [(get-email-files [type] (map (fn [f] [(.toString f) (keyword type)]) (rest (file-seq (file (str "corpus/" type))))))] (mapcat get-email-files ["ham" "spam"])))

The populate-emails function defined in the preceding code returns a sequence of vectors to represent all the ham e-mails from the ham/ folder and the spam e-mails from the spam/ folder in our sample data. Each vector in this returned sequence has two elements. The first element in this vector is a given e-mail's relative file path and the second element is either :spam or :ham depending on whether the e-mail is spam or ham. Note the use of the file-seq function to read the files in a directory as a sequence.

We will now use the train! function to feed the content of all e-mails into our spam classifier. To do this, we can use the slurp function to read the content of a file as a string. For cross-validation, we will classify each e-mail in the supplied cross-validation data using the classify function and return a list of maps representing the test result of the cross-validation. We will do this with the help of the following code:

(defn train-from-corpus! [corpus] (doseq [v corpus] (let [[filename type] v] (train! (slurp filename) type)))) (defn cv-from-corpus [corpus] (for [v corpus] (let [[filename type] v [classification score] (classify (slurp filename))] {:filename filename :type type :classification classification :score score})))

The train-from-corpus! function defined in the preceding code will train our spam classifier with all e-mails found in the corpus/ folder. The cv-from-corpus function classifies the supplied e-mails as spam or ham using the trained classifier and returns a sequence of maps indicating the results of the cross-validation process. Each map in the sequence returned by the cv-from-corpus function contains the file of the e-mail, the actual type (spam or ham) of the e-mail, the predicted type of the e-mail, and the spam score of the e-mail. Now, we need to call these two functions on two appropriately partitioned subsets of the sample data as follows:

(defn test-classifier! [corpus cv-fraction] "Trains and cross-validates the classifier with the sample data in corpus, using cv-fraction for cross-validation. Returns a sequence of maps representing the results of the cross-validation." (clear-db) (let [shuffled (shuffle corpus) size (count corpus) training-num (* size (- 1 cv-fraction)) training-set (take training-num shuffled) cv-set (nthrest shuffled training-num)] (train-from-corpus! training-set) (await feature-db) (cv-from-corpus cv-set)))

The test-classifier! function defined in the preceding code will randomly shuffle the sample data and select a specified fraction of this randomized data as the cross-validation set for our classifier. The test-classifier! function then calls the train-from-corpus! and cv-from-corpus functions to train and cross-validate the data. Note that the use of the await function is to wait until the feature-db agent has finished applying all functions that have been sent to it via the send function.

Now we need to analyze the results of cross-validation. We must first determine the number of incorrectly classified and missed e-mails from the actual and expected class of a given e-mail as returned by the cv-from-corpus function. We will do this with the help of the following code:

(defn result-type [{:keys [filename type classification score]}] (case type :ham (case classification :ham :correct :spam :false-positive :unsure :missed-ham) :spam (case classification :spam :correct :ham :false-negative :unsure :missed-spam)))

The result-type function will determine the number of incorrectly classified and missed e-mails in the cross-validation process. We can now apply the result-type function to all the maps in the results returned by the cv-from-corpus function and print a summary of the cross-validation results with the help of the following code:

(defn analyze-results [results] (reduce (fn [map result] (let [type (result-type result)] (update-in map [type] inc))) {:total (count results) :correct 0 :false-positive 0 :false-negative 0 :missed-ham 0 :missed-spam 0} results)) (defn print-result [result] (let [total (:total result)] (doseq [[key num] result] (printf "%15s : %-6d%6.2f %%%n" (name key) num (float (* 100 (/ num total)))))))

The analyze-results function defined in the preceding code simply applies the result-type function to all the map values in the sequence returned by the cv-from-corpus function, while maintaining the total number of incorrectly classified and missed e-mails. The print-result function simply prints the analyzed result as a string. Finally, let's define a function to load all the e-mails using the populate-emails function and then use this data to train and cross-validate our spam classifier. Since the populate-emails function will return an empty list, or nil when there are no e-mails, we will check this condition to avoid failing at a later stage in our program:

(defn train-and-cv-classifier [cv-frac] (if-let [emails (seq (populate-emails))] (-> emails (test-classifier! cv-frac) analyze-results print-result) (throw (Error. "No mails found!"))))

In the train-and-cv-classifier function shown in the preceding code, we first call the populate-emails function and convert the result to a sequence using the seq function. If the sequence has any elements, we train and cross-validate the classifier. If there are no e-mails found, we simply throw an error. Note that the if-let function is used to check whether the sequence returned by the seq function has any elements.

We have all the parts needed to create and train a spam classifier. Initially, as the classifier hasn't seen any e-mails, the probability of any e-mail or text being spam is 0.5. This can be verified by using the classify function, as shown in the following code, which initially classifies any text as the :unsure type:

user> (classify "Make money fast") [:unsure 0.5] user> (classify "Job interview today! Programmer job position for GNU
project") [:unsure 0.5]

We now train the classifier and cross-validate it using the train-and-cv-classifier function. We will use one-fifth of all the available sample data as our cross-validation set. This is shown in the following code:

user> (train-and-cv-classifier 1/5) total : 600 100.00 % correct : 585 97.50 % false-positive : 1 0.17 % false-negative : 1 0.17 % missed-ham : 9 1.50 % missed-spam : 4 0.67 % nil

Cross-validating our spam classifier asserts that it's appropriately classifying e-mails. Of course, there is still a small amount of error, which can be corrected by using more training data. Now, let's try to classify some text using our trained spam classifier, as follows:

user> (classify "Make money fast") [:spam 0.9720416490829515] user> (classify "Job interview today! Programmer job position for GNU
project") [:ham 0.19095646757667556]

Interestingly, the text "Make money fast" is classified as spam and the text "Job interview … GNU project" is classified as ham, as shown in the preceding code. Let's have a look at how the trained classifier extracts features from some text using the extract-features function. Since the classifier will initially have read no tokens, this function will obviously return an empty list or nil when the classifier is untrained, as follows:

user> (extract-features "some text to extract") (#clj_ml5.spam.TokenFeature{:token "some", :spam 91, :ham 837} #clj_ml5.spam.TokenFeature{:token "text", :spam 907, :ham 1975} #clj_ml5.spam.TokenFeature{:token "extract", :spam 3, :ham 5})

As shown in the preceding code, each TokenFeature record will contain the number of times a given word is seen in spam and ham e-mails. Also, the word "to" is not recognized as a feature since we only consider words comprising of three or more characters.

Now, let's check how sensitive to spam e-mail our spam classifier actually is. We'll first have to select some text or a particular term that is classified as neither spam nor ham. For the training data selected for this example, the word "Job" fits this requirement, as shown in the following code. Let's train the classifier with the word "Job" while specifying the type of the text as ham. We can do this using the train! function, as follows:

user> (classify "Job") [:unsure 0.6871002132196162] user> (train! "Job" :ham) #<Agent@1f7817e: 1993> user> (classify "Job") [:unsure 0.6592140921409213]

After training the classifier with the given text as ham, the probability of the term being spam is observed to decrease by a small amount. If the term "Job" occurred in several more e-mails that were ham, the classifier would eventually classify this word as ham. Thus, the classifier doesn't show much of a reaction to a new ham e-mail. On the contrary, the classifier is observed to be very sensitive to spam e-mails, as shown in the following code:

user> (train! "Job" :spam) #<Agent@1f7817e: 1994> user> (classify "Job") [:spam 0.7445135045480734]

An occurrence of a particular word in a single spam e-mail is observed to greatly increase a classifier's predicted probability of the given term belonging to a spam e-mail. The term "Job" will subsequently be classified as spam by our classifier, at least until it's seen to appear in a sufficiently large number of ham e-mails. This is due to the nature of the chi-squared distribution that we are modeling.

We can also improve the overall error of our spam classifier by supplying it with more training data. To demonstrate this, let's cross-validate the classifier with only one-tenth of the sample data. Thus, the classifier would be effectively trained with nine-tenths of the available data, as follows:

User≫ (Train-And-Cv-Classifier 1/10) Total : 300 100.00 % Correct : 294 98.00 % False-Positive : 0 0.00 % False-Negative : 1 0.33 % Missed-Ham : 3 1.00 % Missed-Spam : 2 0.67 % Nil

As shown in the preceding code, the number of misses and wrongly classified e-mails is observed to reduce when we use more training data. Of course, this is only shown as an example, and we should instead collect more e-mails to feed into the classifier as training data. Using a significant amount of the sample data for cross-validation is a good practice.

In summary, we have effectively built a spam classifier that is trained using Fisher's method. We have also implemented a cross-validation diagnostic, which serves as a kind of unit test for our classifier.

Note that the exact values produced by the train-and-cv-classifier function will vary depending on the spam and ham emails used as training data.


In this article, we've explored the tools provided by the clj-ml library to cross-validate a given classifier. Later, we've built an operational spam classifier that incorporates cross-validation to determine whether the classifier is appropriately classifying e-mails as spam.

Resources for Article:

Further resources on this subject:

You've been reading an excerpt of:

Clojure for Machine Learning

Explore Title