Creating term frequency vectors
To calculate the Euclidean distance, let's first create a vector from our dictionary and document. This will allow us to easily compare the term frequencies between documents because they will occupy the same index of the vector.
(defn term-id [dict term]
(get-in @dict [:terms term]))
(defn term-frequencies [dict terms]
(->> (map #(term-id dict %) terms)
(remove nil?)
(frequencies)))
(defn map->vector [dictionary id-counts]
(let [zeros (vec (replicate (:count @dictionary) 0))]
(-> (reduce #(apply assoc! %1 %2) (transient zeros) id-counts)
(persistent!))))
(defn tf-vector [dict document]
(map->vector dict (term-frequencies dict document)))The term-frequencies function creates a map of term ID to frequency count for each term in the document. The map->vector function simply takes this map and associates the frequency count at the index of the vector given by the term ID. Since there may be many terms...