MovieLens is a research project that provides datasets of various sizes and attributes, containing movie ratings. These datasets are free to download and use for non-commercial purposes. They have done an awesome job putting this data together and a big thanks goes to them for making it available.
I wanted to exercise my Clojure skills (more like add to my tiny set of Clojure skills 🙂 ) and it just so happens that I recently came across the MovieLens project, so how about analyzing that data using Clojure ?
One of the datasets they make available is the One Million Dataset, this set consists of 3 files
- “movies.dat” containing 3883 movie listings, contains title, genre…
- “users.dat” containing 6040 unique users, contains age, occupation, gender …
- “ratings.dat” containing 1000209 movie ratings, that references movie id and user id from the above 2 files.
I could analyze this data to answer questions such as, What age group gave the most ratings ? or What was the highest rated movie for a given time period ?
But before I could do this I wanted to denormalize the ratings file so that it also contains the user and movie information, why ? cause I don’t want to look it up when I am analyzing the data, each record should be self contained.
The outline of the program is quite simple
- Read the users file into memory
- Read the movies files into memory
- For each line in the ratings
- Find the corresponding movie and user
- Print it out to a file.
Take a minute to think how would you do this in java and then look at the below code. I ran it on a Dell laptop dual 2.2Ghz laptop with 4 gig of ram and care to guess how long it takes ?? scroll down for answer.
–
–
(ns com.dev.file-reader
(:use [clojure.contrib.duck-streams])
(:import [java.io BufferedReader FileReader BufferedWriter FileWriter]))
(defstruct user :id :gender :age :ccupation :zip-code)
(defstruct movie :id :title :genres)
(defn format-user [user] (str (:id user) "::" (:gender user) "::" (:age user) "::" (:ccupation user) "::" (:zip-code user)))
(defn format-movie [movie] (str (:id movie) "::" (:title movie) "::" (:genres movie)))
(defn read-user-file [fileName]
(loop [users {} fileSeq (read-lines fileName)]
(let [line (first fileSeq)]
(if (nil? line)
users
(let [tokens (.split line "::")
id (aget tokens 0)
user (struct user id (aget tokens 1) (aget tokens 2) (aget tokens 3) (aget tokens 4))]
(recur (merge users {id user}) (rest fileS)))))))
(defn read-movies-file [fileName]
(loop [movies {} fileSeq (read-lines fileName)]
(let [line (first fileSeq)]
(if (nil? line)
movies
(let [tokens (.split line "::")
id (aget tokens 0)
movie (struct movie (Integer/parseInt (aget tokens 0)) (aget tokens 1) (aget tokens 2))]
(recur (merge movies {id movie}) (rest fileS)))))))
(defn convert-ratings-file
"read the ratings file and denormalize it"
[moviesF usersF ratingsF outputF]
(let [movies (read-movies-file moviesF) users (read-user-file usersF)]
(with-open [#^BufferedReader rdr (BufferedReader. (FileReader. ratingsF) 1048576)
#^BufferedWriter wtr (BufferedWriter. (FileWriter. outputF) 1048576)]
(doseq [line (line-seq rdr)]
(let [tokens (.split line "::")
user-id (aget tokens 0)
movie-id (aget tokens 1)
user (get users user-id)
movie (get movies movie-id)
rating (aget tokens 2)
timestamp (aget tokens 3)]
(.write wtr (str (format-user user) "::" (format-movie movie) "::" rating "::" timestamp "\n")))))))
(defn doIt []
(time (convert-ratings-file
"movielens-1m/movies.dat"
"movielens-1m/users.dat"
"movielens-1m/ratings.dat"
"movielens-1m/output.dat"
)))
So ready with you guess ??
I ran the program 5 times and here is the output
"Elapsed time: 12130.035819 msecs"
"Elapsed time: 13113.92823 msecs"
"Elapsed time: 13364.234216 msecs"
"Elapsed time: 12553.478168 msecs"
"Elapsed time: 14488.706176 msecs"
On average 13.130076521799994 Seconds to read in 1 million records, for each record look up the movie and user and write it back to the disk.
Clojure puts the FUNctional back in programming.