Denormalizing One million records with Clojure.

MovieLens is a research project that provides datasets of various sizes and attributes, containing movie ratings. These datasets are free to download and use for non-commercial purposes. They have done an awesome job putting this data together and a big thanks goes to them for making it available.

I wanted to exercise my Clojure skills (more like add to my tiny set of Clojure skills 🙂 ) and it just so happens that I recently came across the MovieLens project, so how about analyzing that data using Clojure ?

One of the datasets they make available is the One Million Dataset, this set consists of 3 files

  1. movies.dat” containing 3883 movie listings, contains title, genre…
  2. users.dat” containing 6040 unique users, contains age, occupation, gender …
  3. ratings.dat” containing 1000209 movie ratings, that references movie id and user id from the above 2 files.

I could analyze this data to answer questions such as, What age group gave the most ratings ? or What was the highest rated movie for a given time period ?

But before I could do this I wanted to denormalize the ratings file so that it also contains the user and movie information, why ? cause I don’t want to look it up when I am analyzing the data, each record should be self contained.

The outline of the program is quite simple

  • Read the users file into memory
  • Read the movies files into memory
  • For each line in the ratings
    • Find the corresponding movie and user
    • Print it out to a file.

Take a minute to think how would you do this in java and then look at the below code. I ran it on a Dell laptop dual 2.2Ghz laptop with 4 gig of ram and care to guess how long it takes ?? scroll down for answer.

(ns com.dev.file-reader
 (:use [clojure.contrib.duck-streams])
 (:import [java.io BufferedReader FileReader BufferedWriter FileWriter]))

(defstruct user :id :gender :age :ccupation :zip-code)
(defstruct movie :id :title :genres)

(defn format-user [user] (str (:id user) "::" (:gender user) "::" (:age user) "::" (:ccupation user) "::" (:zip-code user)))

(defn format-movie [movie] (str (:id movie) "::" (:title movie) "::" (:genres movie)))

(defn read-user-file [fileName]
 (loop [users {} fileSeq (read-lines fileName)]
   (let [line (first fileSeq)]
     (if (nil? line)
     users
     (let [tokens (.split line "::")
           id (aget tokens 0)
           user (struct user id (aget tokens 1) (aget tokens 2) (aget tokens 3) (aget tokens 4))]
        (recur (merge users {id user}) (rest fileS)))))))

(defn read-movies-file [fileName]
 (loop [movies {} fileSeq (read-lines fileName)]
   (let [line (first fileSeq)]
     (if (nil? line)
     movies
     (let [tokens (.split line "::")
           id (aget tokens 0)
           movie (struct movie (Integer/parseInt (aget tokens 0)) (aget tokens 1) (aget tokens 2))]
         (recur (merge movies {id movie}) (rest fileS)))))))

(defn convert-ratings-file
 "read the ratings file and denormalize it"
 [moviesF usersF ratingsF outputF]
   (let [movies (read-movies-file moviesF) users (read-user-file usersF)]
     (with-open [#^BufferedReader rdr (BufferedReader. (FileReader. ratingsF) 1048576)
                 #^BufferedWriter wtr (BufferedWriter. (FileWriter. outputF) 1048576)]
       (doseq [line (line-seq rdr)]
         (let [tokens (.split line "::")
               user-id (aget tokens 0)
               movie-id (aget tokens 1)
               user (get users user-id)
               movie (get movies movie-id)
               rating (aget tokens 2)
               timestamp (aget tokens 3)]
 (.write wtr (str (format-user user) "::" (format-movie movie) "::" rating "::" timestamp "\n")))))))

(defn doIt []
 (time (convert-ratings-file
 "movielens-1m/movies.dat"
 "movielens-1m/users.dat"
 "movielens-1m/ratings.dat"
 "movielens-1m/output.dat"
 )))

So ready with you guess ??
I ran the program 5 times and here is the output

"Elapsed time: 12130.035819 msecs"
"Elapsed time: 13113.92823 msecs"
"Elapsed time: 13364.234216 msecs"
"Elapsed time: 12553.478168 msecs"
"Elapsed time: 14488.706176 msecs"

On average 13.130076521799994 Seconds to read in 1 million records, for each record look up the movie and user and write it back to the disk.

Clojure puts the FUNctional back in programming.

Redis and Clojure

Check out my previous post about Redis.

In this post I build a very simple example of using Redis with Clojure. I will be using a client library for Redis written in Clojure called redis-clojure. You could also use the java library, to see complete list of supported languages go to this link.

So here we go..

  1. Create a simple clojure project (I personally use Leinigen), to create a new project execute ‘lein new com.dev/try-redis‘ this will create an entire project structure.
  2. Edit the project.clj file under the newly created project directory and add a new dependency for redis-clojure, the file should look close to this after you are done.
    (defproject com.dev/try-redis "1.0.0-SNAPSHOT"
      :description "simple example of using redis"
      :dependencies [[org.clojure/clojure "1.1.0"]
                     [org.clojure/clojure-contrib "1.1.0"]
                     [redis-clojure "1.0.3-SNAPSHOT"]]
      :dev-dependencies [[swank-clojure "1.2.1"]])
    
  3. run ‘lein deps‘ so that all the dependencies are downloaded.
  4. Edit the file core.clj under the directory try-redis/src/com/dev/try_redis, and add the following.
    (ns com.dev.try-redis.core
      (:require redis))
    
    (defn test-redis []
         (redis/with-server {:host "127.0.0.1" :port 6379 :db 0}
           (do
             (redis/set "foo" "bar")
             (println (redis/get "foo")))))
    
    

    On lines 7 and 8 we are setting key value pair and retriving the value.

  5. Start the redis server ‘./redis-server redis.conf
  6. Now we are ready to execute the script, there are 2 ways to do this.
    1. The easiest way is just going to your porject root directory and run ‘lein repl‘ (see the below oouput) which opens a read evaluate loop and once you have that run ‘(load-file “src/com/dev/try_redis/core.clj”)‘ to load the file and then you can run ‘(com.dev.try-redis.core/test-redis)‘ to run the example.
    2. I personally use emacs/slime, but for this option you need to have emacs and slime-clojure installed (See my emacs page). Run ‘lein swank‘ in the project directory and then in your emacs connect to it using ‘M-x slime-connect‘, this will open up a repl, do a C-c C-k to compile the file and in the repl you can execute using ‘(com.dev.try-redis.core/test-redis)’

If everything has gone will you should see this output.

Clojure 1.1.0
user=> (load-file "src/com/dev/try_redis/core.clj")
#'com.dev.try-redis.core/test-redis
user=> (com.dev.try-redis.core/test-redis)         
bar
nil
user=>

Redis

Recently I have heard a lot about Redis, so I decided to try it out. But first a little intro about Redis. “Redis is a database. To be specific, Redis is a database implementing a dictionary, where every key is associated with a value.” .

Think about Redis as memcache on steroids, Redis extends the basic key/value store paradigm by letting you have values of a certain type and defines operations that are unique to each type.

For example Redis lets you associate a key with a list and then lets you do list specific operations.

lpush mylist 1 --adds 1 to mylist
lpush mylist 2
llen mylist --returns the length of mylist

By the way if you just want to try out Redis without having to download and install, then check out this link.

Supported types are Lists, Sets, Sorted Sets and Strings. Below are some interesting operations (to see the entire list go to this link).

  • Adding elements to either the head or tail of a list.
  • Pop the fist element atomically (very lispy)
  • Union two sets.
  • Sorted Sets are sorted by score that you provide. It also uses the score when inserting new elements.

Redis also persists data by writing to the desk asynchronously this way you can have your entire application just use Redis without the need to have a separate database.

Here is a simple example using Redis

  • Download Redis “wget http://redis.googlecode.com/files/redis-1.2.6.tar.gz”.
  • Extract Redis and run ‘make’ in the directory.
  • Run the Redis server ‘./redis-server redis.conf ‘
  • Now you can start playing with the server using redis-cli, try the following

    ./redis-cli set name devender
    ./redis-cli get name