Stop Words and commonly concatenated words

Here is a list of about 800 stop words made based on 4 million documents (I started with this set). This set has helped us reduce model size and increase accuracy, please note that the same list may not be applicable in your application, please review the list before using.

More interestingly here list of commonly concatenated words that I found and their corrections.


Caching method calls or Memoization

Just as we use Hibernate 2nd level cache to store data, we can also save results from a method call this is a pretty old technique and in fact functional programming languages like Haskell have this feature built in and call it with a fancy name called memoization,

Here is how it is done in spring

You can read the details in the link, but on a high level when you call a method the result is stored in the cache and the next time around the result from the cache is used, as usual you can declare how long the cache should stay active and so on in a simple ehcache.xml config file.

There is also an open source project that now lets you just decorate methods with @ Cacheable annotation and it takes care of the rest


# of lines of code in your project

As I wait for the build I wrote up this post, has absolutely no point, just an observation. At present the code base that I work with everyday has :

1030467 lines of Java
641411      lines of Xml
224530     lines of Jsp
58950       lines of plain text
102751     lines in property files
2246         lines of groovy
1353693  lines of SQL (schema files, dml, ddl….)

90    Projects
7186 Java files
2547 SQL files

How many does yours ?
Its easy, run this find . -name ‘*.java’ | xargs wc -l | grep total | sed ‘s/total//g’


Favorite languages, why so great? and why not so much?

About my favorite languages, I actually have 2 favorite languages

  • Ruby: for all scripting and making quick apps.
  • Clojure: for development.

Why Ruby is great

  1. The language was designed for programmer use, you can see that from the api which is totally intuitive.
  2. Lots of libraries, my favorite is Sinatra which lets you build quick and dirty web apps and the other is Sequel.
  3. I wrote a blog post on how to delete RFC-822 in compatible emails (if you are a developer using linux and your company uses Outlook you know what I am talking about), this is a simple example of how I have used Ruby to make quick and dirty scripts.

I have used Ruby numerous times to write scripts to fix production data, correct files, and to generate complex reports. I have used Sinatra with Google Charts to make web apps that can show load times, server status ….

Why Ruby is not so great

  1. Not really meant for performance, recent years there is a push to develop a virtual machine for Ruby but it is still not anywhere close to C/Java performance.
  2. Rails is a pain to deploy, Heroku takes away the pain but what do you do if you have to deploy internally ? I personally have 2 apps on Heroku one of which is

Why Clojure is great
I have been on a quest to learn a functional programming language for the past 3 years, I have read the Erlang book (please see the various posts I wrote about Erlang here). Erlang is a fine language but I lost interest in it after I could not find a single good library that can connect Erlang to Oracle. The problem, there are too few 3rd party libraries. The next language I looked at was Haskell, lots of libraries and seems to be good at performance on the surface, problem I see is acceptance by business, where most of the code is in Java. Then I found Clojure and fell in love with it.

  1. It is just another DSL for the JVM, if you provide type hints the code generated will be the same as what Java would (can easily sneak it in).
  2. Totally embraces the JVM unlike JRuby.
  3. The author Rich Hickey has done a lot to reduce the pain points of lisp.
  4. Finally a language that frees you mind of OOP ( Have you ever noticed how much time you spend in trying to achieve the best object model when a simple one would do ? and for what ? the customers don’t care as long as it works, the computers sure don’t care as long it is 0s and 1s)
  5. Code is so concise and elegant.

Why Clojure is not great.

  1. It has been called as the language with the steepest learning curve on the JVM, I tend to agree with it.
  2. Unlike Scala you have no wiggle room, it is either functional code or nothing ( I like this feature actually).
  3. Debugging is a major pain point. (Though there has been improvement with the latest clojure-swank).

I have written many posts on Clojure on my blog you can see them here. In the most recent post I show you one can parse a one million record file in less than 15 seconds with clojure.


Denormalizing One million records with Clojure.

MovieLens is a research project that provides datasets of various sizes and attributes, containing movie ratings. These datasets are free to download and use for non-commercial purposes. They have done an awesome job putting this data together and a big thanks goes to them for making it available.

I wanted to exercise my Clojure skills (more like add to my tiny set of Clojure skills 🙂 ) and it just so happens that I recently came across the MovieLens project, so how about analyzing that data using Clojure ?

One of the datasets they make available is the One Million Dataset, this set consists of 3 files

  1. movies.dat” containing 3883 movie listings, contains title, genre…
  2. users.dat” containing 6040 unique users, contains age, occupation, gender …
  3. ratings.dat” containing 1000209 movie ratings, that references movie id and user id from the above 2 files.

I could analyze this data to answer questions such as, What age group gave the most ratings ? or What was the highest rated movie for a given time period ?

But before I could do this I wanted to denormalize the ratings file so that it also contains the user and movie information, why ? cause I don’t want to look it up when I am analyzing the data, each record should be self contained.

The outline of the program is quite simple

  • Read the users file into memory
  • Read the movies files into memory
  • For each line in the ratings
    • Find the corresponding movie and user
    • Print it out to a file.

Take a minute to think how would you do this in java and then look at the below code. I ran it on a Dell laptop dual 2.2Ghz laptop with 4 gig of ram and care to guess how long it takes ?? scroll down for answer.

 (:use [])
 (:import [ BufferedReader FileReader BufferedWriter FileWriter]))

(defstruct user :id :gender :age :ccupation :zip-code)
(defstruct movie :id :title :genres)

(defn format-user [user] (str (:id user) "::" (:gender user) "::" (:age user) "::" (:ccupation user) "::" (:zip-code user)))

(defn format-movie [movie] (str (:id movie) "::" (:title movie) "::" (:genres movie)))

(defn read-user-file [fileName]
 (loop [users {} fileSeq (read-lines fileName)]
   (let [line (first fileSeq)]
     (if (nil? line)
     (let [tokens (.split line "::")
           id (aget tokens 0)
           user (struct user id (aget tokens 1) (aget tokens 2) (aget tokens 3) (aget tokens 4))]
        (recur (merge users {id user}) (rest fileS)))))))

(defn read-movies-file [fileName]
 (loop [movies {} fileSeq (read-lines fileName)]
   (let [line (first fileSeq)]
     (if (nil? line)
     (let [tokens (.split line "::")
           id (aget tokens 0)
           movie (struct movie (Integer/parseInt (aget tokens 0)) (aget tokens 1) (aget tokens 2))]
         (recur (merge movies {id movie}) (rest fileS)))))))

(defn convert-ratings-file
 "read the ratings file and denormalize it"
 [moviesF usersF ratingsF outputF]
   (let [movies (read-movies-file moviesF) users (read-user-file usersF)]
     (with-open [#^BufferedReader rdr (BufferedReader. (FileReader. ratingsF) 1048576)
                 #^BufferedWriter wtr (BufferedWriter. (FileWriter. outputF) 1048576)]
       (doseq [line (line-seq rdr)]
         (let [tokens (.split line "::")
               user-id (aget tokens 0)
               movie-id (aget tokens 1)
               user (get users user-id)
               movie (get movies movie-id)
               rating (aget tokens 2)
               timestamp (aget tokens 3)]
 (.write wtr (str (format-user user) "::" (format-movie movie) "::" rating "::" timestamp "\n")))))))

(defn doIt []
 (time (convert-ratings-file

So ready with you guess ??
I ran the program 5 times and here is the output

"Elapsed time: 12130.035819 msecs"
"Elapsed time: 13113.92823 msecs"
"Elapsed time: 13364.234216 msecs"
"Elapsed time: 12553.478168 msecs"
"Elapsed time: 14488.706176 msecs"

On average 13.130076521799994 Seconds to read in 1 million records, for each record look up the movie and user and write it back to the disk.

Clojure puts the FUNctional back in programming.


Redis and Clojure

Check out my previous post about Redis.

In this post I build a very simple example of using Redis with Clojure. I will be using a client library for Redis written in Clojure called redis-clojure. You could also use the java library, to see complete list of supported languages go to this link.

So here we go..

  1. Create a simple clojure project (I personally use Leinigen), to create a new project execute ‘lein new‘ this will create an entire project structure.
  2. Edit the project.clj file under the newly created project directory and add a new dependency for redis-clojure, the file should look close to this after you are done.
    (defproject "1.0.0-SNAPSHOT"
      :description "simple example of using redis"
      :dependencies [[org.clojure/clojure "1.1.0"]
                     [org.clojure/clojure-contrib "1.1.0"]
                     [redis-clojure "1.0.3-SNAPSHOT"]]
      :dev-dependencies [[swank-clojure "1.2.1"]])
  3. run ‘lein deps‘ so that all the dependencies are downloaded.
  4. Edit the file core.clj under the directory try-redis/src/com/dev/try_redis, and add the following.
      (:require redis))
    (defn test-redis []
         (redis/with-server {:host "" :port 6379 :db 0}
             (redis/set "foo" "bar")
             (println (redis/get "foo")))))

    On lines 7 and 8 we are setting key value pair and retriving the value.

  5. Start the redis server ‘./redis-server redis.conf
  6. Now we are ready to execute the script, there are 2 ways to do this.
    1. The easiest way is just going to your porject root directory and run ‘lein repl‘ (see the below oouput) which opens a read evaluate loop and once you have that run ‘(load-file “src/com/dev/try_redis/core.clj”)‘ to load the file and then you can run ‘(‘ to run the example.
    2. I personally use emacs/slime, but for this option you need to have emacs and slime-clojure installed (See my emacs page). Run ‘lein swank‘ in the project directory and then in your emacs connect to it using ‘M-x slime-connect‘, this will open up a repl, do a C-c C-k to compile the file and in the repl you can execute using ‘(’

If everything has gone will you should see this output.

Clojure 1.1.0
user=> (load-file "src/com/dev/try_redis/core.clj")
user=> (         


Either you get it or you don’t, I checked twitter to see if I was the only one, I guess not!

Dan Leslie dleslie Wow, #Perforce manage to take down my entire session. Have I mentioned before how much I /love/ this software?

Dan Leslie dleslie I hate it how that little red-X button in #Perforce /DOES NOTHING/. What’s the point of providing an interrupt function that doesn’t work?

Jerry Chen jcsalterego @mojombo @luckiestmonkey perforce strikes again!!

Shaun Newman SDN_Photography Gaaaaaaaaa! Perforce I hate you! #BloodyUselessCodeRepositoryTool!

Joshua Gosse jlgosse I don’t think Perforce could be any more complicated

CFV gadorcha Fuck perforce. Kill it with fire

aidy lewis aidy_lewis Using #Perforce is like sticking your head in cow pooh.

Artem Skvira art_s Wasting my life on manually merging stuff from Perforce repository with local copy. Creators of Perforce must burn in hell. Lots of hate.

靜毅 Jing_Yi c’mon perforce! you fucking pig.

Zachary TimeDoctor Have I mentioned lately how much perforce

Marmot-headshot_normal geek: For the love of all that’s decent and good, how do I remove the Perforce context menu item from the Windows shell. I hate when VCS’ do that.

Profile Image

Follow Jing_Yi : Perforce makes really shitty use of the network.
polarapfel #Intellij #Idea + #Perforce = #FAIL. Perforce integration in Idea killed my Idea 5 times so far today.
Robert MacCloy My loathing for Perforce grows daily.
polarapfel Giving up on #Perforce in #Intellij #Idea. Disabled Perforce in my project to get rid of Perforce background process hogging 100% CPU
Ryan Evans Ah Perforce, every day you find a new way to kick me in the nuts
Can someone tell me why Perforce exists? Ok Ok, maybe it was pimp back in the day but what’s your excuse now? too lazy to switch?
export P4PORT=host:port – hey, that makes perfect sense. Very intuitive. Good job, @perforce.
Ok feels better now, back to hacking.


Recently I have heard a lot about Redis, so I decided to try it out. But first a little intro about Redis. “Redis is a database. To be specific, Redis is a database implementing a dictionary, where every key is associated with a value.” .

Think about Redis as memcache on steroids, Redis extends the basic key/value store paradigm by letting you have values of a certain type and defines operations that are unique to each type.

For example Redis lets you associate a key with a list and then lets you do list specific operations.

lpush mylist 1 --adds 1 to mylist
lpush mylist 2
llen mylist --returns the length of mylist

By the way if you just want to try out Redis without having to download and install, then check out this link.

Supported types are Lists, Sets, Sorted Sets and Strings. Below are some interesting operations (to see the entire list go to this link).

  • Adding elements to either the head or tail of a list.
  • Pop the fist element atomically (very lispy)
  • Union two sets.
  • Sorted Sets are sorted by score that you provide. It also uses the score when inserting new elements.

Redis also persists data by writing to the desk asynchronously this way you can have your entire application just use Redis without the need to have a separate database.

Here is a simple example using Redis

  • Download Redis “wget”.
  • Extract Redis and run ‘make’ in the directory.
  • Run the Redis server ‘./redis-server redis.conf ‘
  • Now you can start playing with the server using redis-cli, try the following

    ./redis-cli set name devender
    ./redis-cli get name