Stop Words and commonly concatenated words

Posted on March 16, 2011March 16, 2011 by devender

Here is a list of about 800 stop words made based on 4 million documents (I started with this set). This set has helped us reduce model size and increase accuracy, please note that the same list may not be applicable in your application, please review the list before using.

More interestingly here list of commonly concatenated words that I found and their corrections.

General

December is a great time to work

Posted on December 28, 2010December 28, 2010 by devender

Co-workers are playing with remote-controlled helicopters, exchanging recipes for goodies, cake in the kitchen, low traffic going to work, lots of laughs in the office.

How I wish it stayed liked this for the rest of the year.

General

Caching method calls or Memoization

Posted on August 5, 2010 by devender

Just as we use Hibernate 2nd level cache to store data, we can also save results from a method call this is a pretty old technique and in fact functional programming languages like Haskell have this feature built in and call it with a fancy name called memoization, http://en.wikipedia.org/wiki/Memoization

Here is how it is done in spring http://springtips.blogspot.com/2007/06/caching-methods-result-using-spring-and_23.html

You can read the details in the link, but on a high level when you call a method the result is stored in the cache and the next time around the result from the cache is used, as usual you can declare how long the cache should stay active and so on in a simple ehcache.xml config file.

There is also an open source project that now lets you just decorate methods with @ Cacheable annotation and it takes care of the rest http://code.google.com/p/ehcache-spring-annotations/wiki/UsingCacheable

General

Office Setup

Posted on July 13, 2010 by devender

oh how I wish I had this at home (hint hint)

General

# of lines of code in your project

Posted on July 12, 2010July 12, 2010 by devender

As I wait for the build I wrote up this post, has absolutely no point, just an observation. At present the code base that I work with everyday has :

1030467 lines of Java
641411 lines of Xml
224530 lines of Jsp
58950 lines of plain text
102751 lines in property files
2246 lines of groovy
1353693 lines of SQL (schema files, dml, ddl….)

90 Projects
7186 Java files
2547 SQL files

How many does yours ?
Its easy, run this find . -name ‘*.java’ | xargs wc -l | grep total | sed ‘s/total//g’

General

Favorite languages, why so great? and why not so much?

Posted on June 17, 2010June 17, 2010 by devender

About my favorite languages, I actually have 2 favorite languages

Ruby: for all scripting and making quick apps.
Clojure: for development.

Why Ruby is great

The language was designed for programmer use, you can see that from the api which is totally intuitive.
Lots of libraries, my favorite is Sinatra which lets you build quick and dirty web apps and the other is Sequel.
I wrote a blog post on how to delete RFC-822 in compatible emails (if you are a developer using linux and your company uses Outlook you know what I am talking about), this is a simple example of how I have used Ruby to make quick and dirty scripts.

I have used Ruby numerous times to write scripts to fix production data, correct files, and to generate complex reports. I have used Sinatra with Google Charts to make web apps that can show load times, server status ….

Why Ruby is not so great

Not really meant for performance, recent years there is a push to develop a virtual machine for Ruby but it is still not anywhere close to C/Java performance.
Rails is a pain to deploy, Heroku takes away the pain but what do you do if you have to deploy internally ? I personally have 2 apps on Heroku one of which is http://first3links.com/

Why Clojure is great
I have been on a quest to learn a functional programming language for the past 3 years, I have read the Erlang book (please see the various posts I wrote about Erlang here). Erlang is a fine language but I lost interest in it after I could not find a single good library that can connect Erlang to Oracle. The problem, there are too few 3rd party libraries. The next language I looked at was Haskell, lots of libraries and seems to be good at performance on the surface, problem I see is acceptance by business, where most of the code is in Java. Then I found Clojure and fell in love with it.

It is just another DSL for the JVM, if you provide type hints the code generated will be the same as what Java would (can easily sneak it in).
Totally embraces the JVM unlike JRuby.
The author Rich Hickey has done a lot to reduce the pain points of lisp.
Finally a language that frees you mind of OOP ( Have you ever noticed how much time you spend in trying to achieve the best object model when a simple one would do ? and for what ? the customers don’t care as long as it works, the computers sure don’t care as long it is 0s and 1s)
Code is so concise and elegant.

Why Clojure is not great.

It has been called as the language with the steepest learning curve on the JVM, I tend to agree with it.
Unlike Scala you have no wiggle room, it is either functional code or nothing ( I like this feature actually).
Debugging is a major pain point. (Though there has been improvement with the latest clojure-swank).

I have written many posts on Clojure on my blog you can see them here. In the most recent post I show you one can parse a one million record file in less than 15 seconds with clojure.

General

Denormalizing One million records with Clojure.

Posted on June 16, 2010June 16, 2010 by devender

MovieLens is a research project that provides datasets of various sizes and attributes, containing movie ratings. These datasets are free to download and use for non-commercial purposes. They have done an awesome job putting this data together and a big thanks goes to them for making it available.

I wanted to exercise my Clojure skills (more like add to my tiny set of Clojure skills 🙂 ) and it just so happens that I recently came across the MovieLens project, so how about analyzing that data using Clojure ?

One of the datasets they make available is the One Million Dataset, this set consists of 3 files

“movies.dat” containing 3883 movie listings, contains title, genre…
“users.dat” containing 6040 unique users, contains age, occupation, gender …
“ratings.dat” containing 1000209 movie ratings, that references movie id and user id from the above 2 files.

I could analyze this data to answer questions such as, What age group gave the most ratings ? or What was the highest rated movie for a given time period ?

But before I could do this I wanted to denormalize the ratings file so that it also contains the user and movie information, why ? cause I don’t want to look it up when I am analyzing the data, each record should be self contained.

The outline of the program is quite simple

Read the users file into memory
Read the movies files into memory
For each line in the ratings
- Find the corresponding movie and user
- Print it out to a file.

Take a minute to think how would you do this in java and then look at the below code. I ran it on a Dell laptop dual 2.2Ghz laptop with 4 gig of ram and care to guess how long it takes ?? scroll down for answer.
–
–

(ns com.dev.file-reader
 (:use [clojure.contrib.duck-streams])
 (:import [java.io BufferedReader FileReader BufferedWriter FileWriter]))

(defstruct user :id :gender :age :ccupation :zip-code)
(defstruct movie :id :title :genres)

(defn format-user [user] (str (:id user) "::" (:gender user) "::" (:age user) "::" (:ccupation user) "::" (:zip-code user)))

(defn format-movie [movie] (str (:id movie) "::" (:title movie) "::" (:genres movie)))

(defn read-user-file [fileName]
 (loop [users {} fileSeq (read-lines fileName)]
   (let [line (first fileSeq)]
     (if (nil? line)
     users
     (let [tokens (.split line "::")
           id (aget tokens 0)
           user (struct user id (aget tokens 1) (aget tokens 2) (aget tokens 3) (aget tokens 4))]
        (recur (merge users {id user}) (rest fileS)))))))

(defn read-movies-file [fileName]
 (loop [movies {} fileSeq (read-lines fileName)]
   (let [line (first fileSeq)]
     (if (nil? line)
     movies
     (let [tokens (.split line "::")
           id (aget tokens 0)
           movie (struct movie (Integer/parseInt (aget tokens 0)) (aget tokens 1) (aget tokens 2))]
         (recur (merge movies {id movie}) (rest fileS)))))))

(defn convert-ratings-file
 "read the ratings file and denormalize it"
 [moviesF usersF ratingsF outputF]
   (let [movies (read-movies-file moviesF) users (read-user-file usersF)]
     (with-open [#^BufferedReader rdr (BufferedReader. (FileReader. ratingsF) 1048576)
                 #^BufferedWriter wtr (BufferedWriter. (FileWriter. outputF) 1048576)]
       (doseq [line (line-seq rdr)]
         (let [tokens (.split line "::")
               user-id (aget tokens 0)
               movie-id (aget tokens 1)
               user (get users user-id)
               movie (get movies movie-id)
               rating (aget tokens 2)
               timestamp (aget tokens 3)]
 (.write wtr (str (format-user user) "::" (format-movie movie) "::" rating "::" timestamp "\n")))))))

(defn doIt []
 (time (convert-ratings-file
 "movielens-1m/movies.dat"
 "movielens-1m/users.dat"
 "movielens-1m/ratings.dat"
 "movielens-1m/output.dat"
 )))

So ready with you guess ??
I ran the program 5 times and here is the output
"Elapsed time: 12130.035819 msecs" "Elapsed time: 13113.92823 msecs" "Elapsed time: 13364.234216 msecs" "Elapsed time: 12553.478168 msecs" "Elapsed time: 14488.706176 msecs"

On average 13.130076521799994 Seconds to read in 1 million records, for each record look up the movie and user and write it back to the disk.

Clojure puts the FUNctional back in programming.

General

Redis and Clojure

Posted on June 13, 2010 by devender

Check out my previous post about Redis.

In this post I build a very simple example of using Redis with Clojure. I will be using a client library for Redis written in Clojure called redis-clojure. You could also use the java library, to see complete list of supported languages go to this link.

So here we go..

Create a simple clojure project (I personally use Leinigen), to create a new project execute ‘lein new com.dev/try-redis‘ this will create an entire project structure.

Edit the project.clj file under the newly created project directory and add a new dependency for redis-clojure, the file should look close to this after you are done.

(defproject com.dev/try-redis "1.0.0-SNAPSHOT"
  :description "simple example of using redis"
  :dependencies [[org.clojure/clojure "1.1.0"]
                 [org.clojure/clojure-contrib "1.1.0"]
                 [redis-clojure "1.0.3-SNAPSHOT"]]
  :dev-dependencies [[swank-clojure "1.2.1"]])

run ‘lein deps‘ so that all the dependencies are downloaded.

Edit the file core.clj under the directory try-redis/src/com/dev/try_redis, and add the following.

(ns com.dev.try-redis.core
  (:require redis))

(defn test-redis []
     (redis/with-server {:host "127.0.0.1" :port 6379 :db 0}
       (do
         (redis/set "foo" "bar")
         (println (redis/get "foo")))))

On lines 7 and 8 we are setting key value pair and retriving the value.

Start the redis server ‘./redis-server redis.conf‘
Now we are ready to execute the script, there are 2 ways to do this.
1. The easiest way is just going to your porject root directory and run ‘lein repl‘ (see the below oouput) which opens a read evaluate loop and once you have that run ‘(load-file “src/com/dev/try_redis/core.clj”)‘ to load the file and then you can run ‘(com.dev.try-redis.core/test-redis)‘ to run the example.
2. I personally use emacs/slime, but for this option you need to have emacs and slime-clojure installed (See my emacs page). Run ‘lein swank‘ in the project directory and then in your emacs connect to it using ‘M-x slime-connect‘, this will open up a repl, do a C-c C-k to compile the file and in the repl you can execute using ‘(com.dev.try-redis.core/test-redis)’

If everything has gone will you should see this output.

Clojure 1.1.0
user=> (load-file "src/com/dev/try_redis/core.clj")
#'com.dev.try-redis.core/test-redis
user=> (com.dev.try-redis.core/test-redis)         
bar
nil
user=>

General

#IhatePerforce

Posted on June 11, 2010June 16, 2010 by devender

Either you get it or you don’t, I checked twitter to see if I was the only one, I guess not!

dleslie Wow, #Perforce manage to take down my entire session. Have I mentioned before how much I /love/ this software?

dleslie I hate it how that little red-X button in #Perforce /DOES NOTHING/. What’s the point of providing an interrupt function that doesn’t work?

jcsalterego @mojombo @luckiestmonkey perforce strikes again!!

SDN_Photography Gaaaaaaaaa! Perforce I hate you! #BloodyUselessCodeRepositoryTool!

jlgosse I don’t think Perforce could be any more complicated

gadorcha Fuck perforce. Kill it with fire

aidy_lewis Using #Perforce is like sticking your head in cow pooh.

art_s Wasting my life on manually merging stuff from Perforce repository with local copy. Creators of Perforce must burn in hell. Lots of hate.

Jing_Yi c’mon perforce! you fucking pig.

TimeDoctor Have I mentioned lately how much perforce

geek: For the love of all that’s decent and good, how do I remove the Perforce context menu item from the Windows shell. I hate when VCS’ do that.

Follow Jing_Yi : Perforce makes really shitty use of the network.

polarapfel #Intellij #Idea + #Perforce = #FAIL. Perforce integration in Idea killed my Idea 5 times so far today.

rfurlan Perforce kills my inner child. Polished like a Google product. Not a compliment.

rbm

Robert MacCloy My loathing for Perforce grows daily.

polarapfel Giving up on #Perforce in #Intellij #Idea. Disabled Perforce in my project to get rid of Perforce background process hogging 100% CPU

rje

Ryan Evans Ah Perforce, every day you find a new way to kick me in the nuts

feorlen I hate perforce, and I think the feeling is mutual.

UPDATE

unhappycoder

Can someone tell me why Perforce exists? Ok Ok, maybe it was pimp back in the day but what’s your excuse now? too lazy to switch?

omgtbh

export P4PORT=host:port – hey, that makes perfect sense. Very intuitive. Good job, @perforce.

Ok feels better now, back to hacking.

General

Redis

Posted on June 10, 2010June 12, 2010 by devender

Recently I have heard a lot about Redis, so I decided to try it out. But first a little intro about Redis. “Redis is a database. To be specific, Redis is a database implementing a dictionary, where every key is associated with a value.” .

Think about Redis as memcache on steroids, Redis extends the basic key/value store paradigm by letting you have values of a certain type and defines operations that are unique to each type.

For example Redis lets you associate a key with a list and then lets you do list specific operations.
lpush mylist 1 --adds 1 to mylist lpush mylist 2 llen mylist --returns the length of mylist

By the way if you just want to try out Redis without having to download and install, then check out this link.

Supported types are Lists, Sets, Sorted Sets and Strings. Below are some interesting operations (to see the entire list go to this link).

Adding elements to either the head or tail of a list.
Pop the fist element atomically (very lispy)
Union two sets.
Sorted Sets are sorted by score that you provide. It also uses the score when inserting new elements.

Redis also persists data by writing to the desk asynchronously this way you can have your entire application just use Redis without the need to have a separate database.

Here is a simple example using Redis

Download Redis “wget http://redis.googlecode.com/files/redis-1.2.6.tar.gz”.
Extract Redis and run ‘make’ in the directory.
Run the Redis server ‘./redis-server redis.conf ‘
Now you can start playing with the server using redis-cli, try the following
./redis-cli set name devender ./redis-cli get name

Devender's Musings

Devender's Musings

Stop Words and commonly concatenated words

December is a great time to work

Caching method calls or Memoization

Office Setup

# of lines of code in your project

Favorite languages, why so great? and why not so much?

Denormalizing One million records with Clojure.

Redis and Clojure

#IhatePerforce

Redis