Packt Author Award 2009 – Vote for my book!

It’s a year to the month since I published my first book – Ruby on Rails Web Mashup Projects. I didn’t even realise it — time really does fly. Just got a mail from my publisher announcing that they’re having an award for the best author of the year. If you’re one of my readers, or if you enjoyed reading this blog please consider voting for my book.

Here’s the link – http://authors.packtpub.com/content/packt-author-award-2009. Just click on Vote Now! and select Ruby on Rails Web Mashup Projects in the list of books to vote for. Thanks!

Why it is difficult to accept the realities of a downturn

I bought a 17″ Sony Flatron monitor for $535 in 1999, one of my prized possessions that accompanied me during the few years which I hopped from rented room to rented room. It weighed almost 20kg and was a tremendous monster to lug around, maybe 20% of the total weight of my other stuff put together. I retired it (still working) when I got my Samsung 19″ LCD monitor ($220) about 2 years ago and my wife persuaded me to sell it off. However, I baulked when I was offered $50 by the karung guni man, after all, it was still working. So I refused to sell it.

The next time I mentioned it to a karung guni man, he wouldn’t even want to take it and wanted to charge me $5 to take it off my hands. Naturally I didn’t pay.

All these brought me to the couple of conversations I had with Winston and with Simon. Conversations that somehow drifted to talking about our lifestyles in the current state of the economy. Perhaps this was on my mind during our lunchtime chatter, since we talked about completely different things to begin with. Winston talked about how he struggled through his startup days to achieve where he is today. Simon and I talked about how people upgraded their lifestyles over the few good years before the current crisis, and having done so found it extremely hard to accept that they need to lower their expectations now. Stories of people who lost their jobs and cannot accept that their new jobs now pay 20% – 25% lower than their previous one. People who just bought their 42″ LCD TVs, high-end electronic gadgets, cars and even apartments, depended entirely on their salaries to pay for the installments and lost their jobs. People with shiny promising lives ahead of them during the prosperous times but now facing bleak, burgeoning loans and unable to come to terms with the new realities of a recession.

Which led me to think — what makes it so difficult for us to accept the realities of an economic downturn?

My guess is a combination of these three hypotheses:

Endowment effect
This simply means that people place a higher value on objects they own than objects that they do not. An experiment conducted in Duke University by Ziv Carmon and Dan Ariely showed how surprisingly irrational yet humanly plausible this is.

Duke University has a small basketball stadium which doesn’t have enough seats for all its fans so it designed an elaborate lottery system to allocate tickets instead. The experiment involved fans who has won tickets to an NCAA Final Four basketball tournament and fans who had not, 93 respondents in all. Each respondent was asked, one day prior to the match, to indicate their selling or buying price for a ticket to the match. The buying price question asked for the highest price he or she would pay for the ticket and the selling price question asked for the lowest price he or she would agree to sell. In addition, the would-be buyers would be asked to think of other items or experiences that would be equivalent in value to the tickets. The test was to find out how much the buyer value the experience as compared to an equivalent experience and how much the seller value losing that experience. In other words, it was an experiment to determine the existence of the endowment effect.

When the experiment ended, the 10% trimmed mean selling price was about $2,400 while the 10% trimmed mean buying price was about $170. That’s a difference by a factor of 14! Rationally speaking, the would-be buyer and the would-be seller should think of the experience in the same way since they were randomly picked fans. The only difference was that some of the respondents won a lottery for the ticket and others didn’t. It’s pretty hard to explain the differences between the selling price and the buying price but at the same time it is strangely familiar and human.

This is why it’s particularly bitter for us in this downturn — we have already owned our new lifestyle, and it has gained particularly high value. If we have not owned that brand new Honda Civic it wouldn’t have mattered so much, but now that we have it, losing it is much more painful, even though we had lived happily in the past without it.

Loss aversion
This again should be familiar to most of us. It refers to the tendency for people to prefer avoiding losses than to acquiring gains of similar value. For example, it is more painful to lose $1000 than it is pleasurable to gain $1000. Given a choice people would prefer the chance not to lose $1000 than the chance that they might gain $1000. In a 1981 paper by Amos Tversky and Daniel Kahneman, researchers described an experiment conducted at the University of Stanford and the University of British Columbia where 2 groups of students (152 and 155 respectively) were asked to answer this brief questionnaire in a classroom setting:

Imagine that the US is preparing for the outbreak of an unusual Asian disease, which is expected to kill 60 people. Two alternative programs to combat the disease have been proposed. Assume that the exact scientific estimate of the consequences of the programs are as follows:

If Program A is adopted, 200 people will be saved.

If Program B is adopted, there is a 1/3 probability that 600 people will be saved and 2/3 probability that no on will be saved.

Which of the two programs would you favor?

As it turns out 72% of the respondents chose Program A and 28% chose Program B. However, when the questions were phrased like this:

If Program C is adopted 400 people will die.

If Program D is adopted, there is a 1/3 probability that nobody will die and 2/3 probability that 600 people will die.

Which of the two programs would you favor?

the reverse happens i.e. 78% of the respondents chose Program D and 22% chose Program C! As you can see Program A and C are essentially the same, as with Program B and D. The difference was how the questions were worded. Whereas Program A describes people being saved (a gain) while Program C describes people dying (a loss). Intuitively it seems right, but it is still pretty startling that mere words could affect us this way.

Cognitive dissonance
Cognitive dissonance is one of the most influential and extensively studied theories in social psychology. It is the uncomfortable feeling when we have two contradictory ideas at the same time. The theory is that in such situations we would be driven to change or justify our attitude or behavior. For example, if you believe that you are a careful driver and an accident happens, you would normally either blame it on the other driver or some external factor that doesn’t contradict with your self-belief that you’re a careful driver.

In this economic downturn, the self-belief that we are worth this much (in terms of salary or other compensation) contradicts with the reality that we have been offered a job that is paying less. The cognitive dissonance then drives us to angrily reject the offer as being unreasonable, the employer is trying to take advantage of the downturn or some other justification that doesn’t take away that self-belief.

So?
So what can we learn from this? Loss aversion tells us that looking at it in a different perspective makes things less painful for us. It’s not losing a job, it’s galvanizing us to try out the other stuff we’ve always been too busy with. What we learn from the endowment effect is that losing our stuff is painful. That will not go away. However we can make things less painful not only by spending only we can afford, but also spending a few levels less than what we can afford in order to buffer for this pain. I can afford that Honda Civic if I have my job and current salary, but shouldn’t I buy a cheaper car such that I can afford it even if I lose 50% of my salary? Finally, cognitive dissonance is part of human nature but what we can’t change, we should adapt. Recognizing that our self-worth is not measured in terms of our salaries but other things our families and friends, that salaries and jobs, like most things in the world goes through cycles should give us better sanity in dealing with a rapidly changing economic crisis.

Having knowledge of what makes is hurt so much in this downturn doesn’t really help us get back our jobs or salaries. But knowing that it is human to feel this way, and that everyone is feeling the same way probably helps us face it better.

I still have that 17″ Sony Flatron monitor lying somewhere. One of these days I will need to overcome the endowment effect and dump it.

Write a Sinatra-based Twitter clone in 200 lines of Ruby code

After trying out Sinatra as the interface to my search engine, I got hooked to it. I liked the no-frills approach to developing a web application. I liked it so much that I decided to write a fuller web app using Sinatra. Of course I had no idea what to write, so I decided to clone a random popular application. That was how I ended up writing a Twitter clone.

It was inevitable that halfway writing the Chirp-Chirp, my Sinatra-based Twitter clone, I found out that there are already quite a few Twitter clones out there, 262 to be specific, at least as of August 2008.

The design goals for Chirp-Chirp are very simple. I want to build a reasonably feature complete Twitter clone in the simplest way possible. The code should be minimal and easily understood. The user interface should be usable without additional extensive instructions. Also, the deployment should be straightforward.

To begin with, let’s look at a Twitter clone. Fundamentally it’s all about telling people what you’re doing with minimal text entry, in real-time. The people are friends who ‘follow’ you (naturally, called ‘followers’). The text that you type is  short, which is why such apps are also known as micro-blogs.

Let’s look at the various components of Chirp-Chirp:

  • User interface
  • Login and user management
  • Data modeling
  • Home page
  • Friends and followers
  • Public and direct messages

User interface
Simply put I just copied Twitter‘s user interface and slapped on a self-drawn logo.

Login and user management
I don’t really want to write a login module, nor manage users who log in to Chirp-Chirp but I certainly need one. So my solution is to ‘borrow’ someone else’s login module and kept minimal user management. The first thing I looked into is OpenID. Using OpenID is relatively simple but it still involved some code integration using an OpenID library.

Finally I settled on using RPX to handle the login, and doing minimal integration to single sign-on with Yahoo!, Google and Windows Live ID, 3 of the biggest players on the Internet scene. I figured almost everyone would have at least 1 of these 3 accounts. The advantage in using RPX is that there is really little integration needed, and if the user is already logged into either one of the 3 acounts, he doesn’t need to log in again. There is no user registration, choosing password, taking care of security and all that stuff. Just a simple click on the account the user wants to use, and he’s in.

User management is inevitable since I need to keep track on who write what messages. As the unique user ID, I use the email that is returned by the provider. OpenID (which all 3 of the providers support) returns me a unique identifier but that is comparatively difficult to type and remember. Email is a good compromise as it is something easily remembered and is unique.

Data modeling
I used DataMapper again for data modeling. The data model consists of 3 classes — User, Relationship and Chirp. Simply put — a User has Relationships with other Users, and a User has many Chirps (corny but, hey :)). A Chirp with a specific recipient User is a direct message.

Home page
All chirps sent out by the user and his friends are listed in his home page in a reverse timeline order (i.e. the latest chirp is listed at the top). Direct messages received are only shown in the direct message inbox and the direct messages sent are only shown in the sent box.

Friends and followers
The concept of friends and followers is a one way relationship. A friend is someone you follow, while a follower is someone following you. A user can be a friend or a follower or both. Following does not require the permission of the user. For example, you can follow anyone in the system without the explicit approval of that user.

Public messages (chirps) and direct messages
A chirp is a public message that is sent by a user that is viewable by the user and all his followers. Chirps belong to a user. Chirps containing the user ID (email) starting with ‘@’ will be hyperlinked to the user’s home page. 2 other features involving the message is direct messaging and following.

You can send direct message to specific users, who will receive them in their direct message inbox. To send a direct message start the chirp with ‘dm’, followed by the user ID. To follow a user, start the chirp with ‘follow’, followed by the user ID.

With this in mind let’s jump into the code. Here’s the stuff that I used:

Without looking the at view templates, the total number of lines of code is around 200 or so. I have only 2 non-view template files — chirp.rb is the web application while models.rb contains the 3 DataMapper data models. As we go along I’ll be extracting code fragments from these 2 files to explain in details.

Let’s talk about chirp.rb first. The first line after the requires tells Sinatra that I want to use sessions. This is followed by telling Sinatra to use the Rack::Flash library to use session-based flash. What you see next are a bunch of get and post blocks. Let’s look at the first one.

['/', '/home'].each do |path|
get path do
if session[:userid].nil? then
erb :login
else
redirect "/#{User.get(session[:userid]).email}"
end
end
end

You will notice that I wrap the get block with an array block, and the array contains 2 strings. These 2 strings are the routes that are associated with the subsequent get block. In the code above, we associate the ‘/’ and ‘/home’ routes to the get block. The get block here checks if the current session contains a user ID. If it doesn’t, it will redirect the user to the login view template. Otherwise it will get the user with this ID and redirect it to the route that contains the email. For example, if the user’s email is user@yahoo.com, then the route is ‘/user@yahoo.com’.

We’re going to look at the login module next. As mentioned, I used RPX (to be specific, I use RPX Basic, which is free!) to do single sign-on so to begin, go to RPXNow and register a web site. Once you do that, you will get an API key and that is really the only thing you need. For Windows Live ID, you’ll need to do a few more steps to register a Windows Live project but you can just follow the steps provided.

Now let’s look at the login template to understand more about the RPX single-sign on mechanism. You will notice that this is entirely HTML, and it contains 3 forms. The URLs for the single sign-on for Google and Yahoo! are OpenID based (which is not surprising as RPX is run by JanRain, one of the main supporters for OpenID). Each form has only 1 input, which is an image button. The user clicks on the account he wants to log in with, and following the given instructions, will be authenticated by the provider and redirected back to our web app. If the login is correct, our web app will receive some data on the user (the amount of data we get is different with different providers, and also configurable from RPX). I won’t go into the actual mechanism, the RPX documentation is pretty detailed.

Once RPX authenticates the user, it will redirect him to our web app along with a token. For Chirp-Chirp this goes back to ‘/login’. The ‘/login’ block then retrieves this token and uses it to retrieve the user information from RPX.

get '/login' do
openid_user = get_user(params[:token])
user = User.find(openid_user[:identifier])
user.update_attributes({:nickname => openid_user[:nickname], :email => openid_user[:email], :photo_url => "http://www.gravatar.com/avatar/#{Digest::MD5.hexdigest(openid_user[:email])}"}) if user.new_record?
session[:userid] = user.id # keep what is stored small
redirect "/#{user.email}"
end

For Chirp-Chirp, I store the unique identifier from RPX, the user’s nickname and his email. For his avatar picture, I use Gravatar which neatly also uses email address as a unique identifier although we need to use MD5 to hash it first. If this is the first time the user is logging in, I save the user information in the database. Finally I store the user’s serial ID (not the unique identifier) in the session and redirect the user to the user home page.

This is the code fragment that retrieves the user information from RPX.

def get_user(token)
u = URI.parse('https://rpxnow.com/api/v2/auth_info')
req = Net::HTTP::Post.new(u.path)
req.set_form_data({'token' => token, 'apiKey' => '<insert RPX API KEY HERE>', 'format' => 'json', 'extended' => 'true'})
http = Net::HTTP.new(u.host,u.port)
http.use_ssl = true if u.scheme == 'https'
json = JSON.parse(http.request(req).body)

if json['stat'] == 'ok'
identifier = json['profile']['identifier']
nickname = json['profile']['preferredUsername']
nickname = json['profile']['displayName'] if nickname.nil?
email = json['profile']['email']
{:identifier => identifier, :nickname => nickname, :email => email}
else
raise LoginFailedError, 'Cannot log in. Try another account!'
end
end

Let’s look at the home page next. I take the example of a user with the email user@yahoo.com, so the home page for him would be ‘/user@yahoo.com’. This is the code fragment with the get block for the home page:

get '/:email' do
@myself = User.get(session[:userid])
@user = @myself.email == params[:email] ? @myself : User.first(:email => params[:email])
@dm_count = dm_count
erb :home
end

The code is pretty self explanatory. The home page retrieves the currently logged in user (@myself) and if the requested user (@user) is not the currently logged in user, it will retrieve that use from the database as well. It also retrieves a count of the direct messages that this user has, and this is for display purposes.

def dm_count
Chirp.count(:recipient_id => session[:userid]) + Chirp.count(:user_id => session[:userid], :recipient_id.not => nil)
end

The number of direct messages is the number of direct messages sent and received.

Let’s look at the friending and following features.

get '/follow/:email' do
Relationship.create(:user => User.first(:email => params[:email]), :follower => User.get(session[:userid]))
redirect '/home'
end

This get block is called when you want to follow a particular user. Again, I just create a relationship between the person whom you want to follow and you. Deleting that relationship is also very simple.

delete '/follows/:user_id/:follows_id' do
Relationship.first(:follower_id => params[:user_id], :user_id => params[:follows_id]).destroy
redirect '/follows'
end

Note that this is a delete block and not get block any more. Finally to display the friends and followers, I have 2 separate routes, but I want to re-use the same code and view template:

['/follows', '/followers'].each do |path|
get path do
@myself = User.get(session[:userid])
@dm_count = dm_count
erb :follows
end
end

Next let’s look at the chirps and direct messages. To chirp, I use a post block:

post '/chirp' do
user = User.get(session[:userid])
Chirp.create(:text => params[:chirp], :user => user)
redirect "/#{user.email}"
end

However, to implement the various features like URL shortening and command level following and direct messaging, I placed the processing logic in the Chirp class itself.

before :save do
case
when starts_with?('dm ')
process_dm
when starts_with?('follow ')
process_follow
else
process
end
end

Before the chirp is save, I check if the text starts with ‘dm’ or ‘follow’ and I process the text accordingly. Let’s look at these various processing blocks of code.

def process_follow
Relationship.create(:user => User.first(:email => self.text.split[1]), :follower => self.user)
throw :halt # don't save
end

Implementing the ‘follow’ command in the chirp box is probably the easier. I just create the relationship and then throw halt to stop saving the chirp.

def process_dm
self.recipient = User.first(:email => self.text.split[1])
self.text = self.text.split[2..self.text.split.size].join(' ') # remove the first 2 words
process
end

Implementing direct message is also relatively simple. I just set the recipient of the message to be the person that is indicated in the chirp box and when saving the chirp, I remove the command. Finally I pass it on to the common processing block.

def process
# process url
urls = self.text.scan(URL_REGEXP)
urls.each { |url|
tiny_url = open("http://tinyurl.com/api-create.php?url=#{url[0]}") {|s| s.read}
self.text.sub!(url[0], "<a href='#{tiny_url}'>#{tiny_url}</a>")
}
# process @
ats = self.text.scan(AT_REGEXP)
ats.each { |at| self.text.sub!(at, "<a href='/#{at&#91;2,at.length&#93;}'>#{at}</a>") }
end

URL_REGEXP = Regexp.new('\b ((https?|telnet|gopher|file|wais|ftp) : [\w/#~:.?+=&%@!\-] +?) (?=[.:?\-] * (?: [^\w/#~:.?+=&%@!\-]| $ ))', Regexp::EXTENDED)
AT_REGEXP = Regexp.new('\s@[\w.@_-]+', Regexp::EXTENDED)

The common processing block scans the chirp text and does a few things:

  • I find the URLs in the code and replace it with a shortened URL from TinyURL.
  • I find all emails that has ‘@’ prefixed and I replace it with a link to that user email.

As for viewing direct messages, again I reuse the same view template, but for the sake of variety instead of having 2 routes, I used 1 route with a parameter:

get '/direct_messages/:dir' do
@myself = User.get(session[:userid])
case params[:dir]
when 'received' then @chirps = Chirp.all(:recipient_id => @myself.id)
when 'sent'     then @chirps = Chirp.all(:user_id => @myself.id, :recipient_id.not => nil)
end
@dm_count = dm_count
erb :direct_messages
end

In the case above, when I’m looking at the received direct messages (i.e. I’m looking at the inbox) I just want to find all chirps which recipient is myself. For the direct messages I sent, I find all chirps that belong to me, which recipient is not null.

That’s it! Of course there’s the view templates but that’s mainly HTML and some ERB snippets embedded in them. You can check out the entire repository at git://github.com/sausheong/chirp.git.

To run the app, get the code. Then go to the models.rb file and change the database URL accordingly. After that, go to irb, require the models file, and run this:

> DataMapper.auto_migrate!

This will create the necessary database. Then in the directory, run this

> ruby chirp.rb

Go to http://localhost:4567 and you’re up and running locally! Of course, you need to register your RPX account and then change login.erb accordingly.

To view this in a live site, go to http://chirp-chirp.heroku.com. This is deployed to Heroku, which is another amazing service, but that’s another story for another post.

Naive Bayesian Classifiers and Ruby

I first learnt about probability when I was in secondary school. As with all the other topics in Maths, it was just another bunch of formulas to memorize and regurgitate to apply to exam questions. Although I was curious if there was any use for it beyond calculating the odds for gambling, I didn’t manage to find out any. As with many things in my life, things pop up at unexpected places and I stumbled on it again when as I started on machine learning and naive Bayesian classifiers.

A classifier is exactly that — it’s something that classifies other things. A classifier is a function that takes in a set of data and tells us which category or classification the data belongs to. A naive Bayesian classifier is a type of learning classifier, meaning that you can continually train it with more data and it will be be better at its job. The reason why it’s called Bayesian is because it uses Bayes’ Law, a mathematical theorem that talks about conditional probabilities of events, to determine how to classify the data. The classifier is called ‘naive’ because it assumes each event (in this case the data) to be totally unrelated to each other. That’s a very simplistic view but in practice it has been proven to be a surprisingly accurate. Also, because it’s relatively simple to implement, it’s quite popular. Amongst its more well-known usage include email spam filters.

So what’s Bayes’ Law and how can it be used to categorize data? As mentioned, Bayes’ Law describes conditional probabilities. An example of conditional probability is the probability of an event A happening given that another event B has happened. This is usually written as Pr(A | B), which is read as the probability of A, given B. To classify a document, we ask — given a particular text document, what’s the probability that it belongs to this category? When we find the probabilities of the given document in all categories, the classifier picks the category with the highest probability and announce it as the winner, that is, the document most probably belongs to that category.

The question then follows, how to we get the probability of a document belonging to a category? This is where we turn to Bayes’ Law which states that:

Given our usage, what we want is:

What we need is Pr(document|category) and Pr(category). You should keep in mind that we’re comparing relative probabilities here so we can drop Pr(document) because it is the same for every category.

What is Pr(document|category) and how do we find it? It is the probability that this document exists, given a particular category. As a document is made of a bunch of words, what we need to do is to calculate the probability of the words in the document within the category. Here is where the ‘naive’ part comes in. We know that the words in a document are not random and the probability of a word like ‘Ruby’ would be more likely to be found in an article on the Ruby programming language than say, an article on the dental practices in Uganda. However for the purpose of simplicity, the naive Bayesian classifier treats each word as independent of each other.

Remember your probability lessons — if the probability of each word is independent of each other, the probability of a whole bunch of words together is the product of the probability of each word in the bunch. A quick aside to illustrate this.

Take a pair of dice and roll them one after another. The probability of the the first die to fall on any one of its 6 sides is 1 out of 6, that is 1/6. The probability of the second die to fall on any one of its 6 sides is also 1/6. So what is the probability that both dice lands on 6? Out of the 6 x 6 = 36 possible ways that a pair of dice can land there is only 1 way that both dice lands on 6, so the probability is 1/36, which is 1/6 x 1/6. This is true only if the dice rolls are independent of each other. In the same way we are ‘naively’ assuming that the words in the document are occurring independently of each other, as if it is written by the theoretical monkey with a typewriter.

In other words, the probability that a document exists, given a category, is the product of the probability of each word in that document. Now that we’ve established this, how do we get the probability of a single word? Basically it’s the count of the number of times the word appeared in the category after the classifier has been trained compared to the total word counts in that category. Another quick illustration. Say we train the classifier with 2 categories (spam and not-spam), there are 100 word counts in the spam category. There are only be 14 unique words in this category but some of these words have been trained more than once. Out of these 100 word counts, 5 of them are for the word ‘money’. The probability for the word ‘money’ would be the number of times it is mentioned in the spam category (5) divided by the number of word counts in this category (100).

Now that we know Pr(document|category) let’s look at Pr(category). This is simply the probability of any document being in this category (instead of being in another category). This is the number of documents used to train this category over the total number of documents that used to train all categories.

So that’s the basic idea behind naive Bayesian classifiers. With that I’m going to show you how to write a simple classifier in Ruby. There is already a rather popular Ruby implementation by Lucas Carlsson called the Classifier gem (http://classifier.rubyforge.org) which you can use readily but let’s write our own classifier instead. We’ll be creating class named NativeBayes, in a file called native_bayes.rb. This classifier will be used to classify text into different categories. Let’s recap how this classifier will be used:

  1. First, tell the classifier how many categories there will be
  2. Next, train the classifier with a number of documents, while indicating which category those document belongs to
  3. Finally, pass the classifier a document and it should tell us which category it thinks the document should be in

Now let’s run through the public methods of the NativeBayes class, which should map to the 3 actions above:

  1. Provide the categories you want to classify the data into
  2. Train the classifier by feeding it data
  3. Doing the real work, that is to classify given data

The first method we’ll roll into the constructor of the class, so that when we create the object, the categories will be set. The second method, train, takes in a category and a document (a text string) to train the classifier. The last method, classify, takes in just a document (a text string) and returns its category.


class NaiveBayes
 def initialize(*categories)
 @words = Hash.new
 @total_words = 0
 @categories_documents = Hash.new
 @total_documents = 0
 @categories_words = Hash.new
 @threshold = 1.5

 categories.each { |category|
 @words[category] = Hash.new
 @categories_documents[category] = 0
 @categories_words[category] = 0
 }
 end

 def train(category, document)
 word_count(document).each do |word, count|
 @words[category][word] ||= 0
 @words[category][word] += count
 @total_words += count
 @categories_words[category] += count
 end
 @categories_documents[category] += 1
 @total_documents += 1
 end

 def classify(document, default='unknown')
 sorted = probabilities(document).sort {|a,b| a[1]<=>b[1]}
 best,second_best = sorted.pop, sorted.pop
 return best[0] if (best[1]/second_best[1] > @threshold)
 return default
 end
end

Let’s look at the initializer first. We’ll need the following instance variables:
1. @words is a hash containing a list of words trained for the classifier. It looks something like this:

{
"spam"    => { "money" => 10, "quick" => 12, "rich"  => 15 },
"not_spam" => { "report" => 9, "database" => 13, "salaries"  => 12 }
}

“Spam” and “not_spam” are the categories, while “money”, “quick” etc are the words in the “spam” category with the numbers indicating the number of times it has been trained as that particular category.

2. @total_words contains the number of words trained

3. @categories_documents is a hash containing the number of documents trained for each category:

{ "spam" => 4, "not_spam" => 5}

4. @total_documents is the total number of documents trained

5. @categories_words is a hash containing the number of words trained for each category:

{ "spam" => 37, "not_spam" => 34}

6. @threshold is something I will talk about again at the last section of the code descriptions (it doesn’t make much sense now).  Next is the train method, which takes in a category and a document. We break down the document into a number of words, and slot it accordingly into the instance variables we created earlier on. Here we are using a private helper method called word_count to do the grunt work.

def word_count(document)
 words = document.gsub(/[^\w\s]/,"").split
 d = Hash.new
 words.each do |word|
 word.downcase!
 key = word.stem
 unless COMMON_WORDS.include?(word)
 d[key] ||= 0
 d[key] += 1
 end
end
return d
end

COMMON_WORDS = ['a','able','about','above','abroad' ...] # this is truncated

The code is quite straightforward, we’re just breaking down a text string into its constituent words. We want to focus on words that characterize the document so we’re really not that interested in some words such as pronouns, conjunctions, articles, and so on. Dropping those common words will bring up nouns, characteristic adjectives and some verbs. Also, to reduce the number of words, we use a technique called ‘stemming’ which essentially reduces any word to its ‘stem’ or root word. For example, the words ‘fishing’, ‘fisher’, ‘fished’, ‘fishy’ are all reduced to a root word ‘fish’. In our method here we used a popular stemming algorithm called Porter stemming algorithm. To use this stemming algorithm, install the following gem:

gem install stemmer

Now let’s look at the classify method. This is the method that uses Bayes’ Law to classify documents. We will be breaking it down into a few helper methods to illustrate how Bayes’ Law is used. Remember that finally we’re looking at the probability of a given document being in any of the categories, so we need to have a method that returns a hash of categories with their respective probabilities like this:

{ "spam" => 0.123, "not_spam" => 0.327}
def probabilities(document)
 probabilities = Hash.new
 @words.each_key {|category|
 probabilities[category] = probability(category, document)
 }
 return probabilities
end

In the <em>probabilities</em> method, we need to calculate the probability of that document being in each category. As mentioned above, that probability is Pr(document|category) * Pr(category). We create a helper method called probability that simply multiplies the document probability Pr(document|category) and the category probability Pr(category).

def probability(category, document)
 doc_probability(category, document) * category_probability(category)
end

First let’s tackle Pr(document|category). To do that we need to get all the words in the given document, get the word probability of that document and multiply them all together.

def doc_probability(category, document)
 doc_prob = 1
 word_count(document).each { |word| doc_prob *= word_probability(category, word[0]) }
 return doc_prob
end

Next, we want to get the probability of a word. Basically the probability of a word in a category is the number of times it occurred in that category, divided by the number of words in that category altogether. However if the word never occurred during training (and this happens pretty frequently if you don’t have much training data), then what you’ll get is a big fat 0 in probability. If we propagate this upwards, you’ll notice that the document probability will all be made 0 and therefore the probability of that document in that category is made 0 as well. This, of course, is not the desired results. To correct it, we need to tweak the formula a bit.
To make sure that there is at least some probability to the word even if it isn’t in trained list, we assume that the word exists at least 1 time in the training data so that the result is not 0. So this means that instead of:

 

we take:

 

So the code is something like this:

def word_probability(category, word)
 (@words[category][word.stem].to_f + 1)/@categories_words[category].to_f
end

Finally we want to get Pr(category), which is pretty straightforward. It’s just the probability that any random document being in this category, so we take number of documents trained in the category and divide it with the total number of documents trained in the classifier.

def category_probability(category)
 @categories_documents[category].to_f/@total_documents.to_f
end

Now that we have the probabilities, let’s go back to the classify method and take a look at it again:

def classify(document, default='unknown')
 sorted = probabilities(document).sort {|a,b| a[1]<=>b[1]}
 best,second_best = sorted.pop, sorted.pop
 return best[0] if (best[1]/second_best[1] > @threshold)
 return default
end

We sort the probabilities to bubble up the category with the largest probability. However if we use this directly,it means it has to be the one with the largest, even though the category with the second largest probability is only maybe a bit smaller. For example, take the spam and non-spam categories and say the ratio of the probabilities are like this — spam is 53% and non-spam is 47%. Should the document be classified as spam? Logically, not! This is the reason for the threshold variable, which gives a ratio between the best and the second best. In the example code above the value is 1.5 meaning the best probability needs to be 1.5 times better than the second best probability i.e. the ratio needs to be 60% to 40% for the best and second best probabilities respectively. If this is not the case, then the classifier will just shrug and say it doesn’t know (returns ‘default’ as the category). You can tweak this number accordingly depending on the type of categories you are using.

Now that we have the classifier, let’s take it out for a test run. I’m going to use a set of Yahoo news RSS feeds to train the classifier according to the various categories, then use some random text I get from some other sites and ask the classifier to classify them.

require 'rubygems'
require 'rss/1.0'
require 'rss/2.0'
require 'open-uri'
require 'hpricot'
require 'naive_bayes'
require 'pp'

categories = %w(tech sports business entertainment)
classifier = NaiveBayes.new(categories)

content =''
categories.each { |category|
 feed = "http://rss.news.yahoo.com/rss/#{category}"
 open(feed) do |s| content = s.read end
 rss = RSS::Parser.parse(content, false)
 rss.items.each { |item|
 text = Hpricot(item.description).inner_text
 classifier.train(category, text)
}

# classify this
documents = [
"Google said on Monday it was releasing a beta version of Google Sync for the iPhone and Windows Mobile phones",
:Rangers waste 5 power plays in 3-0 loss to Devils",
"Going well beyond its current Windows Mobile software, Microsoft will try to extend its desktop dominance with a Windows phone.",
"UBS cuts jobs after Q4 loss",
"A fight in Hancock Park after a pre-Grammy Awards party left the singer with bruises and a scratched face, police say."]

documents.each { |text|
 puts text
 puts "category => #{classifier.classify(text)}"
 puts
}

This is the output:

Google said on Monday it was releasing a beta version of Google Sync for the iPhone and Windows Mobile phones
{"tech"=>"62.4965904628081%",
 "business"=>"6.51256988628851%",
 "entertainment"=>"16.8319433691552%",
 "sports"=>"14.1588962817482%"}
category => tech

Rangers waste 5 power plays in 3-0 loss to Devils
{"tech"=>"14.7517595939031%",
 "business"=>"3.51842781998617%",
 "entertainment"=>"8.09457974962582%",
 "sports"=>"73.6352328364849%"}
category => sports

Going well beyond its current Windows Mobile software, Microsoft will try to extend its desktop dominance with a Windows phone.
{"tech"=>"91.678065974899%",
 "business"=>"0.851666657161468%",
 "entertainment"=>"4.15349223570253%",
 "sports"=>"3.31677513223704%"}
category => tech

UBS cuts jobs after Q4 loss
{"business"=>"33.1048977545484%",
 "tech"=>"14.1719403525702%",
 "entertainment"=>"31.9818519810561%",
 "sports"=>"20.7413099118253%"}
category => unknown

A fight in Hancock Park after a pre-Grammy Awards party left the R&B singer with bruises and a scratched face, police say.
{"tech"=>"4.10704270254326%",
 "business"=>"1.4959136651331%",
 "entertainment"=>"78.1802587499558%",
 "sports"=>"16.2167848823678%"}
category => entertainment

You can download the code described above from GitHub at http://github.com/sausheong/naive-bayes, including naive_bayes.rb and the bayes_test.rb files. For more information, you should pick up the excellent book Programming Collective Intelligence by Toby Segaran.

Leaving Welcome

Then there’s the matter of leaving Welcome. Well, it’s more or less public now, Francois announced it in the corporate meeting 2 days ago, together with the new organization change and the news spread out fast. People who left Welcome got to know about it and started to ping me through instant messaging to both ask why and congratulate me on my career move. I even have recruiters coming to me to ask me to keep in touch in case I need to recruit in my new job. More than one person has expressed total shock that I will be leaving so much so that I suspect they think I’m part of the furniture in Welcome!

It’s a strange feeling to be in transition, in a limbo between a current job and a new job. What makes it even stranger is that my current team of 50+ people has been sliced and diced and scattered to the 4 winds, forming new departments with new heads or pushed to existing teams but with new structures. People are jostling over the new organization, wanting to take charge of this and that team, having meetings, forming ‘alliances’ or simply adjusting to life without me around.  It’s the feeling of being operated on while supposedly under anesthetics but you’re wide awake. Nominally I’m still ‘in charge’ but the feeling has been that I’ve been put on a pedestal now. It is not only sad to leave the team that I’ve built up over the years, but also pretty devastating to see that the team itself is dissected and served up to different people.

Still, life goes on, projects to be delivered, customers to be serviced, products to be launched. Welcome is Welcome. Work still to be done while I’m here, whatever I can contribute before my final farewell.