How to hire the best engineers – part 2

In my previous blog post, I came to the conclusion that

If you’re a hiring manager with a 100 candidates, just interview a sample pool of 7 candidates, choose the best and then interview the rest one at a time until you reach one that better than the best in the sample pool. You will have 90% chance of choosing someone in the top 25% of those 100 candidates.

I mentioned that there are 2 worst case scenarios for this hiring strategy. Firstly, if coincidentally we selected the 7 worst candidates out of 100, then the next one we choose will be taken regardless how good or bad it is. Secondly, if the best candidate is already in the sample pool, then we’ll be going through the rest of the population without finding the best. Let’s talk about these scenarios now.

The first is easily resolved. The probabilities are already considered and calculated in our simulation so we don’t need to worry about that any more. The second is a bit tricky. To find out the failure probability, we tweak the simulation a bit again. This time round instead of getting every sample pool size, we focus on the best results i.e. sample pool size of 7.

require 'rubygems'
require 'faster_csv'

population_size = 100
sample_size = 0..population_size-1
iteration_size = 100000
top = (population_size-25)..(population_size-1)
size = 7
population = []
frequencies = []

iteration_size.times do |iteration|
 population = (0..population_size-1).to_a.sort_by {rand}
 sample = population.slice(0..size-1)
 rest_of_population = population[size..population_size-1]
 best_sample = sample.sort.last
 rest_of_population.each_with_index do |p, i|
 if p > best_sample
 frequencies << i

puts "Success probability : #{frequencies.size.to_f*100.0/iteration_size.to_f}%"

Notice we’re essentially counting the times we succeed over the iterations we made. The result for this is a success probability of 93%!

Let’s look at some basic laws of probability now. Let’s say A is the event that the strategy works (the sample pool does not contain the best candidate), so the probability of A is P(A) and this value is 0.93. Let’s say B is the event that the the candidate we find out if this strategy is within the best quartile(regardless if A succeeds or not), so the probability of B is P(B). We don’t know the value of P(B) and we can’t really find P(B) since A and B are not independent. What we really want is  . What we do know however is P(B|A) which is the probability of B given A happens. This value is 0.9063 (from the previous post),

Using the rule of multiplication:

We get the answer:

which is 84.3% probability that the sample pool does not contain the best candidate and we get the top quartile of the total candidate population.

Next, we want to find out, given that we have a high probability of getting the correct candidates in, how many candidates will we have to interview before we find one that is better than the best in the sample pool? Naturally if we have to interview a lot of candidates before finding one, the strategy isn’t practical. Let’s tweak the simulation again, this time to count the successful ones and get a histogram of the number of interviews before we hit jackpot.

What kind of histogram do you think we will get? An initial thought was that it would be a normal distribution, which is not great news for this strategy, because a histogram peaks around the mean, which means the number of candidates we need to interview is around 40+.

require 'rubygems'
require 'faster_csv'

population_size = 100
sample_size = 0..population_size-1
iteration_size = 100000
top = (population_size-25)..(population_size-1)
size = 7
population = []
frequencies = []

iteration_size.times do |iteration|
 population = (0..population_size-1).to_a.sort_by {rand}
 sample = population.slice(0..size-1)
 rest_of_population = population[size..population_size-1]
 best_sample = sample.sort.last
 rest_of_population.each_with_index do |p, i|
 if p > best_sample
 frequencies << i
end'optimal_duration.csv', 'w') do |csv|
 rest_of_population = population[size..population_size-1]
 rest_of_population.size.times do |i|
 count_array = frequencies.find_all{|f| f == i}
 csv << [i+1, count_array.size.to_f/frequencies.size.to_f]

The output is a CSV file called optimal_duration.csv. Charting the data as a histogram, surprisingly we get power law graph like this:

This is good news for or strategy! Looking at our dataset, the probability of the first candidate in the rest of the candidate population to be the one is 13.58% while the probability of getting our man (or woman) in the first 10 candidates we interview is a whopping 63.25%!

So have we nailed the question if the strategy is good one? No, there is still one last question unanswered and this is the breaker. Stay tuned for part 3! (For those who are reading this, an exercise — tell me what you think is the breaker question?)

How to hire the best engineers without killing yourself

Garena, where I work now, is in an expansion mode and I have been hiring engineers, sysadmins and so on to feed the development frenzy for platform revamps and product roadmaps. A problem I face when hiring engineers is that we’re not the only companies that are doing so. This is especially true now as many companies have issued their annual bonuses (or lack of) and the ranks of the dissatisfied joined the swelling exodus out of many tech companies. In other words, mass musical chairs with tech companies and engineers.

Needless to say this causes a challenge with hiring. The good news is that there plenty of candidates. The bad news however is to secure the correctly skilled engineer with the right mindset for a growing startup.  At the same time, the identification and confirmation needs to be swift because once you are slow even with a loosely-fit candidate you can potentially lose him/her within a day or two.

This causes me to wonder — what is the best way to go through a large list of candidates and successfully pick up the best engineer or at least someone who is the top percentile of the list of candidates?

In Why Flip a Coin: The Art and Science of Good Decisions, H.W. Lewis wrote about  a similar (though stricter) problem involving dating. Instead of choosing candidates the book talked about choosing a wife and instead of conducting interviews, the problem in the book involved dating. However the difference is in the book you can only date one person at a time while in my situation I can obviously interview more than one candidate. Nonetheless the problems are pretty much the same since if I interview too many candidates and take too long to decide, they will be snapped up by other companies. Not to mention that I will probably foam in the mouth and die from interview overdose before that.

In the book, Lewis suggested this strategy — say we are choosing from a pool of 20 candidates. Instead of interviewing each and every one of those candidates we randomly choose and interview 4 candidates and choose the best out of the sample pool of 4. Now armed with the best candidate from the sample pool, we go through the rest of the candidates one by one until we hit one that is better than him, then hire that candidate.

As you would have guess, this strategy is probabilistic and doesn’t guarantee the best candidate. In fact, there are 2 worst case scenarios. First, if we happened to choose the worst 4 candidates of the lot as the sample pool and the first candidate we choose outside of the sample pool is the 5th worst, then we would have gotten the 5th worst candidate. Not good. Conversely if we have the best candidate in the sample pool, then we run the risk of doing 20 interviews and then lose the best candidate because it took too long to do the interviews. Bad again.

So is this a good strategy? Also, what is the best population pool (total number of candidates) and sample pool we want in order to maximize this strategy? Let’s be a good engineer and do another Monte Carlo simulation to find out.

Let’s start with the population pool of 20 candidates, then we iterate through the sample pool of 0 to 19. For each sample pool size, we find the probability that the candidate we choose is the best candidate in the population. Actually we already know the probability when the sample pool is 0 or 19. When the sample pool is 0, it means we’re going to choose the first candidate we interview (since there is no comparison!) therefore the probability is 1/20 which is 5%. Similarly with a sample pool of 19, we will have to choose the last candidate and the probability of it is also 1/20 which is 5%.

Here’s the Ruby code to simulate this. We run it through 100,000 simulations to make the probability as accurate as possible, then save it into a csv file. called optimal.csv.

require 'rubygems'
require 'faster_csv'

population_size = 20
sample_size = 0..population_size-1
iteration_size = 100000'optimal.csv', 'w') do |csv|
  sample_size.each do |size|
    is_best_choice_count =  0
    iteration_size.times do
      # create the population and randomize it
      population = (0..population_size-1).to_a.sort_by {rand}
      # get the sample pool
      sample = population.slice(0..size-1)
      rest_of_population = population[size..population_size-1]
      # this is the best of the sample pool
      best_sample = sample.sort.last
      # find the best chosen by this strategy
      best_next = rest_of_population.find {|i| i > best_sample}
      best_population = population.sort.last
      # is this the best choice? count how many times it is the best
      is_best_choice_count += 1 if best_next == best_population
    best_probability = is_best_choice_count.to_f/iteration_size.to_f
    csv << [size, best_probability]

The code is quite self explanatory (especially with all the in-code comments) so I won’t go into details. The results are as below in the line chart, after you open it up in Excel and chart it accordingly. As you can see if you choose 4 candidates as the sample pool, you will have roughly 1 out of 3 chance that you choose the best candidate. The best odds are when you choose 7 candidates as the sample pool, in which you get around 38.5% probability that you will choose the best candidate. Doesn’t look good.

But to be honest for some candidates I don’t really need the candidate to be the ‘best’  (anyway such evaluations are subjective). Let’s say I want to get the candidate to be in the top quartile (top 25%). What are my odds then?

Here’s the revised code that does this simulation.

require 'rubygems'
require 'faster_csv'

population_size = 20
sample_size = 0..population_size-1
iteration_size = 100000
top = (population_size-5)..(population_size-1)'optimal.csv', 'w') do |csv|
  sample_size.each do |size|
    is_best_choice_count =  0
    is_top_choice_count = 0
    iteration_size.times do
      population = (0..population_size-1).to_a.sort_by {rand}
      sample = population.slice(0..size-1)
      rest_of_population = population[size..population_size-1]
      best_sample = sample.sort.last
      best_next = rest_of_population.find {|i| i > best_sample}
      best_population = population.sort.last
      top_population = population.sort[top]
      is_best_choice_count += 1 if best_next == best_population
      is_top_choice_count += 1 if top_population.include? best_next
    best_probability = is_best_choice_count.to_f/iteration_size.to_f
    top_probability = is_top_choice_count.to_f/iteration_size.to_f
    csv << [size, best_probability, top_probability]

The optimal.csv file has a new column, which shows the top quartile (top 5) candidates. The new line chart is shown below, with the results of the previous simulation as a comparison.

Things look brighter now, the most optimal sample pool size is 4 (though for practical purposes, 3 is good enough since the difference between 3 and 4 is small) and the probability of choosing a top quartile candidate shoots up to 72.7%. Pretty good!

Now this is with 20 candidates. How about a large candidate pool? How will this strategy stand up in say a population pool of 100 candidates?

As you can see, this strategy doesn’t work in getting the best out of a large pool (sample pool is too large, probability of success is too low) and it is worse than in a smaller population pool. However, if we want the top quartile or so (meaning being less picky), we only need a sample pool of 7 candidates and we can have a probability of 90.63% of getting what we want. This is amazing odds!

This means if you’re a hiring manager with a 100 candidates, you don’t need to kill yourself trying to interview everyone. Just interview a sample pool of 7 candidates, choose the best and then interview the rest one at a time until you reach one that better than the best in the sample pool. You will have 90% of choosing someone in the top 25% of those 100 candidates (which is probably what you want anyway)!

Umbrellas to work, or why engineers should get married

When I was working in Yahoo at Suntec City, I used to take the MRT to work every day and my route is completely sheltered right to the doorsteps of the office so the weather didn’t really have much effect on me. I still take the MRT to the office now that I am at Garena, but there is a short distance from the MRT station to the office, about 5 minutes of walking, where it is subject to the elements.

Now I don’t take any umbrellas along with me, so with the recent change in weather (it has been raining cats and dogs and all sorts of mammals in between) I have been caught a few times in some unpleasant wetness. It is during one of these wet commutes from home to office I was thinking of ways of making sure that will not happen again.

The idea was like this. I will place some umbrellas at the office and also at home. Whenever I leave the house or the office and if it is raining, I will take the umbrella to the office or back home respectively. Naturally if it’s not raining then I wouldn’t want to look like a dork and carry one.

Therein lies the problem. Let’s say I keep 1 umbrella at the office and another at home. If it rains when I’m leaving the office, I will carry that one home, leaving 2 at home and none at the office. If it rains the next day when I leave home it’s ok, I will just take an umbrella to the office, making it 1 – 1 again. If it doesn’t rain and therefore I don’t take any umbrellas to the office, I will be left with 2 – 0. If it happens to rain when I’m leaving the office I’m stuck with getting wet again.

What if I have 2 umbrellas at the office and 2 at home? In this case I will be caught wet only if the same occurrence (rain when leaving the office followed by no rain when leaving the house) happens 2 days in a row, which probability is lower than if it happens just once. Still, that can happen and I want to get assurance that the risk of getting wet is really minimal. In that case, the question becomes, how many umbrellas should I leave at the office and similarly at home such that the probability of getting wet is negligible? Note that it doesn’t really matter if one place has more umbrellas than the other since it will be the same as the lower number of umbrellas in the long run.

To find out, I used a popular (and really pragmatic and therefore very ‘engineering’) method of finding probabilities — the Monte Carlo simulation method. The name sounds fancy but it’s really just an algorithm that uses repeated random samples to find the answer. In our case, what we want to do is to find out, over large sample, the average number of trips I will make before I run the risk of getting wet.

To find this, I will need to inspect the probability that it will rain at any given day. For easy of calculations, I use a probability between 1% chance that it will rain and 99% chance that it will rain. Why don’t I include 0% or 100%? This is because if it is 0% it means it will never rain and so I will never need to use an umbrella, and if it 100% it will always rain and therefore I will always bring an umbrella back or forth.

On top of that, I will run the simulation for the case where I have 2 umbrellas at home and at the office, up to 50 umbrellas at home and at the office. To give a large enough representation, I run this over 1000 iterations (more will be better but it will take too long to run). This is the Ruby code for the simulation.

require 'rubygems'
require 'faster_csv'

umbrellas_range = 2..50
probability_range = 1..99
@location = :home

total_trips = {}
iterations = 1000

def walks_to(loc)
 location = loc
 @num_of_trips += 1

def office?()  @location == :office; end
def home?() @location == :home; end
def raining?(probability) rand(99) < probability; end'umbrellas.csv', 'w') do |csv|
 csv << [''] + probability_range.to_a
 umbrellas_range.each do |umbrellas|
 probability_range.each do |probability|
 total_trips[probability] = 0
 iterations.times do
 @num_of_trips = 0
 wet = false
 home = umbrellas
 office = umbrellas

 while not wet
 if home?
   walks_to :office
   if raining?(probability)
     if home > 0
       home -= 1
       office += 1
       wet = true

 elsif office?
   walks_to :home
   if raining?(probability)
     if office > 0
       office -= 1
       home += 1
       wet = true

 total_trips[probability] += @num_of_trips

 row = [umbrellas]
 total_trips.sort.each { |pair| row << pair[1]/iterations }
 csv << row

The Ruby program writes to a simple CSV file called umbrellas.csv with the first row being the probability, and the first column the number of umbrellas at each location. If you chart each row with the y-axis being the number of trips and the x-axis being the probability of rain from 1% to 99%, you will find a chart like this:

Now that I have the average number of trips that I will make before I get wet, I want to actually see what that means for Singapore. To do this, I went to the National Environment Agency’s website and dug out the weather statistics for Singapore.  There is a statistic that gives the average number of rainy days in any given month for the past 118 years. I take that for each month and divide that with the number of days in that month to derive the monthly probability of rain as in the following table:

Month Probability of rain
January 48%
February 39%
March 45%
April 50%
May 45%
June 43%
July 42%
August 45%
September 47%
October 52%
November 63%
December 61%

Finally I match that with the Monte Carlo simulation I made earlier. The average number of trips for each month, must be between 56 to 62 days (each trip is one way only, so every day is 2 trips). For example, the statistics shows that the wettest months are in November and December which is 63% and 61% probability of rain respectively. This means I will need to have 38 umbrellas at each location for each of these 2 months (please check your umbrellas.csv file; open it up with Excel). In February where the probability of rain is the lowest (39%), I will only need 23 umbrellas at each location to almost guarantee that I will not get wet (the whole deal assumes at the end of each month, I replenish each location with the necessary number of umbrellas, of course).

That evening when I proudly told my wife my findings at home, she stared at me for while (frostily if I might add) then gave me a small folding umbrella, which I now carry in my laptop bag.

And that is the reason why engineers should get married.

(I have been inspired by this book — Digital Dice : Computational Solutions to Practical Probability Problems, by Paul J. Nahin)

Clone TinyURL with 40 lines of Ruby code on Google AppEngine for Java

Amidst the wretched events that happened at work recently, I forgot about an interesting development in running apps on a cloud. Google AppEngine finally released Java support on the AppEngine platform. For those uninitiated, AppEngine is Google’s cloud computing platform that allows developers to serve up applications on Google’s infrastructure. When it was first released in April 2008, the only language supported was Python. Python is a great language but doesn’t appeal to my inner Rubyist so it didn’t catch my attention. Until now that is.

While Java is no longer my language of choice nowadays, Ruby actually runs pretty well under JRuby with Java. And with the addition of the Java support for AppEngine, it became a lot more interesting. A few weeks back I wrote Snip, a TinyURL clone, in about 40 lines of Ruby code, and deployed it on Heroku. It seems like a good idea to take Snip out for a spin on the Google AppEngine for Java (GAE/J).

The first thing you need to do is to create an application on the GAE/J. Start by going to this URL – and log in using a Google account. After logging in, create a new application following the instructions given on the screen. When you’re done you should have an application id. In this case, it is ‘saush-snip’. We will be needing this application id in our configuration later. You will also need to enable Java for your GAE/J. At this point in time, GAE/J is still in beta and Google is only limiting the first 10,000 developers from enabling Java for GAE/J. Unfortunately if you don’t get Java enabled for your account, you won’t be able to try this out until it is fully released and available to all.

Once you have signed up and gotten an email approval for your GAE/J account, the first thing to do is to install JRuby, if you haven’t done so yet. Even if you have installed it previously you might want to install at least 1.3.0RC1 since some fixes were added into this release to make it run better under GAE/J. Run this:

$ git clone git://

This will clone a copy of JRuby into your computer. Then go into the jruby folder that was just created and run this:

$ sudo ant && sudo ant jar-complete

This will install JRuby and create the jruby-complete.jar library that you will need in a while. Take note of this path to JRuby, you’ll need it in the subsequent commands. Assume that you just installed JRuby in ~/jruby, do a quick check to see if the install is ok:

$ ~/jruby/bin/jruby -v

If you installed version 1.3.0RC1 you should see something like this:

jruby 1.3.0RC1 (ruby 1.8.6p287) (2009-05-11 6586) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_07) [x86_64-java]

After installing JRuby, you’ll need to install all gems needed for this application. Remember that even if you have installed gems for your normal Ruby installation you’ll need to install it all over again for JRuby. For Snip, you need Sinatra and HAML, but you’ll also need Rake and Warbler, the JRuby war file packager.

$ ~/jruby/bin/jruby -S gem install rake sinatra haml warbler

Now that the basic JRuby and related gems are done, let’s look at the Snip code itself. One thing that is pretty obvious upfront when dealing with AppEngine is that it doesn’t have a relational database for persistence. Instead of a familiar RDBMS, we get a JDO interface or a DataStore API. How do we use it? As it turns out, we don’t need to do anything major. Ola Bini wrote a small wrapper around DataStore, called Bumble, to allow us to write data models just like we did with DataMapper. Well, almost.

Using Bumble is very much similar to DataMapper and ActiveRecord, so I didn’t have to change my code much. This is the DataMapper version of the Url model:

DataMapper.setup(:default, ENV['DATABASE_URL'] || 'mysql://root:root@localhost/snip')
class Url
  include DataMapper::Resource
  property  :id,          Serial
  property  :original,    String, :length => 255
  property  :created_at,  DateTime
  def snipped() end

And this is the Bumble version of the Url model:

class Url
  include Bumble
  ds :original
  def snipped() self.key.to_s end

I didn’t add in the time stamp for the Bumble version because I don’t really use it but as you can see there are quite a bit of similarities. I didn’t need to put in my own running serial id because it’s managed by the AppEngine. Also, instead of using the object id, I used the object’s key, which again is managed by the AppEngine. A key is a unique identifier of an entity across all apps belonging to the user. The key is created automatically by Bumble through the low-level DataStore Java APIs. Besides this, using the Url class is slightly different also. Instead of

@url = Url.first(: original => uri.to_s)

We use:

@url = Url.find(: original => uri.to_s)

Finally because we don’t use the id anymore and use the key instead, we don’t need to do the base 36 conversion so we let the AppEngine handle everything. Instead of:

get '/:snipped' do redirect Url[params[:snipped].to_i(36)].original end

We use:

get '/:snipped' do redirect Url.get(params[:snipped]).original end

This is the full source code:

%w(rubygems sinatra bumble uri).each  { |lib| require lib}

get '/' do haml :index end

post '/' do
  uri = URI::parse(params[:original])
  raise "Invalid URL" unless uri.kind_of? URI::HTTP or uri.kind_of? URI::HTTPS
  @url = Url.find(:original => uri.to_s)
  @url = Url.create(:original => uri.to_s) if @url.nil?
  haml :index

get '/:snipped' do redirect Url.get(params[:snipped]).original end

error do haml :index end


class Url
  include Bumble
  ds : original
  def snipped() self.key.to_s end


@@ layout
!!! 1.1
    %title Snip! on Google AppEngine
    %link{:rel => 'stylesheet', :href => '', :type => 'text/css'}
  = yield

@@ index
%h1.title Snip! on Google AppEngine
- unless @url.nil?
  %code= @url.original
  snipped to
  %a{:href => env['HTTP_REFERER'] + @url.snipped}
    = env['HTTP_REFERER'] + @url.snipped
#err.warning= env['sinatra.error']
%form{:method => 'post', :action => '/'}
  Snip this:
  %input{:type => 'text', :name => 'original', :size => '50'}
  %input{:type => 'submit', :value => 'snip!'}
%small copyright &copy;
%a{:href => ''}
  Chang Sau Sheong
  %a{:href => ''}
    Full source code

The code is ready but here comes the packaging. GAE/J is a Java servlet environment, which means our app needs to be packaged into a war (Java Web ARchive). Fortunately instead of building up the war by hand we can use Warbler, the JRuby war packager. Before running Warbler, we need to have a couple of things. Firstly we need to build the warble configuration file:

$ mkdir config
$ ~/jruby/bin/jruby -S warble config

We create a directory called config and get Warbler to copy a default configuration file to it. Replace the contents with this minimal setup. If you want to explore more, read warble.rb itself. do |config|
	config.dirs = %w(lib public views)
	config.includes = FileList["appengine-web.xml", "snip.rb", "", "bumble.rb"]
	config.gems = ['sinatra', 'haml']
	config.gem_dependencies = true
	config.war_name = "saush-snip"
	config.webxml.booter = :rack
	config.webxml.jruby.init.serial = true
	config.java_libs.reject! { |lib| lib =~ /jruby-complete/ }

Note that we don’t really need the public and view directories in Snip because everything is in a single file. The 2 other configuration files we will need are appengine-web.xml and We also need to include the snip.rb and bumble.rb into the war file for deployment. To get bumble.rb, go to Ola Bini’s Bumble GitHub repository and get the file that is in the sub-folder (not the main one). The last line tells us not to include the jruby-complete.jar library in the lib folder when we run Warbler. I’ll explain this in a minute. Also note the war file is the application id of the application we created in the GAE admin console earlier on (saush-snip).

Next, create a lib folder. Go to the GAE/J download site and download the GAE/J Java library. It should be called something like appengine-api-1.0-sdk-1.2.0.jar. Copy that into the lib folder you’ve just created. We will also need the Java libraries in the lib folder. Normally for a JRuby deployment, Warbler will package it in, but Google has a 1,000 file limit which Ola Bini kindly pointed out. He also provided a script to split the JRuby library into 2 files. You can find the script here and when you run it, it should split jruby-complete.jar into 2 files named jruby-core-1.3.0RC1.jar and jruby-stdlib-1.3.0RC1.jar. You will also need JRuby-Rack but it’s included in Warbler and as you will see later, Warbler will copy it into the war when you run it. JRuby-Rack is an adapter for the Java servlet environment that allows Sinatra (or any Rack-based application) to run.

The next piece is appengine-web.xml. I used Ola Bini’s version as the base:

<appengine -web-app xmlns="">
    <static -files />
    <resource -files />
    <sessions -enabled>false</sessions>
    <system -properties>
      <property name="" value="false" />
      <property name="os.arch" value="" />
      <property name="jruby.compile.mode" value="JIT"/> <!-- JIT|FORCE|OFF -->
      <property name="jruby.compile.fastest" value="true"/>
      <property name="jruby.compile.frameless" value="true"/>
      <property name="jruby.compile.positionless" value="true"/>
      <property name="jruby.compile.threadless" value="false"/>
      <property name="jruby.compile.fastops" value="false"/>
      <property name="jruby.compile.fastcase" value="false"/>
      <property name="jruby.compile.chainsize" value="500"/>
      <property name="jruby.compile.lazyHandles" value="false"/>
      <property name="jruby.compile.peephole" value="true"/>
			<property name="jruby.rack.logging" value="stdout"/>

The list row in the property line sets logging to STDOUT, which is very useful for debugging. If you don’t set this, you might not be able to see any console output. Again, we need to set the application id that we got earlier on (saush-snip).

Finally we need a Rackup file to start the whole thing:

%w(rubygems sinatra snip).each  { |lib| require lib}
root_dir = File.dirname(__FILE__)
set :environment, :production
set :root, root_dir
set :app_file, File.join(root_dir, 'snip.rb')
disable :run
run Sinatra::Application

You can of course also find all these things in the Snip-AppEngine repository at git:// However it is so much more fun to do it step by step right?

Now that we have all the pieces let’s package our files for deployment. First we need to generate the exploded war file:

$ ~/jruby/bin/jruby -S warble

You should see output like this:

mkdir -p tmp/war/WEB-INF/gems/specifications
cp /Users/saush/jruby/lib/ruby/gems/1.8/specifications/sinatra- tmp/war/WEB-INF/gems/specifications/sinatra-
mkdir -p tmp/war/WEB-INF/gems/gems
JRuby limited openssl loaded. gem install jruby-openssl for full support.
cp /Users/saush/jruby/lib/ruby/gems/1.8/specifications/rack-0.9.1.gemspec tmp/war/WEB-INF/gems/specifications/rack-0.9.1.gemspec
cp /Users/saush/jruby/lib/ruby/gems/1.8/specifications/haml-2.0.9.gemspec tmp/war/WEB-INF/gems/specifications/haml-2.0.9.gemspec
mkdir -p tmp/war/WEB-INF/lib
mkdir -p tmp/war/WEB-INF/public
mkdir -p tmp/war/WEB-INF/views
cp lib/appengine-api-1.0-sdk-1.2.0.jar tmp/war/WEB-INF/lib/appengine-api-1.0-sdk-1.2.0.jar
cp lib/jruby-core-1.3.0RC1.jar tmp/war/WEB-INF/lib/jruby-core-1.3.0RC1.jar
cp lib/jruby-stdlib-1.3.0RC1.jar tmp/war/WEB-INF/lib/jruby-stdlib-1.3.0RC1.jar
cp appengine-web.xml tmp/war/WEB-INF/appengine-web.xml
cp snip.rb tmp/war/WEB-INF/snip.rb
cp tmp/war/WEB-INF/
cp bumble.rb tmp/war/WEB-INF/bumble.rb
cp /Users/saush/.gem/jruby/1.8/gems/warbler-0.9.13/lib/jruby-rack-0.9.4.jar tmp/war/WEB-INF/lib/jruby-rack-0.9.4.jar
cp /Users/saush/.gem/jruby/1.8/gems/warbler-0.9.13/lib/jruby-rack-0.9.4.jar tmp/war/WEB-INF/lib/jruby-rack-0.9.4.jar
mkdir -p tmp/war/WEB-INF
jar cf saush-snip.war  -C tmp/war .

You will also get a saush-snip.war file and a bunch of files under the tmp folder, which is really just the war file exploded. We won’t need the war file (saush-snip.war) itself for deployment, only the tmp directory. Before doing the deployment, we need to make a minor adjustment to the tmp/war/WEB-INF/gems/gems/sinatra- file. Somehow the line ‘use_in_file_templates!’ gives an error when deploying to the GAE/J so comment it out.

That’s it! We are ready for the deployment. To deploy run this command:

$ ~/appengine-java-sdk-1.2.0/bin/ --email= --passin update tmp/war/

You should see output like this:

Reading application configuration data...
2009-05-15 19:51:38.916::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
Beginning server interaction for saush-snip...
Password for :
0% Creating staging directory
5% Scanning for jsp files.
20% Scanning files on local disk.
25% Scanned 250 files.
28% Initiating update.
31% Cloning 340 application files.
33% Cloned 100 files.
34% Cloned 200 files.
35% Cloned 300 files.
40% Uploading 0 files.
90% Deploying new version.
95% Will check again in 1 seconds
98% Will check again in 2 seconds
99% Closing update: new version is ready to start serving.
99% Uploading index definitions.
Update complete.
Cleaning up temporary files...

Now go to and you should be able to see the new deployment of Snip on the Google AppEngine for Java.

A few other tutorials on the Internet also describe how to deploy Sinatra or Rails-based apps on GAE/J, amongst which Samuel Goebert’s tutorial and Ola Bini’s stand out the most.

A few thoughts on comparing Heroku and GAE/J since I’ve deployed on both of them now. Heroku is definitely the easier platform to deploy, with a just few simple steps compared to the hoops I had to jump for GAE/J. Heroku also has the arguably more familiar persistence mechanism as it uses the familiar RDBMS (postgresql) compared to Google’s DataStore implementation, which today only has Ola Bini’s Bumble implementation compared to the established ActiveRecord, DataMpper and Sequel ORMs. In addition, Google’s implementation has many limitations, some harder to understand than others, which forces applications to suit this platform rather than having the platform really service the application.

On the other hand, Google’s set of tools in its console is really impressive, with graphs of usage, data and log viewers. Google AppEngine also has memcached, url fetcher, integration with Google accounts, sending out mails and hosts of other helper services. Also, Google’s brand name does add some weight when you throw in more mission critical projects though Heroku, backed by Amazon’s EC2 services is no push-over.  As for costs, until large deployments emerge it is hard to say which is the better deal, also it really depends highly on usage patterns.

So are they evenly balanced out? Not for me at least. For now I would still favor Heroku over GAE/J because Heroku is more friendly to Ruby developers. But who knows? It’s exciting times for Rubyists.

Packt Author Award 2009 – Vote for my book!

It’s a year to the month since I published my first book – Ruby on Rails Web Mashup Projects. I didn’t even realise it — time really does fly. Just got a mail from my publisher announcing that they’re having an award for the best author of the year. If you’re one of my readers, or if you enjoyed reading this blog please consider voting for my book.

Here’s the link – Just click on Vote Now! and select Ruby on Rails Web Mashup Projects in the list of books to vote for. Thanks!

Third party user authentication with Ruby in a just few lines of code

Some time ago (8 years to be exact) when I was still chest-deep in Java and working in elipva, I wrote Tapestry, a centralized user management system and also an article in Javaworld describing it. That particular project is long gone (it got merged into elipva’s Zephyr product and disappeared as a standalone system) but the problems I wrote about are really quite universal.

To summarize, the main problem that Tapestry tried to solve was that software applications that serve more than 1 user at a time usually require its users to sign in. To do that, most of the time, we create a user management module. Within this user management module we need to provide these 2 major services:

As you can imagine, authentication is usually simpler than access control. Tapestry’s approach was to externalize these services into a configurable standalone system that provides both services. While I never really completed Tapestry, there are many third-party providers of these services in the market now. In this blog post I will talk about authentication and explain some ways to easily integrate your Ruby application with these third-party authentication providers.

But before that, some background. In the open era, 2 different open technologies have surfaced to solve the the issues of authentication and access control separately. OpenID is a popular digital identity used for authentication, originally developed for LiveJournal in 2005. It’s supported by many large companies including Yahoo, Google, PayPal, IBM, Microsoft and so on. As of Nov 2008, there are over 500 million OpenIDs on the Internet and about 27,000 sites have integrated it. OAuth is a newer protocol used for access control, first drafted in 2007. It allows a user to grant access to information on one site (provider) to another site (consumer) without sharing his identity. OAuth is strongly supported by Google and Yahoo, and is used also in Twitter.

At the same time, many larger organizations started to tout their proprietary own authentication and access control mechanism. Microsoft started its Passport service in 1999 and received much criticism along the way (later it changed its name to Windows Live ID). Soon after, in September 2001, the Liberty Alliance was started by Sun Microsystems, Oracle, Novell and so on (which many considered a counter to Passport) to do federated identity management. As mentioned, Google is one of the initial and strong supporter of OAuth but it also offers an older AuthSub web authentication and ClientLogin client authentication APIs while being an OpenID provider (it calls its OpenID service Federated Login). Yahoo started supporting OAuth officially in 2008, in addition to its older BBAuth APIs and is also an OpenID provider. Facebook joined the race in 2008 with Facebook Connect, which does authentication as well as share Facebook information. MySpace also came in with their platform in 2008, originally called Data Availability, which quickly became MySpaceID. In April 2009, Twitter started offering ‘Sign In with Twitter‘, which true to its name, allow users to sign in with their Twitter accounts. In addition to that Twitter uses HTTP basic authentication to protect its resources, and also recently announced support for OAuth.

As you can see there’re really lots of players in the market. RPXNow, which I blogged about previously, took advantage of these offerings and came up with a unified and simplified set of APIs to allow authentication with Yahoo, Google, Windows Live ID, Facebook, AOL, MySpace and various OpenID providers.

Looking from another perspective, there are 2 types of third party authentication APIs:

  • Web authentication for web applications
  • Client authentication for desktop or mobile client applications

The main difference between these 2 types of authentication is that web authentication can redirect the user to provider’s site for him to enter his username and password, then returns to the calling web application, usually with a success or failure token. Client authentication (usually) cannot do this, and would require the application to capture the username and password, which is sent by the client application to the provider. Obviously from a security point of view, web authentication is more secure and re-assuring to users since the user never needs to pass the username and password to the application at all.

Let’s jump into some code now. I will give 2 examples of both web authentication and client authentication each. You can find all the code here at git://

Let’s start with RPX because that’s the easiest. Using RPX is probably the easiest way to do third party web authentication with a large number of different providers. You can either use its provided interfaces (either embedded or pop-up Javascript) or you can call its simple API to authenticate your web application. Here’s a simple Sinatra example.

There are just 2 files. rpx_auth.html is in the public folder:

<form method="POST" action="">
	<input type="hidden" name="openid_identifier" value="" />
	<input type="image" src='' />

This is the input form. Notice that it is entirely in HTML (it’s just a simple form submit with one post parameter). The actual work is done by auth.rb in the root folder:

%w(rubygems sinatra rest_client json).each  { |lib| require lib}

get '/login' do
  'You are logged in!' if authenticate(params[:token])

def authenticate(token)
  response = JSON.parse( '', :token => token, 'apiKey' => <rpx API KEY>, :format => 'json', :extended => 'true')
  return true if response['stat'] == 'ok'
  return false

You will need to register an application in RPXNow. In this example, I’m reusing Chirp-Chirp’s application in RPXNow, which is why you see the post action  is to the server The token is the URL that RPX will redirect to after it authenticates the user. The pleasant surprise here is that it allows you to redirect to localhost, which eases the development effort. In this example also, I’m using Yahoo as the OpenID provider. RPXNow provides a host of other popular providers — in Chirp Chirp I actually used Yahoo, Google and Windows Live ID though it provides a lot more others. I used an image button to submit but that’s to prettify the page.

Notice that RPX redirects to the localhost with the URL ‘/login’. I created a get method for this, which in turn passes the returned token to the authenticate method. Inside this method, I used Adam Wiggin’s (of Heroku fame) really convenient rest-client library to send a post request with the token and the API key provided when you register for an application. I indicated that I want to receive the response in JSON. The response is parsed for the status and can also be parsed for data. For simplicity I didn’t return the Yahoo user information (but it’s a really small subset of data only) though I could potentially parse the JSON data for it.

And that’s it! You now have web application authentication by a third party provider, courtesy of RPX.

What if you don’t want to have a middleman like RPX to do web authentication? In this example, I was using RPX Basic, which is free because I’m using an domain. If I want my own domain for authentication, I need to pay for commercial account with RPX (which is really their business model). An alternative is to use OpenID directly instead, which is very popular and doesn’t add too much complexity.

This is the input form for OpenID authentication, openid_auth.html:

<form method="get" action='/login'>
  OpenID identifier:
  <input type="text" class="openid" name="openid_identifier" size='50'/>
  <input type="submit" value="Verify" /><br />

The input requires the user to enter an OpenID identifier. If you’re not sure what an OpenID identifier is like you can check it out here. Most likely you’d already have one. Just like in the RPX case, the user is redirected to the OpenID provider, where he is requested to log in to that provider. After logging in, he will be returned to the calling application.

This is the Sinatra app that does this work, in a file called openid_auth.rb:

%w(rubygems sinatra openid openid/store/filesystem).each  { |lib| require lib}

REALM = 'http://localhost:4567'
RETURN_TO = "#{REALM}/complete"

get '/login' do
  checkid_request = openid_consumer.begin(params[:openid_identifier])
  redirect checkid_request.redirect_url(REALM, RETURN_TO) if checkid_request.send_redirect?(REALM, RETURN_TO)
  checkid_request.html_markup(REALM, RETURN_TO)

get '/complete' do
  response = openid_consumer.complete(params, RETURN_TO)
  return 'You are logged in!' if response.status == OpenID::Consumer::SUCCESS
  'Could not log on with your OpenID'

def openid_consumer
  @consumer =,'auth/store')) if @consumer.nil?
  return @consumer

Note that it is not that much longer than the RPX code. I used the ruby-openid gem from JanRain, which is probably the defacto OpenID Ruby library. The steps are only a bit more complicated:

  1. The input form asks the user to enter an OpenID identifier and sends it to the ‘/login’ get block (it could easily be a post block)
  2. The ‘/login’ block first gets a consumer object. To create the consumer object, you need to pass in the session where you will store the request information (this is usually provided by the web framework) as well as a store. The store is the place where we keep associations, the shared secrets between the relying party (your application) and an OpenID provider as well as nonces, cryptographic one-time alphanumeric strings used to prevent replay attacks. In the example above, I created the store in the file system.
  3. Once we have the consumer, we initiate the start of the authentication process by calling its begin method with the identifier, which returns a CheckIDRequest object. This object contains the state necessary to generate the actual OpenID request
  4. Next we ask the CheckIDRequest object if we should send the HTTP request as a redirect or if we should produce a html post form for the user
  5. Then we either redirect the user to the OpenID provider or create a HTML form that does the same thing. Here we provide a return_to path for the OpenID provider to return the user to our application with a bunch of other data
  6. Once the user authenticates himself, he will be sent back to our application, and I used another get block ‘/complete’ to process the returned data
  7. We don’t need to go through the returned parameters one by one, we can re-use the same consumer object we created earlier and use the complete method to process those parameters
  8. The complete method returns a response object, which we can check for status

The process actually looks more complicated than it really is. The ruby-openid library simplifies the processing a great deal. In the example above, I only wanted to check if the user is valid or not, if I wanted to get more data on the user, there are a bunch of other things we can do.

We’ve just covered the web authentication. Let’s move on to client authentication, which doesn’t involve a browser. Google provides ClientLogin, which is an authentication API specifically desktop or mobile clients. Using it is actually simpler than web authentication. For this example and the next, I will use Shoes, the excellent Ruby GUI toolkit by why the lucky stiff. To run the code, you need to download and install Shoes, then follow the instructions on how to execute Shoes applications. I won’t go through how to create a Shoes application here though, that’s for another day. Here’s the complete code for using Google’s ClientLogin API, in a file called goog-auth-client.rb:

Shoes.setup do
   gem 'rest-client'
   gem 'net-ssh'

%w(rest_client openssl/bn openssl/cipher openssl/digest openssl/ssl openssl/x509 net/ssh).each  { |lib| require lib}

class GoogleAuthClient < Shoes
  url '/', :login
  url '/list', :list

  def login
    stack do
      title "Please login", :margin => 4
      para "Log in using your Google credentials!"
      flow :margin_left => 4 do para "Email"; @me = edit_line; end
      flow :margin_left => 4 do para "Password"; @password = edit_line :secret => true; end
      button "login" do $me = @me.text; authenticate(@me.text,@password.text) ? visit('/list') : alert("Wrong email or password"); end

  def list
    title "You are logged in!"

  def authenticate(email,password)
    response = '', 'accountType' => 'HOSTED_OR_GOOGLE', 'Email' => email, 'Passwd' => password, :service => 'xapi', :source => 'Goog-Auth-1.0'
    return true if response.code == 200
    return false
end :width => 400, :height => 200, :title => 'Google Authenticator'

You might notice that I actually included a large number of libraries. This is because there is a problem in the current Shoes build, which didn’t include the OpenSSL libraries so we’ll need to add them in manually. The method to focus on is really the authenticate method. I used the rest_client library again to send a post request to the ClientLogin URL. the parameter ‘service’ indicates which Google service you want to access. In our case, we just want to authenticate a Google user, so we can use the generic service name ‘xapi’. The ‘source’ parameter is a short string that identifies our application, this is for logging purposes only.

An alternative to ClientLogin and just as easy is Twitter’s basic authentication service. The code is also exactly the same. Here’s the code, in a file named twit-auth_client.rb.

class AuthClient < Shoes
  url '/', :login
  url '/list', :list

  def login
    stack do
      title "Please login", :margin => 4
      para "Log in using your Twitter credentials!"
      flow :margin_left => 4 do para "Username"; @me = edit_line; end
      flow :margin_left => 4 do para "Password"; @password = edit_line :secret => true; end
      button "login" do authenticate(@me.text,@password.text) ? visit('/list') : alert("Wrong user or password"); end

  def list
    title "You are logged in!"

  def authenticate(user,password)
    open("", :http_basic_authentication=>[user, password]) rescue return false
    return true
end :width => 400, :height => 200, :title => 'Blab!'

Let’s jump into the authenticate method again. I used OpenURI to send a get request to the verify credentials API. Twitter uses basic authentication to protect its resources, so we need to pass in the user and password information as well (notice that OpenURI allows you to use basic authentication). I could parse the returned results, but what happens here is that if the user and password information is invalid, it will throw an error anyway, so I just do a rescue and return false. If all is well, I can just return true.

You might notice that the code that allows me to use both Google and Twitter as a the third-party authentication provider is just a single line! Don’t you just love Ruby?

Clone TinyURL in 40 lines of Ruby code

I’m officially hooked. After writing 2 blog posts on cloning popular web applications on the Internet, I was raring to take on another one. TinyURL looked like the easiest so that’s the one I did. In fact it’s so easy there’s at least 100+ such applications in the market already. I called my TinyURL clone, Snip (

I wrote Snip with Sinatra then deployed it up to Heroku so this is also a good excuse also to describe Heroku, a truly amazing service for the Ruby programming community. The total number of lines in Snip is actually 43, in a single file named snip.rb. including the view template and layout. To check it out go to git://

A quick word about TinyURL and its army of clones. TinyURL is the first of its kind (started in January 2002) providing a very simple but useful service of replacing a full URL with a much shorter one. Going to the shortened URL will redirect the user to the actual full URL. Its usage really exploded with the rising popularity of Twitter, which required its users to send messages in only 140 or less words, really making URL shortening a necessity. In March 2009, the service with second largest market share (13%), raised $2 million as funding and TechCrunch had a field day, even estimated that TinyURL with 75% of market share worth up to $46 million! Unfortunately till date, TinyURL’s actual business model and the question of it and its ilk can make money is still unanswered.

Source : TechCrunch
Source : TechCrunch

Enough on the business and money. Let’s look at the code and start with the first line. Most of the time we write require files in multiple lines, but actually we can use an array and iteratively require each of the library. I only used Sinatra and DataMapper libraries here though Sinatra itself includes HAML, which is the template markup language I used in Snip.

I will not go in depth on using Sinatra. If you’re interested, please go to the Sinatra documentation or my previous post. I will assume you roughly know how Sinatra works and jump right in (anyway there is so little code you’d probably understand it easily).

I have only 2 get blocks. The ‘/’ get block primarily just shows the main page, while the ‘/:snipped’ get block takes in the snipped code fetches the original URL, then redirects the user to it. I also have a post block. The ‘/’ post block (notice that using get will not reach the post block, which neatly demonstrates the pure simplicity of Sinatra) first makes sure that the URL is valid (by running it through the URI parse method). Then it either creates a new URL or fetches the existing one, if it was shortened beforehand.

A note on alphanumeric code used to represent a URL. I store each URL in the database as a row and the code is really just the row ID in the database table. However I cannot use the row ID directly because as the number of URLs grows, the number of characters representing the code grows quickly as well. For example, as I hit 1 million URLs, I would have 7 characters (all numbers) in the code. This is not so efficient. To reduce the number of characters used for representing the row ID, I use a base 36 numbering system (a-z, 0-9). This means 1 million records in the database would only require 4 characters (1,000,000 base 10 is ‘lfls’ base 36). And the 200 million records that TinyURL claims to store, will use only 6 characters instead of 9. I have no idea how TinyURL actually generates their code, but looking into their current number of characters in the code (6) I would say this is not a far off guess. (Reverse engineering their 6 character code this way though, results in showing that TinyURL have > 700 million records).

Doing base 36 conversion seems daunting but in reality most programming languages would have some sort of support for non-decimal numbering system conversion. Ruby’s implementation is particularly simple. A Ruby Fixnum (i.e. all whole numbers) has a method called to_s. This method is probably familiar with all Ruby programmers as it converts any object to a String representation. What is likely less well-known is that Fixnum’s implementation takes in a parameter, which is the radix for the base numbering system used. For example:

>> 1234.to_s(2)
=> "10011010010"
>> 1234.to_s(36)
=> "ya"

Conversely, String’s to_i implementation does the reverse, which is to take a String and convert it into a Fixnum representation of that String, given the radix:

>> "hello world".to_i(36)
=> 29234652

For templating, I used HAML, the delightful albeit more programmer-centric templating system. Most templating systems strike a compromise between HTML and a programming language. This is evident in the popular systems like JSP, ASP, PHP and even ERB (ERB is the templating system used by default in Rails). However HAML abandons HTML altogether and goes for a purely programming approach, very much like Seaside (Seaside is a Smalltalk web application framework). Skipping comparisons on the difference approaches in templating, one advantage HAML has is that it allows programmers to code the interface in a more natural way, which just suits me fine in this case.

I used a Sinatra trick that allows me to embed the template within the same source code itself. While normally I would need to create 2 additional template files (1 for the layout and another for the index), I added in the code for the templates in the same source file but after the __END__ keyword.

To prettify the interface, I used one of the ready-made W3C core stylesheets, which provide me with a standard and well defined set of styles without going through the headache of creating one myself.

And we’re done! A full URL shortening service in a 40 lines of code file. To start it up just do this:

$ ruby snip.rb

Then go to http://localhost:4567.

The next step is to deploy it to Heroku. I can’t say enough about this service, which is heaven-sent for the Ruby web application programming community. Just register an account here at In fact the main page more or less explains the steps you need to do to deploy the app! However, there are a couple more steps for Sinatra. Here is the complete list of steps:

1. Create a file

This is the Rack configuration file, which is actually just another Ruby script. All you need to have in this file is this:

require 'sinatra'
require 'snip'
run Sinatra.application

This tells Rack to include the Sinatra and Snip libraries, then run the Sinatra application.

2. Install the Heroku gem

$ sudo gem install heroku

Heroku provides us with a set of useful tools packaged in a gem, very much like Capistrano.

3. Initialize an empty Git repository in the snip folder

$ cd snip
snip $ git init
Initialized empty Git repository in .git/
snip $ git add .
snip $ git commit -m 'initial import'
Created initial commit 5581d23: initial import
2 files changed, 52 insertions(+), 0 deletions(-)
create mode 100644
create mode 100644 snip.rb

This just creates and initializes an empty git repository on your computer.

4. Create the Heroku application

snip $ heroku create snip
Created |
Git remote heroku added

You will be prompted for your username and password the first time you run a heroku command. Subsequently this will be saved in ~/.heroku/credentials and you won’t be prompted. It will also upload your public key to allow you to push and pull code.

5. Push your code to Heroku

snip $ git push heroku master
Counting objects: 4, done.
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 999 bytes, done.
Total 4 (delta 0), reused 0 (delta 0)
-----> Heroku receiving push
-----> Rack app detected
Compiled slug size is 004K
-----> Launching....... done
App deployed to Heroku
* [new branch]      master -> master

Notice that this pushes your code and loads your application into deployment.

6. Log in to the Heroku console and create the database

snip $ heroku console
Ruby console for
>> DataMapper.auto_migrate!
=> [Url]

Heroku allows you access to a console similar to irb but with the environment of your deployment loaded up, like script/console in Ruby on Rails. To create the database, I just run DataMapper.auto_migrate! and it will create the database accordingly.

This is it! Now go to your application on Heroku and you should be able to see this:

One of the main sticking points in doing Ruby web application development is finding a place to host the application. Most people either do web host sharing or slice hosting but it is either underpowered or costly (or both) and requires mucking around with servers (which many application programmers like myself do so only reluctantly). Heroku is an amazing service that heralds a new way deploying Ruby web applications that saves the day for many Ruby programmers.

A final word on the URL shortening service. For a more full-fledged service, I would probably register a much shorter domain name (something like what has) and also add more features that provide statistics to each redirected URL and so on.