saush

My new book is out!

Posted in Ruby, Sinatra, Uncategorized by sausheong on September 2, 2010

It’s been a while from the day I started writing Cloning Internet Applications with Ruby but it’s finally out! You can get it from bookstores, Amazon or its main site at Packt. It’s available in both a paper and a digital version (PDF), so get it now!

The main idea behind this book is actually quite simple and it started out in this blog. The first ‘clone’ I wrote was the Internet search engine in 200 lines of code, which was really very much scratching an itch that I had while I was in Yahoo, about a year and a half ago. I was interested in search engines athen and I wanted to write a very simple search engine to illustrate the principles behind an Internet search engine. That gave me a chance to try out Sinatra, the minimalist web application framework, which worked out really well for me eventually. In turn, that kickstarted me into on a whimsy challenge to do the same with Twitter in the same number of lines of code, using Sinatra and later, TinyURL in 40 lines of code. After that it was only a short leap to writing a whole book about it.

While the original idea revolved around writing clones with the smallest codebase possible, eventually the book evolved to be about writing minimal-feature clones written using the Ruby libraries that I now love to use i.e. Sinatra, DataMapper and Haml. The fundamental premise of the book still remained though, that is to illustrate how clones of popular Internet applications can be written with Ruby.

While this is a highly technical book with lots of code, I added in plenty of elements of the reasons and rationale (according to me, that is) why and how certain features of those applications work. For example, Twitter’s and Facebook’s features for connecting their users (‘friending’ features) in a social network are different, because they target users differently. Twitter’s friending features are primarily one-way and do not need explicit approval while Facebook’s friending features are two-ways and need explicit approvals from both parties. This means design and implementation differences, which are explained in detail in the book.

The experience in writing this book was good, and I have learnt tremendously in the process though it was a struggle. I can say this now that it’s published, but there were certain times I wanted to throw in the towel because of the messy my career was in then. I was still in Yahoo when I started, and I continued while doing my consulting work which eventually led to Garena, then wrapping up before I left Garena and finally being published now as I’m in HP Labs. It took a longer time to finish this than my first book, because of the upheaval in my career in the past year or so and also because overall I wanted to come up with a better book. This resulted in a book that has been revised repeated as companies, statistics and technologies changed. When I started, TinyURL was the king of the hill of URL shorteners while bit.ly was up and coming having just taken over as the default URL shortener in Twitter. TinyURL is now one of the players, with bit.ly probably the largest but Twitter has come out with its own shortener. Facebook Connect was the way to go when I wrote the chapter on Facebook, but the Open Graph APIs has taken over since then. Twitter used both HTTP Basic Authentication and OAuth when I was writing, but has switched over completely to OAuth now. And I expect the list to go on and on.

Still, it has been a good journey and a good fight. Finally publishing it is a grand feeling second to none (except when I had my first child). Hope you enjoy the book!

Toilets and engineers

Posted in Ruby by sausheong on June 7, 2010

It started with a lunch conversation that slowly turned into a coffee time mini-debate. You see, I moved into a new job recently at the Applied Cloud Computing Lab in HP Labs Singapore. My first task was to hire engineers to staff the lab and while the final numbers were not absolute, the number 70 was bandied around for the population of the facilities. We have half a floor at Fusionopolis, plenty of space really but the bone of contention was more delicately focused on more sanitary facilities i.e. the toilets.

The problem was that the whole floor shares a single pair of toilets (one for male and another for female of course). I was totally convinced that with 70 people in the office, we will reach bladder or bowel apocalypse, a disaster in the waiting. My other colleagues however were less worried — the other floors seem to be doing fine. It wasn’t as if we could do anything about it, no one’s going to magically add in new toilets for us no matter how long the toilet queue was; but I had to know. I was also curious of course — what is the algorithm used to determine the number of toilets per square meter of occupancy?

Looking up building regulations found nothing (at least not online) in Singapore but UK has some regulations covered by the Health and Safety Executive.

Looking at facilities used by men only:

Number of men at work Number of toilets Number of urinals
1-15 1 1
16-30 2 1
31-45 2 2
46-60 3 2
61-75 3 3
76-90 4 3
91-100 4 4

Well, the toilets seem to almost fit into the UK regulations, but these are just regulations. How did these numbers come about and how realistic are they? To better model the usage scenarios, I did a Monte Carlo simulation on the usage pattern of the toilets, based on a simple set of assertions.

Taking an educated guess of 1 in 7 of engineers to be female, that leaves us with 60 male engineers sharing a single toilet facility. The toilet facility has 3 urinals and 2 stalls, one of which is a flush toilet, the other is a squat toilet (which in general no one uses any more).

An important piece of information is to find out how many times an average person need to urinate a day. It turns out that our bladder will signal us to make a trip to the toilet when it fills up to about 200ml. Combine with the information that an average urine output of an adult is 1.5 liters a day, this means that we roughly need to go about 8 times a day. This in turn means 8 times in 24 hours or 1 time in 3 hours on an average. The normal working hours in HP are 8:30am – 5:30pm, which is 9 hours, so this means on an average a person makes a beeline to the toilet 3 times during the course of the working day in the office.

I modeled the scenario discretely, per minute. This means a person in the office would make a decision to go to the loo or not every minute within the 9 hours in the office, or 540 minutes. Coupled with an average of 3 times in 540 minutes, this means the probability that a person would go to the toilet every count of the minute is 3 out of 540. In the case of urinating, my estimate is that a person will be able to complete the necessary job within 2 minute on an average and vacate the facility.

Armed with this model, let’s look at the simulation code:

require 'rubygems'
require 'faster_csv'

# population of 60 people on the same floor
population = 60
# period of time from 8:30am -> 5:30pm is 9 hours
period = 9
# 8 times a day -> 8 times in 24 hours -> 1 time every 3 hours
# within a period of 9 hours -> 3 times
# probability in every min -> 3 out of 9 * 60
times_per_period = 3
# rate of servicing
mu = 2
# number of people in the queue
queue = 0
# number of stalls available
stalls = 3
# number of occupied stalls
occupied = 0

# total of all queue sizes for calculating average queue size
queue_total = 0
# max queue size for the day
max_queue_size = 0

FasterCSV.open('queue.csv', 'w') do |csv|
  csv << %w(t queue occupied)
  (period * 60).times do |t|
   # at every rate of service interval, vacate any and all people from the toilet and replace them with people in the queue
    if t % mu == 0
      occupied = 0
      while occupied < stalls and queue > 0
        occupied += 1
        queue -= 1
      end
    end
    population.times do
      # does the engineer want to relieve himself?
      if (rand(period * 60) + 1) <= times_per_period
        if occupied < stalls
          occupied += 1
        else
          queue += 1
        end
      end
    end

    csv << [t, queue, occupied]
    max_queue_size = queue if queue > max_queue_size
    queue_total = queue_total + queue
  end
end

puts "** Max queue size : #{max_queue_size}"
puts "** Total mins in queue by all : #{queue_total} mins"

Most of the code are quite self explanatory. However notice that mu is 3 (minutes) instead of 2 as we have discussed earlier. The reason for this is that in keeping the simulation code simple, I’ve not factored in that if we clear the toilets every 2 minutes, people who entered during the 2nd minute (according to the simulation) will be cleared in just 1 minute. To compensate for this, I use a mu variable that is 1.5 times the service rate.

The simulation writes to queue.csv, which contains the queue size and toilet occupancy at every interval. This are the results:

** Max queue size : 3
** Total mins in queue by all : 17 mins

I opened the CSV up and used the data to create a simple histogram:

Queue size
Number of minutes
Percentage
0 526 97.6%
1 9 1.8%
2 3 0.4%
3 2 0.2%

From the results it’s pretty obvious that the architect and engineers who built Fusionopolis knew their stuff. These were the quantitative conclusions:

  1. Queues were formed less than 3% of the time of the day
  2. The maximum number of people in the queue were 3 people and that happened for only less than 2 minutes in a day
  3. The total time spent by everyone in the office, in a queue to the toilet was 17 mins

This is for the men’s urinals, let’s look at the stalls now. The same simulation script can be used, just changing the parameters a bit. Instead of mu = 3, we use mu = 8. Instead of 3 urinals, we have 2 stalls and instead of 3 times over the period, we use 1 time over the working period.

These are the results:

** Max queue size : 3
** Total mins in queue by all : 81 mins
Queue size
Number of minutes
Percentage
0 485 89.8%
1 33 6.1%
2 21 3.9%
3 1 0.2%

The problem seems a bit more serious now though not significantly so, though the chances of being in a a queue are higher now (10%).

In conclusion, I was wrong but I’m glad to be wrong in this case. Waiting in a toilet queue is seriously not a fun thing to do.

How to hire the best engineers – part 2

Posted in Ruby by sausheong on April 5, 2010

In my previous blog post, I came to the conclusion that

If you’re a hiring manager with a 100 candidates, just interview a sample pool of 7 candidates, choose the best and then interview the rest one at a time until you reach one that better than the best in the sample pool. You will have 90% chance of choosing someone in the top 25% of those 100 candidates.

I mentioned that there are 2 worst case scenarios for this hiring strategy. Firstly, if coincidentally we selected the 7 worst candidates out of 100, then the next one we choose will be taken regardless how good or bad it is. Secondly, if the best candidate is already in the sample pool, then we’ll be going through the rest of the population without finding the best. Let’s talk about these scenarios now.

The first is easily resolved. The probabilities are already considered and calculated in our simulation so we don’t need to worry about that any more. The second is a bit tricky. To find out the failure probability, we tweak the simulation a bit again. This time round instead of getting every sample pool size, we focus on the best results i.e. sample pool size of 7.

require 'rubygems'
require 'faster_csv'

population_size = 100
sample_size = 0..population_size-1
iteration_size = 100000
top = (population_size-25)..(population_size-1)
size = 7
population = []
frequencies = []

iteration_size.times do |iteration|
 population = (0..population_size-1).to_a.sort_by {rand}
 sample = population.slice(0..size-1)
 rest_of_population = population[size..population_size-1]
 best_sample = sample.sort.last
 rest_of_population.each_with_index do |p, i|
 if p > best_sample
 frequencies << i
 break
 end
 end
end

puts "Success probability : #{frequencies.size.to_f*100.0/iteration_size.to_f}%"

Notice we’re essentially counting the times we succeed over the iterations we made. The result for this is a success probability of 93%!

Let’s look at some basic laws of probability now. Let’s say A is the event that the strategy works (the sample pool does not contain the best candidate), so the probability of A is P(A) and this value is 0.93. Let’s say B is the event that the the candidate we find out if this strategy is within the best quartile(regardless if A succeeds or not), so the probability of B is P(B). We don’t know the value of P(B) and we can’t really find P(B) since A and B are not independent. What we really want is  . What we do know however is P(B|A) which is the probability of B given A happens. This value is 0.9063 (from the previous post),

Using the rule of multiplication:

We get the answer:

which is 84.3% probability that the sample pool does not contain the best candidate and we get the top quartile of the total candidate population.

Next, we want to find out, given that we have a high probability of getting the correct candidates in, how many candidates will we have to interview before we find one that is better than the best in the sample pool? Naturally if we have to interview a lot of candidates before finding one, the strategy isn’t practical. Let’s tweak the simulation again, this time to count the successful ones and get a histogram of the number of interviews before we hit jackpot.

What kind of histogram do you think we will get? An initial thought was that it would be a normal distribution, which is not great news for this strategy, because a histogram peaks around the mean, which means the number of candidates we need to interview is around 40+.

require 'rubygems'
require 'faster_csv'

population_size = 100
sample_size = 0..population_size-1
iteration_size = 100000
top = (population_size-25)..(population_size-1)
size = 7
population = []
frequencies = []

iteration_size.times do |iteration|
 population = (0..population_size-1).to_a.sort_by {rand}
 sample = population.slice(0..size-1)
 rest_of_population = population[size..population_size-1]
 best_sample = sample.sort.last
 rest_of_population.each_with_index do |p, i|
 if p > best_sample
 frequencies << i
 break
 end
 end
end

FasterCSV.open('optimal_duration.csv', 'w') do |csv|
 rest_of_population = population[size..population_size-1]
 rest_of_population.size.times do |i|
 count_array = frequencies.find_all{|f| f == i}
 csv << [i+1, count_array.size.to_f/frequencies.size.to_f]
 end
end

The output is a CSV file called optimal_duration.csv. Charting the data as a histogram, surprisingly we get power law graph like this:


This is good news for or strategy! Looking at our dataset, the probability of the first candidate in the rest of the candidate population to be the one is 13.58% while the probability of getting our man (or woman) in the first 10 candidates we interview is a whopping 63.25%!

So have we nailed the question if the strategy is good one? No, there is still one last question unanswered and this is the breaker. Stay tuned for part 3! (For those who are reading this, an exercise — tell me what you think is the breaker question?)

Umbrellas to work, or why engineers should get married

Posted in Ruby by sausheong on March 30, 2010

When I was working in Yahoo at Suntec City, I used to take the MRT to work every day and my route is completely sheltered right to the doorsteps of the office so the weather didn’t really have much effect on me. I still take the MRT to the office now that I am at Garena, but there is a short distance from the MRT station to the office, about 5 minutes of walking, where it is subject to the elements.

Now I don’t take any umbrellas along with me, so with the recent change in weather (it has been raining cats and dogs and all sorts of mammals in between) I have been caught a few times in some unpleasant wetness. It is during one of these wet commutes from home to office I was thinking of ways of making sure that will not happen again.

The idea was like this. I will place some umbrellas at the office and also at home. Whenever I leave the house or the office and if it is raining, I will take the umbrella to the office or back home respectively. Naturally if it’s not raining then I wouldn’t want to look like a dork and carry one.

Therein lies the problem. Let’s say I keep 1 umbrella at the office and another at home. If it rains when I’m leaving the office, I will carry that one home, leaving 2 at home and none at the office. If it rains the next day when I leave home it’s ok, I will just take an umbrella to the office, making it 1 – 1 again. If it doesn’t rain and therefore I don’t take any umbrellas to the office, I will be left with 2 – 0. If it happens to rain when I’m leaving the office I’m stuck with getting wet again.

What if I have 2 umbrellas at the office and 2 at home? In this case I will be caught wet only if the same occurrence (rain when leaving the office followed by no rain when leaving the house) happens 2 days in a row, which probability is lower than if it happens just once. Still, that can happen and I want to get assurance that the risk of getting wet is really minimal. In that case, the question becomes, how many umbrellas should I leave at the office and similarly at home such that the probability of getting wet is negligible? Note that it doesn’t really matter if one place has more umbrellas than the other since it will be the same as the lower number of umbrellas in the long run.

To find out, I used a popular (and really pragmatic and therefore very ‘engineering’) method of finding probabilities — the Monte Carlo simulation method. The name sounds fancy but it’s really just an algorithm that uses repeated random samples to find the answer. In our case, what we want to do is to find out, over large sample, the average number of trips I will make before I run the risk of getting wet.

To find this, I will need to inspect the probability that it will rain at any given day. For easy of calculations, I use a probability between 1% chance that it will rain and 99% chance that it will rain. Why don’t I include 0% or 100%? This is because if it is 0% it means it will never rain and so I will never need to use an umbrella, and if it 100% it will always rain and therefore I will always bring an umbrella back or forth.

On top of that, I will run the simulation for the case where I have 2 umbrellas at home and at the office, up to 50 umbrellas at home and at the office. To give a large enough representation, I run this over 1000 iterations (more will be better but it will take too long to run). This is the Ruby code for the simulation.

require 'rubygems'
require 'faster_csv'

umbrellas_range = 2..50
probability_range = 1..99
@location = :home

total_trips = {}
iterations = 1000

def walks_to(loc)
 location = loc
 @num_of_trips += 1
end

def office?()  @location == :office; end
def home?() @location == :home; end
def raining?(probability) rand(99) < probability; end

FasterCSV.open('umbrellas.csv', 'w') do |csv|
 csv << [''] + probability_range.to_a
 umbrellas_range.each do |umbrellas|
 probability_range.each do |probability|
 total_trips[probability] = 0
 iterations.times do
 @num_of_trips = 0
 wet = false
 home = umbrellas
 office = umbrellas

 while not wet
 if home?
   walks_to :office
   if raining?(probability)
     if home > 0
       home -= 1
       office += 1
     else
       wet = true
     end
   end

 elsif office?
   walks_to :home
   if raining?(probability)
     if office > 0
       office -= 1
       home += 1
     else
       wet = true
     end
   end
 end
 end

 total_trips[probability] += @num_of_trips
 end
 end

 row = [umbrellas]
 total_trips.sort.each { |pair| row << pair[1]/iterations }
 csv << row
 end
end

The Ruby program writes to a simple CSV file called umbrellas.csv with the first row being the probability, and the first column the number of umbrellas at each location. If you chart each row with the y-axis being the number of trips and the x-axis being the probability of rain from 1% to 99%, you will find a chart like this:

Now that I have the average number of trips that I will make before I get wet, I want to actually see what that means for Singapore. To do this, I went to the National Environment Agency’s website and dug out the weather statistics for Singapore.  There is a statistic that gives the average number of rainy days in any given month for the past 118 years. I take that for each month and divide that with the number of days in that month to derive the monthly probability of rain as in the following table:


Month Probability of rain
January 48%
February 39%
March 45%
April 50%
May 45%
June 43%
July 42%
August 45%
September 47%
October 52%
November 63%
December 61%

Finally I match that with the Monte Carlo simulation I made earlier. The average number of trips for each month, must be between 56 to 62 days (each trip is one way only, so every day is 2 trips). For example, the statistics shows that the wettest months are in November and December which is 63% and 61% probability of rain respectively. This means I will need to have 38 umbrellas at each location for each of these 2 months (please check your umbrellas.csv file; open it up with Excel). In February where the probability of rain is the lowest (39%), I will only need 23 umbrellas at each location to almost guarantee that I will not get wet (the whole deal assumes at the end of each month, I replenish each location with the necessary number of umbrellas, of course).

That evening when I proudly told my wife my findings at home, she stared at me for while (frostily if I might add) then gave me a small folding umbrella, which I now carry in my laptop bag.

And that is the reason why engineers should get married.

(I have been inspired by this book — Digital Dice : Computational Solutions to Practical Probability Problems, by Paul J. Nahin)

Clone TinyURL with 40 lines of Ruby code on Google AppEngine for Java

Posted in JRuby, Ruby, Sinatra by sausheong on May 15, 2009

Amidst the wretched events that happened at work recently, I forgot about an interesting development in running apps on a cloud. Google AppEngine finally released Java support on the AppEngine platform. For those uninitiated, AppEngine is Google’s cloud computing platform that allows developers to serve up applications on Google’s infrastructure. When it was first released in April 2008, the only language supported was Python. Python is a great language but doesn’t appeal to my inner Rubyist so it didn’t catch my attention. Until now that is.

While Java is no longer my language of choice nowadays, Ruby actually runs pretty well under JRuby with Java. And with the addition of the Java support for AppEngine, it became a lot more interesting. A few weeks back I wrote Snip, a TinyURL clone, in about 40 lines of Ruby code, and deployed it on Heroku. It seems like a good idea to take Snip out for a spin on the Google AppEngine for Java (GAE/J).

The first thing you need to do is to create an application on the GAE/J. Start by going to this URL – http://appengine.google.com/start and log in using a Google account. After logging in, create a new application following the instructions given on the screen. When you’re done you should have an application id. In this case, it is ‘saush-snip’. We will be needing this application id in our configuration later. You will also need to enable Java for your GAE/J. At this point in time, GAE/J is still in beta and Google is only limiting the first 10,000 developers from enabling Java for GAE/J. Unfortunately if you don’t get Java enabled for your account, you won’t be able to try this out until it is fully released and available to all.

Once you have signed up and gotten an email approval for your GAE/J account, the first thing to do is to install JRuby, if you haven’t done so yet. Even if you have installed it previously you might want to install at least 1.3.0RC1 since some fixes were added into this release to make it run better under GAE/J. Run this:

$ git clone git://github.com/jruby/jruby.git

This will clone a copy of JRuby into your computer. Then go into the jruby folder that was just created and run this:

$ sudo ant && sudo ant jar-complete

This will install JRuby and create the jruby-complete.jar library that you will need in a while. Take note of this path to JRuby, you’ll need it in the subsequent commands. Assume that you just installed JRuby in ~/jruby, do a quick check to see if the install is ok:

$ ~/jruby/bin/jruby -v

If you installed version 1.3.0RC1 you should see something like this:

jruby 1.3.0RC1 (ruby 1.8.6p287) (2009-05-11 6586) (Java HotSpot(TM) 64-Bit Server VM 1.6.0_07) [x86_64-java]

After installing JRuby, you’ll need to install all gems needed for this application. Remember that even if you have installed gems for your normal Ruby installation you’ll need to install it all over again for JRuby. For Snip, you need Sinatra and HAML, but you’ll also need Rake and Warbler, the JRuby war file packager.

$ ~/jruby/bin/jruby -S gem install rake sinatra haml warbler

Now that the basic JRuby and related gems are done, let’s look at the Snip code itself. One thing that is pretty obvious upfront when dealing with AppEngine is that it doesn’t have a relational database for persistence. Instead of a familiar RDBMS, we get a JDO interface or a DataStore API. How do we use it? As it turns out, we don’t need to do anything major. Ola Bini wrote a small wrapper around DataStore, called Bumble, to allow us to write data models just like we did with DataMapper. Well, almost.

Using Bumble is very much similar to DataMapper and ActiveRecord, so I didn’t have to change my code much. This is the DataMapper version of the Url model:

DataMapper.setup(:default, ENV['DATABASE_URL'] || 'mysql://root:root@localhost/snip')
class Url
  include DataMapper::Resource
  property  :id,          Serial
  property  :original,    String, :length => 255
  property  :created_at,  DateTime
  def snipped() self.id.to_s(36) end
end

And this is the Bumble version of the Url model:

class Url
  include Bumble
  ds :original
  def snipped() self.key.to_s end
end

I didn’t add in the time stamp for the Bumble version because I don’t really use it but as you can see there are quite a bit of similarities. I didn’t need to put in my own running serial id because it’s managed by the AppEngine. Also, instead of using the object id, I used the object’s key, which again is managed by the AppEngine. A key is a unique identifier of an entity across all apps belonging to the user. The key is created automatically by Bumble through the low-level DataStore Java APIs. Besides this, using the Url class is slightly different also. Instead of

@url = Url.first(: original => uri.to_s)

We use:

@url = Url.find(: original => uri.to_s)

Finally because we don’t use the id anymore and use the key instead, we don’t need to do the base 36 conversion so we let the AppEngine handle everything. Instead of:

get '/:snipped' do redirect Url[params[:snipped].to_i(36)].original end

We use:

get '/:snipped' do redirect Url.get(params[:snipped]).original end

This is the full source code:

%w(rubygems sinatra bumble uri).each  { |lib| require lib}

get '/' do haml :index end

post '/' do
  uri = URI::parse(params[:original])
  raise "Invalid URL" unless uri.kind_of? URI::HTTP or uri.kind_of? URI::HTTPS
  @url = Url.find(:original => uri.to_s)
  @url = Url.create(:original => uri.to_s) if @url.nil?
  haml :index
end

get '/:snipped' do redirect Url.get(params[:snipped]).original end

error do haml :index end

use_in_file_templates!

class Url
  include Bumble
  ds : original
  def snipped() self.key.to_s end
end

__END__

@@ layout
!!! 1.1
%html
  %head
    %title Snip! on Google AppEngine
    %link{:rel => 'stylesheet', :href => 'http://www.w3.org/StyleSheets/Core/Modernist', :type => 'text/css'}
  = yield

@@ index
%h1.title Snip! on Google AppEngine
- unless @url.nil?
  %code= @url.original
  snipped to
  %a{:href => env['HTTP_REFERER'] + @url.snipped}
    = env['HTTP_REFERER'] + @url.snipped
#err.warning= env['sinatra.error']
%form{:method => 'post', :action => '/'}
  Snip this:
  %input{:type => 'text', :name => 'original', :size => '50'}
  %input{:type => 'submit', :value => 'snip!'}
%small copyright &copy;
%a{:href => 'http://blog.saush.com'}
  Chang Sau Sheong
%br
  %a{:href => 'http://github.com/sausheong/snip-appengine'}
    Full source code

The code is ready but here comes the packaging. GAE/J is a Java servlet environment, which means our app needs to be packaged into a war (Java Web ARchive). Fortunately instead of building up the war by hand we can use Warbler, the JRuby war packager. Before running Warbler, we need to have a couple of things. Firstly we need to build the warble configuration file:

$ mkdir config
$ ~/jruby/bin/jruby -S warble config

We create a directory called config and get Warbler to copy a default configuration file to it. Replace the contents with this minimal setup. If you want to explore more, read warble.rb itself.

Warbler::Config.new do |config|
	config.dirs = %w(lib public views)
	config.includes = FileList["appengine-web.xml", "snip.rb", "config.ru", "bumble.rb"]
	config.gems = ['sinatra', 'haml']
	config.gem_dependencies = true
	config.war_name = "saush-snip"
	config.webxml.booter = :rack
	config.webxml.jruby.init.serial = true
	config.java_libs.reject! { |lib| lib =~ /jruby-complete/ }
end

Note that we don’t really need the public and view directories in Snip because everything is in a single file. The 2 other configuration files we will need are appengine-web.xml and config.ru. We also need to include the snip.rb and bumble.rb into the war file for deployment. To get bumble.rb, go to Ola Bini’s Bumble GitHub repository and get the file that is in the sub-folder (not the main one). The last line tells us not to include the jruby-complete.jar library in the lib folder when we run Warbler. I’ll explain this in a minute. Also note the war file is the application id of the application we created in the GAE admin console earlier on (saush-snip).

Next, create a lib folder. Go to the GAE/J download site and download the GAE/J Java library. It should be called something like appengine-api-1.0-sdk-1.2.0.jar. Copy that into the lib folder you’ve just created. We will also need the Java libraries in the lib folder. Normally for a JRuby deployment, Warbler will package it in, but Google has a 1,000 file limit which Ola Bini kindly pointed out. He also provided a script to split the JRuby library into 2 files. You can find the script here and when you run it, it should split jruby-complete.jar into 2 files named jruby-core-1.3.0RC1.jar and jruby-stdlib-1.3.0RC1.jar. You will also need JRuby-Rack but it’s included in Warbler and as you will see later, Warbler will copy it into the war when you run it. JRuby-Rack is an adapter for the Java servlet environment that allows Sinatra (or any Rack-based application) to run.

The next piece is appengine-web.xml. I used Ola Bini’s version as the base:

<appengine -web-app xmlns="http://appengine.google.com/ns/1.0">
    <application>saush-snip</application>
    <version>1</version>
    <static -files />
    <resource -files />
    <sessions -enabled>false</sessions>
    <system -properties>
      <property name="jruby.management.enabled" value="false" />
      <property name="os.arch" value="" />
      <property name="jruby.compile.mode" value="JIT"/> <!-- JIT|FORCE|OFF -->
      <property name="jruby.compile.fastest" value="true"/>
      <property name="jruby.compile.frameless" value="true"/>
      <property name="jruby.compile.positionless" value="true"/>
      <property name="jruby.compile.threadless" value="false"/>
      <property name="jruby.compile.fastops" value="false"/>
      <property name="jruby.compile.fastcase" value="false"/>
      <property name="jruby.compile.chainsize" value="500"/>
      <property name="jruby.compile.lazyHandles" value="false"/>
      <property name="jruby.compile.peephole" value="true"/>
			<property name="jruby.rack.logging" value="stdout"/>
   </system>
</appengine>

The list row in the property line sets logging to STDOUT, which is very useful for debugging. If you don’t set this, you might not be able to see any console output. Again, we need to set the application id that we got earlier on (saush-snip).

Finally we need a config.ru Rackup file to start the whole thing:

%w(rubygems sinatra snip).each  { |lib| require lib}
root_dir = File.dirname(__FILE__)
set :environment, :production
set :root, root_dir
set :app_file, File.join(root_dir, 'snip.rb')
disable :run
run Sinatra::Application

You can of course also find all these things in the Snip-AppEngine repository at git://github.com/sausheong/snip-appengine.git. However it is so much more fun to do it step by step right?

Now that we have all the pieces let’s package our files for deployment. First we need to generate the exploded war file:

$ ~/jruby/bin/jruby -S warble

You should see output like this:

mkdir -p tmp/war/WEB-INF/gems/specifications
cp /Users/saush/jruby/lib/ruby/gems/1.8/specifications/sinatra-0.9.1.1.gemspec tmp/war/WEB-INF/gems/specifications/sinatra-0.9.1.1.gemspec
mkdir -p tmp/war/WEB-INF/gems/gems
JRuby limited openssl loaded. gem install jruby-openssl for full support.

http://wiki.jruby.org/wiki/JRuby_Builtin_OpenSSL

cp /Users/saush/jruby/lib/ruby/gems/1.8/specifications/rack-0.9.1.gemspec tmp/war/WEB-INF/gems/specifications/rack-0.9.1.gemspec
cp /Users/saush/jruby/lib/ruby/gems/1.8/specifications/haml-2.0.9.gemspec tmp/war/WEB-INF/gems/specifications/haml-2.0.9.gemspec
mkdir -p tmp/war/WEB-INF/lib
mkdir -p tmp/war/WEB-INF/public
mkdir -p tmp/war/WEB-INF/views
cp lib/appengine-api-1.0-sdk-1.2.0.jar tmp/war/WEB-INF/lib/appengine-api-1.0-sdk-1.2.0.jar
cp lib/jruby-core-1.3.0RC1.jar tmp/war/WEB-INF/lib/jruby-core-1.3.0RC1.jar
cp lib/jruby-stdlib-1.3.0RC1.jar tmp/war/WEB-INF/lib/jruby-stdlib-1.3.0RC1.jar
cp appengine-web.xml tmp/war/WEB-INF/appengine-web.xml
cp snip.rb tmp/war/WEB-INF/snip.rb
cp config.ru tmp/war/WEB-INF/config.ru
cp bumble.rb tmp/war/WEB-INF/bumble.rb
cp /Users/saush/.gem/jruby/1.8/gems/warbler-0.9.13/lib/jruby-rack-0.9.4.jar tmp/war/WEB-INF/lib/jruby-rack-0.9.4.jar
cp /Users/saush/.gem/jruby/1.8/gems/warbler-0.9.13/lib/jruby-rack-0.9.4.jar tmp/war/WEB-INF/lib/jruby-rack-0.9.4.jar
mkdir -p tmp/war/WEB-INF
jar cf saush-snip.war  -C tmp/war .

You will also get a saush-snip.war file and a bunch of files under the tmp folder, which is really just the war file exploded. We won’t need the war file (saush-snip.war) itself for deployment, only the tmp directory. Before doing the deployment, we need to make a minor adjustment to the tmp/war/WEB-INF/gems/gems/sinatra-0.9.1.1/lib/sinatra.rb file. Somehow the line ‘use_in_file_templates!’ gives an error when deploying to the GAE/J so comment it out.

That’s it! We are ready for the deployment. To deploy run this command:

$ ~/appengine-java-sdk-1.2.0/bin/appcfg.sh --email= --passin update tmp/war/

You should see output like this:

Reading application configuration data...
2009-05-15 19:51:38.916::INFO:  Logging to STDERR via org.mortbay.log.StdErrLog
Beginning server interaction for saush-snip...
Password for :
0% Creating staging directory
5% Scanning for jsp files.
20% Scanning files on local disk.
25% Scanned 250 files.
28% Initiating update.
31% Cloning 340 application files.
33% Cloned 100 files.
34% Cloned 200 files.
35% Cloned 300 files.
40% Uploading 0 files.
90% Deploying new version.
95% Will check again in 1 seconds
98% Will check again in 2 seconds
99% Closing update: new version is ready to start serving.
99% Uploading index definitions.
Update complete.
Success.
Cleaning up temporary files...

Now go to http://saush-snip.appspot.com and you should be able to see the new deployment of Snip on the Google AppEngine for Java.

A few other tutorials on the Internet also describe how to deploy Sinatra or Rails-based apps on GAE/J, amongst which Samuel Goebert’s tutorial and Ola Bini’s stand out the most.

A few thoughts on comparing Heroku and GAE/J since I’ve deployed on both of them now. Heroku is definitely the easier platform to deploy, with a just few simple steps compared to the hoops I had to jump for GAE/J. Heroku also has the arguably more familiar persistence mechanism as it uses the familiar RDBMS (postgresql) compared to Google’s DataStore implementation, which today only has Ola Bini’s Bumble implementation compared to the established ActiveRecord, DataMpper and Sequel ORMs. In addition, Google’s implementation has many limitations, some harder to understand than others, which forces applications to suit this platform rather than having the platform really service the application.

On the other hand, Google’s set of tools in its console is really impressive, with graphs of usage, data and log viewers. Google AppEngine also has memcached, url fetcher, integration with Google accounts, sending out mails and hosts of other helper services. Also, Google’s brand name does add some weight when you throw in more mission critical projects though Heroku, backed by Amazon’s EC2 services is no push-over.  As for costs, until large deployments emerge it is hard to say which is the better deal, also it really depends highly on usage patterns.

So are they evenly balanced out? Not for me at least. For now I would still favor Heroku over GAE/J because Heroku is more friendly to Ruby developers. But who knows? It’s exciting times for Rubyists.