## Talk at BigData.SG February 2012

I did this talk at the Big Data SG meet up in Feb 2012. Was a good session.

## Promotional interview with O’Reilly

I did this interview with my O’Reilly editor, Andy Oram, as a promotional video for my book ‘Exploring Everyday Things with R and Ruby’

## Sex, Money and Evolution talk in RubyConf India 2012

I did this talk in Pune, India in Mar 2012

## My new book is out!

It’s been a while from the day I started writing Cloning Internet Applications with Ruby but it’s finally out! You can get it from bookstores, Amazon or its main site at Packt. It’s available in both a paper and a digital version (PDF), so get it now!

The main idea behind this book is actually quite simple and it started out in this blog. The first ‘clone’ I wrote was the Internet search engine in 200 lines of code, which was really very much scratching an itch that I had while I was in Yahoo, about a year and a half ago. I was interested in search engines athen and I wanted to write a very simple search engine to illustrate the principles behind an Internet search engine. That gave me a chance to try out Sinatra, the minimalist web application framework, which worked out really well for me eventually. In turn, that kickstarted me into on a whimsy challenge to do the same with Twitter in the same number of lines of code, using Sinatra and later, TinyURL in 40 lines of code. After that it was only a short leap to writing a whole book about it.

While the original idea revolved around writing clones with the smallest codebase possible, eventually the book evolved to be about writing minimal-feature clones written using the Ruby libraries that I now love to use i.e. Sinatra, DataMapper and Haml. The fundamental premise of the book still remained though, that is to illustrate how clones of popular Internet applications can be written with Ruby.

While this is a highly technical book with lots of code, I added in plenty of elements of the reasons and rationale (according to me, that is) why and how certain features of those applications work. For example, Twitter’s and Facebook’s features for connecting their users (‘friending’ features) in a social network are different, because they target users differently. Twitter’s friending features are primarily one-way and do not need explicit approval while Facebook’s friending features are two-ways and need explicit approvals from both parties. This means design and implementation differences, which are explained in detail in the book.

The experience in writing this book was good, and I have learnt tremendously in the process though it was a struggle. I can say this now that it’s published, but there were certain times I wanted to throw in the towel because of the messy my career was in then. I was still in Yahoo when I started, and I continued while doing my consulting work which eventually led to Garena, then wrapping up before I left Garena and finally being published now as I’m in HP Labs. It took a longer time to finish this than my first book, because of the upheaval in my career in the past year or so and also because overall I wanted to come up with a better book. This resulted in a book that has been revised repeated as companies, statistics and technologies changed. When I started, TinyURL was the king of the hill of URL shorteners while bit.ly was up and coming having just taken over as the default URL shortener in Twitter. TinyURL is now one of the players, with bit.ly probably the largest but Twitter has come out with its own shortener. Facebook Connect was the way to go when I wrote the chapter on Facebook, but the Open Graph APIs has taken over since then. Twitter used both HTTP Basic Authentication and OAuth when I was writing, but has switched over completely to OAuth now. And I expect the list to go on and on.

Still, it has been a good journey and a good fight. Finally publishing it is a grand feeling second to none (except when I had my first child). Hope you enjoy the book!

## How to hire the best engineers without killing yourself

Garena, where I work now, is in an expansion mode and I have been hiring engineers, sysadmins and so on to feed the development frenzy for platform revamps and product roadmaps. A problem I face when hiring engineers is that we’re **not** the only companies that are doing so. This is especially true now as many companies have issued their annual bonuses (or lack of) and the ranks of the dissatisfied joined the swelling exodus out of many tech companies. In other words, mass musical chairs with tech companies and engineers.

Needless to say this causes a challenge with hiring. The good news is that there plenty of candidates. The bad news however is to secure the correctly skilled engineer with the right mindset for a growing startup. At the same time, the identification and confirmation needs to be swift because once you are slow even with a loosely-fit candidate you can potentially lose him/her within a day or two.

This causes me to wonder — what is the best way to go through a large list of candidates and successfully pick up the best engineer or at least someone who is the top percentile of the list of candidates?

In Why Flip a Coin: The Art and Science of Good Decisions, H.W. Lewis wrote about a similar (though stricter) problem involving dating. Instead of choosing candidates the book talked about choosing a wife and instead of conducting interviews, the problem in the book involved dating. However the difference is in the book you can only date one person at a time while in my situation I can obviously interview more than one candidate. Nonetheless the problems are pretty much the same since if I interview too many candidates and take too long to decide, they will be snapped up by other companies. Not to mention that I will probably foam in the mouth and die from interview overdose before that.

In the book, Lewis suggested this strategy — say we are choosing from a pool of 20 candidates. Instead of interviewing each and every one of those candidates we randomly choose and interview 4 candidates and choose the best out of the sample pool of 4. Now armed with the best candidate from the sample pool, we go through the rest of the candidates one by one until we hit one that is better than him, then hire that candidate.

As you would have guess, this strategy is probabilistic and doesn’t guarantee the best candidate. In fact, there are 2 worst case scenarios. First, if we happened to choose the worst 4 candidates of the lot as the sample pool and the first candidate we choose outside of the sample pool is the 5th worst, then we would have gotten the 5th worst candidate. Not good. Conversely if we have the best candidate in the sample pool, then we run the risk of doing 20 interviews and then lose the best candidate because it took too long to do the interviews. Bad again.

So is this a good strategy? Also, what is the best population pool (total number of candidates) and sample pool we want in order to maximize this strategy? Let’s be a good engineer and do another Monte Carlo simulation to find out.

Let’s start with the population pool of 20 candidates, then we iterate through the sample pool of 0 to 19. For each sample pool size, we find the probability that the candidate we choose is the best candidate in the population. Actually we already know the probability when the sample pool is 0 or 19. When the sample pool is 0, it means we’re going to choose the first candidate we interview (since there is no comparison!) therefore the probability is 1/20 which is 5%. Similarly with a sample pool of 19, we will have to choose the last candidate and the probability of it is also 1/20 which is 5%.

Here’s the Ruby code to simulate this. We run it through 100,000 simulations to make the probability as accurate as possible, then save it into a csv file. called *optimal.csv*.

require 'rubygems' require 'faster_csv' population_size = 20 sample_size = 0..population_size-1 iteration_size = 100000 FasterCSV.open('optimal.csv', 'w') do |csv| sample_size.each do |size| is_best_choice_count = 0 iteration_size.times do # create the population and randomize it population = (0..population_size-1).to_a.sort_by {rand} # get the sample pool sample = population.slice(0..size-1) rest_of_population = population[size..population_size-1] # this is the best of the sample pool best_sample = sample.sort.last # find the best chosen by this strategy best_next = rest_of_population.find {|i| i > best_sample} best_population = population.sort.last # is this the best choice? count how many times it is the best is_best_choice_count += 1 if best_next == best_population end best_probability = is_best_choice_count.to_f/iteration_size.to_f csv << [size, best_probability] end end

The code is quite self explanatory (especially with all the in-code comments) so I won’t go into details. The results are as below in the line chart, after you open it up in Excel and chart it accordingly. As you can see if you choose 4 candidates as the sample pool, you will have roughly 1 out of 3 chance that you choose the best candidate. The best odds are when you choose 7 candidates as the sample pool, in which you get around 38.5% probability that you will choose the best candidate. Doesn’t look good.

But to be honest for some candidates I don’t really need the candidate to be the ‘best’ (anyway such evaluations are subjective). Let’s say I want to get the candidate to be in the top quartile (top 25%). What are my odds then?

Here’s the revised code that does this simulation.

require 'rubygems' require 'faster_csv' population_size = 20 sample_size = 0..population_size-1 iteration_size = 100000 top = (population_size-5)..(population_size-1) FasterCSV.open('optimal.csv', 'w') do |csv| sample_size.each do |size| is_best_choice_count = 0 is_top_choice_count = 0 iteration_size.times do population = (0..population_size-1).to_a.sort_by {rand} sample = population.slice(0..size-1) rest_of_population = population[size..population_size-1] best_sample = sample.sort.last best_next = rest_of_population.find {|i| i > best_sample} best_population = population.sort.last top_population = population.sort[top] is_best_choice_count += 1 if best_next == best_population is_top_choice_count += 1 if top_population.include? best_next end best_probability = is_best_choice_count.to_f/iteration_size.to_f top_probability = is_top_choice_count.to_f/iteration_size.to_f csv << [size, best_probability, top_probability] end end

The *optimal.csv* file has a new column, which shows the top quartile (top 5) candidates. The new line chart is shown below, with the results of the previous simulation as a comparison.

Things look brighter now, the most optimal sample pool size is 4 (though for practical purposes, 3 is good enough since the difference between 3 and 4 is small) and the probability of choosing a top quartile candidate shoots up to 72.7%. Pretty good!

Now this is with 20 candidates. How about a large candidate pool? How will this strategy stand up in say a population pool of 100 candidates?

As you can see, this strategy doesn’t work in getting the best out of a large pool (sample pool is too large, probability of success is too low) and it is worse than in a smaller population pool. However, if we want the top quartile or so (meaning being less picky), we only need a sample pool of 7 candidates and we can have a probability of 90.63% of getting what we want. This is amazing odds!

This means if you’re a hiring manager with a 100 candidates, you don’t need to kill yourself trying to interview everyone. Just interview a sample pool of 7 candidates, choose the best and then interview the rest one at a time until you reach one that better than the best in the sample pool. You will have 90% of choosing someone in the top 25% of those 100 candidates (which is probably what you want anyway)!

3comments