# Create 3D anaglyph images with 3 lines of Ruby code

3D has always fascinated me. When I was young my brother and I had a ViewMaster and a Pan-Pet Panorama Stereo Viewer, both of which totally bowled us over when we first saw it. My brother as usual totally took it apart and fixed it up again, multiple times while I simply spent hours goggling at the thing. I have no idea where they are now but thinking back they were my first recollection of understanding stereoscopy is.

# A bit of history

It’s probably surprising to most people (at least it was to me) that the modern techniques of 3D imaging and stereoscopy dated way back before even photography. In fact the first few stereoscopic images were drawings. The picture below shows one of the earliest stereoscopic drawings by Jacopo Chimenti, a painter from Florence, Italy.

In 1838, Charles Wheatstone, a British inventor, published a paper that provided the scientific basis for stereography. He showed that the brain unifies the slightly different two-dimensional images from each eye into a single object of three dimensions. Wheatstone’s early stereographs were also drawings rather than photographs.

Photographic stereographs were first produced in 1849 by the English physicist David Brewster who improved the stereoscope and in 1849 the first true stereo camera with two lenses.

# 3D/stereographic imaging techniques

The principles of stereoscopy are quite simple. We see things in 3 dimensions (i.e. being able to see 3-dimensional depth) because each of our 2 eyes actually see a slightly different image. This is because our eyes positioned are apart from each other which generates what is called binocular disparity. Recreating this effect with a 2-dimensional image then allows us to ‘see’ the image in 3D.

There are a number of ways to do this but generally the principle revolves around creating a set of 2 images, one for each eye and ‘forcing’ the left eye to view the left image and the right eye to view the right image.

## Freeviewing

This method places the left image on the right side and the right image on the left side. To view the image in stereo, force your eyes to go cross-eyed, which will produce 3 images. Then slowly ease the eyes to view the middle image in 3D. This is not as silly as it sounds, and actually works though it can be a strain on the eyes (obviously).

## Wiggle method

The wiggle method surprises a lot of people (including me when I first read about it) but it can sometimes be pretty effective. Basically you use the 2 images and create a GIF that alternates between each other.

## Viewers

This method uses various kinds of viewers, from the 18th century Wheatstone stereoscope to the popular Holmes American stereoscope and the transparency viewers like the ViewMaster and the Pan-Pet that I grew up with. It also includes high tech head-mounted displays.

## Parallax barrier and lenticular printing

These 2 methods are similar though parallax barrier is pretty high-tech while lenticular prints is as low-tech as it can be. Parallax barrier essentially places a barrier in front of an image source, usually a LCD display, with a series of precision slits, allowing each eye to see a different set of pixels. This is famously used in the Nintendo 3DS.

Lenticular printing uses a similar technique but with lenticular lenses. Lenticular prints are popular as novelty items and you’d probably encountered them in many places without knowing what it was called.

These 2 methods are often also classified as ‘autostereoscopy’ or glasses-free 3D.

## 3D glasses

This is probably the method you’re most likely to encounter nowadays in movies and in 3D TVs. I classify both passive and active glasses in this category though the actual technologies can be vastly different such as alternating different frames with special projectors and using polarized light.

Which brings us to the type of 3D image we’ll be trying out today — anaglyphs.

The idea of anaglyphs is simple. We start with the 2 left and right images again. This time they are superimposed on each other, but the left would be corrected to show only red color while the right would be corrected to show cyan color. Actually we can use other colors besides red and cyan but these 2 colors are the most popular (and patent-free).

The image is then viewed with a pair of glasses that filter soff red on the left lens and cyan on the right lens. The results is that the left eye would only see the left image and the right eye the right image, therefore generating the perception of depth.

The main problem with this technique (besides the necessity of wearing glasses) is that the colors are a bit wonky. Also if some color from the left image gets into the right eye (and vice versa) a faintly colored “ghost” will be seen. And if the filter from each lens filters off different amount of light resulting in luminance imbalance, it can easily cause headaches (happened to me lots of times during the experiments I did below).

However there are plenty of advantages of anaglyphs. Firstly there isn’t a need for fancy high-tech equipment. Anaglyph red-cyan glasses can be easily created at home or bought cheaply and as you will see below, creating anaglyphs is child’s play.

# Creating anaglyphs with Ruby

Creating anaglyphs is ridiculously easy with RMagick. This is the whole script I used.

#!/usr/bin/env ruby

require ‘rubygems’
require ‘rmagick’
include Magick

left = ImageList.new(ARGV[0]).gamma_correct(1,0,0)
right = ImageList.new(ARGV[1]).gamma_correct(0,1,1)
anaglyph = left.composite right, CenterGravity, ScreenCompositeOp

anaglyph.write(‘anagylph.jpg’)


As you can see, the real work is done in only 3 lines of code. Firstly I create an ImageList object (assuming the first parameter is the file name of the first image). Then I use #gamma_correct and filter off the greens and blues of the left image while keeping the reds. The for the right image, I do the same thing, except this time I filter off the reds while keeping the greens and blues. Finally I use #composite and blend the 2 images together using the screen blending mode (which lightens the image after blending). I used CenterGravity to place the right image at the center of the left image here but it really doesn’t matter since both images are supposed to be the same size anyway. And what remains is just to write the anaglyph back into a file.

Of course, all of these means nothing if we can’t capture the left and right images. For this there are the stereo cameras, ranging from the amazing to the weird and the totally slap-together.

Alternatively you can do the same thing with a single camera, although I wouldn’t recommend it for anything else except still shots. To do this, some recommend to use what is known as the ‘cha-cha’ technique. This requires the photographer to snap a picture then shifting weight slightly to the left or right therefore moving a few centimeters to take a reasonably good second image.

Me? I didn’t want to buy a fancy 3D camera and wasn’t inclined do the cha-cha so I applied a bit of MacGyverism on my primary camera.

It’s not perfect but it does take a reasonably good picture.

# Edge detection with the Sobel operator in Ruby

I was never much into image processing. Sure, like most programmers I dabbled into it for cropping images or doing some fancy-schmancy filtering effects stuff. I even wrote a Flickr clone for my last book which has a rather impressive photo editor (mashed up from Pixlr, not mine). But I never thought much on how those effects were done or who came up with them in the first place. That is until I met Irwin Sobel.

For those who know their image processing, this should ring bells immediately. Yes, it’s that Sobel. But a minute to give some background — Irwin is a colleague of mine working in the Mobile and Immersive Experience Lab in HP Labs. I was visiting about two weeks ago and was introduced to him and his current projects. Inevitably someone talked about the Sobel operator, a commonly used algorithm used for edge detection. I was, unfortunately, totally clueless about what it was. Not good. So not surprisingly I ended up Googling for ‘Sobel operator’ at the first possible chance and found out what it was.

The Sobel operator is an algorithm for edge detection in images. Edge detection for those who are not familiar with the term, is an image processing technique to discover the boundaries between regions in an image. It’s an important part of detecting features and objects in an image. Simply put, edge detection algorithms help us to determine and separate objects from background, in an image.

The Sobel operator does this in a rather clever way. An image gradient is a change in intensity (or color) of an image (I’m over simplifying but bear with me). An edge in an image occurs when the gradient is greatest  and the Sobel operator makes use of this fact to find the edges in an image. The Sobel operator calculates the approximate image gradient of each pixel by convolving the image with a pair of 3×3 filters. These filters estimate the gradients in the horizontal (x) and vertical (y) directions and the magnitude of the gradient is simply the sum of these 2 gradients.

The magnitude of the gradient, which is what we use, is calculated using:

That’s the simplified, 2-paragraph theory behind the algorithm. If this fascinates you, you should grab a couple of books on image processing and computer vision and go through them.

Let’s look at how to implement the Sobel operator. This is simply by creating the 2 filters and running them through each pixel in the image, starting from the left and going right. Note that because the filter is a 3×3 matrix, the pixels in the first and last rows as well as the first and last columns cannot be estimated so the output image will be a 1 pixel-depth smaller than the original image.

To calculate the pixel in the right side of the equation (the one with coordinates 1,1) the following equation is used:

output pixel [1,1] = ([0,0] x -1) + ([0,1] x 0) + ([0,2] x 1) + ([1,0] x -2) + ([1,1] x 0) + ([1,2] x 2) + ([2,0] x -1) + ([2,1] x 0) + ([2,2] x 1)

To simplify matters even more, the grayscale version of the original image is usually used.

Now let’s look at the Ruby implementation


require 'chunky_png'

class ChunkyPNG::Image
def at(x,y)
ChunkyPNG::Color.to_grayscale_bytes(self[x,y]).first
end
end

img = ChunkyPNG::Image.from_file('engine.png')

sobel_x = [[-1,0,1],
[-2,0,2],
[-1,0,1]]

sobel_y = [[-1,-2,-1],
[0,0,0],
[1,2,1]]

edge = ChunkyPNG::Image.new(img.width, img.height, ChunkyPNG::Color::TRANSPARENT)

for x in 1..img.width-2
for y in 1..img.height-2
pixel_x = (sobel_x[0][0] * img.at(x-1,y-1)) + (sobel_x[0][1] * img.at(x,y-1)) + (sobel_x[0][2] * img.at(x+1,y-1)) +
(sobel_x[1][0] * img.at(x-1,y))   + (sobel_x[1][1] * img.at(x,y))   + (sobel_x[1][2] * img.at(x+1,y)) +
(sobel_x[2][0] * img.at(x-1,y+1)) + (sobel_x[2][1] * img.at(x,y+1)) + (sobel_x[2][2] * img.at(x+1,y+1))

pixel_y = (sobel_y[0][0] * img.at(x-1,y-1)) + (sobel_y[0][1] * img.at(x,y-1)) + (sobel_y[0][2] * img.at(x+1,y-1)) +
(sobel_y[1][0] * img.at(x-1,y))   + (sobel_y[1][1] * img.at(x,y))   + (sobel_y[1][2] * img.at(x+1,y)) +
(sobel_y[2][0] * img.at(x-1,y+1)) + (sobel_y[2][1] * img.at(x,y+1)) + (sobel_y[2][2] * img.at(x+1,y+1))

val = Math.sqrt((pixel_x * pixel_x) + (pixel_y * pixel_y)).ceil
edge[x,y] = ChunkyPNG::Color.grayscale(val)
end
end

edge.save('engine_edge.png')



First thing you’d notice is that I used a library called ChunkyPNG, which is PNG manipulation library that is implemented in pure Ruby. While wrappers over ImageMagick (like RMagick) is probably the defacto image processing and manipulation library in Ruby, I thought it’s kind of pointless to do a Sobel operator with ImageMagick since it already has its own edge detection implementation.

To simplify the implementation, I opened up the Image class in ChunkyPNG and added a new method that will return a grayscale pixel at a specific location. Then I created the 2 Sobel filters with arrays of arrays. I created 2 nested loops to iterate through each pixel column by column, then row by row and at each pixel I used the equation above to calculate the gradient by applying the x filter then the y filter. Finally I used the gradient and set a grayscale pixel based on the gradient value, on a new image.

Here you can see the original image, which I reused from the Wikipedia entry on Sobel operator.

And the edge detected image with the x filter applied only.

This is the edge detected image with the y filter only.

Finally this is the edge detected image with both x and y filters applied.

This short exercise might not be technically challenging but it made me appreciate the pioneers who invented things that we now take for granted. Here’s a final picture, one with myself and Irwin (he is the guy who’s sitting opposite me), and a bunch of other colleagues at HP Labs Palo Alto over lunch. Thanks Irwin, for the Sobel operator!

# Generate MP3 waveforms with Ruby and R

I blame Rully for this. If it wasn’t for him I wouldn’t have been obsessed with this and spent a good few hours at night figuring it out last week. It all started when Rully mentioned that he knew how many beeps there are in the Singapore MRT (subway system) ‘doors closing’ warning. There are 13 beeps, he explained and he said he found out by running a WAV recording of it through a MatLab package which in turn generated a waveform that allowed him to count the number of beeps accurately (it is normally too fast for the ear to determine the number of beeps). Naturally such talk triggered the inner competitive geek in me. I happened to be doing Ruby and R integration at the moment (watch out for my talk at RedDotRubyConf at the end of April) so I had no choice but to retrace his steps using my new toys. Really.

MP3 is a compressed and lossy audio encoding format and to generate a waveform I decided to convert it to WAV first. Doing this is relatively simple — there is a library called icanhasaudio that wraps around LAME to encode and decode audio. Naturally you need to have LAME installed first in your machine before you can do this but once you have done that, decoding the MP3 is a breeze:

reader = Audio::MPEG::Decoder.new
File.open('mrt_closing.mp3', 'rb') do |input|
File.open('out.wav', 'wb')  do |output|
end
end


That was easy enough. The next step was a bit trickier though. To understand how to create a waveform from a WAV file let’s digress a bit into what a WAV file is. WAV is an audio file format, originally from IBM and Microsoft, used to store audio bitstreams.WAV is an extended RIFF format, which is a little-endian version of the AIFF format (which is big-endian). In RIFF, data are stored in ‘chunks’ and for WAV, there are basically 2 types of chunks — a format chunk and a sound data chunk. The format chunk contains the parameters describing the waveform for example its sample rate, and the data chunk contains the actual waveform data. There are other chunks but because I’m really only interested in the waveform I’ll conveniently ignore them. This is how a minimal WAV file looks like:

The data chunk has a chunk ID which is always ‘data’, and a chunk size that is a long integer. Data in the data chunk is stored in sample points. A sample point is a value that represents a sample of a sound at a given moment in time. Each sample point is stored as a linear 2’s-complement value from 9 – 32 bits wide, specificed in the BitsPerSample field in the format chunk. Sounds in a WAV file can also come in multiple channels (for e.g. a stereo sound will come in 2 channels, like our file.) For such multi-channel sounds, the sample points are interleaved, one from each channel. A grouping of sample points for a single point in time for all the channels is called a sample frame. This graphic explains it best.

If you open the WAV up with a hex editor it will look something like this:

I wouldn’t go through the format chunks, in fact there is an easier way to find out the format, and that is for me to open up the WAV file using QuickTime and inspect it.

For more information you can get the WAV specs here. This is the information we found of the WAV file that we will use in a while:

• Format : Linear PCM
• Number of channels : 2
• Number of bits per sample : 16

In order to create the waveform, I opened up the WAV file, and collected each sample point from each channel and convert that sample point into an integer. This will be the data file I will use later in R to generate the waveform. Let’s take a look at the code now that we use to generate the data:

FasterCSV.open('wavdata.csv', 'w') do |csv|
csv << %w(ch1 ch2 combined)
File.open('out.wav') do |file|
while !file.eof?
while !wavedata.eof?
csv << [ch1, ch2,ch1+ch2]
end
end
end
end
end


Note that I didn’t read the number of channels from the WAV file but instead assumed it has 2 channels (stereo), to simplify the code. Firstly I open up the WAV file. I ignored the format chunk completely and looked for the data chunk only. Then I read the length of the data chunk by reading the next 4 bytes and unpacking it as a long integer (hence ‘l’ format in the String#unpack method). This gives me the length of the data chunk that I will need to read next.

Next, for ease of reading I wrap the returned data string in a StringIO object. As we found out earlier, each sample has 2 channels and each sample point has 16 bits, so we need to retrieve 32 bits or 4 bytes. Since each sample point has 16 bits, this means a short integer, so we unpack the 4 bytes that are read into 2 short integers, and this will give us the 2 sample points of 2 channels of that sample frame.

After that it’s a simple matter of stuffing the sample points into a CSV file.

Finally, to generate the waveform from the data file, I run it through a simple R script, which I integrated with Ruby using the Ruby Rserve client.

script=<<-EOF
png(file='/Users/sausheong/projects/wavform/mrtplot.png', height=800, width=600, res=72)
par(mfrow=c(3,1),cex=1.1)
plot(wav_data$combined, type='n', main='Channel 1', xlab='Time', ylab='Frequency') lines(wav_data$ch1)
plot(wav_data$combined, type='n', main='Channel 2', xlab='Time', ylab='Frequency') lines(wav_data$ch2)
plot(wav_data$combined, type='n', main='Channel 1 + Channel 2', xlab='Time', ylab='Frequency') lines(wav_data$combined)
dev.off()
EOF
Rserve::Connection.new.eval(script)


The script generates the following PNG file:

As you can see from the waveform (ignoring the first 2 bulges, which are ‘doors’ and ‘closing’ respectively) there are 13 sharp pointy shapes, which represent a beep each.

You can get the code and images here from my GitHub repository. If you’re interested to hear more about Ruby and R integration, do come down to the RedDotRubyConf on 22 and 23 April 2011!

# My new book is out!

It’s been a while from the day I started writing Cloning Internet Applications with Ruby but it’s finally out! You can get it from bookstores, Amazon or its main site at Packt. It’s available in both a paper and a digital version (PDF), so get it now!

The main idea behind this book is actually quite simple and it started out in this blog. The first ‘clone’ I wrote was the Internet search engine in 200 lines of code, which was really very much scratching an itch that I had while I was in Yahoo, about a year and a half ago. I was interested in search engines athen and I wanted to write a very simple search engine to illustrate the principles behind an Internet search engine. That gave me a chance to try out Sinatra, the minimalist web application framework, which worked out really well for me eventually. In turn, that kickstarted me into on a whimsy challenge to do the same with Twitter in the same number of lines of code, using Sinatra and later, TinyURL in 40 lines of code. After that it was only a short leap to writing a whole book about it.

While the original idea revolved around writing clones with the smallest codebase possible, eventually the book evolved to be about writing minimal-feature clones written using the Ruby libraries that I now love to use i.e. Sinatra, DataMapper and Haml. The fundamental premise of the book still remained though, that is to illustrate how clones of popular Internet applications can be written with Ruby.

While this is a highly technical book with lots of code, I added in plenty of elements of the reasons and rationale (according to me, that is) why and how certain features of those applications work. For example, Twitter’s and Facebook’s features for connecting their users (‘friending’ features) in a social network are different, because they target users differently. Twitter’s friending features are primarily one-way and do not need explicit approval while Facebook’s friending features are two-ways and need explicit approvals from both parties. This means design and implementation differences, which are explained in detail in the book.

The experience in writing this book was good, and I have learnt tremendously in the process though it was a struggle. I can say this now that it’s published, but there were certain times I wanted to throw in the towel because of the messy my career was in then. I was still in Yahoo when I started, and I continued while doing my consulting work which eventually led to Garena, then wrapping up before I left Garena and finally being published now as I’m in HP Labs. It took a longer time to finish this than my first book, because of the upheaval in my career in the past year or so and also because overall I wanted to come up with a better book. This resulted in a book that has been revised repeated as companies, statistics and technologies changed. When I started, TinyURL was the king of the hill of URL shorteners while bit.ly was up and coming having just taken over as the default URL shortener in Twitter. TinyURL is now one of the players, with bit.ly probably the largest but Twitter has come out with its own shortener. Facebook Connect was the way to go when I wrote the chapter on Facebook, but the Open Graph APIs has taken over since then. Twitter used both HTTP Basic Authentication and OAuth when I was writing, but has switched over completely to OAuth now. And I expect the list to go on and on.

Still, it has been a good journey and a good fight. Finally publishing it is a grand feeling second to none (except when I had my first child). Hope you enjoy the book!

# Toilets and engineers

It started with a lunch conversation that slowly turned into a coffee time mini-debate. You see, I moved into a new job recently at the Applied Cloud Computing Lab in HP Labs Singapore. My first task was to hire engineers to staff the lab and while the final numbers were not absolute, the number 70 was bandied around for the population of the facilities. We have half a floor at Fusionopolis, plenty of space really but the bone of contention was more delicately focused on more sanitary facilities i.e. the toilets.

The problem was that the whole floor shares a single pair of toilets (one for male and another for female of course). I was totally convinced that with 70 people in the office, we will reach bladder or bowel apocalypse, a disaster in the waiting. My other colleagues however were less worried — the other floors seem to be doing fine. It wasn’t as if we could do anything about it, no one’s going to magically add in new toilets for us no matter how long the toilet queue was; but I had to know. I was also curious of course — what is the algorithm used to determine the number of toilets per square meter of occupancy?

Looking up building regulations found nothing (at least not online) in Singapore but UK has some regulations covered by the Health and Safety Executive.

Looking at facilities used by men only:

 Number of men at work Number of toilets Number of urinals 1-15 1 1 16-30 2 1 31-45 2 2 46-60 3 2 61-75 3 3 76-90 4 3 91-100 4 4

Well, the toilets seem to almost fit into the UK regulations, but these are just regulations. How did these numbers come about and how realistic are they? To better model the usage scenarios, I did a Monte Carlo simulation on the usage pattern of the toilets, based on a simple set of assertions.

Taking an educated guess of 1 in 7 of engineers to be female, that leaves us with 60 male engineers sharing a single toilet facility. The toilet facility has 3 urinals and 2 stalls, one of which is a flush toilet, the other is a squat toilet (which in general no one uses any more).

An important piece of information is to find out how many times an average person need to urinate a day. It turns out that our bladder will signal us to make a trip to the toilet when it fills up to about 200ml. Combine with the information that an average urine output of an adult is 1.5 liters a day, this means that we roughly need to go about 8 times a day. This in turn means 8 times in 24 hours or 1 time in 3 hours on an average. The normal working hours in HP are 8:30am – 5:30pm, which is 9 hours, so this means on an average a person makes a beeline to the toilet 3 times during the course of the working day in the office.

I modeled the scenario discretely, per minute. This means a person in the office would make a decision to go to the loo or not every minute within the 9 hours in the office, or 540 minutes. Coupled with an average of 3 times in 540 minutes, this means the probability that a person would go to the toilet every count of the minute is 3 out of 540. In the case of urinating, my estimate is that a person will be able to complete the necessary job within 2 minute on an average and vacate the facility.

Armed with this model, let’s look at the simulation code:

require 'rubygems'
require 'faster_csv'

# population of 60 people on the same floor
population = 60
# period of time from 8:30am -> 5:30pm is 9 hours
period = 9
# 8 times a day -> 8 times in 24 hours -> 1 time every 3 hours
# within a period of 9 hours -> 3 times
# probability in every min -> 3 out of 9 * 60
times_per_period = 3
# rate of servicing
mu = 2
# number of people in the queue
queue = 0
# number of stalls available
stalls = 3
# number of occupied stalls
occupied = 0

# total of all queue sizes for calculating average queue size
queue_total = 0
# max queue size for the day
max_queue_size = 0

FasterCSV.open('queue.csv', 'w') do |csv|
csv << %w(t queue occupied)
(period * 60).times do |t|
# at every rate of service interval, vacate any and all people from the toilet and replace them with people in the queue
if t % mu == 0
occupied = 0
while occupied < stalls and queue > 0
occupied += 1
queue -= 1
end
end
population.times do
# does the engineer want to relieve himself?
if (rand(period * 60) + 1) <= times_per_period
if occupied < stalls
occupied += 1
else
queue += 1
end
end
end

csv << [t, queue, occupied]
max_queue_size = queue if queue > max_queue_size
queue_total = queue_total + queue
end
end

puts "** Max queue size : #{max_queue_size}"
puts "** Total mins in queue by all : #{queue_total} mins"


Most of the code are quite self explanatory. However notice that mu is 3 (minutes) instead of 2 as we have discussed earlier. The reason for this is that in keeping the simulation code simple, I’ve not factored in that if we clear the toilets every 2 minutes, people who entered during the 2nd minute (according to the simulation) will be cleared in just 1 minute. To compensate for this, I use a mu variable that is 1.5 times the service rate.

The simulation writes to queue.csv, which contains the queue size and toilet occupancy at every interval. This are the results:

** Max queue size : 3
** Total mins in queue by all : 17 mins


I opened the CSV up and used the data to create a simple histogram:

 Queue size Number of minutes Percentage 0 526 97.6% 1 9 1.8% 2 3 0.4% 3 2 0.2%

From the results it’s pretty obvious that the architect and engineers who built Fusionopolis knew their stuff. These were the quantitative conclusions:

1. Queues were formed less than 3% of the time of the day
2. The maximum number of people in the queue were 3 people and that happened for only less than 2 minutes in a day
3. The total time spent by everyone in the office, in a queue to the toilet was 17 mins

This is for the men’s urinals, let’s look at the stalls now. The same simulation script can be used, just changing the parameters a bit. Instead of mu = 3, we use mu = 8. Instead of 3 urinals, we have 2 stalls and instead of 3 times over the period, we use 1 time over the working period.

These are the results:

** Max queue size : 3
** Total mins in queue by all : 81 mins

 Queue size Number of minutes Percentage 0 485 89.8% 1 33 6.1% 2 21 3.9% 3 1 0.2%

The problem seems a bit more serious now though not significantly so, though the chances of being in a a queue are higher now (10%).

In conclusion, I was wrong but I’m glad to be wrong in this case. Waiting in a toilet queue is seriously not a fun thing to do.

# How to hire the best engineers – part 2

In my previous blog post, I came to the conclusion that

If you’re a hiring manager with a 100 candidates, just interview a sample pool of 7 candidates, choose the best and then interview the rest one at a time until you reach one that better than the best in the sample pool. You will have 90% chance of choosing someone in the top 25% of those 100 candidates.

I mentioned that there are 2 worst case scenarios for this hiring strategy. Firstly, if coincidentally we selected the 7 worst candidates out of 100, then the next one we choose will be taken regardless how good or bad it is. Secondly, if the best candidate is already in the sample pool, then we’ll be going through the rest of the population without finding the best. Let’s talk about these scenarios now.

The first is easily resolved. The probabilities are already considered and calculated in our simulation so we don’t need to worry about that any more. The second is a bit tricky. To find out the failure probability, we tweak the simulation a bit again. This time round instead of getting every sample pool size, we focus on the best results i.e. sample pool size of 7.

require 'rubygems'
require 'faster_csv'

population_size = 100
sample_size = 0..population_size-1
iteration_size = 100000
top = (population_size-25)..(population_size-1)
size = 7
population = []
frequencies = []

iteration_size.times do |iteration|
population = (0..population_size-1).to_a.sort_by {rand}
sample = population.slice(0..size-1)
rest_of_population = population[size..population_size-1]
best_sample = sample.sort.last
rest_of_population.each_with_index do |p, i|
if p > best_sample
frequencies << i
break
end
end
end

puts "Success probability : #{frequencies.size.to_f*100.0/iteration_size.to_f}%"



Notice we’re essentially counting the times we succeed over the iterations we made. The result for this is a success probability of 93%!

Let’s look at some basic laws of probability now. Let’s say A is the event that the strategy works (the sample pool does not contain the best candidate), so the probability of A is P(A) and this value is 0.93. Let’s say B is the event that the the candidate we find out if this strategy is within the best quartile(regardless if A succeeds or not), so the probability of B is P(B). We don’t know the value of P(B) and we can’t really find P(B) since A and B are not independent. What we really want is  . What we do know however is P(B|A) which is the probability of B given A happens. This value is 0.9063 (from the previous post),

Using the rule of multiplication:

which is 84.3% probability that the sample pool does not contain the best candidate and we get the top quartile of the total candidate population.

Next, we want to find out, given that we have a high probability of getting the correct candidates in, how many candidates will we have to interview before we find one that is better than the best in the sample pool? Naturally if we have to interview a lot of candidates before finding one, the strategy isn’t practical. Let’s tweak the simulation again, this time to count the successful ones and get a histogram of the number of interviews before we hit jackpot.

What kind of histogram do you think we will get? An initial thought was that it would be a normal distribution, which is not great news for this strategy, because a histogram peaks around the mean, which means the number of candidates we need to interview is around 40+.

require 'rubygems'
require 'faster_csv'

population_size = 100
sample_size = 0..population_size-1
iteration_size = 100000
top = (population_size-25)..(population_size-1)
size = 7
population = []
frequencies = []

iteration_size.times do |iteration|
population = (0..population_size-1).to_a.sort_by {rand}
sample = population.slice(0..size-1)
rest_of_population = population[size..population_size-1]
best_sample = sample.sort.last
rest_of_population.each_with_index do |p, i|
if p > best_sample
frequencies << i
break
end
end
end

FasterCSV.open('optimal_duration.csv', 'w') do |csv|
rest_of_population = population[size..population_size-1]
rest_of_population.size.times do |i|
count_array = frequencies.find_all{|f| f == i}
csv << [i+1, count_array.size.to_f/frequencies.size.to_f]
end
end



The output is a CSV file called optimal_duration.csv. Charting the data as a histogram, surprisingly we get power law graph like this:

This is good news for or strategy! Looking at our dataset, the probability of the first candidate in the rest of the candidate population to be the one is 13.58% while the probability of getting our man (or woman) in the first 10 candidates we interview is a whopping 63.25%!

So have we nailed the question if the strategy is good one? No, there is still one last question unanswered and this is the breaker. Stay tuned for part 3! (For those who are reading this, an exercise — tell me what you think is the breaker question?)

# Umbrellas to work, or why engineers should get married

When I was working in Yahoo at Suntec City, I used to take the MRT to work every day and my route is completely sheltered right to the doorsteps of the office so the weather didn’t really have much effect on me. I still take the MRT to the office now that I am at Garena, but there is a short distance from the MRT station to the office, about 5 minutes of walking, where it is subject to the elements.

Now I don’t take any umbrellas along with me, so with the recent change in weather (it has been raining cats and dogs and all sorts of mammals in between) I have been caught a few times in some unpleasant wetness. It is during one of these wet commutes from home to office I was thinking of ways of making sure that will not happen again.

The idea was like this. I will place some umbrellas at the office and also at home. Whenever I leave the house or the office and if it is raining, I will take the umbrella to the office or back home respectively. Naturally if it’s not raining then I wouldn’t want to look like a dork and carry one.

Therein lies the problem. Let’s say I keep 1 umbrella at the office and another at home. If it rains when I’m leaving the office, I will carry that one home, leaving 2 at home and none at the office. If it rains the next day when I leave home it’s ok, I will just take an umbrella to the office, making it 1 – 1 again. If it doesn’t rain and therefore I don’t take any umbrellas to the office, I will be left with 2 – 0. If it happens to rain when I’m leaving the office I’m stuck with getting wet again.

What if I have 2 umbrellas at the office and 2 at home? In this case I will be caught wet only if the same occurrence (rain when leaving the office followed by no rain when leaving the house) happens 2 days in a row, which probability is lower than if it happens just once. Still, that can happen and I want to get assurance that the risk of getting wet is really minimal. In that case, the question becomes, how many umbrellas should I leave at the office and similarly at home such that the probability of getting wet is negligible? Note that it doesn’t really matter if one place has more umbrellas than the other since it will be the same as the lower number of umbrellas in the long run.

To find out, I used a popular (and really pragmatic and therefore very ‘engineering’) method of finding probabilities — the Monte Carlo simulation method. The name sounds fancy but it’s really just an algorithm that uses repeated random samples to find the answer. In our case, what we want to do is to find out, over large sample, the average number of trips I will make before I run the risk of getting wet.

To find this, I will need to inspect the probability that it will rain at any given day. For easy of calculations, I use a probability between 1% chance that it will rain and 99% chance that it will rain. Why don’t I include 0% or 100%? This is because if it is 0% it means it will never rain and so I will never need to use an umbrella, and if it 100% it will always rain and therefore I will always bring an umbrella back or forth.

On top of that, I will run the simulation for the case where I have 2 umbrellas at home and at the office, up to 50 umbrellas at home and at the office. To give a large enough representation, I run this over 1000 iterations (more will be better but it will take too long to run). This is the Ruby code for the simulation.

require 'rubygems'
require 'faster_csv'

umbrellas_range = 2..50
probability_range = 1..99
@location = :home

total_trips = {}
iterations = 1000

def walks_to(loc)
location = loc
@num_of_trips += 1
end

def office?()  @location == :office; end
def home?() @location == :home; end
def raining?(probability) rand(99) < probability; end

FasterCSV.open('umbrellas.csv', 'w') do |csv|
csv << [''] + probability_range.to_a
umbrellas_range.each do |umbrellas|
probability_range.each do |probability|
total_trips[probability] = 0
iterations.times do
@num_of_trips = 0
wet = false
home = umbrellas
office = umbrellas

while not wet
if home?
walks_to :office
if raining?(probability)
if home > 0
home -= 1
office += 1
else
wet = true
end
end

elsif office?
walks_to :home
if raining?(probability)
if office > 0
office -= 1
home += 1
else
wet = true
end
end
end
end

total_trips[probability] += @num_of_trips
end
end

row = [umbrellas]
total_trips.sort.each { |pair| row << pair[1]/iterations }
csv << row
end
end


The Ruby program writes to a simple CSV file called umbrellas.csv with the first row being the probability, and the first column the number of umbrellas at each location. If you chart each row with the y-axis being the number of trips and the x-axis being the probability of rain from 1% to 99%, you will find a chart like this:

Now that I have the average number of trips that I will make before I get wet, I want to actually see what that means for Singapore. To do this, I went to the National Environment Agency’s website and dug out the weather statistics for Singapore.  There is a statistic that gives the average number of rainy days in any given month for the past 118 years. I take that for each month and divide that with the number of days in that month to derive the monthly probability of rain as in the following table:

 Month Probability of rain January 48% February 39% March 45% April 50% May 45% June 43% July 42% August 45% September 47% October 52% November 63% December 61%

Finally I match that with the Monte Carlo simulation I made earlier. The average number of trips for each month, must be between 56 to 62 days (each trip is one way only, so every day is 2 trips). For example, the statistics shows that the wettest months are in November and December which is 63% and 61% probability of rain respectively. This means I will need to have 38 umbrellas at each location for each of these 2 months (please check your umbrellas.csv file; open it up with Excel). In February where the probability of rain is the lowest (39%), I will only need 23 umbrellas at each location to almost guarantee that I will not get wet (the whole deal assumes at the end of each month, I replenish each location with the necessary number of umbrellas, of course).

That evening when I proudly told my wife my findings at home, she stared at me for while (frostily if I might add) then gave me a small folding umbrella, which I now carry in my laptop bag.

And that is the reason why engineers should get married.

(I have been inspired by this book — Digital Dice : Computational Solutions to Practical Probability Problems, by Paul J. Nahin)