## Generate MP3 waveforms with Ruby and R

I blame Rully for this. If it wasn’t for him I wouldn’t have been obsessed with this and spent a good few hours at night figuring it out last week. It all started when Rully mentioned that he knew how many beeps there are in the Singapore MRT (subway system) ‘doors closing’ warning. There are 13 beeps, he explained and he said he found out by running a WAV recording of it through a MatLab package which in turn generated a waveform that allowed him to count the number of beeps accurately (it is normally too fast for the ear to determine the number of beeps). Naturally such talk triggered the inner competitive geek in me. I happened to be doing Ruby and R integration at the moment (watch out for my talk at RedDotRubyConf at the end of April) so I had no choice but to retrace his steps using my new toys. Really.

MP3 is a compressed and lossy audio encoding format and to generate a waveform I decided to convert it to WAV first. Doing this is relatively simple — there is a library called icanhasaudio that wraps around LAME to encode and decode audio. Naturally you need to have LAME installed first in your machine before you can do this but once you have done that, decoding the MP3 is a breeze:

reader = Audio::MPEG::Decoder.new File.open('mrt_closing.mp3', 'rb') do |input| File.open('out.wav', 'wb') do |output| reader.decode(input, output) end end

That was easy enough. The next step was a bit trickier though. To understand how to create a waveform from a WAV file let’s digress a bit into what a WAV file is. WAV is an audio file format, originally from IBM and Microsoft, used to store audio bitstreams.WAV is an extended RIFF format, which is a little-endian version of the AIFF format (which is big-endian). In RIFF, data are stored in ‘chunks’ and for WAV, there are basically 2 types of chunks — a format chunk and a sound data chunk. The format chunk contains the parameters describing the waveform for example its sample rate, and the data chunk contains the actual waveform data. There are other chunks but because I’m really only interested in the waveform I’ll conveniently ignore them. This is how a minimal WAV file looks like:

The data chunk has a chunk ID which is always ‘data’, and a chunk size that is a long integer. Data in the data chunk is stored in sample points. A **sample point** is a value that represents a sample of a sound at a given moment in time. Each sample point is stored as a linear 2’s-complement value from 9 – 32 bits wide, specificed in the BitsPerSample field in the format chunk. Sounds in a WAV file can also come in multiple channels (for e.g. a stereo sound will come in 2 channels, like our file.) For such multi-channel sounds, the sample points are interleaved, one from each channel. A grouping of sample points for a single point in time for all the channels is called a **sample frame**. This graphic explains it best.

If you open the WAV up with a hex editor it will look something like this:

I wouldn’t go through the format chunks, in fact there is an easier way to find out the format, and that is for me to open up the WAV file using QuickTime and inspect it.

For more information you can get the WAV specs here. This is the information we found of the WAV file that we will use in a while:

- Format : Linear PCM
- Number of channels : 2
- Number of bits per sample : 16

In order to create the waveform, I opened up the WAV file, and collected each sample point from each channel and convert that sample point into an integer. This will be the data file I will use later in R to generate the waveform. Let’s take a look at the code now that we use to generate the data:

FasterCSV.open('wavdata.csv', 'w') do |csv| csv << %w(ch1 ch2 combined) File.open('out.wav') do |file| while !file.eof? if file.read(4) == 'data' length = file.read(4).unpack('l').first wavedata = StringIO.new file.read(length) while !wavedata.eof? ch1, ch2 = wavedata.read(4).unpack('ss') csv << [ch1, ch2,ch1+ch2] end end end end end

Note that I didn’t read the number of channels from the WAV file but instead assumed it has 2 channels (stereo), to simplify the code. Firstly I open up the WAV file. I ignored the format chunk completely and looked for the data chunk only. Then I read the length of the data chunk by reading the next 4 bytes and unpacking it as a long integer (hence ‘l’ format in the String#unpack method). This gives me the length of the data chunk that I will need to read next.

Next, for ease of reading I wrap the returned data string in a StringIO object. As we found out earlier, each sample has 2 channels and each sample point has 16 bits, so we need to retrieve 32 bits or 4 bytes. Since each sample point has 16 bits, this means a short integer, so we unpack the 4 bytes that are read into 2 short integers, and this will give us the 2 sample points of 2 channels of that sample frame.

After that it’s a simple matter of stuffing the sample points into a CSV file.

Finally, to generate the waveform from the data file, I run it through a simple R script, which I integrated with Ruby using the Ruby Rserve client.

script=<<-EOF png(file='/Users/sausheong/projects/wavform/mrtplot.png', height=800, width=600, res=72) par(mfrow=c(3,1),cex=1.1) wav_data <- read.csv(file='/Users/sausheong/projects/wavform/wavdata.csv', header=TRUE) plot(wav_data$combined, type='n', main='Channel 1', xlab='Time', ylab='Frequency') lines(wav_data$ch1) plot(wav_data$combined, type='n', main='Channel 2', xlab='Time', ylab='Frequency') lines(wav_data$ch2) plot(wav_data$combined, type='n', main='Channel 1 + Channel 2', xlab='Time', ylab='Frequency') lines(wav_data$combined) dev.off() EOF Rserve::Connection.new.eval(script)

The script generates the following PNG file:

As you can see from the waveform (ignoring the first 2 bulges, which are ‘doors’ and ‘closing’ respectively) there are 13 sharp pointy shapes, which represent a beep each.

You can get the code and images here from my GitHub repository. If you’re interested to hear more about Ruby and R integration, do come down to the RedDotRubyConf on 22 and 23 April 2011!

## My new book is out!

It’s been a while from the day I started writing Cloning Internet Applications with Ruby but it’s finally out! You can get it from bookstores, Amazon or its main site at Packt. It’s available in both a paper and a digital version (PDF), so get it now!

The main idea behind this book is actually quite simple and it started out in this blog. The first ‘clone’ I wrote was the Internet search engine in 200 lines of code, which was really very much scratching an itch that I had while I was in Yahoo, about a year and a half ago. I was interested in search engines athen and I wanted to write a very simple search engine to illustrate the principles behind an Internet search engine. That gave me a chance to try out Sinatra, the minimalist web application framework, which worked out really well for me eventually. In turn, that kickstarted me into on a whimsy challenge to do the same with Twitter in the same number of lines of code, using Sinatra and later, TinyURL in 40 lines of code. After that it was only a short leap to writing a whole book about it.

While the original idea revolved around writing clones with the smallest codebase possible, eventually the book evolved to be about writing minimal-feature clones written using the Ruby libraries that I now love to use i.e. Sinatra, DataMapper and Haml. The fundamental premise of the book still remained though, that is to illustrate how clones of popular Internet applications can be written with Ruby.

While this is a highly technical book with lots of code, I added in plenty of elements of the reasons and rationale (according to me, that is) why and how certain features of those applications work. For example, Twitter’s and Facebook’s features for connecting their users (‘friending’ features) in a social network are different, because they target users differently. Twitter’s friending features are primarily one-way and do not need explicit approval while Facebook’s friending features are two-ways and need explicit approvals from both parties. This means design and implementation differences, which are explained in detail in the book.

The experience in writing this book was good, and I have learnt tremendously in the process though it was a struggle. I can say this now that it’s published, but there were certain times I wanted to throw in the towel because of the messy my career was in then. I was still in Yahoo when I started, and I continued while doing my consulting work which eventually led to Garena, then wrapping up before I left Garena and finally being published now as I’m in HP Labs. It took a longer time to finish this than my first book, because of the upheaval in my career in the past year or so and also because overall I wanted to come up with a better book. This resulted in a book that has been revised repeated as companies, statistics and technologies changed. When I started, TinyURL was the king of the hill of URL shorteners while bit.ly was up and coming having just taken over as the default URL shortener in Twitter. TinyURL is now one of the players, with bit.ly probably the largest but Twitter has come out with its own shortener. Facebook Connect was the way to go when I wrote the chapter on Facebook, but the Open Graph APIs has taken over since then. Twitter used both HTTP Basic Authentication and OAuth when I was writing, but has switched over completely to OAuth now. And I expect the list to go on and on.

Still, it has been a good journey and a good fight. Finally publishing it is a grand feeling second to none (except when I had my first child). Hope you enjoy the book!

## Toilets and engineers

It started with a lunch conversation that slowly turned into a coffee time mini-debate. You see, I moved into a new job recently at the Applied Cloud Computing Lab in HP Labs Singapore. My first task was to hire engineers to staff the lab and while the final numbers were not absolute, the number 70 was bandied around for the population of the facilities. We have half a floor at Fusionopolis, plenty of space really but the bone of contention was more delicately focused on more sanitary facilities i.e. the toilets.

The problem was that the whole floor shares a single pair of toilets (one for male and another for female of course). I was totally convinced that with 70 people in the office, we will reach bladder or bowel apocalypse, a disaster in the waiting. My other colleagues however were less worried — the other floors seem to be doing fine. It wasn’t as if we could do anything about it, no one’s going to magically add in new toilets for us no matter how long the toilet queue was; but I had to know. I was also curious of course — what is the algorithm used to determine the number of toilets per square meter of occupancy?

Looking up building regulations found nothing (at least not online) in Singapore but UK has some regulations covered by the Health and Safety Executive.

Looking at facilities used by men only:

Number of men at work | Number of toilets | Number of urinals |

1-15 | 1 | 1 |

16-30 | 2 | 1 |

31-45 | 2 | 2 |

46-60 | 3 | 2 |

61-75 | 3 | 3 |

76-90 | 4 | 3 |

91-100 | 4 | 4 |

Well, the toilets seem to almost fit into the UK regulations, but these are just regulations. How did these numbers come about and how realistic are they? To better model the usage scenarios, I did a Monte Carlo simulation on the usage pattern of the toilets, based on a simple set of assertions.

Taking an educated guess of 1 in 7 of engineers to be female, that leaves us with 60 male engineers sharing a single toilet facility. The toilet facility has 3 urinals and 2 stalls, one of which is a flush toilet, the other is a squat toilet (which in general no one uses any more).

An important piece of information is to find out how many times an average person need to urinate a day. It turns out that our bladder will signal us to make a trip to the toilet when it fills up to about 200ml. Combine with the information that an average urine output of an adult is 1.5 liters a day, this means that we roughly need to go about 8 times a day. This in turn means 8 times in 24 hours or 1 time in 3 hours on an average. The normal working hours in HP are 8:30am – 5:30pm, which is 9 hours, so this means on an average a person makes a beeline to the toilet 3 times during the course of the working day in the office.

I modeled the scenario discretely, per minute. This means a person in the office would make a decision to go to the loo or not every minute within the 9 hours in the office, or 540 minutes. Coupled with an average of 3 times in 540 minutes, this means the probability that a person would go to the toilet every count of the minute is 3 out of 540. In the case of urinating, my estimate is that a person will be able to complete the necessary job within 2 minute on an average and vacate the facility.

Armed with this model, let’s look at the simulation code:

require 'rubygems' require 'faster_csv' # population of 60 people on the same floor population = 60 # period of time from 8:30am -> 5:30pm is 9 hours period = 9 # 8 times a day -> 8 times in 24 hours -> 1 time every 3 hours # within a period of 9 hours -> 3 times # probability in every min -> 3 out of 9 * 60 times_per_period = 3 # rate of servicing mu = 2 # number of people in the queue queue = 0 # number of stalls available stalls = 3 # number of occupied stalls occupied = 0 # total of all queue sizes for calculating average queue size queue_total = 0 # max queue size for the day max_queue_size = 0 FasterCSV.open('queue.csv', 'w') do |csv| csv << %w(t queue occupied) (period * 60).times do |t| # at every rate of service interval, vacate any and all people from the toilet and replace them with people in the queue if t % mu == 0 occupied = 0 while occupied < stalls and queue > 0 occupied += 1 queue -= 1 end end population.times do # does the engineer want to relieve himself? if (rand(period * 60) + 1) <= times_per_period if occupied < stalls occupied += 1 else queue += 1 end end end csv << [t, queue, occupied] max_queue_size = queue if queue > max_queue_size queue_total = queue_total + queue end end puts "** Max queue size : #{max_queue_size}" puts "** Total mins in queue by all : #{queue_total} mins"

Most of the code are quite self explanatory. However notice that mu is 3 (minutes) instead of 2 as we have discussed earlier. The reason for this is that in keeping the simulation code simple, I’ve not factored in that if we clear the toilets every 2 minutes, people who entered during the 2nd minute (according to the simulation) will be cleared in just 1 minute. To compensate for this, I use a mu variable that is 1.5 times the service rate.

The simulation writes to *queue.csv*, which contains the queue size and toilet occupancy at every interval. This are the results:

** Max queue size : 3 ** Total mins in queue by all : 17 mins

I opened the CSV up and used the data to create a simple histogram:

Queue size |
Number of minutes |
Percentage |

0 | 526 | 97.6% |

1 | 9 | 1.8% |

2 | 3 | 0.4% |

3 | 2 | 0.2% |

From the results it’s pretty obvious that the architect and engineers who built Fusionopolis knew their stuff. These were the quantitative conclusions:

- Queues were formed less than 3% of the time of the day
- The maximum number of people in the queue were 3 people and that happened for only less than 2 minutes in a day
- The total time spent by everyone in the office, in a queue to the toilet was 17 mins

This is for the men’s urinals, let’s look at the stalls now. The same simulation script can be used, just changing the parameters a bit. Instead of mu = 3, we use mu = 8. Instead of 3 urinals, we have 2 stalls and instead of 3 times over the period, we use 1 time over the working period.

These are the results:

** Max queue size : 3 ** Total mins in queue by all : 81 mins

Queue size |
Number of minutes |
Percentage |

0 | 485 | 89.8% |

1 | 33 | 6.1% |

2 | 21 | 3.9% |

3 | 1 | 0.2% |

The problem seems a bit more serious now though not significantly so, though the chances of being in a a queue are higher now (10%).

In conclusion, I was wrong but I’m glad to be wrong in this case. Waiting in a toilet queue is seriously not a fun thing to do.

## How to hire the best engineers – part 2

In my previous blog post, I came to the conclusion that

If you’re a hiring manager with a 100 candidates, just interview a sample pool of 7 candidates, choose the best and then interview the rest one at a time until you reach one that better than the best in the sample pool. You will have 90% chance of choosing someone in the top 25% of those 100 candidates.

I mentioned that there are 2 worst case scenarios for this hiring strategy. Firstly, if coincidentally we selected the 7 worst candidates out of 100, then the next one we choose will be taken regardless how good or bad it is. Secondly, if the best candidate is already in the sample pool, then we’ll be going through the rest of the population without finding the best. Let’s talk about these scenarios now.

The first is easily resolved. The probabilities are already considered and calculated in our simulation so we don’t need to worry about that any more. The second is a bit tricky. To find out the failure probability, we tweak the simulation a bit again. This time round instead of getting every sample pool size, we focus on the best results i.e. sample pool size of 7.

require 'rubygems' require 'faster_csv' population_size = 100 sample_size = 0..population_size-1 iteration_size = 100000 top = (population_size-25)..(population_size-1) size = 7 population = [] frequencies = [] iteration_size.times do |iteration| population = (0..population_size-1).to_a.sort_by {rand} sample = population.slice(0..size-1) rest_of_population = population[size..population_size-1] best_sample = sample.sort.last rest_of_population.each_with_index do |p, i| if p > best_sample frequencies << i break end end end puts "Success probability : #{frequencies.size.to_f*100.0/iteration_size.to_f}%"

Notice we’re essentially counting the times we succeed over the iterations we made. The result for this is a success probability of 93%!

Let’s look at some basic laws of probability now. Let’s say A is the event that the strategy works (the sample pool does not contain the best candidate), so the probability of A is P(A) and this value is 0.93. Let’s say B is the event that the the candidate we find out if this strategy is within the best quartile(regardless if A succeeds or not), so the probability of B is P(B). We don’t know the value of P(B) and we can’t really find P(B) since A and B are not independent. What we really want is . What we do know however is P(B|A) which is the probability of B given A happens. This value is 0.9063 (from the previous post),

Using the rule of multiplication:

We get the answer:

which is 84.3% probability that the sample pool does not contain the best candidate and we get the top quartile of the total candidate population.

Next, we want to find out, given that we have a high probability of getting the correct candidates in, how many candidates will we have to interview before we find one that is better than the best in the sample pool? Naturally if we have to interview a lot of candidates before finding one, the strategy isn’t practical. Let’s tweak the simulation again, this time to count the successful ones and get a histogram of the number of interviews before we hit jackpot.

What kind of histogram do you think we will get? An initial thought was that it would be a normal distribution, which is not great news for this strategy, because a histogram peaks around the mean, which means the number of candidates we need to interview is around 40+.

require 'rubygems' require 'faster_csv' population_size = 100 sample_size = 0..population_size-1 iteration_size = 100000 top = (population_size-25)..(population_size-1) size = 7 population = [] frequencies = [] iteration_size.times do |iteration| population = (0..population_size-1).to_a.sort_by {rand} sample = population.slice(0..size-1) rest_of_population = population[size..population_size-1] best_sample = sample.sort.last rest_of_population.each_with_index do |p, i| if p > best_sample frequencies << i break end end end FasterCSV.open('optimal_duration.csv', 'w') do |csv| rest_of_population = population[size..population_size-1] rest_of_population.size.times do |i| count_array = frequencies.find_all{|f| f == i} csv << [i+1, count_array.size.to_f/frequencies.size.to_f] end end

The output is a CSV file called optimal_duration.csv. Charting the data as a histogram, surprisingly we get power law graph like this:

This is good news for or strategy! Looking at our dataset, the probability of the first candidate in the rest of the candidate population to be the one is 13.58% while the probability of getting our man (or woman) in the first 10 candidates we interview is a whopping 63.25%!

So have we nailed the question if the strategy is good one? No, there is still one last question unanswered and this is the breaker. Stay tuned for part 3! (For those who are reading this, an exercise — tell me what you think is the breaker question?)

## How to hire the best engineers without killing yourself

Garena, where I work now, is in an expansion mode and I have been hiring engineers, sysadmins and so on to feed the development frenzy for platform revamps and product roadmaps. A problem I face when hiring engineers is that we’re **not** the only companies that are doing so. This is especially true now as many companies have issued their annual bonuses (or lack of) and the ranks of the dissatisfied joined the swelling exodus out of many tech companies. In other words, mass musical chairs with tech companies and engineers.

Needless to say this causes a challenge with hiring. The good news is that there plenty of candidates. The bad news however is to secure the correctly skilled engineer with the right mindset for a growing startup. At the same time, the identification and confirmation needs to be swift because once you are slow even with a loosely-fit candidate you can potentially lose him/her within a day or two.

This causes me to wonder — what is the best way to go through a large list of candidates and successfully pick up the best engineer or at least someone who is the top percentile of the list of candidates?

In Why Flip a Coin: The Art and Science of Good Decisions, H.W. Lewis wrote about a similar (though stricter) problem involving dating. Instead of choosing candidates the book talked about choosing a wife and instead of conducting interviews, the problem in the book involved dating. However the difference is in the book you can only date one person at a time while in my situation I can obviously interview more than one candidate. Nonetheless the problems are pretty much the same since if I interview too many candidates and take too long to decide, they will be snapped up by other companies. Not to mention that I will probably foam in the mouth and die from interview overdose before that.

In the book, Lewis suggested this strategy — say we are choosing from a pool of 20 candidates. Instead of interviewing each and every one of those candidates we randomly choose and interview 4 candidates and choose the best out of the sample pool of 4. Now armed with the best candidate from the sample pool, we go through the rest of the candidates one by one until we hit one that is better than him, then hire that candidate.

As you would have guess, this strategy is probabilistic and doesn’t guarantee the best candidate. In fact, there are 2 worst case scenarios. First, if we happened to choose the worst 4 candidates of the lot as the sample pool and the first candidate we choose outside of the sample pool is the 5th worst, then we would have gotten the 5th worst candidate. Not good. Conversely if we have the best candidate in the sample pool, then we run the risk of doing 20 interviews and then lose the best candidate because it took too long to do the interviews. Bad again.

So is this a good strategy? Also, what is the best population pool (total number of candidates) and sample pool we want in order to maximize this strategy? Let’s be a good engineer and do another Monte Carlo simulation to find out.

Let’s start with the population pool of 20 candidates, then we iterate through the sample pool of 0 to 19. For each sample pool size, we find the probability that the candidate we choose is the best candidate in the population. Actually we already know the probability when the sample pool is 0 or 19. When the sample pool is 0, it means we’re going to choose the first candidate we interview (since there is no comparison!) therefore the probability is 1/20 which is 5%. Similarly with a sample pool of 19, we will have to choose the last candidate and the probability of it is also 1/20 which is 5%.

Here’s the Ruby code to simulate this. We run it through 100,000 simulations to make the probability as accurate as possible, then save it into a csv file. called *optimal.csv*.

require 'rubygems' require 'faster_csv' population_size = 20 sample_size = 0..population_size-1 iteration_size = 100000 FasterCSV.open('optimal.csv', 'w') do |csv| sample_size.each do |size| is_best_choice_count = 0 iteration_size.times do # create the population and randomize it population = (0..population_size-1).to_a.sort_by {rand} # get the sample pool sample = population.slice(0..size-1) rest_of_population = population[size..population_size-1] # this is the best of the sample pool best_sample = sample.sort.last # find the best chosen by this strategy best_next = rest_of_population.find {|i| i > best_sample} best_population = population.sort.last # is this the best choice? count how many times it is the best is_best_choice_count += 1 if best_next == best_population end best_probability = is_best_choice_count.to_f/iteration_size.to_f csv << [size, best_probability] end end

The code is quite self explanatory (especially with all the in-code comments) so I won’t go into details. The results are as below in the line chart, after you open it up in Excel and chart it accordingly. As you can see if you choose 4 candidates as the sample pool, you will have roughly 1 out of 3 chance that you choose the best candidate. The best odds are when you choose 7 candidates as the sample pool, in which you get around 38.5% probability that you will choose the best candidate. Doesn’t look good.

But to be honest for some candidates I don’t really need the candidate to be the ‘best’ (anyway such evaluations are subjective). Let’s say I want to get the candidate to be in the top quartile (top 25%). What are my odds then?

Here’s the revised code that does this simulation.

require 'rubygems' require 'faster_csv' population_size = 20 sample_size = 0..population_size-1 iteration_size = 100000 top = (population_size-5)..(population_size-1) FasterCSV.open('optimal.csv', 'w') do |csv| sample_size.each do |size| is_best_choice_count = 0 is_top_choice_count = 0 iteration_size.times do population = (0..population_size-1).to_a.sort_by {rand} sample = population.slice(0..size-1) rest_of_population = population[size..population_size-1] best_sample = sample.sort.last best_next = rest_of_population.find {|i| i > best_sample} best_population = population.sort.last top_population = population.sort[top] is_best_choice_count += 1 if best_next == best_population is_top_choice_count += 1 if top_population.include? best_next end best_probability = is_best_choice_count.to_f/iteration_size.to_f top_probability = is_top_choice_count.to_f/iteration_size.to_f csv << [size, best_probability, top_probability] end end

The *optimal.csv* file has a new column, which shows the top quartile (top 5) candidates. The new line chart is shown below, with the results of the previous simulation as a comparison.

Things look brighter now, the most optimal sample pool size is 4 (though for practical purposes, 3 is good enough since the difference between 3 and 4 is small) and the probability of choosing a top quartile candidate shoots up to 72.7%. Pretty good!

Now this is with 20 candidates. How about a large candidate pool? How will this strategy stand up in say a population pool of 100 candidates?

As you can see, this strategy doesn’t work in getting the best out of a large pool (sample pool is too large, probability of success is too low) and it is worse than in a smaller population pool. However, if we want the top quartile or so (meaning being less picky), we only need a sample pool of 7 candidates and we can have a probability of 90.63% of getting what we want. This is amazing odds!

This means if you’re a hiring manager with a 100 candidates, you don’t need to kill yourself trying to interview everyone. Just interview a sample pool of 7 candidates, choose the best and then interview the rest one at a time until you reach one that better than the best in the sample pool. You will have 90% of choosing someone in the top 25% of those 100 candidates (which is probably what you want anyway)!

5comments