## How to hire the best engineers – part 2

In my previous blog post, I came to the conclusion that

If you’re a hiring manager with a 100 candidates, just interview a sample pool of 7 candidates, choose the best and then interview the rest one at a time until you reach one that better than the best in the sample pool. You will have 90% chance of choosing someone in the top 25% of those 100 candidates.

I mentioned that there are 2 worst case scenarios for this hiring strategy. Firstly, if coincidentally we selected the 7 worst candidates out of 100, then the next one we choose will be taken regardless how good or bad it is. Secondly, if the best candidate is already in the sample pool, then we’ll be going through the rest of the population without finding the best. Let’s talk about these scenarios now.

The first is easily resolved. The probabilities are already considered and calculated in our simulation so we don’t need to worry about that any more. The second is a bit tricky. To find out the failure probability, we tweak the simulation a bit again. This time round instead of getting every sample pool size, we focus on the best results i.e. sample pool size of 7.

require 'rubygems' require 'faster_csv' population_size = 100 sample_size = 0..population_size-1 iteration_size = 100000 top = (population_size-25)..(population_size-1) size = 7 population = [] frequencies = [] iteration_size.times do |iteration| population = (0..population_size-1).to_a.sort_by {rand} sample = population.slice(0..size-1) rest_of_population = population[size..population_size-1] best_sample = sample.sort.last rest_of_population.each_with_index do |p, i| if p > best_sample frequencies << i break end end end puts "Success probability : #{frequencies.size.to_f*100.0/iteration_size.to_f}%"

Notice we’re essentially counting the times we succeed over the iterations we made. The result for this is a success probability of 93%!

Let’s look at some basic laws of probability now. Let’s say A is the event that the strategy works (the sample pool does not contain the best candidate), so the probability of A is P(A) and this value is 0.93. Let’s say B is the event that the the candidate we find out if this strategy is within the best quartile(regardless if A succeeds or not), so the probability of B is P(B). We don’t know the value of P(B) and we can’t really find P(B) since A and B are not independent. What we really want is . What we do know however is P(B|A) which is the probability of B given A happens. This value is 0.9063 (from the previous post),

Using the rule of multiplication:

We get the answer:

which is 84.3% probability that the sample pool does not contain the best candidate and we get the top quartile of the total candidate population.

Next, we want to find out, given that we have a high probability of getting the correct candidates in, how many candidates will we have to interview before we find one that is better than the best in the sample pool? Naturally if we have to interview a lot of candidates before finding one, the strategy isn’t practical. Let’s tweak the simulation again, this time to count the successful ones and get a histogram of the number of interviews before we hit jackpot.

What kind of histogram do you think we will get? An initial thought was that it would be a normal distribution, which is not great news for this strategy, because a histogram peaks around the mean, which means the number of candidates we need to interview is around 40+.

require 'rubygems' require 'faster_csv' population_size = 100 sample_size = 0..population_size-1 iteration_size = 100000 top = (population_size-25)..(population_size-1) size = 7 population = [] frequencies = [] iteration_size.times do |iteration| population = (0..population_size-1).to_a.sort_by {rand} sample = population.slice(0..size-1) rest_of_population = population[size..population_size-1] best_sample = sample.sort.last rest_of_population.each_with_index do |p, i| if p > best_sample frequencies << i break end end end FasterCSV.open('optimal_duration.csv', 'w') do |csv| rest_of_population = population[size..population_size-1] rest_of_population.size.times do |i| count_array = frequencies.find_all{|f| f == i} csv << [i+1, count_array.size.to_f/frequencies.size.to_f] end end

The output is a CSV file called optimal_duration.csv. Charting the data as a histogram, surprisingly we get power law graph like this:

This is good news for or strategy! Looking at our dataset, the probability of the first candidate in the rest of the candidate population to be the one is 13.58% while the probability of getting our man (or woman) in the first 10 candidates we interview is a whopping 63.25%!

So have we nailed the question if the strategy is good one? No, there is still one last question unanswered and this is the breaker. Stay tuned for part 3! (For those who are reading this, an exercise — tell me what you think is the breaker question?)

Christopher Taysaid, on September 7, 2010 at 4:58 amHi. It is definitely heartening to see someone applying/reflecting on the applications of applied probability to real life situations. Regarding the last question, are you thinking what if the number of candidates is not fixed ? As in the CVs are “streaming” in ?

I would like to comment on 2 things:

– Probably the book “Why flip a coin” might have mentioned it. For the optimal stopping problem, it can be shown analytically that to maximize your chances of getting the best candidates of N, you have to interview N/e candidates especially when N grows large. So your example of N=20 yields approximately 20/e = 7.36. But of course the advantage of simulation is to be able to easily change parameters or to add in different constraints.

– Concerning the simulation, line 13 of your code shuffles the array by making use of the sorting algo and a randomly returned value. Chances are that the probability distribution on the shuffled array is not uniformly distributed and its resulting probability distribution is highly dependent on ruby’s implementation of sorting. I am not sure but probably it will not have a big impact on your experimental results. However, I still recommend an algorithm like the Fisher-Yates algorithm and see if there is any significant impact. Whatever the case, at least you can be sure that your shuffled array follows a uniform distributed which I guess is what you are looking for?