Naive Bayesian Classifiers and Ruby

Posted in general, Machine learning, Ruby by sausheong on February 11, 2009

I first learnt about probability when I was in secondary school. As with all the other topics in Maths, it was just another bunch of formulas to memorize and regurgitate to apply to exam questions. Although I was curious if there was any use for it beyond calculating the odds for gambling, I didn’t manage to find out any. As with many things in my life, things pop up at unexpected places and I stumbled on it again when as I started on machine learning and naive Bayesian classifiers.

A classifier is exactly that — it’s something that classifies other things. A classifier is a function that takes in a set of data and tells us which category or classification the data belongs to. A naive Bayesian classifier is a type of learning classifier, meaning that you can continually train it with more data and it will be be better at its job. The reason why it’s called Bayesian is because it uses Bayes’ Law, a mathematical theorem that talks about conditional probabilities of events, to determine how to classify the data. The classifier is called ‘naive’ because it assumes each event (in this case the data) to be totally unrelated to each other. That’s a very simplistic view but in practice it has been proven to be a surprisingly accurate. Also, because it’s relatively simple to implement, it’s quite popular. Amongst its more well-known usage include email spam filters.

So what’s Bayes’ Law and how can it be used to categorize data? As mentioned, Bayes’ Law describes conditional probabilities. An example of conditional probability is the probability of an event A happening given that another event B has happened. This is usually written as Pr(A | B), which is read as the probability of A, given B. To classify a document, we ask — given a particular text document, what’s the probability that it belongs to this category? When we find the probabilities of the given document in all categories, the classifier picks the category with the highest probability and announce it as the winner, that is, the document most probably belongs to that category.

The question then follows, how to we get the probability of a document belonging to a category? This is where we turn to Bayes’ Law which states that:

Given our usage, what we want is:

What we need is Pr(document|category) and Pr(category). You should keep in mind that we’re comparing relative probabilities here so we can drop Pr(document) because it is the same for every category.

What is Pr(document|category) and how do we find it? It is the probability that this document exists, given a particular category. As a document is made of a bunch of words, what we need to do is to calculate the probability of the words in the document within the category. Here is where the ‘naive’ part comes in. We know that the words in a document are not random and the probability of a word like ‘Ruby’ would be more likely to be found in an article on the Ruby programming language than say, an article on the dental practices in Uganda. However for the purpose of simplicity, the naive Bayesian classifier treats each word as independent of each other.

Remember your probability lessons — if the probability of each word is independent of each other, the probability of a whole bunch of words together is the product of the probability of each word in the bunch. A quick aside to illustrate this.

Take a pair of dice and roll them one after another. The probability of the the first die to fall on any one of its 6 sides is 1 out of 6, that is 1/6. The probability of the second die to fall on any one of its 6 sides is also 1/6. So what is the probability that both dice lands on 6? Out of the 6 x 6 = 36 possible ways that a pair of dice can land there is only 1 way that both dice lands on 6, so the probability is 1/36, which is 1/6 x 1/6. This is true only if the dice rolls are independent of each other. In the same way we are ‘naively’ assuming that the words in the document are occurring independently of each other, as if it is written by the theoretical monkey with a typewriter.

In other words, the probability that a document exists, given a category, is the product of the probability of each word in that document. Now that we’ve established this, how do we get the probability of a single word? Basically it’s the count of the number of times the word appeared in the category after the classifier has been trained compared to the total word counts in that category. Another quick illustration. Say we train the classifier with 2 categories (spam and not-spam), there are 100 word counts in the spam category. There are only be 14 unique words in this category but some of these words have been trained more than once. Out of these 100 word counts, 5 of them are for the word ‘money’. The probability for the word ‘money’ would be the number of times it is mentioned in the spam category (5) divided by the number of word counts in this category (100).

Now that we know Pr(document|category) let’s look at Pr(category). This is simply the probability of any document being in this category (instead of being in another category). This is the number of documents used to train this category over the total number of documents that used to train all categories.

So that’s the basic idea behind naive Bayesian classifiers. With that I’m going to show you how to write a simple classifier in Ruby. There is already a rather popular Ruby implementation by Lucas Carlsson called the Classifier gem ( which you can use readily but let’s write our own classifier instead. We’ll be creating class named NativeBayes, in a file called native_bayes.rb. This classifier will be used to classify text into different categories. Let’s recap how this classifier will be used:

  1. First, tell the classifier how many categories there will be
  2. Next, train the classifier with a number of documents, while indicating which category those document belongs to
  3. Finally, pass the classifier a document and it should tell us which category it thinks the document should be in

Now let’s run through the public methods of the NativeBayes class, which should map to the 3 actions above:

  1. Provide the categories you want to classify the data into
  2. Train the classifier by feeding it data
  3. Doing the real work, that is to classify given data

The first method we’ll roll into the constructor of the class, so that when we create the object, the categories will be set. The second method, train, takes in a category and a document (a text string) to train the classifier. The last method, classify, takes in just a document (a text string) and returns its category.

class NaiveBayes
 def initialize(*categories)
 @words =
 @total_words = 0
 @categories_documents =
 @total_documents = 0
 @categories_words =
 @threshold = 1.5

 categories.each { |category|
 @words[category] =
 @categories_documents[category] = 0
 @categories_words[category] = 0

 def train(category, document)
 word_count(document).each do |word, count|
 @words[category][word] ||= 0
 @words[category][word] += count
 @total_words += count
 @categories_words[category] += count
 @categories_documents[category] += 1
 @total_documents += 1

 def classify(document, default='unknown')
 sorted = probabilities(document).sort {|a,b| a[1]<=>b[1]}
 best,second_best = sorted.pop, sorted.pop
 return best[0] if (best[1]/second_best[1] > @threshold)
 return default

Let’s look at the initializer first. We’ll need the following instance variables:
1. @words is a hash containing a list of words trained for the classifier. It looks something like this:

"spam"    => { "money" => 10, "quick" => 12, "rich"  => 15 },
"not_spam" => { "report" => 9, "database" => 13, "salaries"  => 12 }

“Spam” and “not_spam” are the categories, while “money”, “quick” etc are the words in the “spam” category with the numbers indicating the number of times it has been trained as that particular category.

2. @total_words contains the number of words trained

3. @categories_documents is a hash containing the number of documents trained for each category:

{ "spam" => 4, "not_spam" => 5}

4. @total_documents is the total number of documents trained

5. @categories_words is a hash containing the number of words trained for each category:

{ "spam" => 37, "not_spam" => 34}

6. @threshold is something I will talk about again at the last section of the code descriptions (it doesn’t make much sense now).  Next is the train method, which takes in a category and a document. We break down the document into a number of words, and slot it accordingly into the instance variables we created earlier on. Here we are using a private helper method called word_count to do the grunt work.

def word_count(document)
 words = document.gsub(/[^\w\s]/,"").split
 d =
 words.each do |word|
 key = word.stem
 unless COMMON_WORDS.include?(word)
 d[key] ||= 0
 d[key] += 1
return d

COMMON_WORDS = ['a','able','about','above','abroad' ...] # this is truncated

The code is quite straightforward, we’re just breaking down a text string into its constituent words. We want to focus on words that characterize the document so we’re really not that interested in some words such as pronouns, conjunctions, articles, and so on. Dropping those common words will bring up nouns, characteristic adjectives and some verbs. Also, to reduce the number of words, we use a technique called ‘stemming’ which essentially reduces any word to its ‘stem’ or root word. For example, the words ‘fishing’, ‘fisher’, ‘fished’, ‘fishy’ are all reduced to a root word ‘fish’. In our method here we used a popular stemming algorithm called Porter stemming algorithm. To use this stemming algorithm, install the following gem:

gem install stemmer

Now let’s look at the classify method. This is the method that uses Bayes’ Law to classify documents. We will be breaking it down into a few helper methods to illustrate how Bayes’ Law is used. Remember that finally we’re looking at the probability of a given document being in any of the categories, so we need to have a method that returns a hash of categories with their respective probabilities like this:

{ "spam" => 0.123, "not_spam" => 0.327}
def probabilities(document)
 probabilities =
 @words.each_key {|category|
 probabilities[category] = probability(category, document)
 return probabilities

In the <em>probabilities</em> method, we need to calculate the probability of that document being in each category. As mentioned above, that probability is Pr(document|category) * Pr(category). We create a helper method called probability that simply multiplies the document probability Pr(document|category) and the category probability Pr(category).

def probability(category, document)
 doc_probability(category, document) * category_probability(category)

First let’s tackle Pr(document|category). To do that we need to get all the words in the given document, get the word probability of that document and multiply them all together.

def doc_probability(category, document)
 doc_prob = 1
 word_count(document).each { |word| doc_prob *= word_probability(category, word[0]) }
 return doc_prob

Next, we want to get the probability of a word. Basically the probability of a word in a category is the number of times it occurred in that category, divided by the number of words in that category altogether. However if the word never occurred during training (and this happens pretty frequently if you don’t have much training data), then what you’ll get is a big fat 0 in probability. If we propagate this upwards, you’ll notice that the document probability will all be made 0 and therefore the probability of that document in that category is made 0 as well. This, of course, is not the desired results. To correct it, we need to tweak the formula a bit.
To make sure that there is at least some probability to the word even if it isn’t in trained list, we assume that the word exists at least 1 time in the training data so that the result is not 0. So this means that instead of:


we take:


So the code is something like this:

def word_probability(category, word)
 (@words[category][word.stem].to_f + 1)/@categories_words[category].to_f

Finally we want to get Pr(category), which is pretty straightforward. It’s just the probability that any random document being in this category, so we take number of documents trained in the category and divide it with the total number of documents trained in the classifier.

def category_probability(category)

Now that we have the probabilities, let’s go back to the classify method and take a look at it again:

def classify(document, default='unknown')
 sorted = probabilities(document).sort {|a,b| a[1]<=>b[1]}
 best,second_best = sorted.pop, sorted.pop
 return best[0] if (best[1]/second_best[1] > @threshold)
 return default

We sort the probabilities to bubble up the category with the largest probability. However if we use this directly,it means it has to be the one with the largest, even though the category with the second largest probability is only maybe a bit smaller. For example, take the spam and non-spam categories and say the ratio of the probabilities are like this — spam is 53% and non-spam is 47%. Should the document be classified as spam? Logically, not! This is the reason for the threshold variable, which gives a ratio between the best and the second best. In the example code above the value is 1.5 meaning the best probability needs to be 1.5 times better than the second best probability i.e. the ratio needs to be 60% to 40% for the best and second best probabilities respectively. If this is not the case, then the classifier will just shrug and say it doesn’t know (returns ‘default’ as the category). You can tweak this number accordingly depending on the type of categories you are using.

Now that we have the classifier, let’s take it out for a test run. I’m going to use a set of Yahoo news RSS feeds to train the classifier according to the various categories, then use some random text I get from some other sites and ask the classifier to classify them.

require 'rubygems'
require 'rss/1.0'
require 'rss/2.0'
require 'open-uri'
require 'hpricot'
require 'naive_bayes'
require 'pp'

categories = %w(tech sports business entertainment)
classifier =

content =''
categories.each { |category|
 feed = "{category}"
 open(feed) do |s| content = end
 rss = RSS::Parser.parse(content, false)
 rss.items.each { |item|
 text = Hpricot(item.description).inner_text
 classifier.train(category, text)

# classify this
documents = [
"Google said on Monday it was releasing a beta version of Google Sync for the iPhone and Windows Mobile phones",
:Rangers waste 5 power plays in 3-0 loss to Devils",
"Going well beyond its current Windows Mobile software, Microsoft will try to extend its desktop dominance with a Windows phone.",
"UBS cuts jobs after Q4 loss",
"A fight in Hancock Park after a pre-Grammy Awards party left the singer with bruises and a scratched face, police say."]

documents.each { |text|
 puts text
 puts "category => #{classifier.classify(text)}"

This is the output:

Google said on Monday it was releasing a beta version of Google Sync for the iPhone and Windows Mobile phones
category => tech

Rangers waste 5 power plays in 3-0 loss to Devils
category => sports

Going well beyond its current Windows Mobile software, Microsoft will try to extend its desktop dominance with a Windows phone.
category => tech

UBS cuts jobs after Q4 loss
category => unknown

A fight in Hancock Park after a pre-Grammy Awards party left the R&B singer with bruises and a scratched face, police say.
category => entertainment

You can download the code described above from GitHub at, including naive_bayes.rb and the bayes_test.rb files. For more information, you should pick up the excellent book Programming Collective Intelligence by Toby Segaran.

About these ads

46 Responses

Subscribe to comments with RSS.

  1. Hugo Baraúna said, on February 22, 2009 at 2:01 am

    Very cool and very well explained! =)

  2. [...] como o Mephisto, Radiant, Insoshi e muitos outros. Recentemente eu estava discutindo sobre classificadores Bayesianos e por acaso no mesmo dia publicaram um post sobre isso, achei interessante. A ELC tem um blog muito [...]

  3. Ritesh said, on June 8, 2009 at 9:59 am

    Great..for two days I had been looking for such a tutorial. Thanks for putting this. Just to make sure that I can refer it in future, I took a snapshot of your page :P

  4. Karl Baum said, on June 18, 2009 at 8:05 pm

    Great tutorial.

    It seems like the number of occurrences of one word is not taken into account when classifying. Is this correct?

    For example, if I write an email with a message body “Viagra, Viagra, Viagra”, the word “Viagra” will only be counted once toward the spamicity when this doc is classified.


  5. sausheong said, on June 18, 2009 at 9:53 pm

    Thanks Karl! You’re right it only takes in the word once. This is just a simple example to explain how naive bayesian classifiers work, in a real spam filter there would definitely be other logic.

  6. szeryf said, on December 11, 2009 at 5:26 pm

    Can’t download naive_bayes.rb and bayes_test.rb files — the links are broken :(

  7. Boban said, on December 14, 2009 at 8:37 pm

    Pls fix the broken links. Thank You in advance

  8. REUBEN said, on October 29, 2010 at 1:00 am

    i want to learn this bt this time using weka to classifie RSS fields
    any sugestions?

  9. Agile Aspect said, on February 26, 2011 at 11:10 am

    So why is the signature of the method initialize() in class NaiveBayes define as

    def initialize(*categories)

    here but in the zip file which we download fom GIT it’s defined as

    def initialize(categories)

    which results in broken code?

  10. Agile Aspect said, on February 26, 2011 at 12:09 pm

    There are multiple byte characters in bayes_test.rb which are invalid for US ASCII.

    The first 3 are hypens:

    bayes_test.rb:20: syntax error, unexpected $end, expecting ‘)’
    …ng”,”COLOMBO, Sri Lanka (AP) — Sri Lanka’s president urged…

    bayes_test.rb:20: syntax error, unexpected $end, expecting ‘)’
    …es and he accused the rebels — known formally as the Liber…

    bayes_test.rb:20: syntax error, unexpected $end, expecting ‘)’
    …ration Tigers of Tamil Eelam — of putting their heavy arti…

    The next 2 are apostropes:

    bayes_test.rb:42: invalid multibyte char (US-ASCII)

  11. Chris said, on April 15, 2011 at 4:16 am

    @Agile Aspect, did you correct and commit?


  12. awaage said, on November 15, 2011 at 5:26 pm

    Awesome explanation!! Thanks

  13. Dental Practices For Sale said, on November 17, 2011 at 12:46 am

    Had trouble with a couple of the links, but otherwise great article. Although I’m pretty new to this stuff, it was very informative!

  14. Quora said, on December 17, 2011 at 6:53 am

    Is there an NLP algorithm for estimating logical agreement between texts?…

    Vaibhav Mallya is right that the task you’re looking to do is Sentiment Analysis. If you’re just looking to classify text as being in one class (pro String Theory) or another (anti String Theory), you can use the many standard classification techniqu…

  15. Fausto said, on November 11, 2012 at 9:53 am

    Hi there, i read your blog occasionally and i own a similar
    one and i was just curious if you get a lot of spam feedback?
    If so how do you stop it, any plugin or anything you can suggest?
    I get so much lately it’s driving me mad so any support is very much appreciated.

  16. Brandi said, on May 1, 2013 at 9:39 pm

    Does your blog have a contact page? I’m having problems locating it but, I’d
    like to send you an e-mail. I’ve got some ideas for your blog you might be interested in hearing. Either way, great blog and I look forward to seeing it grow over time.

  17. abrir cuenta facebook said, on May 10, 2013 at 12:38 pm

    When I originally commented I clicked the “Notify me when new comments are added” checkbox and
    now each time a comment is added I get three emails with the same comment.
    Is there any way you can remove people from that
    service? Thanks a lot!

  18. Beats By Dre said, on September 1, 2013 at 10:19 am

    Its my good fortune to pay a quick visit at this web siteand find out my required post along with video demo, that YouTube video and its also in quality.

  19. moncler outlet said, on November 25, 2013 at 6:10 pm

    what do i receive the lady who has every thing. I might just see if i can discover a photo album with some refence to grandmothers and fill it with photos of ava. I took ava into my outdated function this morning, and it had been a catastrophe..
    moncler outlet

  20. louis vuitton official website usa said, on November 25, 2013 at 6:25 pm

    louis vuitton factory outlet store location louis vuitton fashion louis vuitton store online louis vuitton clothing for men louis vuitton mall outlet louis vuitton online shopping uk louis vuitton uk outlet louis vuitton purses outlet

  21. google plus android said, on February 4, 2014 at 11:11 am

    It’s a shame you don’t have a donate button! I’d most certainly donate to this fantastic blog!

    I guess for now i’ll settle for book-marking and adding your RSS feed to my Google account.
    I look forward to new updates and will talk about this website with my Facebook group.
    Chat soon!

  22. หน้าขาวใส said, on February 24, 2014 at 9:04 pm

    I’m not that much of a online reader to be honest but your blogs really nice, keep
    it up! I’ll go ahead and bookmark your website to come back
    later. All the best

  23. game said, on February 25, 2014 at 6:08 am

    great points altogether, you simply gained a new
    reader. What might you recommend in regards to your submit that you made a few days in the past?
    Any positive?

  24. งานรายได้ดี said, on February 25, 2014 at 9:53 am

    I tend not to leave many responses, however i did a few searching and wound up
    here Naive Bayesian Classifiers and Ruby | saush. And I actually do have 2 questions for you if it’s
    allright. Could it be just me or does it give the impression
    like some of the remarks appear like they are left by brain dead folks?
    :-P And, if you are writing on other social sites, I
    would like to follow anything new you have to post.
    Could you make a list of every one of all your public sites like your twitter feed,
    Facebook page or linkedin profile?

  25. หน้าใส said, on February 26, 2014 at 1:51 am

    Admiring the hard work you put into your website and detailed information you provide.
    It’s nice to come across a blog every once in a while that isn’t
    the same outdated rehashed information. Fantastic
    read! I’ve saved your site and I’m adding your RSS feeds to my Google account.

  26. งานเสริม said, on February 27, 2014 at 9:39 am

    Just desire to say your article is as astounding.

    The clarity for your submit is simply spectacular and that i can suppose
    you’re an expert in this subject. Fine with your permission allow
    me to grasp your RSS feed to keep updated with impending post.
    Thanks one million and please keep up the enjoyable work.

  27. ครีมลดรอยสิว said, on February 27, 2014 at 10:03 pm

    I like what you guys tend to be up too. Such clever work and coverage!

    Keep up the fantastic works guys I’ve added you guys to my blogroll.

  28. เกมส์เบนเทน said, on February 27, 2014 at 10:28 pm

    Good post however , I was wondering if you could write
    a litte more on this subject? I’d be very thankful if you could elaborate
    a little bit more. Thanks!

  29. งานพิเศษ said, on February 27, 2014 at 11:10 pm

    It’s perfect time to make some plans for the long run and it’s time to be happy.
    I have learn this submit and if I may I want to suggest you some
    attention-grabbing issues or suggestions. Maybe you could write next articles relating to this article.

    I desire to read more issues approximately it!

  30. maternity support said, on March 3, 2014 at 2:16 pm

    After looking at a few of the articles on your
    web page, I really like your way of blogging.
    I book marked it to my bookmark webpage list and will be checking back in the near future.

    Please visit my web site too and let me know what you think.

  31. cookware reviews said, on March 9, 2014 at 4:33 am

    My spouse and I stumbled over here different website and thought
    I might as well check things out. I like what I see so
    now i’m following you. Look forward to exploring your web page repeatedly.

  32. best coffee grinders said, on March 9, 2014 at 6:36 am

    I quite like reading through a post that can make people think.
    Also, many thanks for permitting me to comment!

  33. pellet stove reviews said, on March 9, 2014 at 6:56 am

    It is perfect time to make some plans for the
    future and it’s time to be happy. I have read this post and if I could
    I want to suggest you some interesting things or tips.

    Perhaps you can write next articles referring to this article.
    I wish to read more things about it!

  34. curling wand reviews said, on March 9, 2014 at 7:15 am

    It’s perfect time to make some plans for the future and it’s time
    to be happy. I’ve read this post and if I could I desire to suggest you some interesting things or suggestions.
    Maybe you could write next articles referring to this article.

    I wish to read more things about it!

  35. best garage door openers said, on March 9, 2014 at 7:17 am

    Hello it’s me, I am also visiting this web page regularly, this site is genuinely fastidious and the people are truly
    sharing nice thoughts.

  36. best bluetooth portable speaker said, on March 12, 2014 at 7:41 pm

    Have you ever considered about adding a little bit more than just your articles?
    I mean, what you say is important and all. However think about if you added
    some great images or video clips to give your
    posts more, “pop”! Your content is excellent but with images and video
    clips, this site could certainly be one of the most beneficial in
    its field. Good blog!

  37. best male enhancement pills said, on March 15, 2014 at 10:31 pm

    Excellent post. Never knew this, appreciate it for letting me know.

  38. said, on March 15, 2014 at 10:32 pm

    You have noted very attention-grabbing points.
    Thank you for sharing.

  39. win lotto 3 with one number said, on March 18, 2014 at 11:24 am

    I like the valuable information you provie in your articles.
    I’ll bookmark your blog and check again here regularly.
    I am quite certain I’ll learn lots of new stuff right here!
    Good luck foor the next!

  40. xplocial said, on March 19, 2014 at 1:12 am

    My programmer is trying to convince me to move to .net from PHP.
    I have always disliked the idea because of the expenses.

    But he’s tryiong none the less. I’ve been using WordPress on a number
    of websites for about a year and am concerned about switching to another platform.
    I have heard fantastic things about Is there a way I can transfer all my wordpress posts into it?

    Any help would be really appreciated!

  41. Get Rich Radio said, on March 19, 2014 at 2:34 pm

    I’m now not positive the place you’re getting your
    information, but great topic. I must spend some time learning more or figuring out more.
    Thank you for great info I was looking for this info for
    my mission.

  42. metin2 yang hack 2014 fr said, on March 30, 2014 at 10:02 pm

    As can be viewed, these Black – Berry apps enable you to translate text-to-speech.
    According on the EPA, the neww rupe could save between $120 billion and $290 billion.
    lll get using a specially boxed list of incense and candles, or perfumed bath soaps and spa gelee, together with your name discreetly printed around the gift box, of course.

  43. rollover 401k to gold said, on April 7, 2014 at 11:01 pm

    Hey there! I know this is kind of off topic but I was
    wondering if you knew where I could get a captcha plugin for my comment form?
    I’m using the same blog platform as yours and I’m having
    trouble finding one? Thanks a lot!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 444 other followers

%d bloggers like this: