Word stemming in Ruby

Posted in Machine learning, Ruby by sausheong on February 8, 2009

The concept of stemming in natural language processing or NLP (processing human languages such as English) is quite a simple one. In most languages, words relate to each other in certain ways. For example, ‘fish’, ‘fishes’, ‘fishing’ and ‘fisher’ are different inflected words that are related to each other. In NLP, sometimes to reduce the number of words to process, it is convenient to group such words together and treat them as the same. To do that, we want to reduce these different variants of a word into a root word or a ‘stem’ and this is called ‘stemming’.

There are numerous strategies and algorithms for stemming. A widely-used algorithm for English stemming is the Porter stemming algorithm, written by Martin Porter in 1980. The Porter stemmer follows a strategy of suffix stripping, which basically uses a set of rules to strip away suffixes. For example, a word that ends with ‘-ed’ might be suffix-stripped to remove the ‘-ed’. The Porter stemmer follows a sequence of steps in stripping suffixes.

Stemming is closely related to lemmatisation, which is the process of grouping different inflected forms of a word to determine the lemma for that word. A lemma is the base form of the word and may change when inflected, while a stem does not change. For example, for the inflected word ‘produced’, the lemma is ‘produce’ while the stem is ‘produc’ as there is an inflected form like ‘production’. As a result, stems are not necessarily complete words.

The complete Porter stemming algorithm is found in this page, maintained by the creator of the algorithm. The Ruby implementation, by Ray Pereda is also implemented as a Ruby gem. To use the Porter stemmer in Ruby, you can just install the gem:

gem install stemmer

If you’re interested in the original Porter stemming algorithm paper, you can read it here.

About these ads

9 Responses

Subscribe to comments with RSS.

  1. Roman said, on February 10, 2009 at 2:10 am

    You might be interested in my version of stemmer. It’s based on the C version and thus in an order of magnitude faster (and uses MUCH less memory).

  2. […] to a root word ‘fish’. In our method here we used a popular stemming algorithm called Porter stemming algorithm. To use this stemming algorithm, install the following […]

  3. […] I do cheat a bit and use various libraries extensively including Hpricot, DataMapper and the Porter Stemmer. SaushEngine is a web search engine which means it goes out to  Internet and harvests data on […]

  4. […] I do cheat a bit and use various libraries extensively including Hpricot, DataMapper and the Porter Stemmer. SaushEngine is a web search engine which means it goes out to  Internet and harvests data on […]

  5. Arlene said, on October 25, 2012 at 7:17 am

    Hey There. I discovered your blog the usage of msn. This is an extremely well written
    article. I will be sure to bookmark it and come back to learn extra
    of your useful info. Thanks for the post. I will definitely comeback.

  6. wso guide said, on March 10, 2013 at 1:20 am

    Greetings! Very useful advice within this post! It’s the little changes that produce the biggest changes. Many thanks for sharing!

  7. Noel said, on March 12, 2013 at 5:08 pm

    May I simply just say what a relief to uncover an
    individual who truly knows what they are discussing on the
    web. You definitely know how to bring a problem to light and
    make it important. More and more people need to check this out and understand
    this side of the story. It’s surprising you aren’t more popular because you definitely
    have the gift.

  8. wso software said, on March 13, 2013 at 10:56 pm

    Hi there very nice blog!! Guy .. Beautiful .. Superb .
    . I will bookmark your blog and take the feeds additionally?
    I’m satisfied to search out numerous useful info here within the put up, we need develop more techniques in this regard, thank you for sharing. . . . . .

  9. Lester said, on July 28, 2013 at 2:23 am

    Quality articles or reviews is the main to invite the people to pay
    a visit the site, that’s what this web page is providing.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Get every new post delivered to your Inbox.

Join 454 other followers

%d bloggers like this: