Word stemming in Ruby

The concept of stemming in natural language processing or NLP (processing human languages such as English) is quite a simple one. In most languages, words relate to each other in certain ways. For example, ‘fish’, ‘fishes’, ‘fishing’ and ‘fisher’ are different inflected words that are related to each other. In NLP, sometimes to reduce the number of words to process, it is convenient to group such words together and treat them as the same. To do that, we want to reduce these different variants of a word into a root word or a ‘stem’ and this is called ‘stemming’.

There are numerous strategies and algorithms for stemming. A widely-used algorithm for English stemming is the Porter stemming algorithm, written by Martin Porter in 1980. The Porter stemmer follows a strategy of suffix stripping, which basically uses a set of rules to strip away suffixes. For example, a word that ends with ‘-ed’ might be suffix-stripped to remove the ‘-ed’. The Porter stemmer follows a sequence of steps in stripping suffixes.

Stemming is closely related to lemmatisation, which is the process of grouping different inflected forms of a word to determine the lemma for that word. A lemma is the base form of the word and may change when inflected, while a stem does not change. For example, for the inflected word ‘produced’, the lemma is ‘produce’ while the stem is ‘produc’ as there is an inflected form like ‘production’. As a result, stems are not necessarily complete words.

The complete Porter stemming algorithm is found in this page, maintained by the creator of the algorithm. The Ruby implementation, by Ray Pereda is also implemented as a Ruby gem. To use the Porter stemmer in Ruby, you can just install the gem:

[source:ruby]
gem install stemmer
[/source]

If you’re interested in the original Porter stemming algorithm paper, you can read it here.

Advertisements