Stemming algorithm
From Free net encyclopedia
A stemming algorithm is a method of reducing words to their stem, base, or root form. The algorithm has been a long-standing problem in computer science; the first paper on the subject was published in 1968. The process of stemming, often called conflation, is useful in search engines, natural language processing, and other word processing problems.
For example, a stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".
Methods
There are several types of stemming algorithms. Some techniques used are suffix stripping and lookup table replacement. In lemmatization, the part of speech is first detected prior to attempting to find the root since for some languages, the stemming rules change depending on a word's part of speech.
While much of the work in this area has focused on the English language (with significant use of the Porter Stemmer algorithm), other languages have been investigated including at least French, Italian, Spanish, Portuguese, German, Dutch, Swedish, Norwegian, Danish, Russian, Finnish, Hebrew, and Arabic. Apparently, Hebrew and Arabic are still considered difficult research languages for stemming.
Further reading
- W. B. Frakes, Stemming algorithms, Information retrieval: data structures and algorithms, Prentice-Hall, Inc., Upper Saddle River, NJ, 1992
- Lovins, J. B. "Development of a Stemming Algorithm." Mechanical Translation and Computational Linguistics 11, 1968, 22--31.
- Porter, M. F. "An Algorithm for Suffix Stripping." Program 14, 1980, 130--137.