What are the major differences and benefits of Porter and Lancaster Stemming algorithms? [closed]
I'm Working on document classification tasks in java.
Both algorithms came highly recommended, what are the benefits and disadvantages of each and which is more commonly used in the literature for Natural Language Processing tasks?
Solution 1:
At the very basics of it, the major difference between the porter and lancaster stemming algorithms is that the lancaster stemmer is significantly more aggressive than the porter stemmer. The three major stemming algorithms in use today are Porter, Snowball(Porter2), and Lancaster (Paice-Husk), with the aggressiveness continuum basically following along those same lines. Porter is the least aggressive algorithm, with the specifics of each algorithm actually being fairly lengthy and technical. Here is a break down for you though:
Porter: Most commonly used stemmer without a doubt, also one of the most gentle stemmers. One of the few stemmers that actually has Java support which is a plus, though it is also the most computationally intensive of the algorithms(Granted not by a very significant margin). It is also the oldest stemming algorithm by a large margin.
Porter2: Nearly universally regarded as an improvement over porter, and for good reason. Porter himself in fact admits that it is better than his original algorithm. Slightly faster computation time than porter, with a fairly large community around it.
Lancaster: Very aggressive stemming algorithm, sometimes to a fault. With porter and snowball, the stemmed representations are usually fairly intuitive to a reader, not so with Lancaster, as many shorter words will become totally obfuscated. The fastest algorithm here, and will reduce your working set of words hugely, but if you want more distinction, not the tool you would want.
Honestly, I feel that Snowball is usually the way to go. There are certain circumstances in which Lancaster will hugely trim down your working set, which can be very useful, however the marginal speed increase over snowball in my opinion is not worth the lack of precision. Porter has the most implementations though and so is usually the default go-to algorithm, but if you can, use snowball.
Snowball - Additional info
Snowball is a small string processing language designed for creating stemming algorithms for use in Information Retrieval.
The Snowball compiler translates a Snowball script into another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.
History of the name
Since it effectively provides a ‘suffix STRIPPER GRAMmar’, I had toyed with the idea of calling it ‘strippergram’, but good sense has prevailed, and so it is ‘Snowball’ named as a tribute to SNOBOL, the excellent string handling language of Messrs Farber, Griswold, Poage and Polonsky from the 1960s.
---Martin Porter
Stemmers implemented in the Snowball language are sometimes simply referred to as Snowball stemmers. For example, see the Natural Language Toolkit: nltk.stem.snowball.