Stemming for SEO

What are Word Stems?
Word stems can be thought of as the root for a set of very-similar-meaning words that are in different form.  For example: “bats”, “batting”, “batter”, “batted” – all of these share the same stem “bat”, which can be obtained by stripping the suffix characters off of each word (i.e. by stripping the “s” off of “bats”, and so on).  However, a stem need not even necessarily be a valid word.  For instance, the words “bicycle”, “bicyclist”, and “bicycling” all share the stem “bicycl”, which is clearly not a word.  The great thing about stems though is, if you can strip two words down to their stems, and the stems are the same, then the two words must have almost the same meaning and are probably just different forms (plural, adverb, past participle, and so on).  If that’s the case, then the words are about as close as you can get from a relevance standpoint; intuitively, the terms “bicycle” and “bicyclist” are more related than “bicycle” and “inner tube”.

The Porter Stemming Algorithm
Various algorithms have been developed for determining the stem of a word (including a surprisingly little-used form of cheating: looking the word up in a dictionary).  The most popular stemming algorithm is the Porter stemming algorithm, which is about 85% accurate.  In other words, two words that ought to share the same stem are identified by the algorithm to have the same stem about 85% of the time.

Try the Porter Stemming Algorithm for Yourself
The algorithm usually does pretty well, but an example of two words that it fails on are “squeaking”, which stems to “squeak”, and “squeaky”, which, weirdly and frustratingly, stems  to “squeaki” (if anything, you would have thought the other one would have done that(!).  There are a few other stemming algorithms around but they’re only a few percent more accurate at best.

Searching and Word Stems
It t is *extremely* common to type one word, and see results come back that include, or are focused on, a variation of that word.  You may type “bicycling” and receive documents about “bicyclist”.  So an understanding of how Google behaves with regard to word stems, and understanding which variations will help you the most, is critical for content optimization purposes.

 How Google Handles Stemming
They would first take a page, analyze it and figure out what forms of a term were on it, then run queries against Google to see how it was indexed.  For instance, to see if a document containing “cyclist” on “www.foo.com/page1.html” is indexed for the term “cycling” (which let’s assume it does not contain), it could be queried simply with [cycling  www.foo.com/page1.html].  They also did some other fancy queries to include and exclude various word forms when investigating singulars and plurals, and multi-word phrases, but you get the general idea.

Is Google Intentionally Handling Stems Differently?
The study speculates about alternate stem-oriented indexes that Google may be maintaining.  It’s not clear to me from the paper whether Google is really explicitly targeting word stems with special algorithms, or whether the results are simply a byproduct of the fact that different word forms for the same term are highly related, by definition.

Google has disclosed in papers on their Paid Search technology that they have access to a proprietary algorithm similar to Latent Semantic Analysis; these sorts of algorithms can identify related words based on how frequently words appear together in a corpus (i.e. a set of documents).  I’ve seen material put out occasionally by SEOMoz speculating or implying that Latent Dirichlet Allocation may be what Google uses; I think that for the machine learning types in the academic community, 

LDA has been largely superseded in the last couple of years  by Principal Components Analysis.
Regardless of the mechanism, it’s clear that Google looks at how related words are to each other when determining results of a search.  Either way, the study found that regardless of whether Google is *intentionally* handling stems differently, stems seem to consistently act differently than other terms.

#1: Singulars can Help Rank for Plurals and Vice-Versa
The study found that documents with singular forms of keywords tended to come up more often for plural-form queries (about 85% of the time) than did documents with plural forms of  keywords came up for singular-form queries (about 59% of the time).  For instance, a document with “coconut” would be returned for  the query “coconuts” a higher percentage of the time  than would a document about “coconuts” being returned for queries about “coconut”.  In other words, singular phrases help you rank for plurals more than  plurals help you rank for singular phrases.  So if you are trying to rank for a plural phrase, including the singular term a few times probably helps.  The opposite is also true, but less so according to the percentages.  Either way, including the other form some number of times is probably wise.

#2: Combined Words can Help Rank for Sub-words
The study also examined combined words, in other words – if your content contains [batgirl] will that help it to rank for “bat”, “girl”, “bats”, “girls”, “batsgirl”, “batgirls”, or “batsgirls” as well?  What they found (in our interpretation here) was that content in the form [batgirl] should help you to rank for its direct break-up [bat girl], But [batgirl] will *not* help for inexact break-ups or other plural variations (for instance [bat girls], [bats girl], [bats girls], or [batgirls]).

#3: Subwords can Help Rank for Combined Words
Is the converse true though, i.e. should content with [bat girl] help you rank for [batgirl]?  Based on the study results – *yes* – and it will also help you rank for [batgirls], but surprisingly, not [batsgirl].  Individual sub-words aid in ranking for their exact combination, and also for the plural version of that combination, but only if the second word in the combined version is the plural one (i.e. [rat nest] will likely not help you rank for [ratsnest] but could help you with [ratnests].
So, by way of corresponding examples we have Table 1, based on the study’s findings and our interpretation of those here.  Of course, a term will not just help you rank for another term.
Table 1 - Effect of Plural/Singular Word Combinations

Table 1 – Effect of Plural/Singular Word Combinations

Table 2 – Best Practice for Singulars, Plurals, and Combination Terms

Table 2 – Best Practice for Singulars, Plurals, and Combination Terms

Use Additional Terms In Descending Order of Frequency

For the first additional version, use it 1/4 of the number of times you are using the term you want to rank for, then use ratios of 1/8, 1/16, and 1/32 for others (my recommendations base on experience).
Why is the first one X/4?  Well,X is too big – you’d then be smearing the relevance of the page out amongst *two* terms, and Google might think your document is not about the main term you’re targeting.   So clearly a number smaller than X is the correct one to use.  I like X/4 because presumably a natural-appearing distribution should be some sort of long tail geometric distribution, and X/4 is a reasonable guess in that case.  Any better suggestions would be gratefully appreciated.

For example, if you want to rank for [bat girl]…
…and keyword frequency analysis of the top ranking pages for that term tells you that you need the term [bat girl] 64 times…
…then also include [bat] 16 times…
…[girl] 16 times…
…and [batgirl] 8 times.

Don’t get hung up on hitting exact numbers though, these are all “ballpark” recommendations.

A *Major* Unanswered Question
However, for those combined word situations , the study only examined *valid* combined words; it left unexplored the question of nonsense combined words. In other words, if you want to rank for [squeaky floor] should you include [squeakyfloor] in the document?  This is a *great* question for our industry to explore – I’ve not seen anything on this but surely someone must have tried this! Please comment below if you have seen any evidence on this front.

Different Verb Forms
Table 10 of the paper, below, shows the study’s results for twelve different verb forms.  Column 1 (on the left) represents documents with the particular verb form; Row 1 (at the top) shows the queries that those documents tended to rank for, and the numbers in the table show the % of the time that they ranked.  So, for instance, documents containing “ing” terms (like “boxing”) were returned 38.5% of the time when the query ended in “ed” (like “boxed”):

Google's Behaivor on Verb Stems

Stemming test Results in percentages for 10 different verbs with 12 different postfixes*   click to enlarge

 *Reprinted Here by Permission of SAGE and Ahmet Uyar.
“Google Stemming Mechanisms”,
Journal of Information Science 35 (5) 2009, pp. 499–514 © Ahmet Uyar
When you look at Table 10, certain combinations really stand out.  The top performers (if you look at the rightmost “Average”) column were the Plain Form, the “-ed” form, the “-tion” form, and the “-tive” form. Surprisingly the “-s” form didn’t perform that well (although it performed well in the individual cases “Plain”, “-ed”, and “-ing”, its performance for all the others was abysmal).  Note that “-tive” should help you rank for “-tively”, but the converse is oddly not true.
So, the simple take away from this table is: pepper the forms (Plain, -ed, -tion, and -tive) into your content.  Below is a table if you want to be more systematic about it.  I used a value of around 20% in 
Table 10 as a filter to come up with the table of best practices for verbs below:

Table 3 - Best Practice for Verb Forms
Table 3 – Best Practice for Verb Forms

Use the Same Descending Frequency Percentages
For these alternate verb forms I recommend you use the same descending frequency ratios we presented for Table 2 above.

For example, if you want to rank for [creating]…
…and keyword frequency analysis of the top ranking pages for that term
tells you that you need the term [creating] 64 times…
…then also include [create] 16 times…
…[creates] 8 times…
…[creation] 4 times…
…and [created] 2 times.
Again, don’t get hung up on exact numbers, these are rough guidelines.

Why Descending Order and Not Ascending?
An astute reader might question, why do I recommend frequencies descending order and not ascending order (i.e. since intepreting from Table 10, the “-ing” version probably doesn’t help the “Plain” version as much as “-ed” version does, why not have “-ing” appear more frequently in your document, so it can have the opportunity to help as much as “-ed” forms you’re including?).  The reason is, it looks to me that the researchers organized the columns in descending order of frequency in documents (i.e. you probably see the “Plain” version of a verb more often than the “-tively” version), and I believe that peppering in these other forms in descending order is the proper thing to do from the standpoint of making the content appear as *natural* as possible.  The same logic applies to our Table 2 as well.

Another Stemming Use: Meta-Tags
Don’t forget to take advantage of word stems in meta-tags.  For instance, if you have a page targeting keywords like “Bicycle”, you might use a title like “Bicycle – information on Bicycling”.  This way you’re not overloading the title with the same keyword multiple times, but you’re getting a highly related keyword in there.  This should hold for all meta-tags including the meta-description.  Also, note that Google often highlights different stems or word combinations in the title and meta-description in the SERP (see figure 1):

Compond Version of Search Term Bolded in Meta-Description

Don’t Neglect Other Related Keywords!

Because Google is using this sort of technology, don’t forget to pepper related keywords in addition to stem variations; there are a number of free tools available you can use to analyze SERPs; one 

No comments: