What are Word Stems?
Word stems can be thought of as the
root for a set of very-similar-meaning words that are in different form.
For example: “bats”, “batting”, “batter”, “batted” – all of these share the
same stem “bat”, which can be obtained by stripping the suffix characters off
of each word (i.e. by stripping the “s” off of “bats”, and so on).
However, a stem need not even necessarily be a valid word. For instance,
the words “bicycle”, “bicyclist”, and “bicycling” all share the stem “bicycl”,
which is clearly not a word. The great thing about stems though is, if
you can strip two words down to their stems, and the stems are the same, then
the two words must have almost the same meaning and are probably just different
forms (plural, adverb, past participle, and so on). If that’s the case,
then the words are about as close as you can get from a relevance standpoint;
intuitively, the terms “bicycle” and “bicyclist” are more related than
“bicycle” and “inner tube”.
The Porter Stemming Algorithm
Various algorithms have been developed
for determining the stem of a word (including a surprisingly little-used form
of cheating: looking the word up in a dictionary). The most popular
stemming algorithm is the Porter stemming algorithm, which is about 85%
accurate. In other words, two words that ought to share the same stem are
identified by the algorithm to have the same stem about 85% of the time.
Try the Porter Stemming Algorithm for
Yourself
The algorithm usually does pretty
well, but an example of two words that it fails on are “squeaking”, which stems
to “squeak”, and “squeaky”, which, weirdly and frustratingly, stems to
“squeaki” (if anything, you would have thought the other one would have done
that(!). There are a few other stemming algorithms around but they’re
only a few percent more accurate at best.
Searching and Word Stems
It t is *extremely* common to type one
word, and see results come back that include, or are focused on, a variation of
that word. You may type “bicycling” and receive documents about
“bicyclist”. So an understanding of how Google behaves with regard to
word stems, and understanding which variations will help you the most, is
critical for content optimization purposes.
How Google Handles Stemming
They would first take a page, analyze
it and figure out what forms of a term were on it, then run queries against
Google to see how it was indexed. For instance, to see if a document
containing “cyclist” on “www.foo.com/page1.html” is indexed for the term “cycling”
(which let’s assume it does not contain), it could be queried simply with
[cycling www.foo.com/page1.html]. They also did some other fancy
queries to include and exclude various word forms when investigating singulars
and plurals, and multi-word phrases, but you get the general idea.
Is Google Intentionally Handling Stems
Differently?
The study speculates about alternate
stem-oriented indexes that Google may be maintaining. It’s not clear to
me from the paper whether Google is really explicitly targeting word stems with
special algorithms, or whether the results are simply a byproduct of the fact
that different word forms for the same term are highly related, by definition.
Google has disclosed in papers on
their Paid Search technology that they have access to a proprietary algorithm
similar to Latent Semantic Analysis; these sorts of algorithms can identify
related words based on how frequently words appear together in a corpus (i.e. a
set of documents). I’ve seen material put out occasionally by SEOMoz speculating
or implying that Latent Dirichlet Allocation may be what Google uses; I think
that for the machine learning types in the academic community,
LDA has been
largely superseded in the last couple of years by Principal Components
Analysis.
Regardless of the mechanism, it’s
clear that Google looks at how related words are to each other when determining
results of a search. Either way, the study found that regardless of
whether Google is *intentionally* handling stems differently, stems seem to consistently
act differently than other terms.
#1: Singulars can Help Rank for
Plurals and Vice-Versa
The study found that documents with
singular forms of keywords tended to come up more often for plural-form queries
(about 85% of the time) than did documents with plural forms of keywords
came up for singular-form queries (about 59% of the time). For instance,
a document with “coconut” would be returned for the query “coconuts” a
higher percentage of the time than would a document about “coconuts”
being returned for queries about “coconut”. In other words, singular
phrases help you rank for plurals more than plurals help you rank for
singular phrases. So if you are trying to rank for a plural phrase,
including the singular term a few times probably helps. The opposite is
also true, but less so according to the percentages. Either way,
including the other form some number of times is probably wise.
#2: Combined Words can Help Rank for
Sub-words
The study also examined combined
words, in other words – if your content contains [batgirl] will that help it to
rank for “bat”, “girl”, “bats”, “girls”, “batsgirl”, “batgirls”, or “batsgirls”
as well? What they found (in our interpretation here) was that content in
the form [batgirl] should help you to rank for its direct break-up [bat girl],
But [batgirl] will *not* help for inexact break-ups or other plural variations
(for instance [bat girls], [bats girl], [bats girls], or [batgirls]).
#3: Subwords can Help Rank for
Combined Words
Is the converse true though, i.e.
should content with [bat girl] help you rank for [batgirl]? Based on the
study results – *yes* – and it will also help you rank for [batgirls], but
surprisingly, not [batsgirl]. Individual sub-words aid in ranking for
their exact combination, and also for the plural version of that combination,
but only if the second word in the combined version is the plural one (i.e.
[rat nest] will likely not help you rank for [ratsnest] but could help you with
[ratnests].
So, by way of corresponding examples
we have Table 1, based on the study’s findings and our interpretation of those
here. Of course, a term will not just help you rank for another term.
Table 1 – Effect of Plural/Singular Word Combinations
Table 2 – Best Practice for Singulars,
Plurals, and Combination Terms
Use Additional Terms In Descending
Order of Frequency
For the first additional version, use
it 1/4 of the number of times you are using the term you want to rank for, then
use ratios of 1/8, 1/16, and 1/32 for others (my recommendations base on
experience).
Why is the first one X/4? Well,X
is too big – you’d then be smearing the relevance of the page out amongst *two*
terms, and Google might think your document is not about the main term you’re
targeting. So clearly a number smaller than X is the correct one to
use. I like X/4 because presumably a natural-appearing distribution
should be some sort of long tail geometric distribution, and X/4 is a
reasonable guess in that case. Any better suggestions would be gratefully
appreciated.
For example, if you want to rank for [bat girl]…
…and keyword frequency analysis of the top ranking pages for that term tells you that you need the term [bat girl] 64 times…
…then also include [bat] 16 times…
…[girl] 16 times…
…and [batgirl] 8 times.
…and keyword frequency analysis of the top ranking pages for that term tells you that you need the term [bat girl] 64 times…
…then also include [bat] 16 times…
…[girl] 16 times…
…and [batgirl] 8 times.
Don’t get hung up on hitting exact numbers though, these are all
“ballpark” recommendations.
A *Major* Unanswered Question
However, for those combined word situations , the study only examined *valid* combined words; it left unexplored the question of nonsense combined words. In other words, if you want to rank for [squeaky floor] should you include [squeakyfloor] in the document? This is a *great* question for our industry to explore – I’ve not seen anything on this but surely someone must have tried this! Please comment below if you have seen any evidence on this front.
However, for those combined word situations , the study only examined *valid* combined words; it left unexplored the question of nonsense combined words. In other words, if you want to rank for [squeaky floor] should you include [squeakyfloor] in the document? This is a *great* question for our industry to explore – I’ve not seen anything on this but surely someone must have tried this! Please comment below if you have seen any evidence on this front.
Different Verb Forms
Table 10 of the paper, below, shows
the study’s results for twelve different verb forms. Column 1 (on the
left) represents documents with the particular verb form; Row 1 (at the top)
shows the queries that those documents tended to rank for, and the numbers in
the table show the % of the time that they ranked. So, for instance,
documents containing “ing” terms (like “boxing”) were returned 38.5% of the
time when the query ended in “ed” (like “boxed”):
Stemming test Results in percentages
for 10 different verbs with 12 different postfixes* click to
enlarge
*Reprinted Here by Permission of SAGE and Ahmet Uyar.
“Google Stemming Mechanisms”,
Journal of Information Science 35 (5) 2009, pp. 499–514 © Ahmet Uyar
“Google Stemming Mechanisms”,
Journal of Information Science 35 (5) 2009, pp. 499–514 © Ahmet Uyar
When you look at Table 10, certain
combinations really stand out. The top performers (if you look at the
rightmost “Average”) column were the Plain Form, the “-ed” form, the “-tion”
form, and the “-tive” form. Surprisingly the “-s” form didn’t perform that well
(although it performed well in the individual cases “Plain”, “-ed”, and “-ing”,
its performance for all the others was abysmal). Note that “-tive” should
help you rank for “-tively”, but the converse is oddly not true.
So, the simple take away from this
table is: pepper the forms (Plain, -ed, -tion, and -tive) into your content.
Below is a table if you want to be more systematic about it. I used a
value of around 20% in
Table 10 as a filter to come up with the table of best
practices for verbs below:
Table 3 – Best Practice for Verb Forms
Use the Same Descending Frequency Percentages
For these alternate verb forms I
recommend you use the same descending frequency ratios we presented for Table 2
above.
For example, if you want to rank for [creating]…
…and keyword frequency analysis of the top ranking pages for that term
tells you that you need the term [creating] 64 times…
…then also include [create] 16 times…
…[creates] 8 times…
…[creation] 4 times…
…and [created] 2 times.
…and keyword frequency analysis of the top ranking pages for that term
tells you that you need the term [creating] 64 times…
…then also include [create] 16 times…
…[creates] 8 times…
…[creation] 4 times…
…and [created] 2 times.
Again, don’t get hung up on exact
numbers, these are rough guidelines.
Why Descending Order and Not
Ascending?
An astute reader might question, why
do I recommend frequencies descending order and not ascending order (i.e. since
intepreting from Table 10, the “-ing” version probably doesn’t help the “Plain”
version as much as “-ed” version does, why not have “-ing” appear more
frequently in your document, so it can have the opportunity to help as much as
“-ed” forms you’re including?). The reason is, it looks to me that the
researchers organized the columns in descending order of frequency in documents
(i.e. you probably see the “Plain” version of a verb more often than the
“-tively” version), and I believe that peppering in these other forms in descending
order is the proper thing to do from the standpoint of making the content
appear as *natural* as possible. The same logic applies to our Table 2 as
well.
Another Stemming Use: Meta-Tags
Don’t forget to take advantage of word
stems in meta-tags. For instance, if you have a page targeting keywords
like “Bicycle”, you might use a title like “Bicycle – information on
Bicycling”. This way you’re not overloading the title with the same
keyword multiple times, but you’re getting a highly related keyword in
there. This should hold for all meta-tags including the
meta-description. Also, note that Google often highlights different stems
or word combinations in the title and meta-description in the SERP (see figure
1):
Don’t Neglect Other Related Keywords!
Because Google is using this sort of
technology, don’t forget to pepper related keywords in addition to stem
variations; there are a number of free tools available you can use to analyze
SERPs; one
No comments:
Post a Comment