Digital State Consulting

Correlation does not imply….do I really need to finish this sentence? Yes correlation does not imply causation but just yelling this phrase every time a new correlation study appears is just as silly as treating anything that comes out from them as gospel.

Correlation studies should be used as guidance to areas of further study, not a finished product for justifying your practices and the questions surrounding them represent a more subtle problem than being either useless or wonderful.

At the risk of beating a dead horse, I hope to cover some points that I have not seen mentioned before and give a taster of another method to test for association: logistic regression.

 Observations on correlation studies

  1. The correlations are low, so they may as well be ignored? 

This is a common argument and one that, on the face of it, makes sense – arguing that because the correlations are of a size that would generally be considered weak in most textbooks or school classes, they are of no or little value. I shall illustrate how this can be wrong with the following example:

Word Count

Take a fictional search engine – let’s call it 10100.

Say 10100 gives a “page one point” in its ranking algorithm for every word on it and that the number of words on any page, X, has a maximum word count of 10,000 and a page having a particular number of words is the same for every number in this range (ie: 0-10,000).

Now of course 10100 won’t only judge a page on its number of words; for a correlation study however, all the other potential factors are considered separately. Let us model the additional score in 10100’s ranking algorithm due to these other factors as normally distributed with mean 0 [it does not actually matter what the mean is for this case as the spearman’s rank correlation coefficients are judged on relative rank] and standard deviation of 10,000 so we have the total of all other factors, Y, modelled as  Y ~ N(0,10,000^2) and so the overall score determined by our favourite fictional search engine is

Ranking Score = X + Y

Now we simulate the method practiced by these correlation studies in R, a free piece of statistical software I would fully recommend taking a look at if you have never heard of it before.

rhoEstimates = rep(0,14000) # The Moz Correlation Study (TMCS) in 2013 used approximately 14,000 search terms

for( i in 1:14000){

x =floor(runif(50,0,10001)) # TMCS took the top 50 results for each search term

y = rnorm(50,0,10000)

rhoEstimates[i] = cor(x+y,x,method=”spearman”)} # This is storing the correlations of each of the 14,000 search terms in a vector

mean(rhoEstimates)

0.2666066

And so this correlation study of our fictional search engine’s ranking system against word count would result in a correlation of around 0.27 (even though we know it contributes to the ranking).

Obviously these numbers and distributions are made up and have zero validity in the real world. The point is that because there are so many factors to consider (so far the consensus is that Google uses over 200 ranking factors), even when we do hit upon a factor within a correlation study, the correlations are not likely to be “textbook large”. Correlations are a relative thing to each industry and it just happens that, in the SEO industry, “large” can be about ±0.3. 

  1. More samples don’t always make a study more useful 

The two major correlation studies that I have seen (from Moz and Searchmetrics in 2013) used a large number of search terms. But why is collecting the mean correlation of the top 50 results for 14,000 search terms better than 10,000?

What if a new study adopts the same methodology but uses 100,000 search terms – do we ignore the other studies in favour of this?

What about if a study with a sample size of 200 search terms comes back with some controversial results? Do we ignore it because there is not enough data?

This can be a difficult question in statistical analysis and, strictly speaking, there is no arbitrary correct amount. I propose that we are reaching the point of enough now and that larger correlation studies are at best showing off, at worse misleading people into thinking they are more useful than they actually are.

Delving a little deeper into methodology and entering the fun realm of maths, it can be shown that the accuracy in predicting the true mean correlation (the one we would obtain if we were to use an infinite sample size) in relation to our sample size is of the order √N. That is, if we want to halve the confidence interval in which we are x% sure our true mean correlation lies, we will need four times as many samples. I have struggled to find any work on the standard deviations of correlation coefficients from unspecified distributions but based on the full results from Moz survey in 2013 and some of our in-house work , it is assumed here the standard deviation of the spearman’s correlation coefficient for a sample size of 50 is no more than 0.25. On that assumption, for a sample size of 14,000 we can obtain a [conservative] 98% confidence level of ±0.005 for all of the mean correlations.

This means a sample size of 14,000 will give a mean spearman’s that we are at least 98% confident will be the same (to 2 decimal places) as if we were to use the same methodology on an infinite sample. Most of these studies don’t go beyond 2 decimal places when presenting their findings so any accuracy gain from large sample sizes will be lost because of this rounding.

Now there are other factors that come in to play here, and if you have the data likely to be used for other things on hand it makes no sense not to use it. If anything, visibly reducing your original data can raise suspicions. But this level of accuracy for a statistic whose use is a guideline for further research is, in essence, a little pointless. Knowing the population spearman mean for a factor to 3, 4, 5 or beyond decimal places just doesn’t really tell us much more beyond 2 decimal places and we should bear that in mind when we use them. I am not saying that using a large sample is inherently wrong. What I am saying is we need to be careful we are not adopting a philosophy.

  1. The assumption of independence 

A case of the devil is in the detail here.

Correlation studies following the more traditional method make the assumption that all the ranking factors are independent – and, in fairness, this assumption is reasonable for a lot of them: knowing the amount of Facebook likes a page has received isn’t likely to tell you the total number of characters in the HTML Code.

A greater stretch may be needed to justify that Facebook likes and Facebook shares are independent but then again, one could in theory get more shares on Facebook whilst keeping the likes relatively constant. But what if one factor cannot possibly increase or be present without another? What if one factor is nested within the other? Does it really matter that much?

Word count part 2: the letter e

Let us consider the same fictional search engine from before, 10100.

Now though we have a (false) suspicion that our search engine is awarding additional ranking score for every word that contains the letter e. For example: the words “belly” and “event” are each causing a page to get a slightly ranking score than “fat” and “show” and this is by a set amount.

Keeping all the assumptions from the first example the same and add that each word has a 50/50 chance of containing the letter e and is independent of the words around it. This is an example of a nested situation: we can’t have more words containing the letter e than the number of words in total. We will run the simulation again; this time adding the mean spearman for the factor “number of words containing the letter e”

rhoEstimates = rep(0,14000)  # running the same simulation as the first example

rhoEstimates2 = rep(0,14000)

for( i in 1:14000){

x =floor(runif(50,0,10001))

z = rep(0,50)

 for( j in 1:50){

 z[j] = sum(floor(runif(x[j],0,2)))}  # simulating each word having a 50/50 chance of containing an  e and counting how many there are   

 y = rnorm(50,0,10000) #

rhoEstimates[i] = cor(x+y,x,method=”spearman”) # simulates the correlations as the first example

 rhoEstimates2[i] = cor(x+y,z,method=”spearman”)} # produces the correlations for number of words containing e and page score ( words + white noise) 

 mean(rhoEstimates) # mean spearman’s for word count (will not be exactly the same as in the first example!)

[1] 0.2670338

mean(rhoEstimates2) # mean spearman’s for e count

[1] 0.2688919

It can be seen here that we get a mean correlation for number of words containing the letter e of 0.27 .Something we know has no effect on rankings. Again this is just a (slightly silly) example to highlight what can go on and why one should be wary when a claim is made with only a correlation study to back it up.  Often the only difference between an obviously stupid example and the next big thing is a convincing back story and nothing more!

This situation occurred within our in-house studies (which we will be publishing in the next few days).  When analysing the impact of referencing an XML sitemap in a robot.txt file on ranking, we simply measured the mean spearman of this variable on the whole sample.  This gave us a mean correlation of about 0.15.

Leaving it at that, one might (falsely) infer that referencing an XML sitemap in your robot.txt file is moderately useful.  However, when we filtered our sample to only include those who actually had an XML sitemap and robot.txt file and saw the mean spearman drop to 0.06, the inference becomes quite a bit less convincing.

Maybe this blanket assumption of independence does make things easier, but it can lead to results open to misinterpretation.

       4.An alternative route to analysis

A correlation study is not designed to imply which of the factors tested are used for the rankings in a search engine algorithm but to give us an idea which traits are common in highly ranking URLs and to help us point out which of those are likely to help a page rank highly.

With this philosophy in mind, I’d like to discuss the use of logistic regression and odds ratios.

Multiple logistic regression & odds ratios

These practices extend much further and, unfortunately, it is just too large of a topic to cover fully here [for more information to get started and some help on applying this I would recommend this R tutorial] but for our study, we will focus on multiple logistic regression for a binary outcome. I will just try and weigh up some advantages to its use.

Firstly we choose a cut-off point – a line of successes and failure. Why would we do this?  As anyone will vague SEO knowledge will confirm, the vast majority of clicks obtained through SERPs are taken by those top few pages; furthermore, climbing in the rankings from 40 to 30, 20 to 10 and 10 to 1 clearly are not equivalent feats.  By choosing a cut-off point, we can reflect this.

For our recent study we chose the threshold to be the top 5. Your own regressions may choose a different value and it would be interesting to discuss where different agencies class the line of success and failure in the regard.

Next we perform logistic regression analysis.  For the purposes on this study, I won’t go into the complexities here but we will focus on interpreting the results.

Firstly though a quick word on odds, probability and odds ratios…

Odds, probability and odds ratios

Say the probability of an event occurring is p, then we would define the odds of p occurring as p/1-p.  Essentially, this stretches our traditional probability measure from 0 to 1 to any non-negative number. For the purposes of modelling, sure events (those with a probability of 1) are omitted.

This may seem like an over-complication but it turns out to be useful to explain changes in terms of odds of an event occurring. Keeping things simple – if the odds increase, so will the probability of the event occurring.

An odds ratio is simply the odds of one event divided by the other. It tells us how much more likely an event is than another.

This is similar but not the same as relative risk – which is the ratio of probabilities.

So what do odds ratios have to do with logistic regression?

Well we will take a look at some of the results from our in-house study (more details will available in the our full report, to be published soon).

 (TLTP) = total links to page

(LDEP) = links to domain excluding page

 So what should one do with these results?  Well we take a factor – say TLPT0.25 – and assuming all the other factors are fixed, we model that an increase of one unit in TLTP0.25 will give an odds ratio of exp(0.123) = 1.131.

This model suggest, therefore, that for every unit increase of TLTP0.25, a page is approximately 13% more likely to be ranked in the top 5. For example holding all the other factors in the regression the same for two pages, one with 200 links to page A and 100 links to page B would have a predicted odds ratio of exp(0.123((2000.25) – (1000.25))) = 1.076 so page A, this model predicts, is about 7.6% more likely to be ranked in the top 5.

Obviously, the numbers for this model should not be taken as gospel – it is in its early stages and clearly there are far more factors that need to be considered. So how can this all still be of any use? The main use of logistic regression in this context is to determine which factors are significant, not to provide an accurate prediction; that is, which of the coefficients are likely to not be 0, given our estimates.

Coefficient

If a coefficient is zero then we see an odds ratio of exp(0) = 1. This translates to: holding everything else considered fixed and changing our tested factor, we expect to see no change in the likelihood of ranking highly. In the table above we provide confidence intervals so for our TLPT0.25 factor, we are 95% confident our coefficient, the value of our estimate if we were to apply this regression on an infinite sample, would be somewhere between 0.066 and 0.180.

Multiple logistic regression takes into account other potential factors whilst the correlation studies as they stand do not.  Surely that is better, even if it is only a few?

As we’ve have already discussed, we felt ranking in the top 5 is of greater importance because click through rate is more appropriately reflected by the use of a binary variable.

Odds ratios have an interpretable meaning. It is still unclear what exactly should be inferred from a mean spearman correlation whereas multiple logistic regression also has the potential to have predictive power. No matter how many factors are considered or how large our sample size, a correlation on its own will never achieve this.

To be clear this is not an exhaustive tutorial on the use of logistic regression. It is, at best, a brief taster to encourage discussion on its use in the SEO industry. It was never our intention to lay out a rigorous explanation of logistic regression and by no means is this the limit of what can be done with generalized linear models.

Nor am I saying we replace correlation studies with them. They could, however, act as the next stage of evidence-building after correlation studies and should be seen as another tool in the box for uncovering factors in search engine algorithms.