On a recent hike around the ruins of the St. Francis Dam disaster site about 40 miles from Downtown Los Angeles, with my archaeologist friend, John, we discussed the tarnished life of its builder and the age of the “Gentlemen Scientist.”
The St. Francis Dam was built between 1924 and 1926 to create a large storage reservoir for the city of Los Angeles, California, by the Bureau of Water Works and Supply, now the Department of Water and Power. The department was under the direction of its general manager and chief engineer, William Mulholland. If you’ve ever seen the classic movie, “Chinatown”, William Mulholland was such a significant part of Los Angeles history they had to break him into two characters.
While he was a legend in his own time, Mulholland wasn’t a civil engineer by today’s standards. He was self-taught during his early days as a “ditch tender” for the Water Department. After a hard day’s work, Mulholland would study textbooks on mathematics, engineering, hydraulics, and geology. This origin story is the foundation of the “Gentlemen Scientist” persona – devouring all the material available on a subject and then claiming an understanding that would allow them to oversee a massive undertaking, despite any form of testing or certification.
If I showed up at NASA and said I was qualified to send humans to Mars because I read a lot of books on space travel and used to build model rockets as a kid, they would throw me off the property. In Mulholland’s day, it meant a promotion to the head of the department.
Mulholland is an integral part of Los Angeles history. While many of his early efforts literally changed the landscape of Los Angeles (he supervised the design and construction of the Los Angeles Aqueduct, which brought water to much of the county), his lack of modern civil engineering caused “one of the worst American civil engineering disasters of the 20th century,” according to the Catherine Mulholland, in her biography of William Mulholland, her grandfather.
Just minutes before midnight on March 12, 1928, the dam catastrophically failed, and the resulting flood killed at least 431 people, but some reports claim up to one thousand. Even with the smaller number, the collapse of the St. Francis Dam remains the second-greatest loss of life in California’s history. Only the 1906 San Francisco earthquake and fire killed more people.
The discussion with my friend that day made me think about the search engine optimization business and its collection of “Gentleman Scientists.”
Instead of building dams, our colleagues are trying to reverse engineer the complex algorithms of search engines like Google using faulty statistical practices to devise SEO strategies backed by shoddy science.
A Long History of Bad Science
For decades now, legions of SEO professionals claim to have “tested” different theories about Google’s algorithms via some very questionable practices. In the beginning, these tests usually involved a self-proclaimed SEO “mad scientist changing one aspect of a single webpage, then waiting for the next Google Dance to see if their website advanced in a search engine’s index. If it worked, they published a post about the results in a forum or on their websites. If the poster were popular enough, the SEO community would replicate their new “hack” until Yahoo, Google, or one of the other early search engines told them to stop or figured out how to block it from happening in their algorithms.
Early SEO legends were born from this sort of activity.
Eventually, companies like Moz, Ahrefs, and SEMrush figured out ways to replicate Google’s index, the “testing” or “studies” they did got a lot more legitimate-looking because of the access to much larger data sets. Google would occasionally shut these theories down with the classic and appropriate, “Correlation does not equal causation” reply; however, most of these faulty proclamations lived on under the flag of “Trust but verify.”
My long-held stance on the matter comes from the fact that Google’s multiple algorithms consider hundreds of data points to create an index of the World Wide Web composed of billions of webpages. With something so sophisticated, are most SEO professionals qualified to “test” Google using our limited understanding of statistics?
With rare exceptions, that I’m sure will be highlighted once this article is published, most of the people who work in SEO are novice statisticians who, at best, attended the typical classes and retained more than most. A few colleagues possess a slightly more in-depth understanding of statistics, but still aren’t statisticians or mathematicians, but acquired their mathematical abilities in the study of other sciences accustomed to less complex data. In most cases, the statistical systems they use are for analyzing surveys or media buying forecasts. They aren’t for the large, complex systems found in search engine algorithms and the information they organize.
Our Basic Understanding of Statistics May Not Be Enough
I’ll be the first to admit I am not a mathematician or a statistician. I struggled with math in school just enough to finish my undergraduate degrees and didn’t feel comfortable with it all until grad school. Even then, that was in the standard business statistics class most people suffered while seeking their MBA.
Just as when I worked with actual intellectual property lawyers for my article on the legality of Google’s Featured Snippets, I sought out an actual statistician. Most importantly, I needed someone who doesn’t work in the SEO space to avoid any observer bias, that is, someone who would subconsciously project their expectations onto the research.
My search led me to the statistician, Jen Hood. Jen studied mathematics and economics at Virginia’s Bridgewater College, and for most of the 15 years she has been working as a statistician. She was a data analyst for Volvo. Since 2019, she has been working as an analytics consultant at her company, Avant Analytics, mostly helping small businesses that wouldn’t usually have an in-house analyst.
We spoke about how most of the studies around SEO rely on the concept of statistical correlation during our first discussions. Statistical correlation shows whether – and how strongly – pairs of variables, such as certain aspects of a webpage and that page’s position in Google’s search engine result pages, are related.
“The vast majority of statistical work, even forecasting the future, revolves around measuring correlation,” Jen says cautiously. “However, causation is incredibly difficult to prove.” Causation is the action of causing something to happen; that is, the real reason things work the way they do.
“Without knowing the details of how any of these companies create their metrics, I’m suspicious there’s a significant amount of confirmation bias occurring,” Jen continued. Confirmation bias happens when the person performing an analysis wants to prove a predetermined assumption. Rather than doing the actual work needed to confirm the hypothesis, they make the data fit until this assumption is proven.
To give Jen a better idea of how these companies were producing their data, I shared some of the more popular SEO studies over the past few years. Some of the proclamations made in these studies have been disproven by Google multiple times over the years, others still linger on Twitter, Reddit, and Quora and get discussed on what feels like a daily basis.
“The confirmation bias error shows up a lot in these SEO articles,” Jen states right away. “This is common with any topic where someone’s telling you how to get an advantage.”
First, Jen reviewed a study presented by Rob Ousbey at Mozcon 2019, back when he was working for Distilled (he currently works for Moz) on the SEO testing platform, then called Distilled ODN, now the spin-off SearchPilot. Of the various theories presented that day, one claimed that the results on Page 1 of search engine result pages are driven more by engagement with those pages than links. Jen gets suspicious immediately.
“With the information available, it’s hard to say if Rob’s theory about the first page of results is driven by engagement and subsequent results are driven by links is accurate,” Jen wrote after reviewing the presentation. “This idea that it’s mainly links [driving the search results for Page 2 onward] seems a bit strange given that there are so many factors that go into the ranking.”
“The easy test would be: if you can rank on Page 1, especially the top of the page, without previously having any engagement, then the engagement is most likely driven by placement, not the other way around.”
I reached out to Will Critchlow, founder, and CEO of Distilled. He offered another study by a former colleague of Rob Ousbey, Tom Capper, that provided a deeper dive into the material that Rob presented back in 2019. “Tom approached this from a few different angles – but the short answer is no – this is not just because top results get more interaction because they are top results.”
“[Tom’s study provided] various different kinds of evidence,” Will continued, “One is that links have a higher correlation with relative rankings lower down the SERPs than they do on the first page (and especially for high-volume keywords).”
“Other evidence includes the way rankings change when a query goes from being a relatively low volume search phrase to a head term (e.g., very spiky volume),” Will states, referring to an analysis of the search term, “Mother’s Day flowers.”
“This continues to get more interesting,” Jen writes after reviewing the new information. “This new [data] gets into actual correlation values though on a completely different and much smaller sample focused on data from the UK – only 4,900 queries over two months.”
Before we continue, it’s crucial to understand how correlation studies are supposed to work.
There are multiple ways to measure the relationship, or correlation, between two factors. Regardless of the method, the numbers returned from these calculations measure between -1 and 1. A correlation of -1 means as one factor goes up, the other factor goes down every time. A correlation of 1 means as one factor goes up, the other factor goes up every time. A correlation of zero means there is no relationship – no predictable linear pattern, up/down, up/up, down/up, or otherwise.
“Most correlation coefficients (results) aren’t close to 1 or -1,” Jen clarifies. “Anything at +/-1 means that 100% of the variation is explained by the factor you’re comparing. That is, you can always use the first factor to predict what the second factor will do.”
While there’s no rule for saying a correlation is strong, weak, or somewhere in between, there are some generally accepted thresholds, which Jen describes. “Keeping in mind that we can have values that are +/-, for factors that are easily countable, such as the number of links a webpage has and that webpage’s ranking on Google, the high correlation would be 0.7-1.0, moderate would be 0.3-0.7, and weak would be 0-0.3.”
“Someone could challenge these exact groupings,” Jen acknowledges, “though I’ve erred on the side of generosity for correlation strength.”
We go back to the study. “Tom’s slides mainly reference back to a February 2017 presentation he did on whether Google still needs links. There’s a Moz study also referenced which, at this point, is five years old.” (Jen pauses here to state, “On a side note, I find it interesting that everyone seems to acknowledge the algorithms have experienced significant changes and yet they’re referencing studies two, three, or more years old.”)
“In this, [Tom] looks at how Domain Authority and rankings relate,” referring to the Moz metric that is the cornerstone of the tools inbound link reporting. “He gives the correlation of Domain Authority to a webpage’s Google ranking as 0.001 for positions 1 through 5 and 0.011 for positions 6 through 10.”
“This means that Domain Authority is more highly correlated with search engine ranking for positions 6 through 10, but both results are very weak correlations,” Jen paused to make sure I understand.
“To put this in plainer terms, for positions 1 through 5 in Google’s results, Domain Authority can be used to explain 0.1% of the variance in SERP ranking. For positions 6 through 10, it explains 1.1% of the variance in SERP ranking,” clarifying her point.
“This is held up as proof that Domain Authority doesn’t matter as much for top positions. Yet the correlations for both are so extremely low as to be nearly meaningless,” Jen says excitedly by the discovery. At the same time, I consider how many domains and links are bought and sold using this metric. “Elsewhere, he mentions 0.023 and 0.07 as correlation coefficients for Domain Authority and ranking in top 10 positions, which doesn’t make sense with his earlier values both being lower.”
Jen brings the explanation full circle, “Since this is the backup detail, more technically focused, provided by the company, it seems like a reasonable leap to think that the correlations in the original study you sent me are of a similar level.” That is to say, while we don’t have the numbers for Rob Ousbey’s original presentation, they are most likely just as weak a correlation.
“The Mother’s Day study is highly anecdotal,” Jen continues, “The results are interesting and raise questions about what implication this might have for other search terms. However, this is one search term studied for one month. There’s nowhere close to enough content to this study to make universal implications from it.”
“Good for a sales pitch; bad for a statistical study,” Jen proclaims. “Meanwhile, I still haven’t seen anything that shows how they’ve proven that the top results don’t get more interaction because they are the top result.”
“There are many examples presented on other slides to support the claims, but no broad studies.” Jen refers to some of the other studies provided in Rob’s original presentation by Larry Kim, Brian Dean, and Searchmetrics.
“[Larry Kim’s study on the influence of click-through rate on rankings] suggests that lower click-through rate drives a lower ranking. Yet it could be the lower ranking driving the lower click-through rate,” Jen says, illustrating a common paradox with this sort of data. “I would fully expect a high correlation between page rank and click-through rate simply because more people are presented the opportunity to engage.”
“Does Bounce Rate affect search position or vice-versa?” Jen asks, moving on to another slide that references a study by Brian Dean of Backlinko that claimed that the bounce rate metric influences the search result position. “I find it interesting that the narrative seems different if you actually go to the source data.”
Jen refers to the original Backlinko study where the graph used in Rob’s presentation was sourced, which stated, “Please keep in mind that we aren’t suggesting that low bounce rates cause higher rankings. Google may use bounce rate as a ranking signal (although they have previously denied it). Or it may be the fact that high-quality content keeps people more engaged. Therefore, lower bounce rate is a byproduct of high-quality content, which Google does measure.”
The statement concludes, “As this is a correlation study, it’s impossible to determine from our data alone,” thus proving Jen’s point of how inappropriate it is to publish these studies at all.
Jen strongly concludes, “The use of this graph is intentionally misleading.”
“[These studies are] just looking at one factor. With multiple algorithms in place, there must be many factors all working together. Each must have individual ratings that are weighted into a total for the specific algorithm and likely weighted again within the aggregating algorithm they use.” Jen states, mirroring something that Google’s Gary Illyes and John Mueller has said more than once at various conferences and on Twitter and something this publication’s own Dave Davies has recently discussed.
Because of this acknowledged complexity, some SEO studies have abandoned correlation methods entirely in favor of machine learning-based algorithms, such as Random Forest. A technique a 2017 investigation by SEMrush uses to propose top-ranking factors on Google, such as page traffic and content length. “This is a good approach to predict what’s likely to happen,” Jen writes after reviewing the SEMrush study and its explanation of its methodology, “but it still doesn’t show causation. It just says which factors are better predictors of ranking.”
The Research Presented Is Limited & Unverified
Most of the research around search engines that is issued comes not from independent sources or educational institutions, but from companies selling tools to help you with SEO.
This kind of activity by a company is the ethical equivalent of Gatorade proving its claims of being a superior form of hydration for athletes by referencing a study conducted by The Gatorade Sports Science Institute, a research lab owned by Gatorade.
When I mentioned to Jen Hood how many of the studies she reviewed have spawned new guiding metrics or entirely new products, she was surprised anyone takes those metrics or products seriously.
“Anyone claiming that they have a metric which mimics Google is asserting that they’ve established many cause-effect relationships that lead to a specific ranking on Google,” Jen wrote, referring to Moz’s Domain Authority. “What this should mean is that their metric consistently matches with the actual results. If I started a brand-new site or a brand-new page today and did everything that they say is an important factor, I should get a top ranking. Not probably rank high. If there’s a true match to the algorithms, the results should always follow.”
Jen provides a hypothetical example:
“Let’s say I offer a service where I’ll tell you exactly where your webpage will rank for a given search term based a metric I include in that service. I have a formula for calculating that metric so I can do it for many different sites. If I could accurately tell you where you’d rank based on my formula 0.1% of the time, would it seem like my formula has the Google algorithms figured out? If I upped that to 1.1% of the time, would you now feel confident?”
“That’s all these studies [and products] seem to be doing,” Jen explains. “Cloaking themselves in just enough statistical terms and details to make it seem like it’s much more meaningful.”
* * *
As Jen alluded to earlier, most studies of Google’s results are using a limited amount of data, but claiming statistical significance; however, their understanding of that concept is flawed given the nature of the very thing they are studying.
“Rand says he estimates that Jumpshot’s data contains ‘somewhere between 2-6% of the total number of mobile and desktop internet-browsing devices in the U.S., a.k.a., a statistically significant sample size,’” Jen is referring to a 2019 study by SparkToro’s Rand Fishkin that claims that less than half of all Google searches result in a click. “Rand would be right about statistical significance if the Jumpshot data were a truly random and representative sampling of all Google searches.”
“From what I could find, [Jumpshot] harvested all their data from users who used Avast antivirus,” referring to the now-shuttered service’s parent company. “This set of users and their data likely differs from all Google users. This means that the sample Jumpshot provides isn’t random and likely not representative enough – a classic sampling error usually referred to as Availability Bias.”
“Statistics without context should always be taken with a grain of salt. This is why there are analytics experts to raise questions and give context. What types of questions are people asking, and how have these maybe changed?” Jen said, digging into the premise of the study.
“For instance, people searching for topics where there’s no added value of going to another website are unlikely to be considered lost opportunities for those who lose the clicks. Are people immediately refining their search term because the algorithm didn’t capture the context of what they were asking?” Jen suggested, both something that Rand clarified later as being a part of his claim of why clicks on results are not happening on more than half of the results. “Now we’re getting more into the nuance, but if Rand is claiming that the no-click searches are bad, then there needs to be a context of why this might happen even in the absence of a [Featured Snippet].”
* * *
If the concept of using data too thin to be accurate isn’t damning enough, there’s the problem that there’s no concept of peer review within the SEO industry. Most of these studies are conducted once and then published without ever being replicated and verified by outside sources. Even if the studies are replicated, they are done by the same people or companies as a celebrated annual tradition.
Of all the historical studies of the St. Francis Dam Disaster, one by J. David Rogers, Ph.D., Chair in Geological Engineering, Department of Geological Sciences & Engineering and professor at Missouri University of Science and Technology, stands out to me. He stated one of the critical reasons for the failure: “The design and construction being overseen by only one person.”
“Unless the results are life and death or highly regulated, we don’t normally see people doing the actual work required to show causation,” Jen Hood adds. “The only way to really show causation is to have a robust study that randomizes and controls for other factors on the proper scale. Outside of clinical drug testing, which normally takes years, it’s very uncommon to see this taking place.”
How the SEO industry conducts and presents its research is not how scientific studies have been administered since the 1600s. You don’t have to believe me. I’m not a scientist, but Neil deGrasse Tyson is.
“There is no truth that does not exist without experimental verification of that truth,” said Tyson in an interview with Chuck Klosterman for his book, “But What If We’re Wrong”. “And not only one person’s experiment, but an ensemble of experiments testing the same idea. And only when an ensemble of experiments statistically agrees, do we then talk about an emerging truth within science.”
The standard counter to this argument is just to state, “I never said this study was scientific.” If that’s so, why does this information get shared and believed with such conviction? This is the heart of the problem of confirmation bias, not just with the researchers but also with the users of that research.
“[I]f you really think about what you really actually know, it’s only a few things, like seven things, maybe everybody knows,” comedian, Marc Maron, is talking about the concept of knowledge in his stand-up special, “End Times Fun”. “If you actually made a column of things, you’re pretty sure you know for sure, and then made another column of how you know those things, most of that column is like, ‘Some guy told me.’”
“You know, it’s not sourced material, it’s just – it’s clickbait and hearsay, that’s all,” Maron continues. “Goes into the head, locks onto a feeling, you’re like, ‘That sounds good. I’m gonna tell other people that.’ And that’s how brand marketing works, and also fascism, we’re finding.”
Most of the Time, We’re Wrong
Science has been about figuring out how the physical world works since the time of Aristotle, which most people agree now, was wrong about many things. Scientists must make these efforts because there’s no user’s manual for our planet or anything else in the universe. We can’t visit a random deity during office hours and ask why they made gravity work the way it does.
But with Google and the other search engines, we do have such access.
I hate to fall back on the “Because Google said so!” type argument for these things, but unlike most sciences, we can get notes from The Creator during announced office hours and occasionally, Twitter.
The following tweet by John Mueller from earlier this year was in response to yet another correlative study published by yet another SEO tool company without any outside corroborations, claiming to have unlocked Google’s secrets with a limited amount of data.
You’ve built complicated algorithms at scale too — you know that it’s never a single calculation with static multipliers. These things are complex, and change over time. I find these reports fascinating – who would have thought X? – but I worry that folks assume they’re useful.
— John (@JohnMu) April 28, 2020
John Mueller and I share a very similar view of the presentation of this type of data, “I worry that folks assume they’re useful,” that is, this data isn’t useful at all and even potentially misleading.
The above statement came about after the author of the study, Brian Dean stated that this report was “more to shed a bit of light onto how some of Google’s ranking factors *might* work.”
Statements like this are a popular variation of a typical mea culpa when an SEO research study is called out as incorrect. “I never said this was a Google ranking factor, but that there’s a high correlation,” implying that even if Google says it is not valid, it still might a good proxy for Google’s algorithm. After that, the conversation breaks down as SEO professionals claim to have caught Google in some sort of disinformation campaign to protect their intellectual property. Even the slightest crack in their answer is treated as if someone discovered they were using the souls of conquered SEO pros to power their servers.
“I’m perplexed how this hasn’t become an issue before,” Jen says during our final conversation. I tell her it’s always been an issue and that there have always been people like me who try and point out the problem.
“There’s no solid science behind it with people knowing just enough to be dangerous at best or downright deceptive,” she says, amazed by the concept. “A coin flip can do a better job than any of the studies I’ve seen so far when it comes to predicting whether one site is going to rank higher than another website.”
“The only way to statistically prove that any individual metric claiming to recreate Google’s search algorithms is accurate is to do massive randomized testing over time, controlling for variation, and randomly assigning changes to be made to improve or decline in ranking,” Jen says, providing a solution that seems impossibly distant for our industry. “This needs to be on a large scale across many different topics, styles of searches, etc.”
“Even then, I suspect that Google has frequent algorithm updates of different magnitudes,” Jen supposes, which I confirm. “Undoubtedly, they have dozens or hundreds of engineers, programmers, analysts, and so on working on these algorithms daily, which means if we take a snapshot in time now of what we suspect the algorithm is, by the time we’ve fully tested it, it’s changed.”
In the end, Jen agrees that it appears our industry doesn’t have the tools we need to make these studies useful. “The mathematics of analyzing how Google’s index functions are closer to astrophysics than predicting election results, but that’s the methods used today are closer to the latter.”
* * *
I don’t want to make the people who publish these studies out to be total charlatans. Their efforts clearly come from an honest quest for discovery.
I get it. It’s fun to play with all the data they have at their disposal and try and figure out how something so complicated works.
However, there are known methodologies that reveal what is being presented as theories with these studies, but they are just not being applied… at all.
When it comes down to it, these “Gentlemen Scientists” of SEO are trying to build a dam without a full understanding of engineering, and that’s just dangerous.
Sure, publishing yet another report claiming something is a ranking factor because of a high correlation won’t accidentally kill 400 people. It is undoubtedly wasting multitudes of their clients’ time and money by sending them on a wild goose chase.