That’s a pretty strong title for my first Ninja blog post but this is a teaser for my upcoming PubCon session on Tuesday March 18 about Algo Chaos. where I’ll be discussing what webmasters need to do to be proactive and stop penalties and algo problems before they happen. Don’t waste time trying to fix Google penalties chasing unauthorized content usage with Google Alerts, Copyscape, Google’s Webmaster Toos, disavowing junk links and other services. Better yet, avoid the hassle of using DMCAs making it your last resort, not the first.
This will be a 2 part blog post starting with a list of some of the chaos that scrapers can cause in search engines and ending with some actions that can be taken to proactively prevent scraping in the first place.
Proactive Content Distribution aka Bot Blocking
People used to think that bot blocking was something that just the bit twiddling scraper paranoid webmaster nerds did and often poked fun at them for being such control freaks. Turns out, the nerds were right, and to some degree this vindication of their efforts is the Revenge of the SEO Nerds.
By being proactive and blocking unauthorized access to the website content the website copy and links weren’t being randomly used in ways that could cause harm to the scraped site. While not all data scrapers and aggregators are bad, and some do perform a useful purpose, the majority tend to run towards the bad side and can cause lots of harm that can easily cripple a site and defeat the site’s own SEO efforts.
I can still remember back when some link builders used to tell me scrapers were good for business that those free links were just gravy helping your site. I warned them of the dangers as I had already encountered the dark side of this which cost me some business, but it wasn’t as clear cut back then so they allowed themselves to be scraped and welcomed it. Now some of those very same link builders don’t want scrapers anywhere near their sites.
My how the times have changed.
What happens is all sorts of bad sites that scrape content not only make sites the victim of plagiarism but can incur various linking penalties in Google. Don’t believe it when they say there’s nothing a 3rd party site can do to cause damage because you’re guilty by association of these linking schemes whether you had any participation or not. The only way to avoid it is to employ every anti-scraping method possible, including prayer.
If you spend all day doing disavows while allowing scrapers to generate more links, it quickly escalates to more than you can process. It just becomes a big game of whack-a-mole that you’ll never win unless you stop the problem at the root, block the scraper.
This is the worst ever, getting outranked by your own content. It’s insanity at it’s finest wrapped up in the pretty colorful Google logo letters. Traditional methods to fight it are money and time wasting efforts using Google alerts and Copyscape to track this stuff and then resorting to racking up legal fees sending Cease & Desist letters, DMCA requests, etc. Often the case is that these sites are outside of the jurisdiction of the site owner making it nearly impossible, ie. impractical, to do anything except maybe get the content removed from one or two search engines with no recourse anywhere else.
When it comes to RSS feeds the site operator is often completely culpable for most of the damage being caused by insisting on publishing the full text feed instead of a summary. Making it easy for your readers also makes it easy for scrapers and unfortunately, it’s almost impossible to block scrapers without blocking RSS feed readers as they are scrapers themselves. Therefore the choice is simple, publish full text RSS feeds and get lower rankings or publish the summary and out rank RSS aggregators.
Stop the madness and summarize those feeds!
Brand Dilution and Reputation Damage
When other sites are carrying your content and branding without your permission the confusion it creates in the marketplace can be devastating. The problem is that the scrapers using your content really don’t even want it for anything other than long tail keywords to bring them traffic so they aren’t concerned with what other content yours is associated with on their site as mixed and often controversial topics can appear on the same page. These issues have created outrage in people not familiar with that scrapers do thinking that companies were putting their brand or authorizing it’s usage in such ways which is the furthest from the truth.
The easiest way to avoid such issues is to prevent them from happening in the first place.
Unintended Search Results Consequences
Some advertisers had their ads scraped and spun onto pages trying to attract visitors to seedy websites and these advertisers pulled their ads costing many thousands of dollars in damages to the publisher. They blamed the association of their ads to those types of websites on the website where the ad was hosted when that site had nothing to do with the situation except, it didn’t stop the scraper in the first place. Normally an advertiser wouldn’t have been able to track his ads to the source of the scraping when running a massive ad campaign but this advertiser did a custom ad just for that site so the source of the scraping was painfully obvious.
While the association of the advertiser and the seedy side of the web was the combination of the scraper and the search engine working together, it cost the publisher money in the end, something that should be obviously avoided.
Again, stopping scrapers in advance is the only way to prevent this problem.
The simplest form of damage is tricking Googlebot or other search engines into hijacking your entire site via something called a proxy hijacking which isn’t as common as it used to be but these still occur despite the best efforts of the search engines to stop them. The best case scenario is that your website only ranks against it’s own content and loses some position. The worse case scenario is the proxy copy manages to completely relegate your site to supplemental results, if indexed at all. Validating the source of crawlers and rejecting spiders such as Googlebot when they crawling from IPs outside their valid range it all it takes to protect your site from this issue.
Considering that a fix for proxy hijacking has been available since 2006 it’s a real shame all sites aren’t protected today.
Having a few RSS feed stories out rank your site is one thing, but losing your entire website to a proxy hijack is totally unacceptable and completely avoidable!