How to foil scrapers on your blog

by Patrick Altoft on / 32 responses

Scrapers can cause a lot of problems for bloggers, mainly because a lot of them remove links back to your blog making it hard for search engines to decide which blog is the copycat.

Here is what Matt Cutts recently said about how best to protect yourself against duplicate content:

If you are syndicating articles on third party sites make sure they link back to the original article on your site, rather than your homepage.

So, having internal links within the post as well as maybe a link to your homepage in your feed footer isn’t going to be the best solution. What you really need is a link to your blog post from within the feed content. Obviously your feed will already have a link to your post anyway but most scrapers tend to remove those links and just keep the title and the content.

Find your feed-rss2.php file in the wp-includes folder and add the following code to line 39 (in WP 2.3.1). The code needs to be added just after where it says <?php the_content() ?>

<p><a href="<?php the_guid(); ?>">Permalink + Comments</a></p>

This will make sure search engines know the source of the post and will give your readers an extra place to click to visit your site.

Patrick Altoft is Director of Search at Branded3, a Leeds SEO & Digital Agency specialising in SEO, Web Design, Development & Social Media.

Get daily posts direct to your inbox

You can get our blog posts delivered for free by email every day - simply add your email address to the box above, or alternatively you can grab the RSS feed.

Comments

Read the 22 comments below, or add your own!

January 10, 2008 at 12:23pm

Hmmm. That seems to be a very useful tip. There are a lot of sites that scrape my content, so this should at least do a little something about it.

Reply

SC
January 10, 2008 at 5:05pm

I am not a coder, will this affect software a cms like Drupal?

Reply

January 10, 2008 at 9:48pm

This code is for WordPress only. The same principle of adding a link to the blog post will work in other systems but the code would be different.

Reply

January 10, 2008 at 6:46pm

Really good advice there, think I might have to add this to my feed – really hate scrapers and splogs!

Reply

January 10, 2008 at 8:14pm

Hmm.. this will add some work when building custom feeds (outside of the standard blogs).

Reply

January 10, 2008 at 9:00pm

I get a few sites scraping me. I emailed one last week and he felt he was completely innocent. However, he still obeyed and removed my feed.

I am guessing that at each WP upgrade that this file will probably be replaced and should be re edited. I will add it to my list of checks I run for each time I upgrade.

Reply

January 10, 2008 at 9:49pm

Ideally I would have done this using a plugin but I couldn’t figure it out. :)

Reply

January 10, 2008 at 9:09pm

Sorry for the second post here. I wonder why Google cannot detect scrapers. The logic behind detection would be something like. Site A and B write original content, but C copies A and B. Surely it could spot that C is a copy of A and B, or C=A+B. I guess if the scrapers are selective in their posts it could pose a problem though as I have seen some scrapers copy every single post I write and some scrapers copy posts with certain keywords like iPod.

I wonder if the trust rank of a domain could help also by maybe knowing that my site has been around for 2 – 3 years, and this new site suddenly has the same content as mine.

Reply

January 11, 2008 at 11:13pm

Your logic is sound.. however I think most SEO’s can speak from experience when they say that Google – advanced as it may be – still has trouble identifying a posts true owner.

I can’t answer your question though.. I had the exact same though about it: detecting duplicated content shouldn’t be difficult.

Reply

January 11, 2008 at 4:20am

Been testing this with one of my blogs that uses Feedburner, it’s definitely that php file one should use? The code goes immediately after php the_content() right?

db

Reply

January 11, 2008 at 5:29am

Yes thats right.

Reply

January 11, 2008 at 7:25am

I’ve updated my RSS Footer WordPress plugin to be able do this, with the title of the post as anchor text!

Reply

January 11, 2008 at 4:01pm

Thanks for the heads up.

My blogposts are also being copied and i was worried about it..

Can’t scrapers remove this link code by any way ?

Reply

January 11, 2008 at 7:59pm

Okay! Got it working now!

Many thanks. Will report back if I see any interesting trends re the dozens of scrapers that pull Sciencebase

db

Reply

January 11, 2008 at 8:57pm

Another thought…presumably, this doesn’t work in retrospect, so any archived duplicated posts on scraper sites will not have embedded the permalink…

db

Reply

ZebZiggle
January 12, 2008 at 12:00pm

Blah … I just use jquery to add a ref=’nofollow’ to all scraped links.

Nice try though.

Reply

January 13, 2008 at 12:09am

@ZebZiggle: jquery == javascript, search engines DON’T DO javascript, so those nofollows are nonsense.

Reply

ZebZiggle
January 15, 2008 at 12:30pm

@Joost: Good point. Better add it to the back-end as well. That should do it.

Reply

Caroline Witte
February 11, 2008 at 11:16pm

Since I am slightly backward, technically speaking, I don’t know if this would work or not, but if someone started a file storage site, on the order of 4shared or Rapid Share, where bloggers could upload their blogs before publishing them, and have them time stamped and dated when uploading, then the proof of who first published said blog, etc. would be able to be proven positively. In the case of a dispute, the blog entries, including the time/date stamp could be downloaded and sent to the advertisers, etc. If something like this could be made to work, it would be a money maker for the owner of said site, as well as a big help in settling disputes over ‘scraping’. I’m sorry if this is a really idiotic suggestion, but it just seems to be a common sense solution, if the storage site could pass any legal requirements as to validity of the time/date stamps. If I had any technical know-how or money, I might try it myself. It would easily be worth say $50 a year to you ‘big guys?

Reply

February 11, 2008 at 11:32pm

Hi Caroline.

Personally I don’t have an issue with scrapers because I have always made sure they link to my site. The more scrapers I get the more links I get.

Scrapers are easy to spot and I doubt we need a service that proves who owns the content.

Reply

July 8, 2008 at 3:53am

I would love a solution like this for Typepad.

Reply

September 24, 2011 at 4:45pm

Amazing! I have it working now!

Thanks a million. This is working like jam… will get back and check for any interesting scrapers that pull great ideas

Joe

Reply

10 trackbacks

Leave a comment

Your email address will not be published. Fields marked with an asterisk are required.
 

  *

  *

You can use one of the following tags:
<a href=""><blockquote><code><em><strike><strong>