Follow me on Twitter

How to foil scrapers on your blog

by Patrick Altoft on January 10, 2008

Scrapers can cause a lot of problems for bloggers, mainly because a lot of them remove links back to your blog making it hard for search engines to decide which blog is the copycat.

Here is what Matt Cutts recently said about how best to protect yourself against duplicate content:

If you are syndicating articles on third party sites make sure they link back to the original article on your site, rather than your homepage.

So, having internal links within the post as well as maybe a link to your homepage in your feed footer isn’t going to be the best solution. What you really need is a link to your blog post from within the feed content. Obviously your feed will already have a link to your post anyway but most scrapers tend to remove those links and just keep the title and the content.

Find your feed-rss2.php file in the wp-includes folder and add the following code to line 39 (in WP 2.3.1). The code needs to be added just after where it says <?php the_content() ?>

<p><a href="<?php the_guid(); ?>">Permalink + Comments</a></p>

This will make sure search engines know the source of the post and will give your readers an extra place to click to visit your site.

You can get our blog posts delivered for free by email every day - simply add your email address to the box below or alternatively grab the RSS feed.

Read some similar posts

Post category: Blogging   Share it : delicious | digg | reddit | StumbleUpon | Google Bookmarks | Sphinn

{ 21 comments… read them below or add one }

Sly from Slyvisions.com 10 Jan 2008 at 12:23 pm

Hmmm. That seems to be a very useful tip. There are a lot of sites that scrape my content, so this should at least do a little something about it.

SC 10 Jan 2008 at 5:05 pm

I am not a coder, will this affect software a cms like Drupal?

Nick - road2blogging 10 Jan 2008 at 6:46 pm

Really good advice there, think I might have to add this to my feed – really hate scrapers and splogs!

Damien van Holten 10 Jan 2008 at 8:14 pm

Hmm.. this will add some work when building custom feeds (outside of the standard blogs).

Matthew 10 Jan 2008 at 9:00 pm

I get a few sites scraping me. I emailed one last week and he felt he was completely innocent. However, he still obeyed and removed my feed.

I am guessing that at each WP upgrade that this file will probably be replaced and should be re edited. I will add it to my list of checks I run for each time I upgrade.

Matthew 10 Jan 2008 at 9:09 pm

Sorry for the second post here. I wonder why Google cannot detect scrapers. The logic behind detection would be something like. Site A and B write original content, but C copies A and B. Surely it could spot that C is a copy of A and B, or C=A+B. I guess if the scrapers are selective in their posts it could pose a problem though as I have seen some scrapers copy every single post I write and some scrapers copy posts with certain keywords like iPod.

I wonder if the trust rank of a domain could help also by maybe knowing that my site has been around for 2 – 3 years, and this new site suddenly has the same content as mine.

Patrick Altoft 10 Jan 2008 at 9:48 pm
Find me on Twitter

This code is for Wordpress only. The same principle of adding a link to the blog post will work in other systems but the code would be different.

More comments from Patrick Altoft
Patrick Altoft 10 Jan 2008 at 9:49 pm
Find me on Twitter

Ideally I would have done this using a plugin but I couldn’t figure it out. :)

More comments from Patrick Altoft
David Bradley 11 Jan 2008 at 4:20 am

Been testing this with one of my blogs that uses Feedburner, it’s definitely that php file one should use? The code goes immediately after php the_content() right?

db

Patrick Altoft 11 Jan 2008 at 5:29 am
Find me on Twitter

Yes thats right.

More comments from Patrick Altoft
Joost de Valk 11 Jan 2008 at 7:25 am

I’ve updated my RSS Footer WordPress plugin to be able do this, with the title of the post as anchor text!

Reed 11 Jan 2008 at 4:01 pm

Thanks for the heads up.

My blogposts are also being copied and i was worried about it..

Can’t scrapers remove this link code by any way ?

David Bradley 11 Jan 2008 at 7:59 pm

Okay! Got it working now!

Many thanks. Will report back if I see any interesting trends re the dozens of scrapers that pull Sciencebase

db

David Bradley 11 Jan 2008 at 8:57 pm

Another thought…presumably, this doesn’t work in retrospect, so any archived duplicated posts on scraper sites will not have embedded the permalink…

db

Damien van Holten 11 Jan 2008 at 11:13 pm

Your logic is sound.. however I think most SEO’s can speak from experience when they say that Google – advanced as it may be – still has trouble identifying a posts true owner.

I can’t answer your question though.. I had the exact same though about it: detecting duplicated content shouldn’t be difficult.

ZebZiggle 12 Jan 2008 at 12:00 pm

Blah … I just use jquery to add a ref=’nofollow’ to all scraped links.

Nice try though.

Joost de Valk 13 Jan 2008 at 12:09 am

@ZebZiggle: jquery == javascript, search engines DON’T DO javascript, so those nofollows are nonsense.

ZebZiggle 15 Jan 2008 at 12:30 pm

@Joost: Good point. Better add it to the back-end as well. That should do it.

Caroline Witte 11 Feb 2008 at 11:16 pm

Since I am slightly backward, technically speaking, I don’t know if this would work or not, but if someone started a file storage site, on the order of 4shared or Rapid Share, where bloggers could upload their blogs before publishing them, and have them time stamped and dated when uploading, then the proof of who first published said blog, etc. would be able to be proven positively. In the case of a dispute, the blog entries, including the time/date stamp could be downloaded and sent to the advertisers, etc. If something like this could be made to work, it would be a money maker for the owner of said site, as well as a big help in settling disputes over ’scraping’. I’m sorry if this is a really idiotic suggestion, but it just seems to be a common sense solution, if the storage site could pass any legal requirements as to validity of the time/date stamps. If I had any technical know-how or money, I might try it myself. It would easily be worth say $50 a year to you ‘big guys?

Patrick Altoft 11 Feb 2008 at 11:32 pm
Find me on Twitter

Hi Caroline.

Personally I don’t have an issue with scrapers because I have always made sure they link to my site. The more scrapers I get the more links I get.

Scrapers are easy to spot and I doubt we need a service that proves who owns the content.

More comments from Patrick Altoft
Jay Z 08 Jul 2008 at 3:53 am

I would love a solution like this for Typepad.

{ 9 trackbacks }

Advice Network Founders Blog» Blog archives » Stop scrapers from spoiling your fun.
01.10.08 at 4:06 pm
Make the scrapers work for you! - SEO WordPress - Joost de Valk's SEO Blog
01.11.08 at 7:30 am
Usar os scrappers em proveito próprio - Marketing de Busca
01.11.08 at 11:37 pm
Save your content from being Stolen | Tenacious Creations
01.12.08 at 12:13 am
13 Great Articles - January 11, 2007 | My lucky number 13
01.12.08 at 7:39 am
Protect yourself from content theft | Mixed Market Arts
01.12.08 at 12:52 pm
Employ Scrapers for your Blog : Yeepage
01.15.08 at 4:04 pm
Defeat Spam Blogs With IP Based Content Delivery - Nullamatix - Technology Made Simple
03.05.08 at 12:14 pm
RSS Footer Plugin Is A Great Link Building Tool | MyWebtronics.com
01.27.09 at 6:04 pm

Leave a Comment (registration is optional)

Registration is free, takes about 5 seconds and is worth doing.

You can use these HTML tags and attributes: <a href=""> <b> <blockquote> <code> <em> <i> <strike> <strong>