How to foil scrapers on your blog

How to foil scrapers on your blog

Scrapers can cause a lot of problems for bloggers, mainly because a lot of them remove links back to your blog making it hard for search engines to decide which blog is the copycat.

Here is what Matt Cutts recently said about how best to protect yourself against duplicate content:

If you are syndicating articles on third party sites make sure they link back to the original article on your site, rather than your homepage.

So, having internal links within the post as well as maybe a link to your homepage in your feed footer isn’t going to be the best solution. What you really need is a link to your blog post from within the feed content. Obviously your feed will already have a link to your post anyway but most scrapers tend to remove those links and just keep the title and the content.

Find your feed-rss2.php file in the wp-includes folder and add the following code to line 39 (in WP 2.3.1). The code needs to be added just after where it says <?php the_content() ?>

<p><a href="<?php the_guid(); ?>">Permalink + Comments</a></p>

This will make sure search engines know the source of the post and will give your readers an extra place to click to visit your site.

28 Reader Comments leave yours >>

Hmmm. That seems to be a very useful tip. There are a lot of sites that scrape my content, so this should at least do a little something about it.

 

[…] Patrick from Blog Storm is blogging about a solution that he read on Matt Cutts’ […]

 

I am not a coder, will this affect software a cms like Drupal?

SC   January 10, 2008 5:05 pm | Reply

This code is for Wordpress only. The same principle of adding a link to the blog post will work in other systems but the code would be different.

 
 

Really good advice there, think I might have to add this to my feed - really hate scrapers and splogs!

 

Hmm.. this will add some work when building custom feeds (outside of the standard blogs).

 

I get a few sites scraping me. I emailed one last week and he felt he was completely innocent. However, he still obeyed and removed my feed.

I am guessing that at each WP upgrade that this file will probably be replaced and should be re edited. I will add it to my list of checks I run for each time I upgrade.

Matthew   January 10, 2008 9:00 pm | Reply

Ideally I would have done this using a plugin but I couldn’t figure it out. :)

 
 

Sorry for the second post here. I wonder why Google cannot detect scrapers. The logic behind detection would be something like. Site A and B write original content, but C copies A and B. Surely it could spot that C is a copy of A and B, or C=A+B. I guess if the scrapers are selective in their posts it could pose a problem though as I have seen some scrapers copy every single post I write and some scrapers copy posts with certain keywords like iPod.

I wonder if the trust rank of a domain could help also by maybe knowing that my site has been around for 2 - 3 years, and this new site suddenly has the same content as mine.

Matthew   January 10, 2008 9:09 pm | Reply

Your logic is sound.. however I think most SEO’s can speak from experience when they say that Google - advanced as it may be - still has trouble identifying a posts true owner.

I can’t answer your question though.. I had the exact same though about it: detecting duplicated content shouldn’t be difficult.

 
 

Been testing this with one of my blogs that uses Feedburner, it’s definitely that php file one should use? The code goes immediately after php the_content() right?

db

 

I’ve updated my RSS Footer WordPress plugin to be able do this, with the title of the post as anchor text!

 

[…] based on a quote from Matt Cutts in a post by Patrick Altoft, “How to foil scrapers on your blog”, I’ve added the option to add a link back to the post itself, with the title of the post as anchor […]

 

Thanks for the heads up.

My blogposts are also being copied and i was worried about it..

Can’t scrapers remove this link code by any way ?

 

Okay! Got it working now!

Many thanks. Will report back if I see any interesting trends re the dozens of scrapers that pull Sciencebase

db

 

Another thought…presumably, this doesn’t work in retrospect, so any archived duplicated posts on scraper sites will not have embedded the permalink…

db

 

[…] artigo. Quem está à vontade com php, poderá obter o mesmo resultado com a introdução de uma linha de código no ficheiro de […]

 

[…]  It turns out this is common practice out there in the big bad internet. They even have a term for it - Blog Scraping. Well I thought there was nothing I could do about it until I came across this post. […]

 

[…] How to foil scrapers on your blog - by Patrick Altoft […]

 

Blah … I just use jquery to add a ref=’nofollow’ to all scraped links.

Nice try though.

 

[…] would like to thank Stephan for the Matt Cutts interview and Blogstorm for the PHP code. There is also supposedly a plugin that does these two things, but at the time of the posting, the […]

 

@ZebZiggle: jquery == javascript, search engines DON’T DO javascript, so those nofollows are nonsense.

 

@Joost: Good point. Better add it to the back-end as well. That should do it.

 

[…] to manually add a backlink to text via your HTML code in your control panel which can be found in Blogstorm, But if your like me then maybe I think Joost de Valk’s may have solved the issue for all the […]

 

Since I am slightly backward, technically speaking, I don’t know if this would work or not, but if someone started a file storage site, on the order of 4shared or Rapid Share, where bloggers could upload their blogs before publishing them, and have them time stamped and dated when uploading, then the proof of who first published said blog, etc. would be able to be proven positively. In the case of a dispute, the blog entries, including the time/date stamp could be downloaded and sent to the advertisers, etc. If something like this could be made to work, it would be a money maker for the owner of said site, as well as a big help in settling disputes over ’scraping’. I’m sorry if this is a really idiotic suggestion, but it just seems to be a common sense solution, if the storage site could pass any legal requirements as to validity of the time/date stamps. If I had any technical know-how or money, I might try it myself. It would easily be worth say $50 a year to you ‘big guys?

Caroline Witte   February 11, 2008 11:16 pm | Reply

Hi Caroline.

Personally I don’t have an issue with scrapers because I have always made sure they link to my site. The more scrapers I get the more links I get.

Scrapers are easy to spot and I doubt we need a service that proves who owns the content.

 
 

[…] of bloggers are forced to deal with spam blogs (splogs, aka scraper blogs), and even though a variety of counter measures exist, they just don’t seem to do the trick. Most of the time, splogs will […]

 

Read our comment policy
We moderate first time commenters

Name (required)
E-mail (required - never shown publicly)
Your website
Your Comment (smaller size | larger size)
You may use <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <code> <em> <i> <strike> <strong> in your comment.

Trackback URI

Design by Patrick, theme by Justin Tadlock & code by Wordpress