How to foil scrapers on your blog
Scrapers can cause a lot of problems for bloggers, mainly because a lot of them remove links back to your blog making it hard for search engines to decide which blog is the copycat.
Here is what Matt Cutts recently said about how best to protect yourself against duplicate content:
If you are syndicating articles on third party sites make sure they link back to the original article on your site, rather than your homepage.
So, having internal links within the post as well as maybe a link to your homepage in your feed footer isn’t going to be the best solution. What you really need is a link to your blog post from within the feed content. Obviously your feed will already have a link to your post anyway but most scrapers tend to remove those links and just keep the title and the content.
Find your feed-rss2.php file in the wp-includes folder and add the following code to line 39 (in WP 2.3.1). The code needs to be added just after where it says <?php the_content() ?>
<p><a href="<?php the_guid(); ?>">Permalink + Comments</a></p>
This will make sure search engines know the source of the post and will give your readers an extra place to click to visit your site.















Hmmm. That seems to be a very useful tip. There are a lot of sites that scrape my content, so this should at least do a little something about it.
Sly from Slyvisions.com January 10, 2008 12:23 pm | Reply
[…] Patrick from Blog Storm is blogging about a solution that he read on Matt Cutts’ […]
Advice Network Founders Blog» Blog archives » Stop scrapers from spoiling your fun. January 10, 2008 4:06 pm | Reply
I am not a coder, will this affect software a cms like Drupal?
SC January 10, 2008 5:05 pm | Reply
This code is for Wordpress only. The same principle of adding a link to the blog post will work in other systems but the code would be different.
Patrick Altoft January 10, 2008 9:48 pm |
ReplyReally good advice there, think I might have to add this to my feed - really hate scrapers and splogs!
Nick - road2blogging January 10, 2008 6:46 pm | Reply
Hmm.. this will add some work when building custom feeds (outside of the standard blogs).
Damien van Holten January 10, 2008 8:14 pm | Reply
I get a few sites scraping me. I emailed one last week and he felt he was completely innocent. However, he still obeyed and removed my feed.
I am guessing that at each WP upgrade that this file will probably be replaced and should be re edited. I will add it to my list of checks I run for each time I upgrade.
Matthew January 10, 2008 9:00 pm | Reply
Ideally I would have done this using a plugin but I couldn’t figure it out.
Patrick Altoft January 10, 2008 9:49 pm |
ReplySorry for the second post here. I wonder why Google cannot detect scrapers. The logic behind detection would be something like. Site A and B write original content, but C copies A and B. Surely it could spot that C is a copy of A and B, or C=A+B. I guess if the scrapers are selective in their posts it could pose a problem though as I have seen some scrapers copy every single post I write and some scrapers copy posts with certain keywords like iPod.
I wonder if the trust rank of a domain could help also by maybe knowing that my site has been around for 2 - 3 years, and this new site suddenly has the same content as mine.
Matthew January 10, 2008 9:09 pm | Reply
Your logic is sound.. however I think most SEO’s can speak from experience when they say that Google - advanced as it may be - still has trouble identifying a posts true owner.
I can’t answer your question though.. I had the exact same though about it: detecting duplicated content shouldn’t be difficult.
Damien van Holten January 11, 2008 11:13 pm |
ReplyBeen testing this with one of my blogs that uses Feedburner, it’s definitely that php file one should use? The code goes immediately after php the_content() right?
db
David Bradley January 11, 2008 4:20 am | Reply
Yes thats right.
Patrick Altoft January 11, 2008 5:29 am |
ReplyI’ve updated my RSS Footer WordPress plugin to be able do this, with the title of the post as anchor text!
Joost de Valk January 11, 2008 7:25 am | Reply
[…] based on a quote from Matt Cutts in a post by Patrick Altoft, “How to foil scrapers on your blog”, I’ve added the option to add a link back to the post itself, with the title of the post as anchor […]
Make the scrapers work for you! - SEO WordPress - Joost de Valk's SEO Blog January 11, 2008 7:30 am | Reply
Thanks for the heads up.
My blogposts are also being copied and i was worried about it..
Can’t scrapers remove this link code by any way ?
Reed January 11, 2008 4:01 pm | Reply
Okay! Got it working now!
Many thanks. Will report back if I see any interesting trends re the dozens of scrapers that pull Sciencebase
db
David Bradley January 11, 2008 7:59 pm | Reply
Another thought…presumably, this doesn’t work in retrospect, so any archived duplicated posts on scraper sites will not have embedded the permalink…
db
David Bradley January 11, 2008 8:57 pm | Reply
[…] artigo. Quem está à vontade com php, poderá obter o mesmo resultado com a introdução de uma linha de código no ficheiro de […]
Usar os scrappers em proveito próprio - Marketing de Busca January 11, 2008 11:37 pm | Reply
[…] It turns out this is common practice out there in the big bad internet. They even have a term for it - Blog Scraping. Well I thought there was nothing I could do about it until I came across this post. […]
Save your content from being Stolen | Tenacious Creations January 12, 2008 12:13 am | Reply
[…] How to foil scrapers on your blog - by Patrick Altoft […]
13 Great Articles - January 11, 2007 | My lucky number 13 January 12, 2008 7:39 am | Reply
Blah … I just use jquery to add a ref=’nofollow’ to all scraped links.
Nice try though.
ZebZiggle January 12, 2008 12:00 pm | Reply
[…] would like to thank Stephan for the Matt Cutts interview and Blogstorm for the PHP code. There is also supposedly a plugin that does these two things, but at the time of the posting, the […]
Protect yourself from content theft | Mixed Market Arts January 12, 2008 12:52 pm | Reply
@ZebZiggle: jquery == javascript, search engines DON’T DO javascript, so those nofollows are nonsense.
Joost de Valk January 13, 2008 12:09 am | Reply
@Joost: Good point. Better add it to the back-end as well. That should do it.
ZebZiggle January 15, 2008 12:30 pm | Reply
[…] to manually add a backlink to text via your HTML code in your control panel which can be found in Blogstorm, But if your like me then maybe I think Joost de Valk’s may have solved the issue for all the […]
Employ Scrapers for your Blog : Yeepage January 15, 2008 4:04 pm | Reply
Since I am slightly backward, technically speaking, I don’t know if this would work or not, but if someone started a file storage site, on the order of 4shared or Rapid Share, where bloggers could upload their blogs before publishing them, and have them time stamped and dated when uploading, then the proof of who first published said blog, etc. would be able to be proven positively. In the case of a dispute, the blog entries, including the time/date stamp could be downloaded and sent to the advertisers, etc. If something like this could be made to work, it would be a money maker for the owner of said site, as well as a big help in settling disputes over ’scraping’. I’m sorry if this is a really idiotic suggestion, but it just seems to be a common sense solution, if the storage site could pass any legal requirements as to validity of the time/date stamps. If I had any technical know-how or money, I might try it myself. It would easily be worth say $50 a year to you ‘big guys?
Caroline Witte February 11, 2008 11:16 pm | Reply
Hi Caroline.
Personally I don’t have an issue with scrapers because I have always made sure they link to my site. The more scrapers I get the more links I get.
Scrapers are easy to spot and I doubt we need a service that proves who owns the content.
Patrick Altoft February 11, 2008 11:32 pm |
Reply[…] of bloggers are forced to deal with spam blogs (splogs, aka scraper blogs), and even though a variety of counter measures exist, they just don’t seem to do the trick. Most of the time, splogs will […]
Defeat Spam Blogs With IP Based Content Delivery - Nullamatix - Technology Made Simple March 5, 2008 12:14 pm | Reply