Google’s duplicate content filter is broken

by Patrick Altoft on February 25, 2008

Over the last few weeks Blogstorm has been fighting a battle with scraper sites & the Google duplicate content filter. For some queries, the battle has been well and truly lost.

Try this query to see an example for one of my most popular posts.

Most articles continue to rank highly on Google but a number of them have been filtered from the search results because Google thinks they are duplicates of other pages. Most of the other pages are things like scraper sites or Digg/Sphinn stories – all of which link back to the original post. The articles that have been filtered used to rank very well and they only went missing in the last few weeks.

The duplicate content filter seems to have become a lot more reliant on trust recently and perhaps I am seeing a side effect of this. I would hope that Blogstorm, with over 100,000 natural links and good on-page optimisation, might be a trusted site too but it clearly isn’t there yet.

One theory I have is that the duplicate content filter struggles to allocate the correct source of content that has changed urls. I changed the url structure on Blogstorm in January and since then Google has decided a lot of articles are suddenly duplicate content. One example is the What do people take a picture of first post which used to rank 4th for the term “jpg” and now suddenly is classed as duplicate content and is filtered from the results when you search for the exact terms in the title tag.

dupcontent1.gif

I invented this title and for my article to not be on the front page is frustrating to say the least.

If you are wondering how I know that the duplicate content filter is to blame take a look at the screenshot below. The original article has been filtered from a search for “Top 10 worst websites you’ll wish you hadn’t seen” but other pages from Blogstorm that reference the article are still showing up which proves the domain has the authority and relevance to rank. This situation is happening across a load of queries.

dupcontent.gif

The only reason I found these issues is because I was testing some new software that lets you do bulk rank checking based on page titles and I ran it on Blogstorm.

Expecting that most articles would rank first for their own unique titles I was pretty surprised to see that quite a lot didn’t even rank on the first page! I suspect this blog isn’t unique and that millions of pages are being filtered incorrectly without the site owners ever realising.

If you have an example of this for your site then post a link in the comments and hopefully Google might take a closer look at how the duplicate content algorithm is working. Certainly a site like Digg shouldn’t be outranking the story it links to when the original story is on a trusted blog.

Patrick Altoft is Director of Search at Leeds based digital & SEO agency Branded3. Patrick also runs Blogstorm.

You can get our blog posts delivered for free by email every day - simply add your email address to the box below or alternatively grab the RSS feed.

Read some similar posts

{ 5 comments… read them below or add one }

Syam 25 Feb 2008 at 9:42 pm

I am no expert but I guess there are enough crawlers who keeps linking to popular articles from reddit,digg and sphinn .. effectively making those pages ranked best. I believe Google should figure out themselves the hub should not be considered as destination.

Chetan 25 Feb 2008 at 9:55 pm

Hope you dont go into a phase like John Chow is doing. Content scrap was one of the reason for getting penalised in google for him.
I actually just hate the content scraping sites, those who just copy each and every content of my blog and just linking back to my original post.
In my mind, these kinda sites hurt our quality by having all copied matter, and giving link back showing that we have low quality backlinks from unnatural pages.

Sucker 26 Feb 2008 at 1:23 am

I did a lot of URL structure changes on one of my sites late in 2007 and had this same problem. A few pages were fixed with 301 redirects but Google is disregarding others (or devaluing them a little.)

master_rooter 26 Feb 2008 at 12:33 pm

I get frustrated too because a well know scrap site is ranking better than i do (me who produces the content but copied by someone else). What will Google (which is a quality content driven search engine) do? Nothing.

Patrick Altoft 28 Feb 2008 at 1:32 am
Find me on Twitter

Some of the queries appear to be returning better results today, the duplicate content filter seems to jump around a bit.

More comments from Patrick Altoft

Leave a Comment (registration is optional)

Registration is free, takes about 5 seconds and is worth doing.

You can use these HTML tags and attributes:
<a href=""> <b> <blockquote> <code> <em> <i> <strike> <strong>