Nov 10

How Do The Duplicate Content Filters Work?

The topic of duplicate content has been covered but there, as far as I know, haven’t been any good tests to see what kicks off duplicate content filters and what does not.

I have just completed my first phase of duplicate content testing and would like to share the results with you.

First off, this test did not go into the effects that duplicate meta tags can have on a site. It also did not test duplicate content issues between sites. Put another way, it only tested duplicate content on a single site and effects of the duplicate content within that site.

Secondly, I do not have any hard fast numbers on this. I’d love to give them to you but I’m not a statistical geek so I honestly don’t know how to setup that kind of test.

First, I took a list of 23,000 cities throughout the U.S. Next I created a topic. For instance, one of my topics was plumbing. From that point I created a basic template and put some basic text about plumbing into it. I also had some tags that looked like this $CITY and $STATE.

With 25 or so links, the sites ranked and I managed to get almost all of the 23,000 pages indexed in google and they’d come up for phrases like city, state plumber and so on. I made probably $5 a day on the site at it’s peak.

These sites tanked when I went link happy and over optimized the links. At one point I had over 20,000 links to my plumbing site. I’m sure it was a huge spam flag to Google at the time.

Incidentally, this was before I learned PHP so I had to manually upload all 23,000 of those pages to my host which took a LONG time!

A few months went by and I decided to take what I had learned and push it out to another site that would be PHP driven. Unfortunately I changed the parameters some…

When I created the site this time, I had writers write content on various service industries. Once they had an article written which was nothing more than 30 sets of questions and answers, I stored them in a couple of files.

I had also done a bit more research and now had over 130,000 cities throughout the U.S.

I wrote a php script to combine this data but I would simply rotate the questions and answers on the page randomly.

I did the same thing as before and got 25 or so links to it…and the sites went nowhere.

After analyzing the new sites compared to the old sites as well as looking at the patents regardling ’shingles’ I have come to the conclusion that the search engines are using shingles to detect duplicate content.

As a result it was very easy to detect the pages that were essentially the same with the exception that the sentences were reordered.

Knowing this bit of information, let’s take a look at the Shingle Algorithm.

Keep in mind that the search engines may or may not use it exactly the way that it’s mentioned in the patents and I might have some minor errors in my understanding. Regardless, I’ll do the best that I can to break this down for you.

Let’s suppose that you have a page that contains the following text:

The swift brown fox jumped over the lazy dog.

For the purposes of this discussion, we’ll assume that the HTML has already been removed.

When you apply the algorithm to this text, you break the text down into groups of words. To make it easier to understand here (and also so we don’t have more shingles than we need to demonstrate the concept) we’ll assume that the shingle size is 4.

Breaking the text into shingles results in (one shingle per line):

The swift brown fox

swift brown fox jumped

brown fox jumped over

fox jumped over the

jumped over the lazy

over the lazy dog

As you can see in the example above, I did not remove the stop words. These are words like ‘the’, ‘and’, ‘to’, ‘of’, etc.

At this point, a formula is run across each of the sentences which results in a fingerprint of sorts to uniquely identify that group of words.

So, at some point the search engine comes along to ’score’ the page. It has these fingerprints and it looks across all of the pages on a site for this fingerprint.

If the fingerprint exists then a point value is incremented. If a particular page has more points than the desired threshhold then the page is flagged as possible duplicate content.

Taking our example above, let’s say that we had page two that had the following sentence:

My old lady swears that she saw the lazy dog jump over the swift brown fox.

Breaking it down into shingles:

my old lady swears

old lady swears that

lady swears that she

swears that she saw

that she saw the

she saw the lazy

saw the lazy dog

the lazy dog jump

lazy dog jump over

dog jump over the

jump over the swift

over the swift brown

the swift brown fox.

If you compare the two sets of shingles, you’ll see that there is only one match - “the swift brown fox”.

Thus, these two documents would not be considered duplicates of each other. If, however, more than 70% of them matched, they might be considered duplicates. Note that the exact percentage is a secret only known to Google and the other search engines.

So, how exactly, can you put this to use?

Well, the first thing is that you now have an easy way to ’score’ your pages to determine if they are duplicates of one another or not. For instance, let’s suppose you have an article directory. Chances are, people will be submitting their article to you as well as hundreds of other directories through resources like Article Marketer.

If that’s the case then you’ll have a huge duplicate content issue with other sites that use these articles unless you can lower the percentage of shingle matches.

That’s a very important concept to understand because as I see it, there are basically only two ways to lower that percentage. The first is to put random text throughout the sentences breaking them up.

It’s an ugly solution because it’s going to look like garbage to the user unless you also use CSS to move the text to a different position on the page. Personally I don’t think this is a good solution because it’s pretty easy to programatically check.

The second method is that you can add some text around the article (sides, bottom, top). Stuff that makes sense.

Now, for those of you who are thinking - I know what I’ll do. I’ll add an rss feed to the bottom of my document. That’ll work perfectly.

Do not do it! RSS Feeds are by their very nature duplicate content!

Sure, some of you are groaning at this point but go back to our discussion before. You need to get your page to get below the threshhold that trips the duplicate content filter. It doesn’t matter where the duplicate content comes from. We’re using a fingerprint which could be found on any one of the billions of pages throughout the internet.

The RSS feed has shingles which have their own fingerprints and the article has shingles which have their own fingerprints. The result when added together is still duplicate content!

Mercy me! If Matt Cutts were dead, he’d be rolling over in his grave about what I’m about to say!

You’ve got to have unique original content on your page to avoid this duplicate filter.

Yes, yes, I hear you saying - “but but but you can use random content!”

If you’re going to use random content, why bother with the articles in the first place?

Alternatively, you could rewrite the articles on the fly with a synonym replacer or figure out something along those lines.

Chances are, however, in the next year or two even those methods will be detectable in some fashion.

On one last note, Levi and I had a nice discussion about this recently.

Levi thinks that adding some original content at the end of the page would not help you. I believe it would because, as I told him, the search engines are not doing either of the methods that would help them detect if a full article were copied word for word and then extra text added at the end.

The only way I see to do this is either to track the location of the shingle on the page in addition to the shingle fingerprint or group shingles into sets. That is, you would add shingle 1 + shingle 2 = shingle set 1 and so on.

Both of these methods have their own flaws so I don’t see them being implemented anytime soon.

Finally, the shingle size I used is not what the articles I’ve read on this recommend. They talk about using a shingle size of 15 words. Of course, they show pretty graphs with compute time, false triggers (saying a page is duplicate content when it’s not) and other stuff like that.

You should understand the basics of how duplicate content detection works now. If you’ve got any comments or see any flaws in my thinking, let me know.

G-Man

4 Responses

  1. I agree to disagree on duplicate content! » Making Money Online - Boogybonbon.com » Blog Archive says

    […] Please take a minute to read G-Man’s full article on his idea of duplicate content, as well as subscribe to his feed as he always has great bit of information on BlackHat, and SEO in general. […]

    November 13th, 2006 |

  2. johnmoo says

    Test your idea, G-Man: http://oy-oy.eu/page/shingles/ 

    November 25th, 2006 |

  3. SEORookie » Duplicate Content and the -30 Penalty says

    […] G-Man has a thought provoking post about duplicate content filters. […]

    November 27th, 2006 |

  4. Content shuffle example: an SEO copywriting how-to says

    […] variants and made sure that all sentences were SHORT. This has to do with the way Google calculates shingles and detects duplicate content issues. I also ‘broke’ the shingles by making them as […]

    August 9th, 2008 |

:mrgreen: :neutral: :twisted: :shock: :smile: :???: :cool: :evil: :grin: :oops: :razz: :roll: :wink: :cry: :eek: :lol: :mad: :sad:

TrackBack URI

  G-man
 
Email Updates
Email:
     

  

View Geoffrey 
'G-Man' Faivre-Malloy's profile on LinkedIn

Links