Duplicate Content In Wordpress - Begone!
Duplicate content in wordpress? I cast thee from my site!
So where does it come from?
Well, the first source of duplicate content will come from you not putting the proper items in your robots.txt.
Here’s what my robots.txt file looks like:
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-content/plugins/wp-chunk.php
Disallow: /wp-content/plugins/wp-db-backup.php
Disallow: /wp-content/plugins/st-orderpages.php
Disallow: /wp-content/plugins/pollpress/
Disallow: /wp-content/plugins/buddyCards/
Disallow: /wp-content/plugins/akismet/
Disallow: /wp-content/plugins/advanced-wysiwg.php
Disallow: /wp-content/plugins/sociable/
Disallow: /wp-content/plugins/custom-smileys.php
Disallow: /wp-content/plugins/autometa/
Disallow: /wp-content/plugins/lmbbox-smileys/
Disallow: /wp-content/plugins/wodspewm/
Disallow: /wp-content/themes/
Disallow: /wp-includes/
Disallow: /category/
Disallow: /page/
Disallow: /feed/
Disallow: /comments/
You’ll notice that I stopped the bots from crawling in all of my plugins folder. I could have simply used Disallow: /wp-content/ - unfortunately, doing so would have led to none of the translated pages getting indexed since the translation icons at the bottom of this page come from a plugin that resides in /wp-content/plugins/global-translator.
Anyhow, make sure you don’t block anything that you are using inside of your pages with direct links to it like I mentioned and you’ll be fine. Then you’ll block the rest of the includes, category, feed, page, and comments folders.
It’s all precisely designed to keep the duplicate content from your pages!
The second source of duplicate content from wordpress is archive pages. This one was a bit trickier to solve. To do so I had to modify my wordpress template. This isn’t for the faint of heart!
The code to do what I’ve done on my archive pages where it only lists the title looks like this:
<h4 class="post-content">
<a title="Permanent Link to <?php the_title(); ?>" rel="bookmark" href="http://www.lookwhatgmanfound.com/%3C?php%20the_permalink%28%29;%20?%3E" /></h4>
The code used to look like this:
<code>
<div class="entry">
<? the_content() ?></div>
<?
<p class="postmetadata">Posted in <?php the_category(', ') php edit_post_link('Edit', '', ' | '); comments_popup_link('No Comments »',
'1 Comment »', '% Comments »'); echo '
';Â ?>
By making these changes you remove the duplicate content from your archive pages because they’ll no longer have all of the articles for that month on the page. It’ll only be on the single page like it should be!
G-Man
P.S. Free link to anyone who can tell me how to close up the gap on those links on my archives page so that it doesn’t look like it’s double-spaced 

14 Responses


I wonder how much of this is necessary for a blog, where like mine, I only show snippets on all pages except the post's permalink url. As far as I know, I eliminated the dup content problem this way - no full post text on any page (home page, archives, categories, feeds, etc.) except the actual post itself. But I may have not thought of something. Also, why would plugins cause a dup content problem? Going to one of the plugins urls wouldn't really show anything, would it? At least not post duplication?
November 5th, 2006 |
Well, the problem with wordpress - and I dunno if this exists for other blogs - is that it puts lots of the same content in different pages. When you look at an unmodified theme for wordpress in the archives you'll see the same posts that you see in the single pages as well as on the front page for a time.
This usually leads to supplemental content which is a duplicate content issue.
Now, as far as blocking my plugins folder, I'll admiit I'm being overly cautious BUT in the case of the translator plugin for wordpress I wanted to make sure the pages would not get blocked which the google webmaster tool said they were
Hopefully that makes more sense.
G-Man
November 6th, 2006 |
Hmmm nice try but it didn't work
I haven't tried only showing an excerpt from the posts. Am thinking of doing that on my main page because some of my posts are rather lengthy. I hate for people to have to scroll TOO much.
G-Man
November 6th, 2006 |
Here's a WMW thread, titled Wordpress and Google: Avoiding Duplicate Content Issues. http://www.webmasterworld.com/forum30/page1.htmI think Graywolf says something similar in that thread to what Dazzlingdonna says here about displaying only excerpts.BTW, just disallowing urls in robots.txt may not do the trick if you have links pointing at those urls. For example, if you have single posts and your blog index pointing to category pages, chances are those urls will show up as url-only. What I prefer is use PHP include to selectively add META NOINDEX on pages I want to exclude.
November 6th, 2006 |
See, excerpts are my whole point. I only show excerpts of posts on all pages except the actual post page. So, the posts themselves are never duplicated.
November 6th, 2006 |
Ohhh I see! So if I used the More link in wordpress then it would get rid of the problem because it'd only be a snippet of the full article and the full article would be on the single page.
G-Man
November 6th, 2006 |
hehe… acutally I believe it works, if done right :-PWhen taking a look at this I see some errors..:post-content, post-content h4, postcontent div, postcontent a{ padding: 0px; margin: 0px; line-height:130%; text-align:justify; padding:2px 6px 4px;}There's a dot missing in front of 'post-content' for it to refer to the class. It should be '.post-content'.And then 'padding: 0px;' is overruled be the second last line: 'padding:2px 6px 4px;'
November 6th, 2006 |
Stefan - you're right. That does work! Thanks man! Now where should I put that link love?! :P BTW, no worries on the duplicate posts. Akismet is getting a bit aggressive with your posts for some reason and always marks them as spam.
Halfdeck - that might be the case with some search engines but I had the url's directly linked down into the plugins/global-translator folder and google disallowed them based on my Disallow: wp-content/ that I had initially.
Using php to add the noindex is a good idea. I'd have to figure out which files to put it in because I certainly don't want it in my template
G-Man
November 6th, 2006 |
I guess Akismet knows more about me than I thought.. ;-)About the link, well, maybe some day I'll write something linkworthy. In that case feel free to give me a plug.
November 7th, 2006 |
[quote]Halfdeck - that might be the case with some search engines but I had the url's directly linked down into the plugins/global-translator folder and google disallowed them based on my Disallow: wp-content/ that I had initially.[/quote]Well, I think it depends largely on the strength of the links pointing to that url. For example, there was a case where a well-known domain decided to noindex their entire site using robots.txt. But there were thousands of sites linking to it. in that case, Google did not crawl the url (parse the HTML) but still chose to display the url in the SERP. Why? Because if people searched for bmw.com, and they happened to disallow their entire site, Google still wanted to return something in the results.Also, keep in mind that even if Google disobeys a robots.txt disallow, if a page is weak, it may just not make it into the index. Pages not appearing in the index doesn't necessarily mean Google ignored those pages completely.META NOINDEX will keep the page out of the SERP completely, or so they tell me.
November 7th, 2006 |
[…] Meni osobno zanimljivu tezu o Wordpressu i duplom sadržaju Goeffrey je adresirao na svome blogu. Velik dio svjetske blogerske scene koristi Wordpress kao svoj alat za pisanje. […]
November 12th, 2006 |
Howdy, G-Man. We haven’t met (though I’ve seen you at SEOmoz), but I can’t resist a coding puzzle.
This is your code for the Archive for the ‘Sales Letters’ Category:
<div class=”post-content”>
<h4 class=”post-content”>Post Title Here
</h4>
<h4 class=”post-content”>etc
</h4>
</div>
The problem is that H tags by default have top and bottom margins. So, you can either switch to something other than H4 tags (if I were a die-hard CSS/semantics evangelist, I’d be calling for a list, but I’m not). Or you could change the CSS for the post-content class. Something like this ought to do:
h4.post-content {margin-top:0; margin-bottom:0}
Now, it might be that the post-content class is used on other pages (I don’t recall offhand, but it’s likely that it is), or that there’s an ID somewhere above it that interferes. The easiest thing, in such a case, would be to wrap the whole thing in a div with an ID:
<div id=”archivelist”>
<div class=”post-content”>
<h4 class=”post-content”>Post Title Here
</h4>
<h4 class=”post-content”>etc
</h4>
</div>
</div>
And then:
#archivelist h4.post-content {margin-top:0; margin-bottom:0}
and you can probably get away with
#archivelist h4.post-content {margin:0}
As an aside, I find that the way the default WordPress styles.css files are written is … confusing at best. Styles are broken up weirdly (e.g., an H tag will be defined several times — margin in one place, color in another), which is not the most intuitive or organized method I can think of.
Anyway, good blog; nice content. And nice to meet you.
November 26th, 2006 |
Ihave implemented many things to prevent this duplicate content problem, but some how google is not so obidient sometimes to follow noindex tag.
March 1st, 2007 |
In addition I may consider to add these ones: Disallow: /tag/Disallow: /trackback/Disallow: /cgi-bin/ Disallow: /author/ Disallow: /backup/ Disallow: /wp-content/cacheDisallow: /*.js$ Disallow: /*.inc$Disallow: /*.css$Disallow: /*?
November 3rd, 2008 |