PDA

View Full Version : robots.txt Related Issues



davidp
04-13-2007, 07:32 AM
Hi all,

I have an addon domain in which I have a Joomla installation at / and Wordpress at /blog. I only have 1 robots.txt file in / and here is what is currently in it:


User-agent: *
Allow: /sitemap.xml
Allow: /blog/sitemap.xml
Allow: /index.php
Allow: /blog/index.php
Disallow: /*.php$
Disallow: /blog/*.php$
Disallow: /*.js$
Disallow: /*.inc$
Disallow: /*.css$
Disallow: /*.wmv$
Disallow: /*.cgi$
Disallow: /*.xhtml$
Disallow: /*.php-dist$
Disallow: /*.ico$
Disallow: /blog/*.js$
Disallow: /blog/*.inc$
Disallow: /blog/*.css$
Disallow: /blog/*.wmv$
Disallow: /blog/*.cgi$
Disallow: /blog/*.xhtml$
Disallow: /blog/*.php-dist$
Disallow: /blog/*.ico$
Disallow: /administrator/
Disallow: /cache/
Disallow: /components/
Disallow: /editor/
Disallow: /help/
Disallow: /images/
Disallow: /includes/
Disallow: /language/
Disallow: /mambots/
Disallow: /media/
Disallow: /modules/
Disallow: /templates/
Disallow: /installation/
Disallow: /gci-bin/
Disallow: /blog/wp-content/
Disallow: /blog/wp-admin/
Disallow: /blog/wp-includes/

Please take a look at the items highlighted in bold (particularly the first few)--and let me know if this is allowed (and especially if it will work) in the robots.txt file?

Also, should I have a separate robots.txt file for the /blog directory?

Many thanks!

David

hofmax
04-13-2007, 10:16 AM
As far as I´m aware you can only disallow not allow. So just leave that blank and make sure it´s not disallowed by some of the other rules. Apart from that the other rules look correct.

KenJackson
04-13-2007, 08:39 PM
As far as I´m aware you can only disallow not allow.

That's right. Here is the standard (http://www.robotstxt.org/wc/norobots.html).

Also note that wildcards are not supported, nor '$'.

Also, Disallow: /gci-bin/ is a typo.

hofmax
04-14-2007, 04:38 AM
I´m also trying to find a good robots.txt at the moment. The one looks familiar from some blog post I googled about a seo robots.txt for wordpress. Excluding php files and the content folder are probably not a good idea.

davidp
04-14-2007, 09:21 AM
Thanks for all your feedback!

The reason I have some "Allow" fields there is because of what is found here: http://www.askapache.com/seo/wordpress-robotstxt-optimized-for-seo.html

Here: http://www.johntp.com/2007/03/29/create-a-robotstxt-file-and-increase-your-search-engine-rankings/ is also another site which demonstrates (supposedly) that the $ suffix is acceptable. Perhaps these sites may be wrong?


Excluding php files and the content folder are probably not a good idea.

Excluding the content folder is fine because none of the items therein are useful for indexing purposes. I think the same may be true for all files besides for index.php. Anyone?

davidp
04-14-2007, 09:36 AM
As far as I´m aware you can only disallow not allow. So just leave that blank and make sure it´s not disallowed by some of the other rules. Apart from that the other rules look correct.
Hi!

Quickly take a look at Google's robots.txt file... if Google has "Allow" fields, then you can be sure that one is allowed them... ;)

http://www.google.com/robots.txt

lazynitwit
04-14-2007, 02:33 PM
Hi!

Quickly take a look at Google's robots.txt file... if Google has "Allow" fields, then you can be sure that one is allowed them... ;)

http://www.google.com/robots.txt
Google doesn't always follow standards. However, in this case they sorta do.

There is an "Extended Standard" in which "Allow" is a field, among others. http://www.conman.org/people/spc/robots2.html

Remember this is a proposed addition to the standard, as such not all crawlers will follow it.