Results 1 to 2 of 2

Thread: Robots.txt problem with https

  1. #1

    Default Robots.txt problem with https

    This problem concerns Google providing search results for my site pointing to https links. My site has never used SSL, and these links to "nonexistent" secure pages are riddled with errors.

    Bottom line is I want Google to remove these bad links, and the issue I'm running into is I can't get Googlebot to read a robots.txt file instructing it not to crawl https. The suggestions I find to remove https links all talk about what to do if you have SSL set up on your server. Since I don't have SSL and should have no https pages on my site in the first place, I'm not sure how to get Google to stop crawling and creating these bogus links.

    Can anyone help?

  2. #2

    Default

    Well, I'm not 100% happy with the solution I've come up with, but it at least makes the problem better. I cobbled this together from several places, so here are suggestions merged in one place for the benefit of others having the same issue.

    1) The Google crawler looks for a robots.txt file when accessing your site. My first thought was to modify my existing robots.txt to block https. In researching how to do that, I learned Google treats http and https as two separate sites and you need separate robots.txt files for each of them.

    I created a file called robots-ssl.txt to provide the commands for https links. It contains this text to prevent indexing any links:

    Code:
    User-agent: *
    Disallow: /
    2) Next step is to direct crawlers to the robots.txt file specific for https links instead of the one used for http. I found this recommended code to place in the .htaccess file to do so:

    Code:
    # REDIRECT SEARCH ENGINES TO robots.txt FILE FOR SECURE PAGES
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots.txt$ robots_ssl.txt [NC,L]
    Unfortunately, this command doesn't seem to work on Bluehost. When I try accessing

    Code:
    https://exampleSiteHere.com/robots.txt
    the link ends up here:

    Code:
    https://exampleSiteHere.com/~example1/~example1/robots.txt
    which produces a 404 error.

    I managed to get around this problem by making the .htaccess code more explicit:

    Code:
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^robots.txt$ http://www.exampleSiteHere.com/robots_ssl.txt [NC,L]
    3) My hope is the above steps will block crawlers from finding https links and remove existing ones from their search results. However, I still have hundreds of existing bad links to contend with. Since I don't want my visitors clicking on links full of errors, I found this tip to redirect them from https to http links using .htaccess. There are lots of ways to do this, and either of these seem to work:

    Code:
    # REDIRECT HTTPS TO HTTP (method 1)
    RewriteCond %{SERVER_PORT} ^443$
    RewriteRule ^(.*)$ http://www.exampleSiteHere.com/$1 [R=301,L]
    
    # REDIRECT HTTPS TO HTTP (method 2)
    RewriteCond %{HTTPS} on
    RewriteRule ^(.*)$ http://www.exampleSiteHere.com/$1 [R=301,L]
    =====

    I'm still getting some warnings and errors in Google Webmaster Tools that make me wonder if all this will work long-term, but at least it seems to help.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •