WordPress.org

Ready to get started?Download WordPress

Forums

Search Spiders Robots.txt (10 posts)

  1. jdcfsu
    Member
    Posted 8 years ago #

    How does a site spider a wordpress blog? Could I disallow everything in my wordpress directory and still have the spider find the different artilces--because they are called from the database, or do I need part of the wordpress directory to be accessable to a site spider?

  2. niziol
    Member
    Posted 8 years ago #

    A spider won't really know whether it is called from a database or a static file. If you disallow acces to the directory where your blog is, the spiders will not index it.

  3. Direkt
    Member
    Posted 8 years ago #

    A spider responds to the file (robots.txt) inside of your root directory ( generally the good ones will, some will ignore this ). Also, the spider isn't going to know if it's retrieved from a database, just think of it as a user that stores the content they see.

  4. jdcfsu
    Member
    Posted 8 years ago #

    Ok, but then what should I disallow in the Robots.txt file? Surly I don't want the wp-admin and parts of the wp-content indexed for search engines. So what should be allowed and what should not?

  5. whooami
    Member
    Posted 8 years ago #

    jdcfsu, spidering is done via links, therefore content inside a directory is only as accessable as the links that you provide. Meaning that just because a directory exists, doesnt mean it will spidered. Spiders are also subject to the same rules a user would be, ie, an area requiring authentication (wp-admin, for instance), is not any more accessable to a spider.

    Restricting spiders is very simple though; this is the structure you will need to follow:

    User-agent: Googlebot <-- this can also be * to cover all spiders
    Allow: /
    Disallow: /some-dir/
    Disallow: /some-other-dir/

    A good tutorial on using robots.txt is here:

    http://www.searchengineworld.com/robots/robots_tutorial.htm

  6. jdcfsu
    Member
    Posted 8 years ago #

    Right, I know how to disallow directories though I am unclear on what I should disallow. My site directory is laid out as follows:
    /index.php
    /wordpress/

    My question is if I disallow the wordpress directory will the spider still see the content pages because they are generated via php upon visiting the main page?

  7. Chris_K
    Member
    Posted 8 years ago #

    Short answer to your question: No.

    The spider doesn't give a rat's dump on how the pages are generated -- it just follows the links. If you exclude the directory, then links won't be followed.

  8. jdcfsu
    Member
    Posted 8 years ago #

    Ok, then that brings me to part two of my question: what should be allowed so that the spiders can see it? Is it the cache folder inside the wp-content directory? I'm trying to figure out where this stuff is kept.

  9. Chris_K
    Member
    Posted 8 years ago #

    well, you don't care where it is "kept". What you care about is where the urls point. The spider is essentially like any user in front of a web browser -- it goes where the links point.

    If you want to allow your blog's content to be spidered, you can just ignore it in the robots.txt file and most likely, it'll get crawled.

    Your admin stuff won't get crawled unless the spider magically has your username and password.

  10. niziol
    Member
    Posted 8 years ago #

    Ok, say your blog and wordpress installation are in the same directory: /blog

    So, the robot needs to access: /blog but it does not need to access the wp-admin area: /blog/wp-admin/ and so on and so forth.

    The only thing that matters in the actual URI's to access it. So just disallow access to whatever specific directories you wish to exclude.

    As was previously stated, it can try and index it all, but unless it knows your username and password.

Topic Closed

This topic has been closed to new replies.

About this Topic

Tags