WordPress.org

Support

Support » Plugins and Hacks » BulletProof Security » [Resolved] Linkchecker and other legit bots are broken

[Resolved] Linkchecker and other legit bots are broken

  • Bug 1: Broken link checker is one of the most commonly used plugins but certain queries, in particular checking images on my site, are blocked with a 403 error (see below) I did not turn on hotlinking of images.

    I’m using BPS 47.8 and WP 3.5.1

    Bug 2: I’m also having the same problem this guy is having with the facebook block, that is still unsolved:
    https://wordpress.org/support/topic/403-errors-2?replies=21

    Bug 3: In the new line DirectoryIndex index.php index.html /index.php
    it took a while but it looks like “/index.php” was really messing up my installation where I have enabled apache directory listing on certain directories. Specifically, it was causing a 403 error but commenting out that line fixed the problem. Now anytime this plugin is updated I will have to comment out that line again.

    Thank you very much for your efforts!

    (log anonymized- note that the link checker impersonates IE)
    >>>>>>>>>>> 403 Error Logged – February 1, 2013 – 4:24 pm <<<<<<<<<<<
    REMOTE_ADDR: 123.123.123.123
    Host Name: 123.123.123.123
    HTTP_CLIENT_IP:
    HTTP_FORWARDED:
    HTTP_X_FORWARDED_FOR:
    HTTP_X_CLUSTER_CLIENT_IP:
    REQUEST_METHOD: GET
    HTTP_REFERER: 123.123.123.123
    REQUEST_URI: /wp-content/uploads/2012/09/my-image.png
    QUERY_STRING:
    HTTP_USER_AGENT: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)

Viewing 15 replies - 1 through 15 (of 57 total)
  • Plugin Author AITpro

    @aitpro

    1. Hmm not sure if this should be categorized as a bug since BPS is actively blocking something that is violating security rules/filters in the root .htaccess file. I will install and test the Broken Link checker plugin to see what is being blocked and why. pending testing.

    2. I don’t believe that this is a bug either. I am not exactly sure what is causing these errors. I would appreciate any information that you can provide so that i can narrow down what is actually going on. Whether this is really the facebook bot or just some new spam bot disguised as a legitimate bot. Logically what could be occuring is that either some plugin that legitimately connects with facebook could be in the equation or possibly something has changed about the way facebook is now retrieving image files. Example: the way the image files are being retrieved violates the security rules/filters in the root htaccess file. So please post any plugins that you have installed that would have anything at all to do with facebook or any other logical relevant cause that you think could be in the equation.

    3. Some Server Configurations do not allow certain directives to be used in htaccess files. One of the more common htaccess directives that is not allowed/disallowed on some hosts is the Options directive, but I have also seen some hosts disallow/not allow the DirectoryIndex directive as well.

    The majority of Hosts allow both of these htaccess directives in the httpd.conf file, which in turn means they are allowed in htaccess files. I think that ratio is around 99% allow these directives to 1% that do not allow these directives. I will look into if it is possible to somehow detect if these directives are allowed on a particular host and then write or do not write them based on the result. I don’t really think this is possible, but I will check it out anyway. 😉

    Plugin Author AITpro

    @aitpro

    Now anytime this plugin is updated I will have to comment out that line again.

    Actually you would not have to comment out that line again. BPS updates are now automated. You do not need to click the AutoMagic buttons and activate BulletProof Modes anymore when installing a BPS upgrade. BPS will not change any htaccess code modifications that you have made. BPS will only automatically update the .htaccess files and add new .htaccess code or remove obsolete code or do other htaccess code house cleaning automatically on upgrade.

    So if you used the AutoMagic buttons again then yes you would need to comment out that line again.

    Plugin Author AITpro

    @aitpro

    oh wow! I am seeing the facebook UA in my logs now too. So this is definitely something new that facebook is doing. Ok I am not using any facebook related plugins so that is out. So this is definitely isolated to something new that facebook is doing to retrieve image files or this is some new form of spam/recon/sniffer bot. I will figure this out and post the solution here.

    >>>>>>>>>>> 403 Error Logged - February 6, 2013 - 12:09 pm <<<<<<<<<<<
    REMOTE_ADDR: 69.171.247.112
    Host Name: 69.171.247.112
    HTTP_CLIENT_IP:
    HTTP_FORWARDED:
    HTTP_X_FORWARDED_FOR:
    HTTP_X_CLUSTER_CLIENT_IP:
    REQUEST_METHOD: GET
    HTTP_REFERER:
    REQUEST_URI: /wp-content/uploads/2012/11/Wordfence-P3-Profiler-Scan-1-300x237.png
    QUERY_STRING:
    HTTP_USER_AGENT: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

    Plugin Author AITpro

    @aitpro

    Plugin Author AITpro

    @aitpro

    What is important to note and keep in mind is that image files are not being blocked themselves. What is being blocked is how the check to see if your image files still exist at that URL is what is being blocked. Still trying to track down the script itself. It is probably not publicly available though….

    Plugin Author AITpro

    @aitpro

    aha now I am getting somewhere

    facebook Developers Debugger tool to check Open Graph, etc.

    http://developers.facebook.com/tools/debug

    facebook Crawler / Scraper

    https://developers.facebook.com/docs/ApplicationSecurity/#facebook_scraper

    Thanks for your fast and thorough response- I’ve become very familiar with the .htaccess rules and I’ve spent a while attempting to figure out why it’s blocking the link checker and the facebook bot. I assume the facebook bot is downloading the thumbnail for the page- This is what we want because people are more likely to click on a link if it has a thumbnail.

    I thought perhaps the link checker was making a HEAD request but it says GET in the log.

    Also, with point 3 let me clarify. I have http://www.mydomain.com/ with wordpress installed at root. I then have http://www.mydomain.com/dir/ which I have placed a .htaccess file in containing a single line: “Options +Indexes” in order to display the index in that one directory but no others. The line that I commented out interfered with my customization.

    I also put the following custom code in the custom code tab so I can except my directory from being processed by the wordpress script. This had worked with previous versions of bulletproof security but only recently stopped working. It took a lot of effort to find out what the problem was. I’ve worked around this issue but I thought you might want to know about the issue in case it can help another user.

    # EXCEPTIONS FOR VARIOUS MYCOMPANY DIRECTORIES
    RewriteCond %{REQUEST_URI} ^/dir [NC]
    RewriteRule . – [L]

    RewriteCond %{REQUEST_URI} ^/dir2 [NC]
    RewriteRule . – [L]

    (Please note that the above customizations are on one wordpress installation- problems 1 and 2 were replicated on a vanilla install on a different server.

    I will try to get you a rewrite log later.

    Plugin Author AITpro

    @aitpro

    I have not determined yet what the script is doing. From everything i have read so far all that script does is verify that the image file still exists and does not do anything else. Once i figure what it is doing exactly and how it is doing it then i will have/create a solution.

    A HEAD Request will be logged as a GET Request.

    Yes, that would make sense because the 2 directives conflict with each other.

    Yep thanks for posting that custom code as it may help someone else out with that exact same scenario. 😉

    Plugin Author AITpro

    @aitpro

    What is interesting is this:

    Using the facebook Developers Debugger tool the thumbnail image is retreived successfully and the image file itself is retreived successfully, but you also see a 206 error. I keep running into that “cache/caching” is somehow involved in this equation.

    http://100pulse.com/http-statuscode/206.jsp

    Scrape Information

    Response Code: 206
    Fetched URL: http://forum.ait-pro.com/wp-content/uploads/2012/11/Wordfence-P3-Profiler-Scan-1-300×237.png
    Canonical URL: http://forum.ait-pro.com/wp-content/uploads/2012/11/Wordfence-P3-Profiler-Scan-1-300×237.png
    Errors That Must Be Fixed

    Can’t Download: Could not retrieve data from URL.
    URLs

    Graph API: http://graph.facebook.com/210058239138438
    Scraped URL: See exactly what our scraper sees for your URL
    Type of Share

    When this URL is shared on facebook, it is treated as a certain type. By putting meta tags on this page, you can influence how it is shared.

    Photo

    A HEAD Request will be logged as a GET Request.

    I will remove HEAD from the htaccess file and let you know if this fixes it the link checker. I can’t test the facebook issue until later tonight because current development is on an internal server can’t be accessed by facebook.

    Plugin Author AITpro

    @aitpro

    Which caching plugin do you use?

    No caching plugins.

    Plugin Author AITpro

    @aitpro

    hmm interesting because i recently just deleted my caching plugin and am now only doing caching purely with htaccess code. I need to check several sites and compare the differences. getting warmer.

    It looks like the broken link checker plugin is indeed using HEAD requests- I haven’t had a 403 error since removing HEAD checking-

    If the log incorrectly characterizes a HEAD as a GET, then that’s a problem- Really was a head scratcher.

    this is from broken-link-checker/modules/checkers/http.php

    if ( $nobody ){
                    //If possible, use HEAD requests for speed.
                            curl_setopt($ch, CURLOPT_NOBODY, true);
                    } else {
                            //If we must use GET at least limit the amount of downloaded data.
                            curl_setopt($ch, CURLOPT_HTTPHEADER, array('Range: bytes=0-2048')); //2 KB
                    }

    Side note re caching plugins- I’d love to use one but I had problems with certain dynamic content.

Viewing 15 replies - 1 through 15 (of 57 total)
  • The topic ‘[Resolved] Linkchecker and other legit bots are broken’ is closed to new replies.
Skip to toolbar