WordPress.org

Ready to get started?Download WordPress

Forums

BulletProof Security
[resolved] Linkchecker and other legit bots are broken (58 posts)

  1. wordpressmike
    Member
    Posted 1 year ago #

    Bug 1: Broken link checker is one of the most commonly used plugins but certain queries, in particular checking images on my site, are blocked with a 403 error (see below) I did not turn on hotlinking of images.

    I'm using BPS 47.8 and WP 3.5.1

    Bug 2: I'm also having the same problem this guy is having with the facebook block, that is still unsolved:
    https://wordpress.org/support/topic/403-errors-2?replies=21

    Bug 3: In the new line DirectoryIndex index.php index.html /index.php
    it took a while but it looks like "/index.php" was really messing up my installation where I have enabled apache directory listing on certain directories. Specifically, it was causing a 403 error but commenting out that line fixed the problem. Now anytime this plugin is updated I will have to comment out that line again.

    Thank you very much for your efforts!

    (log anonymized- note that the link checker impersonates IE)
    >>>>>>>>>>> 403 Error Logged - February 1, 2013 - 4:24 pm <<<<<<<<<<<
    REMOTE_ADDR: 123.123.123.123
    Host Name: 123.123.123.123
    HTTP_CLIENT_IP:
    HTTP_FORWARDED:
    HTTP_X_FORWARDED_FOR:
    HTTP_X_CLUSTER_CLIENT_IP:
    REQUEST_METHOD: GET
    HTTP_REFERER: 123.123.123.123
    REQUEST_URI: /wp-content/uploads/2012/09/my-image.png
    QUERY_STRING:
    HTTP_USER_AGENT: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)

  2. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    1. Hmm not sure if this should be categorized as a bug since BPS is actively blocking something that is violating security rules/filters in the root .htaccess file. I will install and test the Broken Link checker plugin to see what is being blocked and why. pending testing.

    2. I don't believe that this is a bug either. I am not exactly sure what is causing these errors. I would appreciate any information that you can provide so that i can narrow down what is actually going on. Whether this is really the facebook bot or just some new spam bot disguised as a legitimate bot. Logically what could be occuring is that either some plugin that legitimately connects with facebook could be in the equation or possibly something has changed about the way facebook is now retrieving image files. Example: the way the image files are being retrieved violates the security rules/filters in the root htaccess file. So please post any plugins that you have installed that would have anything at all to do with facebook or any other logical relevant cause that you think could be in the equation.

    3. Some Server Configurations do not allow certain directives to be used in htaccess files. One of the more common htaccess directives that is not allowed/disallowed on some hosts is the Options directive, but I have also seen some hosts disallow/not allow the DirectoryIndex directive as well.

    The majority of Hosts allow both of these htaccess directives in the httpd.conf file, which in turn means they are allowed in htaccess files. I think that ratio is around 99% allow these directives to 1% that do not allow these directives. I will look into if it is possible to somehow detect if these directives are allowed on a particular host and then write or do not write them based on the result. I don't really think this is possible, but I will check it out anyway. ;)

  3. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    Now anytime this plugin is updated I will have to comment out that line again.

    Actually you would not have to comment out that line again. BPS updates are now automated. You do not need to click the AutoMagic buttons and activate BulletProof Modes anymore when installing a BPS upgrade. BPS will not change any htaccess code modifications that you have made. BPS will only automatically update the .htaccess files and add new .htaccess code or remove obsolete code or do other htaccess code house cleaning automatically on upgrade.

    So if you used the AutoMagic buttons again then yes you would need to comment out that line again.

  4. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    oh wow! I am seeing the facebook UA in my logs now too. So this is definitely something new that facebook is doing. Ok I am not using any facebook related plugins so that is out. So this is definitely isolated to something new that facebook is doing to retrieve image files or this is some new form of spam/recon/sniffer bot. I will figure this out and post the solution here.

    >>>>>>>>>>> 403 Error Logged - February 6, 2013 - 12:09 pm <<<<<<<<<<<
    REMOTE_ADDR: 69.171.247.112
    Host Name: 69.171.247.112
    HTTP_CLIENT_IP:
    HTTP_FORWARDED:
    HTTP_X_FORWARDED_FOR:
    HTTP_X_CLUSTER_CLIENT_IP:
    REQUEST_METHOD: GET
    HTTP_REFERER:
    REQUEST_URI: /wp-content/uploads/2012/11/Wordfence-P3-Profiler-Scan-1-300x237.png
    QUERY_STRING:
    HTTP_USER_AGENT: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
  5. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

  6. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    What is important to note and keep in mind is that image files are not being blocked themselves. What is being blocked is how the check to see if your image files still exist at that URL is what is being blocked. Still trying to track down the script itself. It is probably not publicly available though....

  7. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    aha now I am getting somewhere

    facebook Developers Debugger tool to check Open Graph, etc.

    http://developers.facebook.com/tools/debug

    facebook Crawler / Scraper

    https://developers.facebook.com/docs/ApplicationSecurity/#facebook_scraper

  8. wordpressmike
    Member
    Posted 1 year ago #

    Thanks for your fast and thorough response- I've become very familiar with the .htaccess rules and I've spent a while attempting to figure out why it's blocking the link checker and the facebook bot. I assume the facebook bot is downloading the thumbnail for the page- This is what we want because people are more likely to click on a link if it has a thumbnail.

    I thought perhaps the link checker was making a HEAD request but it says GET in the log.

    Also, with point 3 let me clarify. I have http://www.mydomain.com/ with wordpress installed at root. I then have http://www.mydomain.com/dir/ which I have placed a .htaccess file in containing a single line: "Options +Indexes" in order to display the index in that one directory but no others. The line that I commented out interfered with my customization.

    I also put the following custom code in the custom code tab so I can except my directory from being processed by the wordpress script. This had worked with previous versions of bulletproof security but only recently stopped working. It took a lot of effort to find out what the problem was. I've worked around this issue but I thought you might want to know about the issue in case it can help another user.

    # EXCEPTIONS FOR VARIOUS MYCOMPANY DIRECTORIES
    RewriteCond %{REQUEST_URI} ^/dir [NC]
    RewriteRule . - [L]

    RewriteCond %{REQUEST_URI} ^/dir2 [NC]
    RewriteRule . - [L]

    (Please note that the above customizations are on one wordpress installation- problems 1 and 2 were replicated on a vanilla install on a different server.

    I will try to get you a rewrite log later.

  9. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    I have not determined yet what the script is doing. From everything i have read so far all that script does is verify that the image file still exists and does not do anything else. Once i figure what it is doing exactly and how it is doing it then i will have/create a solution.

    A HEAD Request will be logged as a GET Request.

    Yes, that would make sense because the 2 directives conflict with each other.

    Yep thanks for posting that custom code as it may help someone else out with that exact same scenario. ;)

  10. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    What is interesting is this:

    Using the facebook Developers Debugger tool the thumbnail image is retreived successfully and the image file itself is retreived successfully, but you also see a 206 error. I keep running into that "cache/caching" is somehow involved in this equation.

    http://100pulse.com/http-statuscode/206.jsp

    Scrape Information

    Response Code: 206
    Fetched URL: http://forum.ait-pro.com/wp-content/uploads/2012/11/Wordfence-P3-Profiler-Scan-1-300x237.png
    Canonical URL: http://forum.ait-pro.com/wp-content/uploads/2012/11/Wordfence-P3-Profiler-Scan-1-300x237.png
    Errors That Must Be Fixed

    Can't Download: Could not retrieve data from URL.
    URLs

    Graph API: http://graph.facebook.com/210058239138438
    Scraped URL: See exactly what our scraper sees for your URL
    Type of Share

    When this URL is shared on facebook, it is treated as a certain type. By putting meta tags on this page, you can influence how it is shared.

    Photo

  11. wordpressmike
    Member
    Posted 1 year ago #

    A HEAD Request will be logged as a GET Request.

    I will remove HEAD from the htaccess file and let you know if this fixes it the link checker. I can't test the facebook issue until later tonight because current development is on an internal server can't be accessed by facebook.

  12. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    Which caching plugin do you use?

  13. wordpressmike
    Member
    Posted 1 year ago #

    No caching plugins.

  14. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    hmm interesting because i recently just deleted my caching plugin and am now only doing caching purely with htaccess code. I need to check several sites and compare the differences. getting warmer.

  15. wordpressmike
    Member
    Posted 1 year ago #

    It looks like the broken link checker plugin is indeed using HEAD requests- I haven't had a 403 error since removing HEAD checking-

    If the log incorrectly characterizes a HEAD as a GET, then that's a problem- Really was a head scratcher.

    this is from broken-link-checker/modules/checkers/http.php

    if ( $nobody ){
                    //If possible, use HEAD requests for speed.
                            curl_setopt($ch, CURLOPT_NOBODY, true);
                    } else {
                            //If we must use GET at least limit the amount of downloaded data.
                            curl_setopt($ch, CURLOPT_HTTPHEADER, array('Range: bytes=0-2048')); //2 KB
                    }
  16. wordpressmike
    Member
    Posted 1 year ago #

    Side note re caching plugins- I'd love to use one but I had problems with certain dynamic content.

  17. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    Nice on the Broken Link Checker fix!

    If this problem is related to cache then it is definitely not related to any WordPress caching plugins and is something else, which I have not figured out, if it has to do with cache in any way. I think that it is somehow related to the Header, but not cache.

    Now i am even more confused by doing several tests with the facebook Developers Debugger. I get a 404 not a 403 error and facebook is successfully seeing and retrieving the image??? Total contradiction - either the URL IS found or NOT found - it can't be both??? Something else is happening here that i cannot see because i do not have access to the facebook externalhit_uatext.php file/script.

    So since the image file is found = 200 OK
    Since facebook is retrieving the image files, but saying it is not able to retrieve them then I am stumped and can only guess that what is happening is that the Header that is returned is not being interpreted correctly.
    This 404 error really confuses me and I guess whatever else the externalhit_uatext.php file/script is trying to retrieve is where that 404 error is coming from. it is not the image file itself???

    >>>>>>>>>>> 404 Error Logged [02/07/2013 10:28 PM] <<<<<<<<<<<
    REMOTE_ADDR: 173.252.110.117
    Host Name: 173.252.110.117
    HTTP_CLIENT_IP:
    HTTP_FORWARDED:
    HTTP_X_FORWARDED_FOR:
    HTTP_X_CLUSTER_CLIENT_IP:
    REQUEST_METHOD: GET
    HTTP_REFERER:
    REQUEST_URI: /aitpro-blog/wp-content/themes/aitpro/images/bps-45-website-protection.png
    QUERY_STRING:
    HTTP_USER_AGENT: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
  18. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    Ok at a dead end here and the facebook Developers Debug tool is very limited. I can only make 1 request and cannot do more than 1 test to do trial and error. Will have to look at this some other day. In any case, what is most important is that image files are being retrieved so whatever else this error is it is really not important and does not negatively impact anything besides just being a damn nuisance. ;)

  19. wordpressmike
    Member
    Posted 1 year ago #

    I'd be curious to know if the facebook problem stops if HEAD is removed- I will hopefully be testing this later tonight.

  20. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    hmm I think that was tried already, but yeah maybe the most obvious thing is the issue. ;)

  21. wordpressmike
    Member
    Posted 1 year ago #

    I wonder if facebook falls back to GET if HEAD fails- Thus we'd record a block but also one would be passed through.

  22. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    I removed HEAD and I did not get an error logged, but I do not think the facebook Debugger tool is still actually sending Requests to my site anymore. I assume they have some sort of abuse protection setup so that someone cannot just sit there and click Debug all day long?

  23. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    Going by just the pure HTTP 206 Error on the facebook end of it then these would be the areas to look at:

    10.2.7 206 Partial Content

    The server has fulfilled the partial GET request for the resource. The request MUST have included a Range header field (section 14.35) indicating the desired range, and MAY have included an If-Range header field (section 14.27) to make the request conditional.

    The response MUST include the following header fields:

    - Either a Content-Range header field (section 14.16) indicating
    the range included with this response, or a multipart/byteranges
    Content-Type including Content-Range fields for each part. If a
    Content-Length header field is present in the response, its
    value MUST match the actual number of OCTETs transmitted in the
    message-body.
    - Date
    - ETag and/or Content-Location, if the header would have been sent
    in a 200 response to the same request
    - Expires, Cache-Control, and/or Vary, if the field-value might
    differ from that sent in any previous response for the same
    variant
    If the 206 response is the result of an If-Range request that used a strong cache validator (see section 13.3.3), the response SHOULD NOT include other entity-headers. If the response is the result of an If-Range request that used a weak validator, the response MUST NOT include other entity-headers; this prevents inconsistencies between cached entity-bodies and updated headers. Otherwise, the response MUST include all of the entity-headers that would have been returned with a 200 (OK) response to the same request.

    A cache MUST NOT combine a 206 response with other previously cached content if the ETag or Last-Modified headers do not match exactly, see 13.5.4.

    A cache that does not support the Range and Content-Range headers MUST NOT cache 206 (Partial) responses.

  24. wordpressmike
    Member
    Posted 1 year ago #

    Ok I've run my other site for about 12 hours with Bulletproof turned on, except for HEAD removed.

    I'm still encountering the Facebook problem and am also seeing that it looks like wordpress has been blocking itself- See the following, which has been anonymized.

    Any progress?

    >>>>>>>>>>> 403 Error Logged - February 7, 2013 - 11:05 pm <<<<<<<<<<<
    REMOTE_ADDR: 123.123.123.123
    Host Name: myserver.myhost.com
    HTTP_CLIENT_IP:
    HTTP_FORWARDED:
    HTTP_X_FORWARDED_FOR:
    HTTP_X_CLUSTER_CLIENT_IP:
    REQUEST_METHOD: GET
    HTTP_REFERER:
    REQUEST_URI: /wp-admin/post-new.php
    QUERY_STRING:
    HTTP_USER_AGENT: WordPress/3.5.1; http://www.mydomain.com

    Facebook problem:

    >>>>>>>>>>> 403 Error Logged - February 8, 2013 - 12:05 pm <<<<<<<<<<<
    REMOTE_ADDR: 173.252.110.112
    Host Name: 173.252.110.112
    HTTP_CLIENT_IP:
    HTTP_FORWARDED:
    HTTP_X_FORWARDED_FOR:
    HTTP_X_CLUSTER_CLIENT_IP:
    REQUEST_METHOD: GET
    HTTP_REFERER:
    REQUEST_URI: /wp-content/uploads/2011/08/icon.gif
    QUERY_STRING:
    HTTP_USER_AGENT: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)

  25. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    For the first error whitelist the post-new.php file in your wp-admin htaccess file. I am not sure what is causing that error, but it should be completely safe to whitelist that file. You would add this to BPS Custom Code:

    add this .htaccess bypass / skip code below to the wp-admin Custom Code box – CUSTOM CODE WPADMIN PLUGIN FIXES: and then activate BulletProof Mode for your wp-admin folder again. The skip rule must be [S=2] because it will be written to your wp-admin .htaccess file above skip / bypass rule [S=1]. This bypass / skip rule is safe to use because the wp-admin area is protected with WP Authentication security.

    # post-new.php bypass / skip rule
    RewriteCond %{REQUEST_URI} (post-new\.php) [NC]
    RewriteRule . - [S=2]

    I ran into a dead end since i cannot view the facebook script (not publicly available) and the facebook Developers Debugger tool does not allow me to do multiple tests. I only get 1 test per session or whatever other limit facebook has restricted the debugger tool too.

    The issue is some kind of Header problem with the externalhit_uatext.php script. I am guessing since i cannot view the script. the image files are successfully being retrieved, but something about the script is also trying to retrieve Header information that is not being successfully retrieved. This would make absolute logical sense because the error on the facebook side is a 206 Error which means the Header info could not be retrieved, which means the 206 Partial Content error also makes total logic sense.

    Why the facebook script cannot retrieve the Header i have no idea. This may or may have anything to do with BPS. When you google this issue you will find plenty of folks discussing this issue.

    Where i am at is this - i have no idea if this is related to BPS or not. There is no negative impact since images are retrieved successfully. There is only a nuisance factor since these errors are being logged. Since this is only a nuisance issue it has very low priority, but further testing is scheduled. The problem i have is i am shooting blind since i cannot view the facebook script - it is not publicly available.

  26. wordpressmike
    Member
    Posted 1 year ago #

    Are you familiar with rewrite logging? I can try to get you a rewrite log this weekend.

  27. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    pending further scheduled testing.

  28. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    I have done that already. The problem is i cannot see what the facebook script is doing. all the logs that i am checking do not tell me anything regarding what the facebook script is trying to do - Server Logs, BPS logs, Rewrite Logs, etc. - shooting blind.

  29. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    This facebook 206 Error is not a WordPress issue or a BPS issue. I have Googled this and i see this error occuring on non-WordPress sites.

    this is a standard HTML site's error logs so obviously there is something that is not quite right with the externalhit_uatext.php script itself, but yeah i would like to get rid of the nuisance factor of BPS logging this script's issues/problems. ;)
    http://happyhourtvmd.com/logs/access_121210.log

  30. AITpro
    Member
    Plugin Author

    Posted 1 year ago #

    hmm that just gave me an idea. The 206 error is being logged as a 403 error because a ErrorDocument 206 directive is not in the root .htaccess file. So logically something like this might work to get rid of the nuisance. add this ErrorDocument directive to your root .htaccess file and create a blank 206 php file and upload it to your site somewhere and add the correct path to the 206.php file.

    ErrorDocument 206 /206.php

Topic Closed

This topic has been closed to new replies.

About this Plugin

About this Topic