Linkchecker and other legit bots are broken

Resolved wordpressmike
(@wordpressmike)

11 years, 2 months ago

Bug 1: Broken link checker is one of the most commonly used plugins but certain queries, in particular checking images on my site, are blocked with a 403 error (see below) I did not turn on hotlinking of images.

I’m using BPS 47.8 and WP 3.5.1

Bug 2: I’m also having the same problem this guy is having with the facebook block, that is still unsolved:
https://wordpress.org/support/topic/403-errors-2?replies=21

Bug 3: In the new line DirectoryIndex index.php index.html /index.php
it took a while but it looks like “/index.php” was really messing up my installation where I have enabled apache directory listing on certain directories. Specifically, it was causing a 403 error but commenting out that line fixed the problem. Now anytime this plugin is updated I will have to comment out that line again.

Thank you very much for your efforts!

(log anonymized- note that the link checker impersonates IE)
>>>>>>>>>>> 403 Error Logged – February 1, 2013 – 4:24 pm <<<<<<<<<<<
REMOTE_ADDR: 123.123.123.123
Host Name: 123.123.123.123
HTTP_CLIENT_IP:
HTTP_FORWARDED:
HTTP_X_FORWARDED_FOR:
HTTP_X_CLUSTER_CLIENT_IP:
REQUEST_METHOD: GET
HTTP_REFERER: 123.123.123.123
REQUEST_URI: /wp-content/uploads/2012/09/my-image.png
QUERY_STRING:
HTTP_USER_AGENT: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)

Viewing 15 replies - 1 through 15 (of 57 total)

1 2 3 4 →

Plugin Author AITpro
(@aitpro)

11 years, 2 months ago

1. Hmm not sure if this should be categorized as a bug since BPS is actively blocking something that is violating security rules/filters in the root .htaccess file. I will install and test the Broken Link checker plugin to see what is being blocked and why. pending testing.

2. I don’t believe that this is a bug either. I am not exactly sure what is causing these errors. I would appreciate any information that you can provide so that i can narrow down what is actually going on. Whether this is really the facebook bot or just some new spam bot disguised as a legitimate bot. Logically what could be occuring is that either some plugin that legitimately connects with facebook could be in the equation or possibly something has changed about the way facebook is now retrieving image files. Example: the way the image files are being retrieved violates the security rules/filters in the root htaccess file. So please post any plugins that you have installed that would have anything at all to do with facebook or any other logical relevant cause that you think could be in the equation.

3. Some Server Configurations do not allow certain directives to be used in htaccess files. One of the more common htaccess directives that is not allowed/disallowed on some hosts is the Options directive, but I have also seen some hosts disallow/not allow the DirectoryIndex directive as well.

The majority of Hosts allow both of these htaccess directives in the httpd.conf file, which in turn means they are allowed in htaccess files. I think that ratio is around 99% allow these directives to 1% that do not allow these directives. I will look into if it is possible to somehow detect if these directives are allowed on a particular host and then write or do not write them based on the result. I don’t really think this is possible, but I will check it out anyway. 😉

Plugin Author AITpro
(@aitpro)

11 years, 2 months ago

Now anytime this plugin is updated I will have to comment out that line again.

Actually you would not have to comment out that line again. BPS updates are now automated. You do not need to click the AutoMagic buttons and activate BulletProof Modes anymore when installing a BPS upgrade. BPS will not change any htaccess code modifications that you have made. BPS will only automatically update the .htaccess files and add new .htaccess code or remove obsolete code or do other htaccess code house cleaning automatically on upgrade.

So if you used the AutoMagic buttons again then yes you would need to comment out that line again.
Plugin Author AITpro
(@aitpro)

11 years, 2 months ago
oh wow! I am seeing the facebook UA in my logs now too. So this is definitely something new that facebook is doing. Ok I am not using any facebook related plugins so that is out. So this is definitely isolated to something new that facebook is doing to retrieve image files or this is some new form of spam/recon/sniffer bot. I will figure this out and post the solution here.
```
>>>>>>>>>>> 403 Error Logged - February 6, 2013 - 12:09 pm <<<<<<<<<<<
REMOTE_ADDR: 69.171.247.112
Host Name: 69.171.247.112
HTTP_CLIENT_IP:
HTTP_FORWARDED:
HTTP_X_FORWARDED_FOR:
HTTP_X_CLUSTER_CLIENT_IP:
REQUEST_METHOD: GET
HTTP_REFERER:
REQUEST_URI: /wp-content/uploads/2012/11/Wordfence-P3-Profiler-Scan-1-300x237.png
QUERY_STRING:
HTTP_USER_AGENT: facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)
```
Plugin Author AITpro
(@aitpro)

11 years, 2 months ago

This describes what this new facebook UA/script is doing so I will try and get the script itself if possible.

http://stackoverflow.com/questions/9773954/why-facebook-is-flooding-my-site

http://www.facebook.com/externalhit_uatext.php

http://www.wundercounter.com/counter/ip-tracker/144/

Plugin Author AITpro
(@aitpro)

11 years, 2 months ago

What is important to note and keep in mind is that image files are not being blocked themselves. What is being blocked is how the check to see if your image files still exist at that URL is what is being blocked. Still trying to track down the script itself. It is probably not publicly available though….

Plugin Author AITpro
(@aitpro)

11 years, 2 months ago

aha now I am getting somewhere

facebook Developers Debugger tool to check Open Graph, etc.

http://developers.facebook.com/tools/debug

facebook Crawler / Scraper

https://developers.facebook.com/docs/ApplicationSecurity/#facebook_scraper

Thread Starter wordpressmike
(@wordpressmike)

11 years, 2 months ago

Thanks for your fast and thorough response- I’ve become very familiar with the .htaccess rules and I’ve spent a while attempting to figure out why it’s blocking the link checker and the facebook bot. I assume the facebook bot is downloading the thumbnail for the page- This is what we want because people are more likely to click on a link if it has a thumbnail.

I thought perhaps the link checker was making a HEAD request but it says GET in the log.

Also, with point 3 let me clarify. I have http://www.mydomain.com/ with wordpress installed at root. I then have http://www.mydomain.com/dir/ which I have placed a .htaccess file in containing a single line: “Options +Indexes” in order to display the index in that one directory but no others. The line that I commented out interfered with my customization.

I also put the following custom code in the custom code tab so I can except my directory from being processed by the wordpress script. This had worked with previous versions of bulletproof security but only recently stopped working. It took a lot of effort to find out what the problem was. I’ve worked around this issue but I thought you might want to know about the issue in case it can help another user.

# EXCEPTIONS FOR VARIOUS MYCOMPANY DIRECTORIES
RewriteCond %{REQUEST_URI} ^/dir [NC]
RewriteRule . – [L]

RewriteCond %{REQUEST_URI} ^/dir2 [NC]
RewriteRule . – [L]

(Please note that the above customizations are on one wordpress installation- problems 1 and 2 were replicated on a vanilla install on a different server.

I will try to get you a rewrite log later.

Plugin Author AITpro
(@aitpro)

11 years, 2 months ago

I have not determined yet what the script is doing. From everything i have read so far all that script does is verify that the image file still exists and does not do anything else. Once i figure what it is doing exactly and how it is doing it then i will have/create a solution.

A HEAD Request will be logged as a GET Request.

Yes, that would make sense because the 2 directives conflict with each other.

Yep thanks for posting that custom code as it may help someone else out with that exact same scenario. 😉

Plugin Author AITpro
(@aitpro)

11 years, 2 months ago

What is interesting is this:

Using the facebook Developers Debugger tool the thumbnail image is retreived successfully and the image file itself is retreived successfully, but you also see a 206 error. I keep running into that “cache/caching” is somehow involved in this equation.

http://100pulse.com/http-statuscode/206.jsp

Scrape Information

Response Code: 206
Fetched URL: http://forum.ait-pro.com/wp-content/uploads/2012/11/Wordfence-P3-Profiler-Scan-1-300×237.png
Canonical URL: http://forum.ait-pro.com/wp-content/uploads/2012/11/Wordfence-P3-Profiler-Scan-1-300×237.png
Errors That Must Be Fixed

Can’t Download: Could not retrieve data from URL.
URLs

Graph API: http://graph.facebook.com/210058239138438
Scraped URL: See exactly what our scraper sees for your URL
Type of Share

When this URL is shared on facebook, it is treated as a certain type. By putting meta tags on this page, you can influence how it is shared.

Photo

Thread Starter wordpressmike
(@wordpressmike)

11 years, 2 months ago

A HEAD Request will be logged as a GET Request.

I will remove HEAD from the htaccess file and let you know if this fixes it the link checker. I can’t test the facebook issue until later tonight because current development is on an internal server can’t be accessed by facebook.

Plugin Author AITpro
(@aitpro)

11 years, 2 months ago

Which caching plugin do you use?

Thread Starter wordpressmike
(@wordpressmike)

11 years, 2 months ago

No caching plugins.

Plugin Author AITpro
(@aitpro)

11 years, 2 months ago

hmm interesting because i recently just deleted my caching plugin and am now only doing caching purely with htaccess code. I need to check several sites and compare the differences. getting warmer.
Thread Starter wordpressmike
(@wordpressmike)

11 years, 2 months ago
It looks like the broken link checker plugin is indeed using HEAD requests- I haven’t had a 403 error since removing HEAD checking-

If the log incorrectly characterizes a HEAD as a GET, then that’s a problem- Really was a head scratcher.

this is from broken-link-checker/modules/checkers/http.php
```
if ( $nobody ){
                //If possible, use HEAD requests for speed.
                        curl_setopt($ch, CURLOPT_NOBODY, true);
                } else {
                        //If we must use GET at least limit the amount of downloaded data.
                        curl_setopt($ch, CURLOPT_HTTPHEADER, array('Range: bytes=0-2048')); //2 KB
                }
```
Thread Starter wordpressmike
(@wordpressmike)

11 years, 2 months ago

Side note re caching plugins- I’d love to use one but I had problems with certain dynamic content.

Viewing 15 replies - 1 through 15 (of 57 total)

1 2 3 4 →

The topic ‘Linkchecker and other legit bots are broken’ is closed to new replies.

Tags