Does BPS block Good/Legitimate Bots/User Agents?

ResolvedPlugin Author AITpro
(@aitpro)

10 years, 11 months ago

BPS does not block any good/legitimate bots from doing what they do. What happens a lot of the time is that a bot script is doing several different things and one of those things in the script triggers a 403 error. When Security/HTTP Error logging was added to BPS a lot of folks started thinking that BPS was blocking good/legitimate bots. For years BPS has not blocked any good/legitimate bots and nothing has changed about that in BPS.

Another thing that can trigger 403 errors is how a particular plugin is handling image files. Example: If you have a plugin that is doing something additional with image files – thumbnailing them or lightboxing them then however the plugin is doing that can trigger a 403 error, but the important thing to note is that in most cases everything still works perfectly fine, but a nuisance 403 error is being logged. In most cases you can choose to ignore these errors. If on the otherhand something is actually not working correctly then creating a whitelist rule is what should be done to solve the issue/problem.

There are 2 approaches/methods that can be used to resolve excessive Security/HTTP Error logging.

Whitelist approach…
or
Ignore logging approach…

See this Forum Topic link below for details on how to use either of these approaches/methods.
http://wordpress.org/support/topic/image-bots-blocked-on-multisite?replies=10#post-4225442

http://wordpress.org/extend/plugins/bulletproof-security/

Viewing 6 replies - 1 through 6 (of 6 total)

dceljak
(@dceljak)

10 years, 8 months ago

I am crawl engineer for Croatian Web Archive and our bot sends this User-agent header since 2004: Mozilla/5.0 (compatible; SrceHarvester/3.3.1 +http://haw.nsk.hr/)
Our good/legitimate bot is filtered out (status 403) by BPS just because it contains the word “harvest” (defined in BPS htaccess RewriteCond). It is also possible that other national archives (http://www.netpreserve.org/) have similar problems due to the common use of the word “harvesting” in the context of web archiving.

We could suggest web owners to manualy edit theit htaccess and remove the word harvest but the average wordpress user does not feel comfortable with editing htaccess.

Is it possible that you exclude “harvest” from rewrite rules in future BPS releases?
If not what should we suggest to web owners that want to have their web archived?

Thanks.

Plugin Author AITpro
(@aitpro)

10 years, 8 months ago

First off, when looking around at other security based plugins and scripts it is common practice to block the “harvest” user agent string so since this is basically a standard then it must have significant/relevant value/a reason for being blocked. For that reason/logic we would not consider removing this from the standard BPS code.

BPS allows users to decide/choose whether or not they want to allow harvesting/scraping of their website. BPS includes a Security Log file that logs everything that is blocked in a way that clearly logs exactly what is being blocked so that users can see what is being blocked and then make a decision on whether or not to allow this.

The BPS philosophy is this: We start with the maximum security possible and then make it easy for users to decrease that security on a case by case as needed basis. In this case all that is needed is for the end user to simply alter/modify this security filter and remove harvest from the filter.

RewriteCond %{HTTP_USER_AGENT} (;|<|>|'|"|\)|\(|%0A|%0D|%22|%27|%28|%3C|%3E|%00).*(libwww-perl|wget|python|nikto|curl|scan|java|winhttp|HTTrack|clshttp|archiver|loader|email|harvest|extract|grab|miner) [NC,OR]

Adithya Shetty
(@adithyashetty)

10 years, 7 months ago

Hi i’m getting this warning message – “W3 Total Cache is activated, but W3TC .htaccess code was NOT found in your root .htaccess file.”

same thing happened last time, but it disappeared after i unlocked it to allow W3TC to write its htaccess code to my root htaccess file.
now, even though w3 total cache plugin is functioning,the warning message is still there!

Plugin Author AITpro
(@aitpro)

10 years, 7 months ago

What I recommend is that you copy the W3TC .htaccess cache code to this BPS Custom Code text box: CUSTOM CODE TOP PHP/PHP.INI HANDLER/CACHE CODE: Add php.ini handler and/or plugin cache code here

Click the Save Root Custom Code button.
Go to the BPS Security Modes page and click the Create secure.htaccess File AutoMagic button and then activate Root Folder BulletProof Mode.

The reasons and logic for doing this is W3TC sometimes writes its .htaccess cache code to the bottom of the root .htaccess file, which does not work – it must be at the top of the root .htaccess file. By adding the W3TC .htaccess code to BPS Custom Code it is saved permanently, will always be in the right place (top of the root .htaccess file) if you click the AutoMagic buttons in the future and you will not see the W3TC redeploy alerts ever again.

lmwporg
(@lmwporg)

10 years, 3 months ago

I have to agree with @dceljak.

Blocking user agents is not a strong security method – certainly not ‘BulletProof’. It’s akin to asking someone ‘are you a terrorist?’ and letting them in if they say ‘no’. All programs can change their user agent strings.

What this encourages is for web crawlers (including search engines) to use alternate user agent strings so they’re not arbitrarily blocked by plugins such as this. Anyone who is genuinely intent on causing harm to a website will already be masquerading as chrome/firefox/ie/safari.

This ultimately provides a false sense of ‘blocking’ against web crawlers/search/archiving engines. If web masters genuinely don’t want their site indexed or stored, then they should use relevant directives in robots.txt or the meta robot commands which are an industry defacto standard. Perhaps BPS could provide an option to set this up?

Plugin Author AITpro
(@aitpro)

10 years, 3 months ago

Blocking user agent strings is intended to be more of a nuisance management/prevention thing (ie some bad bots/user agents excessively scrape/harvest/mine/etc) vs security and BPS only contains a very small amount of bad bot/user agent nuisance filters. Yes, user agent strings can be very easily faked. The primary security filters in BPS instead use/take an Action Security approach. X does bad action Y and Z is the result = Forbidden. The security focus is Y (bad action) and not X (user agent or other identifier).

Only good/legitimate bots follow the rules/directives in a robots.txt file. bad bots ignore/disregard the rules/directives in a robots.txt file. WordPress already comes with an option setting to tell Search Engines not to crawl and index a website >>> Settings >>> Reading >>> Search Engine Visibility Discourage search engines from indexing this site

It is up to search engines to honor this request.

Most folks want visitor traffic to their website so of course they would not want to discourage search engines from indexing their website (either creates an entry in the WordPress Virtual Robots.txt “file” or does this in a meta tag – not really sure). We have several testing websites that we do not want to have indexed or crawled so we use that option setting on those test sites.

A virtual robots option/tool/feature is scheduled for inclusion in BPS. it will be added at some point. The option/tool/feature would use this code.
http://forum.ait-pro.com/forums/topic/wordpress-robots-txt-wordpress-virtual-robots-txt/#post-6523

NOTE: A robots.txt file is not designed/intended to be a website security measure and is instead an SEO tool.

Viewing 6 replies - 1 through 6 (of 6 total)

The topic ‘Does BPS block Good/Legitimate Bots/User Agents?’ is closed to new replies.