I am crawl engineer for Croatian Web Archive and our bot sends this User-agent header since 2004: Mozilla/5.0 (compatible; SrceHarvester/3.3.1 +http://haw.nsk.hr/)
Our good/legitimate bot is filtered out (status 403) by BPS just because it contains the word “harvest” (defined in BPS htaccess RewriteCond). It is also possible that other national archives (http://www.netpreserve.org/) have similar problems due to the common use of the word “harvesting” in the context of web archiving.
We could suggest web owners to manualy edit theit htaccess and remove the word harvest but the average wordpress user does not feel comfortable with editing htaccess.
Is it possible that you exclude “harvest” from rewrite rules in future BPS releases?
If not what should we suggest to web owners that want to have their web archived?
Thanks.
Plugin Author
AITpro
(@aitpro)
First off, when looking around at other security based plugins and scripts it is common practice to block the “harvest” user agent string so since this is basically a standard then it must have significant/relevant value/a reason for being blocked. For that reason/logic we would not consider removing this from the standard BPS code.
BPS allows users to decide/choose whether or not they want to allow harvesting/scraping of their website. BPS includes a Security Log file that logs everything that is blocked in a way that clearly logs exactly what is being blocked so that users can see what is being blocked and then make a decision on whether or not to allow this.
The BPS philosophy is this: We start with the maximum security possible and then make it easy for users to decrease that security on a case by case as needed basis. In this case all that is needed is for the end user to simply alter/modify this security filter and remove harvest from the filter.
RewriteCond %{HTTP_USER_AGENT} (;|<|>|'|"|\)|\(|%0A|%0D|%22|%27|%28|%3C|%3E|%00).*(libwww-perl|wget|python|nikto|curl|scan|java|winhttp|HTTrack|clshttp|archiver|loader|email|harvest|extract|grab|miner) [NC,OR]
Hi i’m getting this warning message – “W3 Total Cache is activated, but W3TC .htaccess code was NOT found in your root .htaccess file.”
same thing happened last time, but it disappeared after i unlocked it to allow W3TC to write its htaccess code to my root htaccess file.
now, even though w3 total cache plugin is functioning,the warning message is still there!
Plugin Author
AITpro
(@aitpro)
What I recommend is that you copy the W3TC .htaccess cache code to this BPS Custom Code text box: CUSTOM CODE TOP PHP/PHP.INI HANDLER/CACHE CODE: Add php.ini handler and/or plugin cache code here
Click the Save Root Custom Code button.
Go to the BPS Security Modes page and click the Create secure.htaccess File AutoMagic button and then activate Root Folder BulletProof Mode.
The reasons and logic for doing this is W3TC sometimes writes its .htaccess cache code to the bottom of the root .htaccess file, which does not work – it must be at the top of the root .htaccess file. By adding the W3TC .htaccess code to BPS Custom Code it is saved permanently, will always be in the right place (top of the root .htaccess file) if you click the AutoMagic buttons in the future and you will not see the W3TC redeploy alerts ever again.
I have to agree with @dceljak.
Blocking user agents is not a strong security method – certainly not ‘BulletProof’. It’s akin to asking someone ‘are you a terrorist?’ and letting them in if they say ‘no’. All programs can change their user agent strings.
What this encourages is for web crawlers (including search engines) to use alternate user agent strings so they’re not arbitrarily blocked by plugins such as this. Anyone who is genuinely intent on causing harm to a website will already be masquerading as chrome/firefox/ie/safari.
This ultimately provides a false sense of ‘blocking’ against web crawlers/search/archiving engines. If web masters genuinely don’t want their site indexed or stored, then they should use relevant directives in robots.txt or the meta robot commands which are an industry defacto standard. Perhaps BPS could provide an option to set this up?
Plugin Author
AITpro
(@aitpro)
Blocking user agent strings is intended to be more of a nuisance management/prevention thing (ie some bad bots/user agents excessively scrape/harvest/mine/etc) vs security and BPS only contains a very small amount of bad bot/user agent nuisance filters. Yes, user agent strings can be very easily faked. The primary security filters in BPS instead use/take an Action Security approach. X does bad action Y and Z is the result = Forbidden. The security focus is Y (bad action) and not X (user agent or other identifier).
Only good/legitimate bots follow the rules/directives in a robots.txt file. bad bots ignore/disregard the rules/directives in a robots.txt file. WordPress already comes with an option setting to tell Search Engines not to crawl and index a website >>> Settings >>> Reading >>> Search Engine Visibility Discourage search engines from indexing this site
It is up to search engines to honor this request.
Most folks want visitor traffic to their website so of course they would not want to discourage search engines from indexing their website (either creates an entry in the WordPress Virtual Robots.txt “file” or does this in a meta tag – not really sure). We have several testing websites that we do not want to have indexed or crawled so we use that option setting on those test sites.
A virtual robots option/tool/feature is scheduled for inclusion in BPS. it will be added at some point. The option/tool/feature would use this code.
http://forum.ait-pro.com/forums/topic/wordpress-robots-txt-wordpress-virtual-robots-txt/#post-6523
NOTE: A robots.txt file is not designed/intended to be a website security measure and is instead an SEO tool.