Well, you could detect the user-agent of the visiting entity and just have your plugin not block that user-agent, providing a whitelist functionality of a sort for your plugin.
However, I think a more opportune question would be, why does the user want this? If you know exactly why, there may be a better way, because honestly it makes no sense to hide a site from the public, yet have it’s contents indexed (and cached) in a public search engine.
Thread Starter
jon
(@adiant)
Like you, I wondered that, too, then thought about it for a while, and I can see it being a good marketing ploy for someone trying to sell access to his/her site.
Yes, I realized I could check the User Agent, but couldn’t find any reliable and performance effective way to get an always current list of Search Engine agent names I could check against. I thought it would be too much of a burden to put on to the user of the plugin (to create his/her own whitelist).
Thanks for taking the time to respond!
Yeah, that’s going to be the biggest hurdle.
I have used http://useragentstring.com/ to find these before, for example:
http://useragentstring.com/pages/Googlebot/ and http://useragentstring.com/pages/Bingbot/
And, they do have an API (which I have never used) http://useragentstring.com/pages/api.php
With that said though, I don’t think the site has been updated in a while, as I see some newer popular search engine bots missing, like https://duckduckgo.com/duckduckbot
It also doesn’t match the official list from Google: https://support.google.com/webmasters/answer/1061943
And from Bing: http://www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0
The problem with whitelisting by user-agent is that anyone with their browser’s developer tools enabled could spoof the user-agent and get into the site free of charge. (If they found the content on Google, it would make sense they could get in by spoofing the Googlebot’s user-agent, or just read Google’s cache of the link.)
Whitelisting by IP is much more secure, but the IPs that Google and Bing use change almost daily, and they don’t seem to offer a public API or easily scrape-able list.