Detect Search Engine Crawlers

Resolved jon
(@adiant)

11 years, 2 months ago

One of my plugins turns a regular WordPress Site into a private site. One of my users wants to allow Search Engine indexing of his Site that is blocked (with my plugin) from everyone who is not logged in.

I’ve found a few general articles on detecting Search Engine Crawlers, but wondered if there any specific WordPress functions or hooks that would help. Or even just anyone with advice on the subject.

Viewing 5 replies - 1 through 5 (of 5 total)

Moderator James Huff
(@macmanx)

11 years, 2 months ago

Well, you could detect the user-agent of the visiting entity and just have your plugin not block that user-agent, providing a whitelist functionality of a sort for your plugin.

However, I think a more opportune question would be, why does the user want this? If you know exactly why, there may be a better way, because honestly it makes no sense to hide a site from the public, yet have it’s contents indexed (and cached) in a public search engine.

Thread Starter jon
(@adiant)

11 years, 2 months ago

Like you, I wondered that, too, then thought about it for a while, and I can see it being a good marketing ploy for someone trying to sell access to his/her site.

Yes, I realized I could check the User Agent, but couldn’t find any reliable and performance effective way to get an always current list of Search Engine agent names I could check against. I thought it would be too much of a burden to put on to the user of the plugin (to create his/her own whitelist).

Thanks for taking the time to respond!

Moderator James Huff
(@macmanx)

11 years, 2 months ago

Yeah, that’s going to be the biggest hurdle.

I have used http://useragentstring.com/ to find these before, for example:

http://useragentstring.com/pages/Googlebot/ and http://useragentstring.com/pages/Bingbot/

And, they do have an API (which I have never used) http://useragentstring.com/pages/api.php

With that said though, I don’t think the site has been updated in a while, as I see some newer popular search engine bots missing, like https://duckduckgo.com/duckduckbot

It also doesn’t match the official list from Google: https://support.google.com/webmasters/answer/1061943

And from Bing: http://www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0

The problem with whitelisting by user-agent is that anyone with their browser’s developer tools enabled could spoof the user-agent and get into the site free of charge. (If they found the content on Google, it would make sense they could get in by spoofing the Googlebot’s user-agent, or just read Google’s cache of the link.)

Whitelisting by IP is much more secure, but the IPs that Google and Bing use change almost daily, and they don’t seem to offer a public API or easily scrape-able list.

Thread Starter jon
(@adiant)

11 years, 2 months ago

Wow! Fabulous answer!

Thank you again.

Moderator James Huff
(@macmanx)

11 years, 2 months ago

You’re welcome!

Viewing 5 replies - 1 through 5 (of 5 total)

The topic ‘Detect Search Engine Crawlers’ is closed to new replies.

Detect Search Engine Crawlers

Tags

Topics

Topics with no replies

Non-support topics

Resolved topics

Unresolved topics

All topics