Well, I can tell that this isn’t caused by Relevanssi. Are you sure it’s not some other plugin?
Not allowing Google to index search results is probably a good idea, I remember reading from Google guidelines that having search results pages indexed by Google is a bad thing.
Thread Starter
Gryz
(@gryz)
After posting here, I realized that this support-forum isn’t about Relevanssi, but about plugins in general. Sorry about that. But if Relevanssi isn’t the cause, then this is maybe a correct place to ask ?
So what is causing this ?
The WordPress setup we have is nothing special. Only a handfull of plugins. And our site is clearly not the only site that has this problem.
We have had the line:
Disallow: /search
in our robots.txt file for a long time. I guess that isn’t enough, and we need /?s= in there as well. But I rather see the root-cause disappear than just having every WordPress user in the world change his robots.txt file manually.
Thread Starter
Gryz
(@gryz)
Is there a way to see where searches are coming from (IP address or domainname). When I reset the Relevanssi logs, I’m getting log-entries with no-results: in the query-string within minutes. I can’t believe it’s google’s webcrawler that is so quick to crawl my website. However, if it’s not google, then how does google pick up those bogus URLs ?
Also, when I disable Relevanssi, is there a way for me to see the query-strings that get processed by the default WP-search engine ? That would allow me to prove to myself that the bogus searches also happen when Relevanssi is disabled.
Thread Starter
Gryz
(@gryz)
I figured out how google can pickup weird search-URLs. We are running google-analytics. When some broken (or weird) site is generating those searches with the nested no-result: queries, the resulting page will trigger google-analytics. And google will be notified about the existance of the no-result: page. Maybe google uses that information in their page-ranking algorithms ? Not sure if this is what happens, but it could explain one part of the puzzle.
I don’t know, maybe if you build a filter function that triggers from the_posts hook and saves the queries? Don’t know, that’s where Relevanssi is inserted.
Thread Starter
Gryz
(@gryz)
Thanks for the suggestion. I’m an old C-programmer who used to write C-code for networking devices. I have no knowledge about php, and I’m not sure I wanna check out all WP code to see how it hangs together. I was hoping for a log-function of WP, where I can just go through all http-requests. Maybe I’ll see if I can write some code.
I’ve grepped through all the php-code. The only place where I could find the exact string “no-results:” was in the google analytics code.
From googleanalytics.php:
} else if ($wp_query->is_search) {
$pushstr = “‘_trackPageview’,'”.get_bloginfo(‘url’).”/?s=”;
if ($wp_query->found_posts == 0) {
$push[] = $pushstr.”no-results:”.rawurlencode($wp_query->query_vars[‘s’]).”&cat=no-results'”;
} else
It looks like the string “no-results:” is pre-pended to the search-string. This seems like a place where excessive no-results: could be prepended.
I disabled google analytics for a few minutes on our website, and I still saw new searches with the mangled query. 🙁 It’s very weird. I’ll look into it again this weekend.
The problem is happening at many sites.
When searching on google for “no-results:no-results:” I’m getting 15.8 millions results ! Although google only gave me 355 results. Still doesn’t look good. I’m surprised nobody ever looked at this before.
http://www.google.nl/search?complete=0&q=%22no-results%3Ano-results%3A%22
Thread Starter
Gryz
(@gryz)
It turns out our webhost keeps a logfile with all HTTP requests.
I can see that many of the “no-results” queries are from Googlebot.
I now also understand why the
Disallow: /?s=
line in robots.txt didn’t work. It turns out Google does queries for
/page/3/?s=no-results:no-results:<etc>
/page/8/?s=no-results:<etc>
So I added another line to robots.txt.
Disallow: /page/*/?s=
I hope the ? and = characters are not special characters, like * is.
Msaari, if you are still reading this.
I have a small suggestion.
Maybe you can include a line:
<meta name=”robots” content=”noindex”>
in all result-pages from searches ?
I don’t think people want dynamic search results indexed in search engines anyway. So if wordpress/relevanssiwould include the “noindex” tag in all search results, that could prevent problems ?
Thread Starter
Gryz
(@gryz)
One more update.
I couldn’t believe something was wrong in the google-analytics code. Google isn’t that sloppy. But then I realized that that code does not come from Google. It is part of the “Google Analytics for Wordpres” plugin. And the code is written by a volunteer from the WP community, not by Google.
When I looked at a lot of those websites that had the same problem, I noticed they were all using the “Google Analytics for Wordpres” plugin. So this plugin might very well be the cause of the problems.
I disabled it, and replaced it with another plugin.
Ultimate Google Analytics
Let’s hope this fixes the root of the problem.
We should know in a few days. I’ll update this post.
So there’s three parts to the solution.
1) Use a different GA plugin.
2) Add rules to robots.txt to prevent googlebots crawling for old malformed search-URLs.
3) Wait for old search-URLs to depricate from the google database.
If this turns out to fix the problem, I wonder if I should notify parties involved. Would Google remove the bogus 18 million entries from their database ? Should I contact the author of the GA for WordPress plugin ?
Anyway, Relevanssi had nothing to do with this. It was a very useful tool to warn us that something was wrong. Thanks for all your effort, msaari !
Relevanssi doesn’t do any changes in the search results template, I’ve so far left that under user control, also many people use all sorts of SEO plugins and good SEO plugin will cover that. But yeah, maybe I could add it the meta noindex field as an option.
Gryz, did you happen to solve this problem?
Sander