Support » Plugin: PhastPress » Google crawler and web browser seeing phast.php JSON files as HTML document
Google crawler and web browser seeing phast.php JSON files as HTML document
-
phast.php may be generating JSON with incorrect MIME types. The Google crawler is seems to be doing excessive crawling thinking these are all HTML documents.
Can you please look into this?
Thanks!
-
This topic was modified 4 months, 4 weeks ago by
kw11.
-
This topic was modified 4 months, 4 weeks ago by
-
Hi @kw11,
Unless you share a URL of an indexed resource, I cannot check.
However, PhastPress sends JSON as
text/plain
responses because not all server configurations automatically compressapplication/json
responses. It could be that Google automatically indexes these.In the latest release, I’ve added a header
X-Robots-Tag: none
to these JSON responses to avoid any indexing or link following from Google. This should fix the issue.–Albert
Thanks for doing this.
However, on closer look, it seems that the phast.php bundler files are being crawled as HTML, which ends up significantly lowering the crawl rate for Googlebot/2.1 Smartphone of HTML pages. Desktop crawl rate is ok, which is weird.
It seems these JSON text/plain files, which are being interpreted as text/html is what’s leading to the drop in crawl rate.
When phast.php was inadvertently blocked from Googlebot, the overall amount of pages/files crawled was significantly less, but the crawl rate of Googlebot Smartphone was magnitudes higher.
Whether not this is a “bug” of the Google crawler or the phast bundler, the resulting crawl rates become unacceptable. With phast bundler blocked, Googlebot can crawl the whole site in under a few days. With phast.php not inadvertently blocked from robots, it might be weeks or longer.
-
This reply was modified 4 months, 2 weeks ago by
kw11.
That being said, the overall crawling, in terms of items crawled, is about 2-3x when phast.php is not blocked from robots. So for example, instead of 2000 documents being crawled a day, 6000 documents are being crawled a day. Googlebot crawls the phast bundler JSON files in very high volume, but is skipping out on crawling the actual page HTML on their smartphone crawler. It also seems to me that the overall crawl budget of real HTML pages is much less in Googlebot smartphone crawler with phast.php not blocked from robots.
-
This reply was modified 4 months, 2 weeks ago by
kw11.
I just reread what you wrote and you said it’s because some servers don’t automatically compress application/json.
Is there a way you can instruct the server to do this with application/json MIME type? If not, can you detect if a server has the capability to compress application/json, and if it does, serve them as application/json? The latter solution is not a perfect solution from an SEO perspective, but will get rid of this Googlebot issue for those who can compress application/json.
-
This reply was modified 4 months, 2 weeks ago by
kw11.
I edited the plugin so the Content-Type to
application/json; charset=utf-8
. I’ll let you know what Googlebot thinks in a few days. On our server, it’s automatically Brotli compressed.One more question. Will blocking phast.php* from robots.txt have any negative consequences when it comes to crawling? Or will it just prevent bundled resources from caching?
It seems like Googlebot and most robots don’t cache anything anyway?
Hi @kw11,
We cannot prevent the initial request to
phast.php
from happening by changing the headers on that response. Because the headers are retrieved only after actually making the request.So I suggest adding
phast.php
to yourrobots.txt
. There’s nothing there that should be crawled by Google.–Albert
So to be clear, you think it’s fine to block phast.php in robots.txt?
Hi @kw11,
I’m actually not sure what will happen.
On the one hand, it doesn’t stop Google from indexing the content of your page.
On the other hand, it prevents Google from using the Phast bundler, and I’m not sure what that will do to Google’s impression of your site’s performance.
I edited the plugin so the Content-Type to application/json; charset=utf-8. I’ll let you know what Googlebot thinks in a few days. On our server, it’s automatically Brotli compressed.
Thanks for this. I’m looking forward to hearing the results of this experiment. If it makes a big difference I may add a setting to the plugin.
–Albert
-
This reply was modified 4 months, 2 weeks ago by
Albert Peschar.
Mobile crawl requests are back to normal (actually above previous normal), and desktop crawl requests also spiking.
Overall crawl requests much higher now.
phast.php* is blocked with robots.txt. I didn’t have a chance to test it for a significant period of time with just text/json MIME type, but still think it probably was confusing the mobile crawler, which was interpreting text/plain as text/html.
Crawler was confused and would suggest blocking phast.php from robots. Unfortunately, this plugin has the potentially to seriously impact indexability and rankings without these steps.
-
This reply was modified 4 months, 1 week ago by
kw11.
Hi @kw11,
Thanks for your testing.
I have released version 2.10 which reverts back to using
application/json
as the MIME type on bundler responses, and removes theX-Robots-Tag
header.Hopefully this will prevent the issue in future.
–Albert
-
This reply was modified 4 months, 2 weeks ago by
- You must be logged in to reply to this topic.