The scanner does not work

Resolved almendron
(@almendron)

2 years, 10 months ago
WordPress 6.2
Broken Link Checker 2.0.0
WPMU DEV 4.11.18

Links found:
1. Old: 3.102
2. New. After three hours I stopped the process: 1.441.386 !!!!!!!!!!!!!!!
The page I need help with: [log in to see the link]

Viewing 8 replies - 1 through 8 (of 8 total)

Kris – WPMU DEV Support
(@wpmudevsupport13)

2 years, 10 months ago

Hi @almendron

I hope you are doing well today.

In your scan I see multiple errors like:
Request Error: Not found"
Request error: (ERR_FR_TOO_MANY_REDIRECTS)
Request error: (ERR_TLS_CERT_ALTNAME_INVALID)
Server Error: Internal Server Error
Request timed out (connection timeout)
Request Error: Forbiden

and in general scan was timeout after 3h. I pinged our BLC Team to review this scan closer. We will post an update here as soon as more information is available.

Kind Regards,
Kris

Thread Starter almendron
(@almendron)

2 years, 10 months ago

I have accessed the full report and found links that do not correspond to the website.
Active website: https://www.almendron.com/artehistoria/
Example link that is not from that website: https://www.almendron.com/blog/sentimientos-en-las-paredes-de-las-ciudades/

In the report there are links to three different sites. There are three wordpress installed in three subfolders.

Plugin Support Laura – WPMU DEV Support
(@wpmudev-support8)

2 years, 10 months ago

Hi @almendron

Broken Link Checker scanner scans your site’s front-end (just like any search engine crawler would – but faster) and checks all the links it can find there.

The example link that you shared is part of the scanned page (from crawler/user perspective) even if – technically speaking – it is a separate WordPress installation (or some other platform, even a pure HTML site).

If you open your “active website” all it takes to get to that example link manually is to

– click on “BITACORA” item in menu (so it takes you to /blog/ page)
– then click on “Arte” category in sidebar
– then navigate to the page number 7 (currently, it may change) using paging numbers
– and that example link is there.

It doesn’t really matter if the link “technically” belongs to the particular setup or not, it could even be an external link to some 3rd-party unrelated site. For as long as it’s available on checked site, it’s discovered and tested.

The “old” engine worked in a different way and actually checked only “content” of the page its on – so yes, it might have been skipping a lot of links in case of such “connected” installs. But still – this isn’t a bug, it’s just how crawler works.

In this specific case some of the checked URLs were responding too slow so cloud engine marked them as “Request timed out” (as it cannot wait indefinitely). After a number of such time-outs, entire scan is halted due to time out in order not to stress the site too much – so this is a security precaution to avoid breaking the site due to intense scan.

And that scan would be pretty intense. Apparently for some of the URLs found in footer, scan identified nearly 21 million (!) occurrences across the entire site – where by site I mean what “user sees from front-end” rather than single WordPress installation.

This isn’t an easy thing to address, I’m afraid, for now, unless you decide to go with the old “Local” scan again. In future we’ll be adding additional “rate limiting” to cloud scan and other options that would make it better handle such a huge amounts of links and bigger number of time-outs but I don’t have ETA on that – it’s a work in progress currently so it might take a while until it gets updated.

Kind regards,
Adam

Thread Starter almendron
(@almendron)

2 years, 10 months ago

Thank you very much for your explanations. However:

I have three WordPress. Each of them in a folder.
https://www.almendron.com/tribuna/
https://www.almendron.com/blog/
https://www.almendron.com/artehistoria/

When I scan “blog”, the plugin should scan only on that site. The same for “tribuna” and “artehistoria”.

However, the plugin scans all.

Plugin Support Laura – WPMU DEV Support
(@wpmudev-support8)

2 years, 10 months ago

HI @almendron

Thank you for response!

The plugin does scan all, indeed, if you use the “Cloud” scan.

Those may be three separate setups but, as I mentioned, the new “Cloud” scan works from outside. It’s like any crawler – it scans on front-end and it follows the link.

In other words – it checks the site it was started for and finds all the links there then it follows those that “from outside” appear to the part of the same side.

Your setup, despite being three separate setups, appear like a single site. Even a human user would not think it’s three separate sites. And crawler doesn’t know that too – all three sites are “interlinked” from menus, are in the same domain and use URL structure that make them appear to be single site.

It really doesn’t matter if they are different setups or not. It’s crawler that crawls the links and checks them.

Your setup is not really a typical setup as usually even site’s in sub-folders are not so tightly integrated on front-end. If there would be no links from one site to another, crawler would not follow those links. But there are, making it all appear as a single setup.

The entire Cloud engine works from outside and that’s why it sees it this way – it really does pretty much nothing on the site’s end (aside form providing user interface under “Link Checker -> Cloud (new)” page and authenticating site with the Hub to allow scan).

It’s not a bug – it’s how it is designed to work and the reason why it scans all those three sites in this case is not the scanner but how those sites are set; it’s expected and correct way to scan.

—

For now, you can always switch to “Local (old)” scan which is the very same engine (exactly the same) that it was before 2.0 plugin and it doesn’t use Cloud scanner so it will work exactly the same way as it used to work before.

In future, there will be options added also to define exclusions for Cloud scan which will make it possible to exclude certain URLs from scan; but this is planned for future and I don’t have ETA currently.

Kind regards,
Adam

Thread Starter almendron
(@almendron)

2 years, 10 months ago

With all due respect, I think your approach is wrong.

Site 1: domain/blog
Site 2: domain/tribuna

On “Site 1” there are, for example, 10 links pointing to “Site 2”.
The plugin should only analyse these 10 links.
However, the plugin does not only analyse these 10 links but also looks for all links that are on “Site 2”.

Honestly, I think this behaviour should be corrected. The plugin should limit its search to the links it finds in the posts and pages with the base URL (in this case “domain/xxxx”).

Note: if someone from the team spoke English, they could explain me better.

Miguel.

Plugin Support Nithin – WPMU DEV Support
(@wpmudevsupport11)

2 years, 10 months ago

Hi @almendron,

We do understand your concern and thanks for your valuable feedback and for sharing your perspective on the plugin’s functionality. I’ll make sure to bring this further to our developer’s attention to see if there are any further improvements that could be looked at regarding this.

Will keep you posted once we get further feedback.

Kind Regards,

Nithin

Thread Starter almendron
(@almendron)

2 years, 10 months ago

Thank you very much
Kind regards,
Miguel.

Viewing 8 replies - 1 through 8 (of 8 total)

The topic ‘The scanner does not work’ is closed to new replies.