Support » Plugin: Broken Link Checker » Any issues with false messages about broken links?

  • Resolved collapsing_os

    (@collapsing_os)


    Hello all together,

    I know, this topic isn’t new. But in the last days the amount of messages of broken links that aren’t broken increase. For example:

    [ redundant link removed ]

    For me, every link listed works likea charm. Why not for BLC?

    Thanks and many greetings,
    Henning

    The page I need help with: [log in to see the link]

Viewing 7 replies - 1 through 7 (of 7 total)
  • Ambyomoron

    (@josiah-s-carberry)

    I just received such a message. Not only was the link not broken; it wasn’t even listed in the plugin back end as a broken link! It makes one lose confidence in the plugin.

    Well it has been a year since any updates

    Thank you. I don’t have the time to verify if a plugin that I rely on still works as expected. And so I uninstalled it. Now I’m searching for a substitute. “Link Checker” might be the solution. Let’s see.

    For my site, it works internally on its own domain. But I get a lot of false positives on links to my sub-domains. So https:// www. mysite .org works but links to: https:// blog . mysite.org or https:// secure. mysite .org come up as broken. Since the plugin is finding too many links which I have to specify as “not broken”… I’ve had to deactivate it. I’ve gone back to using my non-Wordpress set of tools for finding broken links on web sites… for Mac: Integrity link checker… for Windows: Insite link checker.

    • This reply was modified 6 months, 2 weeks ago by  mlipenk.

    Hi together

    Disclaimer: I’m the developer of the other Link Checker, not this plugin.

    False positives are mostly caused by badly configured web hosters. In theory a webhost should allow all requests and respond correctly when the request is allowed by the robots.txt file and the crawl-delay is met by the accessor. Practically many webhost and firewalls respond with freely chosen, inaccurate status codes when a bot accesses a page.

    These inaccurate status codes make it nearly impossible to get the number of false positives down to zero.

    This is thus a problem that nearly all link checker share. The Broken Link Checker just for external links, but it doesn’t check all internal links.

    My Link Checker also has this issues with internal links if the webhost on which the checked website is hosted, behaves badly. However, all potentially broken links are shown in the results. There are many sites who have no or just a few false positives, but for others the Link Checker is unusable because the the result is polluted with false positives.

    Ambyomoron

    (@josiah-s-carberry)

    Thank you for your insight, Marco, which might explain some of the errors we users have encountered.
    I wonder, though, if dependence on robots.txt is desirable for a broken link checker. It is a problem, given that the use of robots.txt is de facto and is not defined by an official international standard. That being said, I suspect that most web site owners use robots.txt to control how legitimate robotic crawlers attempt to discover pages, mostly for search engine purposes. But a broken link checker is not trying to discover new pages; it is only trying to confirm the validity of URLs that had previously been discovered by some means – most likely not by a web crawler.
    This situation might be eased, somewhat, if there were a de facto naming convention of user agents that are used for broken link checking.

    Depending on the robots.txt is definitely not desirable. I also made the distinction between discovery and checking single links and decided to check the blocked external links anyway, but show the user an appropriate message if the request fails so that the user knows it’s probably a false positive.

    For internal links, I’m very strict because the robots.txt is normally controlled by the user who executes the link check and there are some bot traps that block the access completely when the robots.txt is ignored. I say normally because I have also seen webhosts which manipulate the robots.txt without the consent of the client. Some also block the bots IP indefinitely when the first blocked page is accessed.

    However, the robots.txt is not the main problem, because the most sites have no global restrictions for bots in their robots.txt. They just respond with inaccurate status codes like Forbidden or let the request timeout.

    It may be possible the improve the results by using a pool of local residence IPs, simulate normal visitor behavior and simulate the usage of a browser by executing all JavaScript. But purchasing local residence IPs is expensive and executing all JavaScript costs lots of CPU and memory, so is also expensive. Also the implementation of such mechanisms is very time consuming.

    Standards would definitely help, but I don’t see them coming…

Viewing 7 replies - 1 through 7 (of 7 total)
  • The topic ‘Any issues with false messages about broken links?’ is closed to new replies.