Hey Torsten,
thanks for getting back that quick 🙂 I sent a few seconds ago the form filled out as good as I could (was not sure about the user-agent, hope it’s the “comment-agent”)
Also, important to say: I’m not really a developer, it’s more like that I tried to read some parts of the plugins-code, stumbling through the support-forum, try to understand a bit and adapt it.
Now to your questions:
Regex I tried out
I thought maybe I can extend the host-part in
array(
'host' => '^(www\.)?fkbook\.co\.uk$|^(www\.)?nsru\.net$|^(www\.)?goo\.gl$|^(www\.)?bit\.ly$',
),
for example with adding some of the following
|^(http|https)(\:\/\/)(www)*(\w+){1}$
|^(http|https)(\:\/\/)(www)*(\[a-z0-1]+){1}$
Worth to mention: whenever I have to do something with regex, I use https://regexr.com/. To test it before implementing it, I tested it with
1. http://test (should be matched to the regex-expression and so marked as spam)
2. https://test (should be matched to the regex-expression and so marks as spam)
3. http://www.test.de (should not be matched to the regex-expression and so not be marked as spam)
4. http://test.de (should not be matched to the regex-expression and so not be marked as spam)
Somehow the regex-expressions are working, but they also mark comments as spam when they are like example 3 and 4 – so in the end, they are not working or let’s say “they work too good” and mark more than they should.
I also tried the same regex-expressions as an additional array, e.g.
array(
'rawurl' => '^(http|https)(\:\/\/)(www)*(\w+){1}$',
),
Your example at github
If I’m not wrong, the curly brackets and the number within it means “exactly the amount of the number in the brackets” – which means: if the body exists only of one word with a length of 30 characters, its working, but with more or less than the 30 characters, it fails and so it’s not marked as spam.
Gravatars
I disabled Gravatars completly because of GDPR-reasons (you know, the stupid german data-protection law … was at least for me the easiest thing to just get rid of it)
“Use regular expressions”
Yes, I do (double-checked it, just to be sure) 🙂
Some more thoughts from my side
For me, it seems the common pattern is always the strange looking URL. They don’t have any subdomain mentioned and also no dot and a TLD at the end – could be a good startpoint for some solution for which I’m too dumb to figure the implementation out. Additionally or another point to start might be the comment-body – its always just one gibberish word, but at least quite long (but still not always the same length).