View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0005163||unreal||ircd||public||2018-11-16 22:47||2018-11-18 20:01|
|Target Version||Fixed in Version|
|Summary||0005163: spamfilter regex does not seem to match multibyte characters|
|Description||We're plagued with the infamous Freenodegate spam. An example of a spam line is:|
/?\ ATTN: Th?? ?h?nn?? h?? ?ov?? to?irc.freenode.n?t ?/jo?n /!?
This flood uses multibyte homoglyphs for standard ASCII characters
A regex that works outside of unreal:
When tested on regex101.com using pcre, it matches the example string above. However, when used in unreal's spamfilter, it does not match.
|Steps To Reproduce||1) /spamfilter add -regex cN kill - Freenodegate_Spam [T?][?h][?i?][?s].+[?c?][h?][a?][n?][n?][e??][l?].+[h?][a?][s?].+[m ?][o??]v[e??][d?].+t[o??].+[i?]r[?c?].+fr[e??][?e?][n?][o??][d?][?e?]|
2) Send this string in a channel PRIVMSG: /?\ ATTN: Th?? ?h?nn?? h?? ?ov?? to?irc.freenode.n?t ?/jo?n /!?
The client is not killed even though the regex should match.
|Tags||No tags attached.|
|3rd party modules|
I must add that the strings appear mangled in MantisBT (lots of question marks where multibyte characters should be), but I have the original strings handy if needed.
Charset is UTF-8.
||Could you post them somewhere, like on pastebin or dpaste? Mantis indeed makes this rather unreadable.|
||Sure, here it is: http://dpaste.com/01909HH|
I see, I think this is because the character stream is in "bytes" and the regex is in UTF8 but not interpreted as so.
By the way, when I see your spam example (first time I hear about this) it looks quite easy to detect, like via a module.
I believe this GitHub commit covers some aspects of this bug (new module): https://github.com/unrealircd/unrealircd/commit/793e82721812b3d87246e31ec77d521f2509b946 ?
Probably not all usecases are covered but that's something that can be enhanced later on
Right, so I have added the module - which The_Myth pasted - called "antimixedutf8" which should solve your spam problem.
I don't think the other thing, what this bug report is about, will be resolved anytime soon.
Not sure if I should leave the bug report open or not, as presumably your problem will be solved by the module.
Thanks for the feedback! I'll try and give the module a go when I can.
As for the bug report itself, that's really up to you. If there is a need to spamfilter actual UTF-8 characters (ie. not just because there's a mix), then it would probably be worth looking into. Even though the French language has a few UTF-8 chars in use, until now, I've never had such a need, but it might be quite different for other languages.
After the real-time typo fix (thank you Pegasus and Syzop), the module builds and works like a charm on 4.0.18
Thanks for the great support! :D
|2018-11-16 22:47||Mr_Smoke||New Issue|
|2018-11-16 22:49||Mr_Smoke||Note Added: 0020377|
|2018-11-17 09:52||syzop||Note Added: 0020378|
|2018-11-17 18:50||Mr_Smoke||Note Added: 0020379|
|2018-11-18 13:09||syzop||Note Added: 0020380|
|2018-11-18 15:39||The_Myth||Note Added: 0020381|
|2018-11-18 18:24||syzop||Note Added: 0020382|
|2018-11-18 19:29||Mr_Smoke||Note Added: 0020383|
|2018-11-18 20:01||Mr_Smoke||Note Added: 0020384|