View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0005163 | unreal | ircd | public | 2018-11-16 22:47 | 2020-04-19 18:00 |
Reporter | Le_Coyote | Assigned To | syzop | ||
Priority | normal | Severity | minor | Reproducibility | always |
Status | resolved | Resolution | duplicate | ||
Product Version | 4.0.18 | ||||
Fixed in Version | 5.0.5 | ||||
Summary | 0005163: spamfilter regex does not seem to match multibyte characters | ||||
Description | We're plagued with the infamous Freenodegate spam. An example of a spam line is: /?\ ATTN: Th?? ?h?nn?? h?? ?ov?? to?irc.freenode.n?t ?/jo?n /!? This flood uses multibyte homoglyphs for standard ASCII characters A regex that works outside of unreal: [T?][?h][?i?][?s].+[?c?][h?][a?][n?][n?][e??][l?].+[h?][a?][s?].+[m ?][o??]v[e??][d?].+t[o??].+[i?]r[?c?].+fr[e??][?e?][n?][o??][d?][?e?] When tested on regex101.com using pcre, it matches the example string above. However, when used in unreal's spamfilter, it does not match. | ||||
Steps To Reproduce | 1) /spamfilter add -regex cN kill - Freenodegate_Spam [T?][?h][?i?][?s].+[?c?][h?][a?][n?][n?][e??][l?].+[h?][a?][s?].+[m ?][o??]v[e??][d?].+t[o??].+[i?]r[?c?].+fr[e??][?e?][n?][o??][d?][?e?] 2) Send this string in a channel PRIVMSG: /?\ ATTN: Th?? ?h?nn?? h?? ?ov?? to?irc.freenode.n?t ?/jo?n /!? The client is not killed even though the regex should match. | ||||
Tags | No tags attached. | ||||
3rd party modules | |||||
|
I must add that the strings appear mangled in MantisBT (lots of question marks where multibyte characters should be), but I have the original strings handy if needed. Charset is UTF-8. |
|
Could you post them somewhere, like on pastebin or dpaste? Mantis indeed makes this rather unreadable. |
|
Sure, here it is: http://dpaste.com/01909HH |
|
I see, I think this is because the character stream is in "bytes" and the regex is in UTF8 but not interpreted as so. By the way, when I see your spam example (first time I hear about this) it looks quite easy to detect, like via a module. |
|
I believe this GitHub commit covers some aspects of this bug (new module): https://github.com/unrealircd/unrealircd/commit/793e82721812b3d87246e31ec77d521f2509b946 ? Probably not all usecases are covered but that's something that can be enhanced later on |
|
Right, so I have added the module - which The_Myth pasted - called "antimixedutf8" which should solve your spam problem. I don't think the other thing, what this bug report is about, will be resolved anytime soon. Not sure if I should leave the bug report open or not, as presumably your problem will be solved by the module. |
|
Thanks for the feedback! I'll try and give the module a go when I can. As for the bug report itself, that's really up to you. If there is a need to spamfilter actual UTF-8 characters (ie. not just because there's a mix), then it would probably be worth looking into. Even though the French language has a few UTF-8 chars in use, until now, I've never had such a need, but it might be quite different for other languages. |
|
After the real-time typo fix (thank you Pegasus and Syzop), the module builds and works like a charm on 4.0.18 Thanks for the great support! :D |
|
This is more or less a duplicate of 0005187. |
Date Modified | Username | Field | Change |
---|---|---|---|
2018-11-16 22:47 | Le_Coyote | New Issue | |
2018-11-16 22:49 | Le_Coyote | Note Added: 0020377 | |
2018-11-17 09:52 | syzop | Note Added: 0020378 | |
2018-11-17 18:50 | Le_Coyote | Note Added: 0020379 | |
2018-11-18 13:09 | syzop | Note Added: 0020380 | |
2018-11-18 15:39 | PeGaSuS | Note Added: 0020381 | |
2018-11-18 18:24 | syzop | Note Added: 0020382 | |
2018-11-18 19:29 | Le_Coyote | Note Added: 0020383 | |
2018-11-18 20:01 | Le_Coyote | Note Added: 0020384 | |
2020-04-19 18:00 | syzop | Assigned To | => syzop |
2020-04-19 18:00 | syzop | Status | new => resolved |
2020-04-19 18:00 | syzop | Resolution | open => duplicate |
2020-04-19 18:00 | syzop | Fixed in Version | => 5.0.5 |
2020-04-19 18:00 | syzop | Note Added: 0021513 |