View Issue Details

IDProjectCategoryView StatusLast Update
0005163unrealircdpublic2020-04-19 18:00
ReporterLe_Coyote Assigned Tosyzop  
Status resolvedResolutionduplicate 
Product Version4.0.18 
Fixed in Version5.0.5 
Summary0005163: spamfilter regex does not seem to match multibyte characters
DescriptionWe're plagued with the infamous Freenodegate spam. An example of a spam line is:
/?\ ATTN: Th?? ?h?nn?? h?? ?ov?? to?irc.freenode.n?t ?/jo?n /!?
This flood uses multibyte homoglyphs for standard ASCII characters

A regex that works outside of unreal:
[T?][?h][?i?][?s].+[?c?][h?][a?][n?][n?][e??][l?].+[h?][a?][s?].+[m ?][o??]v[e??][d?].+t[o??].+[i?]r[?c?].+fr[e??][?e?][n?][o??][d?][?e?]

When tested on using pcre, it matches the example string above. However, when used in unreal's spamfilter, it does not match.
Steps To Reproduce1) /spamfilter add -regex cN kill - Freenodegate_Spam [T?][?h][?i?][?s].+[?c?][h?][a?][n?][n?][e??][l?].+[h?][a?][s?].+[m ?][o??]v[e??][d?].+t[o??].+[i?]r[?c?].+fr[e??][?e?][n?][o??][d?][?e?]
2) Send this string in a channel PRIVMSG: /?\ ATTN: Th?? ?h?nn?? h?? ?ov?? to?irc.freenode.n?t ?/jo?n /!?
The client is not killed even though the regex should match.
TagsNo tags attached.
3rd party modules



2018-11-16 22:49

reporter   ~0020377

I must add that the strings appear mangled in MantisBT (lots of question marks where multibyte characters should be), but I have the original strings handy if needed.
Charset is UTF-8.


2018-11-17 09:52

administrator   ~0020378

Could you post them somewhere, like on pastebin or dpaste? Mantis indeed makes this rather unreadable.


2018-11-17 18:50

reporter   ~0020379

Sure, here it is:


2018-11-18 13:09

administrator   ~0020380

I see, I think this is because the character stream is in "bytes" and the regex is in UTF8 but not interpreted as so.

By the way, when I see your spam example (first time I hear about this) it looks quite easy to detect, like via a module.


2018-11-18 15:39

reporter   ~0020381

I believe this GitHub commit covers some aspects of this bug (new module): ?
Probably not all usecases are covered but that's something that can be enhanced later on


2018-11-18 18:24

administrator   ~0020382

Right, so I have added the module - which The_Myth pasted - called "antimixedutf8" which should solve your spam problem.

I don't think the other thing, what this bug report is about, will be resolved anytime soon.

Not sure if I should leave the bug report open or not, as presumably your problem will be solved by the module.


2018-11-18 19:29

reporter   ~0020383

Thanks for the feedback! I'll try and give the module a go when I can.
As for the bug report itself, that's really up to you. If there is a need to spamfilter actual UTF-8 characters (ie. not just because there's a mix), then it would probably be worth looking into. Even though the French language has a few UTF-8 chars in use, until now, I've never had such a need, but it might be quite different for other languages.


2018-11-18 20:01

reporter   ~0020384

After the real-time typo fix (thank you Pegasus and Syzop), the module builds and works like a charm on 4.0.18
Thanks for the great support! :D


2020-04-19 18:00

administrator   ~0021513

This is more or less a duplicate of 0005187.

Issue History

Date Modified Username Field Change
2018-11-16 22:47 Le_Coyote New Issue
2018-11-16 22:49 Le_Coyote Note Added: 0020377
2018-11-17 09:52 syzop Note Added: 0020378
2018-11-17 18:50 Le_Coyote Note Added: 0020379
2018-11-18 13:09 syzop Note Added: 0020380
2018-11-18 15:39 PeGaSuS Note Added: 0020381
2018-11-18 18:24 syzop Note Added: 0020382
2018-11-18 19:29 Le_Coyote Note Added: 0020383
2018-11-18 20:01 Le_Coyote Note Added: 0020384
2020-04-19 18:00 syzop Assigned To => syzop
2020-04-19 18:00 syzop Status new => resolved
2020-04-19 18:00 syzop Resolution open => duplicate
2020-04-19 18:00 syzop Fixed in Version => 5.0.5
2020-04-19 18:00 syzop Note Added: 0021513