View Issue Details

IDProjectCategoryView StatusLast Update
0005163unrealircdpublic2018-11-18 20:01
ReporterMr_SmokeAssigned To 
PrioritynormalSeverityminorReproducibilityalways
Status newResolutionopen 
Product Version4.0.18 
Target VersionFixed in Version 
Summary0005163: spamfilter regex does not seem to match multibyte characters
DescriptionWe're plagued with the infamous Freenodegate spam. An example of a spam line is:
/?\ ATTN: Th?? ?h?nn?? h?? ?ov?? to?irc.freenode.n?t ?/jo?n /!?
This flood uses multibyte homoglyphs for standard ASCII characters

A regex that works outside of unreal:
[T?][?h][?i?][?s].+[?c?][h?][a?][n?][n?][e??][l?].+[h?][a?][s?].+[m ?][o??]v[e??][d?].+t[o??].+[i?]r[?c?].+fr[e??][?e?][n?][o??][d?][?e?]

When tested on regex101.com using pcre, it matches the example string above. However, when used in unreal's spamfilter, it does not match.
Steps To Reproduce1) /spamfilter add -regex cN kill - Freenodegate_Spam [T?][?h][?i?][?s].+[?c?][h?][a?][n?][n?][e??][l?].+[h?][a?][s?].+[m ?][o??]v[e??][d?].+t[o??].+[i?]r[?c?].+fr[e??][?e?][n?][o??][d?][?e?]
2) Send this string in a channel PRIVMSG: /?\ ATTN: Th?? ?h?nn?? h?? ?ov?? to?irc.freenode.n?t ?/jo?n /!?
The client is not killed even though the regex should match.
TagsNo tags attached.
3rd party modules

Activities

Mr_Smoke

2018-11-16 22:49

reporter   ~0020377

I must add that the strings appear mangled in MantisBT (lots of question marks where multibyte characters should be), but I have the original strings handy if needed.
Charset is UTF-8.

syzop

2018-11-17 09:52

administrator   ~0020378

Could you post them somewhere, like on pastebin or dpaste? Mantis indeed makes this rather unreadable.

Mr_Smoke

2018-11-17 18:50

reporter   ~0020379

Sure, here it is: http://dpaste.com/01909HH

syzop

2018-11-18 13:09

administrator   ~0020380

I see, I think this is because the character stream is in "bytes" and the regex is in UTF8 but not interpreted as so.

By the way, when I see your spam example (first time I hear about this) it looks quite easy to detect, like via a module.

The_Myth

2018-11-18 15:39

reporter   ~0020381

I believe this GitHub commit covers some aspects of this bug (new module): https://github.com/unrealircd/unrealircd/commit/793e82721812b3d87246e31ec77d521f2509b946 ?
Probably not all usecases are covered but that's something that can be enhanced later on

syzop

2018-11-18 18:24

administrator   ~0020382

Right, so I have added the module - which The_Myth pasted - called "antimixedutf8" which should solve your spam problem.

I don't think the other thing, what this bug report is about, will be resolved anytime soon.

Not sure if I should leave the bug report open or not, as presumably your problem will be solved by the module.

Mr_Smoke

2018-11-18 19:29

reporter   ~0020383

Thanks for the feedback! I'll try and give the module a go when I can.
As for the bug report itself, that's really up to you. If there is a need to spamfilter actual UTF-8 characters (ie. not just because there's a mix), then it would probably be worth looking into. Even though the French language has a few UTF-8 chars in use, until now, I've never had such a need, but it might be quite different for other languages.

Mr_Smoke

2018-11-18 20:01

reporter   ~0020384

After the real-time typo fix (thank you Pegasus and Syzop), the module builds and works like a charm on 4.0.18
Thanks for the great support! :D

Issue History

Date Modified Username Field Change
2018-11-16 22:47 Mr_Smoke New Issue
2018-11-16 22:49 Mr_Smoke Note Added: 0020377
2018-11-17 09:52 syzop Note Added: 0020378
2018-11-17 18:50 Mr_Smoke Note Added: 0020379
2018-11-18 13:09 syzop Note Added: 0020380
2018-11-18 15:39 The_Myth Note Added: 0020381
2018-11-18 18:24 syzop Note Added: 0020382
2018-11-18 19:29 Mr_Smoke Note Added: 0020383
2018-11-18 20:01 Mr_Smoke Note Added: 0020384