View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0005187 | unreal | installing | public | 2019-01-03 06:09 | 2023-09-09 12:27 |
Reporter | kieseen | Assigned To | syzop | ||
Priority | normal | Severity | minor | Reproducibility | always |
Status | resolved | Resolution | fixed | ||
Platform | UnrealIRCd-4.2.1 | OS | Linux (Ubuntu Server) | OS Version | 18.04 |
Product Version | 4.2.1 | ||||
Fixed in Version | 6.1.2-rc1 | ||||
Summary | 0005187: Default configuration can't use \p{} in spamfilter regexes | ||||
Description | Hi, We've been trying to generalise our spam filters to deal with future spam waves with less maintenance. We found that a better potential default for compiling would be to remove --disable-unicode, so that submitted regexes can use "\p{}" to access and match against a character's UCD properties (like "Latin", "Sc", "Common", or any specified script name). The current default response is with an error: "Error in regex '\p{Common}': this version of PCRE2 does not have support for \P, \p, or \X (at character #2)." https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5 We saw a recommendation that the antimixedutf8 module is used against spam, but since it only checks for Latin and Cyrillic, and confusable characters sit in many other ranges, it may not be useful in future spam waves: https://unicode.org/cldr/utility/confusables.jsp?a=aAbBCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ&r=None Would building without "--disable-unicode" have any big disadvantages? | ||||
Steps To Reproduce | 1. Compile with default configuration file and settings 2. Run the ircd and connect to the server with an O-line 3. Attempt to run the command "/spamfilter add -regex p warn - PCREPropertyTest \p{Common}" (or add /quote) if your client requires) | ||||
Tags | pcre, spamfilter, unicode | ||||
3rd party modules | |||||
|
Did you check if it works, if compiled without --disable-unicode? |
|
Yes, it does, this report is about suggesting a better default configuration. |
|
Right, so, just to double check, you test with other scripts (non-latin) and it correctly identified the character classes using https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5 ? I ask, because, some other PCRE2 features don't work like that, even if you enable this option. See 0005163 (though that is about the matcher, the regex, and not the target, the privmsg). |
|
In pcre2_compile we set PCRE2_NEVER_UTF and PCRE2_NEVER_UCP, and similarly we do not set PCRE2_UTF. So yeah, I'm wondering if this would work without changes in those areas. And, related, what would happen if we were to change the above. Taking into account that IRC deals with streams of bytes, which may be latin1, utf8, or some completely different character set. It would be nice if all this already worked just by removing --disable-unicode, but.. that would surprise me. |
|
Two things come to mind: 1) With non-utf8-non-latin character sets (eg: some Russian codepage), it may be possible that characters are misinterpreted (seen as something they are not). This may lead to odd matches and non-matches. How much of a problem would this be? 2) How does PCRE2 deal with "invalid UTF8 characters"? Will it entirely fail to run the regex? Will it skip the character as if it was not there? What will it do? I see https://www.pcre.org/current/doc/html/pcre2unicode.html When the PCRE2_UTF option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned. So yeah, it will fail to run the regex. Or you can set PCRE2_NO_UTF_CHECK: If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result is undefined and your program may crash or loop indefinitely. .. but this does not sound like a good idea :D Hope the above comments help to give you a better picture of the situation we are in. And, again, if somehow the functionality you describe works by just removing --disable-unicode and not having to deal with all the rest, then yeah.. that would be nice... but.. would surprise me. |
|
Good news, spamfilter is now UTF8 aware. This is not in 5.0.4 but should be in 5.0.5. See https://github.com/unrealircd/unrealircd/commit/bc70882bd3935be728b953f4252a94f9de6ff3f6 |
|
A month later (May 2020) we had to revert this again. But this issue was never updated with that information. Copy/paste from an IRC conversation of someone asking why \p was not working: it's a complex subject. it starts with the fact that IRC is not 100% guaranteed valid UTF8 https://github.com/unrealircd/unrealircd/commit/bc70882bd3935be728b953f4252a94f9de6ff3f6 will probably interest you, this added \p support a year ago unfortunately we had to disable it again because PCRE2 crashed/looped PCRE2 devs basically said oh yeah this is new etc... since then i have not trusted it anymore and not put it back in the retraction was mentioned in https://forums.unrealircd.org/viewtopic.php?f=1&t=9013 the upstream bug was here https://bugs.exim.org/show_bug.cgi?id=2581 "The invalid utf support is a new feature so there might be issues with it, but we will try to fix them." i did not want unrealircd to be a test project for them :D clearly hanging/looping/crashing is not acceptable, especially for a long running process like an ircd antimixedutf8 may interest you though |
|
We should ask the PCRE2 guys upstream if they now run with PCRE2_MATCH_INVALID_UTF through a tester and fuzzer with random data. This, to have a path to getting this tested (by them, not by our random users crashing/hanging) and eventually back in unrealircd again. |
|
Another attempt for UnrealIRCd 6.0.7(-git). Any feedback would be appreciated :) https://github.com/unrealircd/unrealircd/commit/4b4562516c44650661de47e6f7eb888b738f09ea commit 4b4562516c44650661de47e6f7eb888b738f09ea (HEAD -> unreal60_dev, origin/unreal60_dev, origin/HEAD) Author: Bram Matthys <[email protected]> Date: Wed Mar 22 08:56:08 2023 +0100 Another attempt at UTF8-aware spamfilter. This was previously tried at 19-apr-2020 in bc70882bd3935be728b953f4252a94f9de6ff3f6 in UnrealIRCd 5.0.5. Sadly it had to be reverted immediately with a quick 5.0.5.1 release, all because of a PCRE2 100% CPU usage. Since then that bug has been fixed, plus another bug. I'm now readding it "as an option" that is marked experimental. Hopefully people test it out and can report back if it works well and then we can make it the default someday. This makes it a runtime setting so makes it much easier to switch back/forth if there are any issues without recompiling anything. Had to use a bit more code now though to handle the recompiling of spamfilters if the setting is changed. Original issue was https://bugs.unrealircd.org/view.php?id=5187 * [Spamfilter](https://www.unrealircd.org/docs/Spamfilter) can be made UTF8-aware. * This is experimental, to enable: `set { spamfilter { utf8 yes; } }`` * Case insensitive matches will then work better. For example, with extended Latin, a spamfilter on `ę` then also matches `Ę`. * Other PCRE2 features such as [\p](https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5) can then be used. For example you can then set a spamfilter with the regex `\p{Arabic}` to block all Arabic script. Please do use these new tools with care. Blocking an entire language or script is quite a drastic measure. * As a consequence of this we require PCRE2 10.36 or newer. If your system PCRE2 is older than this will mean the UnrealIRCd-shipped-library version will be compiled and `./Config` may take a little longer than usual. |
|
In 6.1.2-rc1 we now default to 'yes' for utf8 support. |
Date Modified | Username | Field | Change |
---|---|---|---|
2019-01-03 06:09 | kieseen | New Issue | |
2019-01-03 06:09 | kieseen | Tag Attached: pcre | |
2019-01-03 06:09 | kieseen | Tag Attached: spamfilter | |
2019-01-03 06:09 | kieseen | Tag Attached: unicode | |
2019-01-03 08:41 | syzop | Note Added: 0020446 | |
2019-01-03 09:31 | kieseen | Note Added: 0020447 | |
2019-01-03 09:42 | syzop | Note Added: 0020448 | |
2019-01-03 09:57 | syzop | Note Added: 0020449 | |
2019-01-03 09:59 | syzop | Note Edited: 0020449 | |
2019-01-03 10:07 | syzop | Note Added: 0020450 | |
2019-01-03 10:08 | syzop | Note Edited: 0020450 | |
2020-04-19 18:01 | syzop | Assigned To | => syzop |
2020-04-19 18:01 | syzop | Status | new => resolved |
2020-04-19 18:01 | syzop | Resolution | open => fixed |
2020-04-19 18:01 | syzop | Fixed in Version | => 5.0.5 |
2020-04-19 18:02 | syzop | Note Added: 0021514 | |
2021-06-16 15:17 | syzop | Note Added: 0022014 | |
2021-06-16 15:17 | syzop | Status | resolved => confirmed |
2021-06-16 15:17 | syzop | Fixed in Version | 5.0.5 => |
2021-06-16 15:18 | syzop | Note Added: 0022015 | |
2021-06-16 15:20 | syzop | Note Edited: 0022015 | |
2023-03-22 09:01 | syzop | Note Added: 0022803 | |
2023-09-09 12:27 | syzop | Status | confirmed => resolved |
2023-09-09 12:27 | syzop | Fixed in Version | => 6.1.2-rc1 |
2023-09-09 12:27 | syzop | Note Added: 0023031 |