View Issue Details

IDProjectCategoryView StatusLast Update
0005187unrealinstallingpublic2023-09-09 12:27
Reporterkieseen Assigned Tosyzop  
PrioritynormalSeverityminorReproducibilityalways
Status resolvedResolutionfixed 
PlatformUnrealIRCd-4.2.1OSLinux (Ubuntu Server)OS Version18.04
Product Version4.2.1 
Fixed in Version6.1.2-rc1 
Summary0005187: Default configuration can't use \p{} in spamfilter regexes
DescriptionHi,

We've been trying to generalise our spam filters to deal with future spam waves with less maintenance. We found that a better potential default for compiling would be to remove --disable-unicode, so that submitted regexes can use "\p{}" to access and match against a character's UCD properties (like "Latin", "Sc", "Common", or any specified script name). The current default response is with an error: "Error in regex '\p{Common}': this version of PCRE2 does not have support for \P, \p, or \X (at character #2)."

https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5

We saw a recommendation that the antimixedutf8 module is used against spam, but since it only checks for Latin and Cyrillic, and confusable characters sit in many other ranges, it may not be useful in future spam waves: https://unicode.org/cldr/utility/confusables.jsp?a=aAbBCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ&r=None

Would building without "--disable-unicode" have any big disadvantages?
Steps To Reproduce1. Compile with default configuration file and settings
2. Run the ircd and connect to the server with an O-line
3. Attempt to run the command "/spamfilter add -regex p warn - PCREPropertyTest \p{Common}" (or add /quote) if your client requires)
Tagspcre, spamfilter, unicode
3rd party modules

Activities

syzop

2019-01-03 08:41

administrator   ~0020446

Did you check if it works, if compiled without --disable-unicode?

kieseen

2019-01-03 09:31

reporter   ~0020447

Yes, it does, this report is about suggesting a better default configuration.

syzop

2019-01-03 09:42

administrator   ~0020448

Right, so, just to double check, you test with other scripts (non-latin) and it correctly identified the character classes using https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5 ?

I ask, because, some other PCRE2 features don't work like that, even if you enable this option. See 0005163 (though that is about the matcher, the regex, and not the target, the privmsg).

syzop

2019-01-03 09:57

administrator   ~0020449

Last edited: 2019-01-03 09:59

In pcre2_compile we set PCRE2_NEVER_UTF and PCRE2_NEVER_UCP, and similarly we do not set PCRE2_UTF.
So yeah, I'm wondering if this would work without changes in those areas.

And, related, what would happen if we were to change the above. Taking into account that IRC deals with streams of bytes, which may be latin1, utf8, or some completely different character set.

It would be nice if all this already worked just by removing --disable-unicode, but.. that would surprise me.

syzop

2019-01-03 10:07

administrator   ~0020450

Last edited: 2019-01-03 10:08

Two things come to mind:
1) With non-utf8-non-latin character sets (eg: some Russian codepage), it may be possible that characters are misinterpreted (seen as something they are not). This may lead to odd matches and non-matches. How much of a problem would this be?
2) How does PCRE2 deal with "invalid UTF8 characters"? Will it entirely fail to run the regex? Will it skip the character as if it was not there? What will it do?

I see https://www.pcre.org/current/doc/html/pcre2unicode.html
When the PCRE2_UTF option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned.
So yeah, it will fail to run the regex.

Or you can set PCRE2_NO_UTF_CHECK:
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result is undefined and your program may crash or loop indefinitely.
.. but this does not sound like a good idea :D

Hope the above comments help to give you a better picture of the situation we are in.

And, again, if somehow the functionality you describe works by just removing --disable-unicode and not having to deal with all the rest, then yeah.. that would be nice... but.. would surprise me.

syzop

2020-04-19 18:02

administrator   ~0021514

Good news, spamfilter is now UTF8 aware. This is not in 5.0.4 but should be in 5.0.5.
See https://github.com/unrealircd/unrealircd/commit/bc70882bd3935be728b953f4252a94f9de6ff3f6

syzop

2021-06-16 15:17

administrator   ~0022014

A month later (May 2020) we had to revert this again. But this issue was never updated with that information.

Copy/paste from an IRC conversation of someone asking why \p was not working:
it's a complex subject. it starts with the fact that IRC is not 100% guaranteed valid UTF8
https://github.com/unrealircd/unrealircd/commit/bc70882bd3935be728b953f4252a94f9de6ff3f6 will probably interest you, this added \p support a year ago
unfortunately we had to disable it again because PCRE2 crashed/looped PCRE2 devs basically said oh yeah this is new etc...
since then i have not trusted it anymore and not put it back in
the retraction was mentioned in https://forums.unrealircd.org/viewtopic.php?f=1&t=9013
the upstream bug was here https://bugs.exim.org/show_bug.cgi?id=2581
"The invalid utf support is a new feature so there might be issues with it, but we will try to fix them."
i did not want unrealircd to be a test project for them :D
clearly hanging/looping/crashing is not acceptable, especially for a long running process like an ircd
antimixedutf8 may interest you though

syzop

2021-06-16 15:18

administrator   ~0022015

Last edited: 2021-06-16 15:20

We should ask the PCRE2 guys upstream if they now run with PCRE2_MATCH_INVALID_UTF through a tester and fuzzer with random data.

This, to have a path to getting this tested (by them, not by our random users crashing/hanging) and eventually back in unrealircd again.

syzop

2023-03-22 09:01

administrator   ~0022803

Another attempt for UnrealIRCd 6.0.7(-git). Any feedback would be appreciated :)

https://github.com/unrealircd/unrealircd/commit/4b4562516c44650661de47e6f7eb888b738f09ea

commit 4b4562516c44650661de47e6f7eb888b738f09ea (HEAD -> unreal60_dev, origin/unreal60_dev, origin/HEAD)
Author: Bram Matthys <[email protected]>
Date: Wed Mar 22 08:56:08 2023 +0100

    Another attempt at UTF8-aware spamfilter.
    
    This was previously tried at 19-apr-2020 in bc70882bd3935be728b953f4252a94f9de6ff3f6
    in UnrealIRCd 5.0.5. Sadly it had to be reverted immediately with a quick 5.0.5.1
    release, all because of a PCRE2 100% CPU usage. Since then that bug has been fixed,
    plus another bug. I'm now readding it "as an option" that is marked experimental.
    Hopefully people test it out and can report back if it works well and then we can
    make it the default someday.
    
    This makes it a runtime setting so makes it much easier to switch back/forth if
    there are any issues without recompiling anything. Had to use a bit more code now
    though to handle the recompiling of spamfilters if the setting is changed.
    
    Original issue was https://bugs.unrealircd.org/view.php?id=5187
    
    * [Spamfilter](https://www.unrealircd.org/docs/Spamfilter) can be made UTF8-aware.
      * This is experimental, to enable: `set { spamfilter { utf8 yes; } }``
      * Case insensitive matches will then work better. For example, with extended
        Latin, a spamfilter on `ę` then also matches `Ę`.
      * Other PCRE2 features such as [\p](https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5)
        can then be used. For example you can then set a spamfilter with the regex
        `\p{Arabic}` to block all Arabic script.
        Please do use these new tools with care. Blocking an entire language
        or script is quite a drastic measure.
      * As a consequence of this we require PCRE2 10.36 or newer. If your system
        PCRE2 is older than this will mean the UnrealIRCd-shipped-library version
        will be compiled and `./Config` may take a little longer than usual.

syzop

2023-09-09 12:27

administrator   ~0023031

In 6.1.2-rc1 we now default to 'yes' for utf8 support.

Issue History

Date Modified Username Field Change
2019-01-03 06:09 kieseen New Issue
2019-01-03 06:09 kieseen Tag Attached: pcre
2019-01-03 06:09 kieseen Tag Attached: spamfilter
2019-01-03 06:09 kieseen Tag Attached: unicode
2019-01-03 08:41 syzop Note Added: 0020446
2019-01-03 09:31 kieseen Note Added: 0020447
2019-01-03 09:42 syzop Note Added: 0020448
2019-01-03 09:57 syzop Note Added: 0020449
2019-01-03 09:59 syzop Note Edited: 0020449
2019-01-03 10:07 syzop Note Added: 0020450
2019-01-03 10:08 syzop Note Edited: 0020450
2020-04-19 18:01 syzop Assigned To => syzop
2020-04-19 18:01 syzop Status new => resolved
2020-04-19 18:01 syzop Resolution open => fixed
2020-04-19 18:01 syzop Fixed in Version => 5.0.5
2020-04-19 18:02 syzop Note Added: 0021514
2021-06-16 15:17 syzop Note Added: 0022014
2021-06-16 15:17 syzop Status resolved => confirmed
2021-06-16 15:17 syzop Fixed in Version 5.0.5 =>
2021-06-16 15:18 syzop Note Added: 0022015
2021-06-16 15:20 syzop Note Edited: 0022015
2023-03-22 09:01 syzop Note Added: 0022803
2023-09-09 12:27 syzop Status confirmed => resolved
2023-09-09 12:27 syzop Fixed in Version => 6.1.2-rc1
2023-09-09 12:27 syzop Note Added: 0023031