View Issue Details

IDProjectCategoryView StatusLast Update
0005187unrealinstallingpublic2021-06-16 15:20
Reporterkieseen Assigned Tosyzop  
PrioritynormalSeverityminorReproducibilityalways
Status confirmedResolutionfixed 
PlatformUnrealIRCd-4.2.1OSLinux (Ubuntu Server)OS Version18.04
Product Version4.2.1 
Summary0005187: Default configuration can't use \p{} in spamfilter regexes
DescriptionHi,

We've been trying to generalise our spam filters to deal with future spam waves with less maintenance. We found that a better potential default for compiling would be to remove --disable-unicode, so that submitted regexes can use "\p{}" to access and match against a character's UCD properties (like "Latin", "Sc", "Common", or any specified script name). The current default response is with an error: "Error in regex '\p{Common}': this version of PCRE2 does not have support for \P, \p, or \X (at character #2)."

https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5

We saw a recommendation that the antimixedutf8 module is used against spam, but since it only checks for Latin and Cyrillic, and confusable characters sit in many other ranges, it may not be useful in future spam waves: https://unicode.org/cldr/utility/confusables.jsp?a=aAbBCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ&r=None

Would building without "--disable-unicode" have any big disadvantages?
Steps To Reproduce1. Compile with default configuration file and settings
2. Run the ircd and connect to the server with an O-line
3. Attempt to run the command "/spamfilter add -regex p warn - PCREPropertyTest \p{Common}" (or add /quote) if your client requires)
Tagspcre, spamfilter, unicode
3rd party modules

Activities

syzop

2019-01-03 08:41

administrator   ~0020446

Did you check if it works, if compiled without --disable-unicode?

kieseen

2019-01-03 09:31

reporter   ~0020447

Yes, it does, this report is about suggesting a better default configuration.

syzop

2019-01-03 09:42

administrator   ~0020448

Right, so, just to double check, you test with other scripts (non-latin) and it correctly identified the character classes using https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5 ?

I ask, because, some other PCRE2 features don't work like that, even if you enable this option. See 0005163 (though that is about the matcher, the regex, and not the target, the privmsg).

syzop

2019-01-03 09:57

administrator   ~0020449

Last edited: 2019-01-03 09:59

View 2 revisions

In pcre2_compile we set PCRE2_NEVER_UTF and PCRE2_NEVER_UCP, and similarly we do not set PCRE2_UTF.
So yeah, I'm wondering if this would work without changes in those areas.

And, related, what would happen if we were to change the above. Taking into account that IRC deals with streams of bytes, which may be latin1, utf8, or some completely different character set.

It would be nice if all this already worked just by removing --disable-unicode, but.. that would surprise me.

syzop

2019-01-03 10:07

administrator   ~0020450

Last edited: 2019-01-03 10:08

View 2 revisions

Two things come to mind:
1) With non-utf8-non-latin character sets (eg: some Russian codepage), it may be possible that characters are misinterpreted (seen as something they are not). This may lead to odd matches and non-matches. How much of a problem would this be?
2) How does PCRE2 deal with "invalid UTF8 characters"? Will it entirely fail to run the regex? Will it skip the character as if it was not there? What will it do?

I see https://www.pcre.org/current/doc/html/pcre2unicode.html
When the PCRE2_UTF option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned.
So yeah, it will fail to run the regex.

Or you can set PCRE2_NO_UTF_CHECK:
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result is undefined and your program may crash or loop indefinitely.
.. but this does not sound like a good idea :D

Hope the above comments help to give you a better picture of the situation we are in.

And, again, if somehow the functionality you describe works by just removing --disable-unicode and not having to deal with all the rest, then yeah.. that would be nice... but.. would surprise me.

syzop

2020-04-19 18:02

administrator   ~0021514

Good news, spamfilter is now UTF8 aware. This is not in 5.0.4 but should be in 5.0.5.
See https://github.com/unrealircd/unrealircd/commit/bc70882bd3935be728b953f4252a94f9de6ff3f6

syzop

2021-06-16 15:17

administrator   ~0022014

A month later (May 2020) we had to revert this again. But this issue was never updated with that information.

Copy/paste from an IRC conversation of someone asking why \p was not working:
it's a complex subject. it starts with the fact that IRC is not 100% guaranteed valid UTF8
https://github.com/unrealircd/unrealircd/commit/bc70882bd3935be728b953f4252a94f9de6ff3f6 will probably interest you, this added \p support a year ago
unfortunately we had to disable it again because PCRE2 crashed/looped PCRE2 devs basically said oh yeah this is new etc...
since then i have not trusted it anymore and not put it back in
the retraction was mentioned in https://forums.unrealircd.org/viewtopic.php?f=1&t=9013
the upstream bug was here https://bugs.exim.org/show_bug.cgi?id=2581
"The invalid utf support is a new feature so there might be issues with it, but we will try to fix them."
i did not want unrealircd to be a test project for them :D
clearly hanging/looping/crashing is not acceptable, especially for a long running process like an ircd
antimixedutf8 may interest you though

syzop

2021-06-16 15:18

administrator   ~0022015

Last edited: 2021-06-16 15:20

View 2 revisions

We should ask the PCRE2 guys upstream if they now run with PCRE2_MATCH_INVALID_UTF through a tester and fuzzer with random data.

This, to have a path to getting this tested (by them, not by our random users crashing/hanging) and eventually back in unrealircd again.

Issue History

Date Modified Username Field Change
2019-01-03 06:09 kieseen New Issue
2019-01-03 06:09 kieseen Tag Attached: pcre
2019-01-03 06:09 kieseen Tag Attached: spamfilter
2019-01-03 06:09 kieseen Tag Attached: unicode
2019-01-03 08:41 syzop Note Added: 0020446
2019-01-03 09:31 kieseen Note Added: 0020447
2019-01-03 09:42 syzop Note Added: 0020448
2019-01-03 09:57 syzop Note Added: 0020449
2019-01-03 09:59 syzop Note Edited: 0020449 View Revisions
2019-01-03 10:07 syzop Note Added: 0020450
2019-01-03 10:08 syzop Note Edited: 0020450 View Revisions
2020-04-19 18:01 syzop Assigned To => syzop
2020-04-19 18:01 syzop Status new => resolved
2020-04-19 18:01 syzop Resolution open => fixed
2020-04-19 18:01 syzop Fixed in Version => 5.0.5
2020-04-19 18:02 syzop Note Added: 0021514
2021-06-16 15:17 syzop Note Added: 0022014
2021-06-16 15:17 syzop Status resolved => confirmed
2021-06-16 15:17 syzop Fixed in Version 5.0.5 =>
2021-06-16 15:18 syzop Note Added: 0022015
2021-06-16 15:20 syzop Note Edited: 0022015 View Revisions