View Issue Details

IDProjectCategoryView StatusLast Update
0005187unrealinstallingpublic2019-01-03 10:08
ReporterkieseenAssigned To 
PrioritynormalSeverityminorReproducibilityalways
Status newResolutionopen 
PlatformUnrealIRCd-4.2.1OSLinux (Ubuntu Server)OS Version18.04
Product Version4.2.1 
Target VersionFixed in Version 
Summary0005187: Default configuration can't use \p{} in spamfilter regexes
DescriptionHi,

We've been trying to generalise our spam filters to deal with future spam waves with less maintenance. We found that a better potential default for compiling would be to remove --disable-unicode, so that submitted regexes can use "\p{}" to access and match against a character's UCD properties (like "Latin", "Sc", "Common", or any specified script name). The current default response is with an error: "Error in regex '\p{Common}': this version of PCRE2 does not have support for \P, \p, or \X (at character #2)."

https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5

We saw a recommendation that the antimixedutf8 module is used against spam, but since it only checks for Latin and Cyrillic, and confusable characters sit in many other ranges, it may not be useful in future spam waves: https://unicode.org/cldr/utility/confusables.jsp?a=aAbBCdDeEfFgGhHiIjJkKlLmMnNoOpPqQrRsStTuUvVwWxXyYzZ&r=None

Would building without "--disable-unicode" have any big disadvantages?
Steps To Reproduce1. Compile with default configuration file and settings
2. Run the ircd and connect to the server with an O-line
3. Attempt to run the command "/spamfilter add -regex p warn - PCREPropertyTest \p{Common}" (or add /quote) if your client requires)
Tagspcre, spamfilter, unicode
3rd party modules

Activities

syzop

2019-01-03 08:41

administrator   ~0020446

Did you check if it works, if compiled without --disable-unicode?

kieseen

2019-01-03 09:31

reporter   ~0020447

Yes, it does, this report is about suggesting a better default configuration.

syzop

2019-01-03 09:42

administrator   ~0020448

Right, so, just to double check, you test with other scripts (non-latin) and it correctly identified the character classes using https://www.pcre.org/current/doc/html/pcre2syntax.html#SEC5 ?

I ask, because, some other PCRE2 features don't work like that, even if you enable this option. See 0005163 (though that is about the matcher, the regex, and not the target, the privmsg).

syzop

2019-01-03 09:57

administrator   ~0020449

Last edited: 2019-01-03 09:59

View 2 revisions

In pcre2_compile we set PCRE2_NEVER_UTF and PCRE2_NEVER_UCP, and similarly we do not set PCRE2_UTF.
So yeah, I'm wondering if this would work without changes in those areas.

And, related, what would happen if we were to change the above. Taking into account that IRC deals with streams of bytes, which may be latin1, utf8, or some completely different character set.

It would be nice if all this already worked just by removing --disable-unicode, but.. that would surprise me.

syzop

2019-01-03 10:07

administrator   ~0020450

Last edited: 2019-01-03 10:08

View 2 revisions

Two things come to mind:
1) With non-utf8-non-latin character sets (eg: some Russian codepage), it may be possible that characters are misinterpreted (seen as something they are not). This may lead to odd matches and non-matches. How much of a problem would this be?
2) How does PCRE2 deal with "invalid UTF8 characters"? Will it entirely fail to run the regex? Will it skip the character as if it was not there? What will it do?

I see https://www.pcre.org/current/doc/html/pcre2unicode.html
When the PCRE2_UTF option is set, the strings passed as patterns and subjects are (by default) checked for validity on entry to the relevant functions. If an invalid UTF string is passed, an negative error code is returned.
So yeah, it will fail to run the regex.

Or you can set PCRE2_NO_UTF_CHECK:
If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the result is undefined and your program may crash or loop indefinitely.
.. but this does not sound like a good idea :D

Hope the above comments help to give you a better picture of the situation we are in.

And, again, if somehow the functionality you describe works by just removing --disable-unicode and not having to deal with all the rest, then yeah.. that would be nice... but.. would surprise me.

Issue History

Date Modified Username Field Change
2019-01-03 06:09 kieseen New Issue
2019-01-03 06:09 kieseen Tag Attached: pcre
2019-01-03 06:09 kieseen Tag Attached: spamfilter
2019-01-03 06:09 kieseen Tag Attached: unicode
2019-01-03 08:41 syzop Note Added: 0020446
2019-01-03 09:31 kieseen Note Added: 0020447
2019-01-03 09:42 syzop Note Added: 0020448
2019-01-03 09:57 syzop Note Added: 0020449
2019-01-03 09:59 syzop Note Edited: 0020449 View Revisions
2019-01-03 10:07 syzop Note Added: 0020450
2019-01-03 10:08 syzop Note Edited: 0020450 View Revisions