View Issue Details

IDProjectCategoryView StatusLast Update
0002259unrealircdpublic2013-01-09 10:18
ReportercodemastrAssigned Tosyzop  
PrioritynormalSeverityfeatureReproducibilityN/A
Status closedResolutionno change required 
Summary0002259: TRE extension for mIRC code stripping
DescriptionI've been playing around with TRE (I'm submitting a bunch of patches for it). My main reason for doing this was to get a little familiar with the code. My idea is basically one that will solve the badword stripping results in color codes disappearing and such. How will it does this? Well, we won't strip the color codes. Rather, we will "ignore" them in the regex engine. So for example, we encounter a \2 (bold), we just ++ past it (in a sense, it always matches). Of course, there might be instances where we do want to match these characters. For example, a drone that uses a realname of "foo\2bar" so it will be dynamically controllable, "(?-C)foo\2bar" and now the Code stripping is disabled so it only matches if the \2 is there. By default we'd use a new flag, REG_CODEIGNORE and the (?-C) construct would turn it off. That way it is backward compatible, and also gives the user more control.

I don't think this will be too hard to do, however I'm not yet 100% sure I'm able to do it. I do understand how to add new modifiers (things like (?i)), since one of the patches I'm submitting adds one. Unfortunately, all the patches I'm making only deal with regcomp() not regexec(), so I will need to do more learning before I'm sure this is possible.
TagsNo tags attached.
3rd party modules

Activities

aquanight

2004-12-29 20:44

reporter   ~0008688

Last edited: 2004-12-29 20:45

Just out of curiosity, does this mean it just skips the color and pretend it wasn't there (but possibly include it in things like captures), or regardless of the character class, a color always matches?

I guess in other words, with this option, would 'e' be treated as 'e[\1\2\3\4\17\37\16\33]*' or '[e\1\2\3\4\17\37\16\33]'?

Reason I ask is because what you said "(in a sense, it always matches)" could suggest it goes either way (though mentioning ++ would suggest the former).

Oh, and would CTCP characters be effected by this at all (even though CTCP isn't a color/format code... it's still in the nonprintable ASCII range)?

*edit* Oh and I know TRE doesn't support the \### octal character notation (unless it does and no one told me ;p ). Also, forgot the ESC character is considered by +c to be a "color code". */edit*

codemastr

2004-12-29 20:49

reporter   ~0008689

Last edited: 2004-12-29 20:53

[quote]Reason I ask is because what you said "(in a sense, it always matches)" could suggest it goes either way (though mentioning ++ would suggest the former).[/quote]
Well I'm thinking, 100% ignored. Like, the color characters become "zero-width."

[quote]Oh, and would CTCP characters be effected by this at all[/quote]
No. It will only find and skip color and formatting codes.

[quote]*edit* Oh and I know TRE doesn't support the \### octal character notation[/quote]
Well it supports \x## where ## is hex.

*Edit:

Better example,
regex: "([a-z])" text: "\2a\2b\2c" matches \1 = "a\2b\2c"

aquanight

2004-12-29 20:56

reporter   ~0008690

So basically it's almost as if every character class had a [\x02\x03\x04\x0F\x1F\x0E\x1B]* after it (as far as making the regular expression goes, anyway)? Except you don't have to type out that whole mess every time ;-) . Nice.

Yeah yeah I know more accurate to say it pretends they don't even exist but :) .

(And actually, it'd be more like: ([\x02\x0F\x1F\x0E\x1B]|\x03([0-9][0-9]?(,[0-9][0-9]?))?|\x04[0-9A-Fa-f]{6}(,[0-9A-Fa-f]{6})?)* - strip the codes, you need to strip the args for mirc/rbg color too :) )

codemastr

2004-12-29 21:02

reporter   ~0008691

[quote](And actually, it'd be more like: ([\x02\x0F\x1F\x0E\x1B]|\x03([0-9][0-9]?(,[0-9][0-9]?))?|\x04[0-9A-Fa-f]{6}(,[0-9A-Fa-f]{6})?)* - strip the codes, you need to strip the args for mirc/rbg color too :) )[/quote]
Yeah, pretty much, though I'd probably just use [:xdigit:] ;). But it should be more efficient. I don't intend to actually make it "expand" to that. I intend to hardcode it into the parser. Like if cflag & REG_IGNORECODE && *curchar == '\2') curchar++; So the size of the regex won't grow as a result of this.

aquanight

2005-01-03 00:35

reporter   ~0008695

Of course not :) . I was mostly thinking appearance, not actual implementation. Of course it would be easier to just ++ past the code + args :P . (On the other hand, it might take a bit of work off on your part... :P)

syzop

2005-01-03 11:35

administrator   ~0008696

As long as it's fast (and doesn't crash) ;).
Obviously, it can only become slower than the current implementation.. since stripping (color) codes once vs doing it every regex is impossible without any performance penalty. That said, since it's (very) simple, there shouldn't be any noticable slowdown[1].. and if implemented properly, I would in fact be happy with this feature.. it's clean, and it's useful (or even required) for some (spamfilter) cases :).

[1] Comparing bytes that are in the L1 cache (or will become anyway) and increasing a counter (pointer) are very fast instructions ;p

Stealth

2007-04-18 12:46

reporter   ~0013521

*bump*

Is anyone going to work on this?

djGrrr

2007-04-18 17:09

reporter   ~0013537

in my opinion, TRE sucks bigtime. i think unrealircd would be MUCH more well suited to use PCRE, its much faster, and can do much more powerful regular expresions

vonitsanet

2007-04-18 18:12

reporter   ~0013545

I agree with djGrrr.
Also as i can see from this one http://bugs.unrealircd.org/view.php?id=2887 the TRE author is (almost) not working on it anymore (or not?).

stskeeps

2007-06-21 14:22

reporter   ~0014397

We have included PCRE in 3.3 now.

syzop

2013-01-09 10:17

administrator   ~0017318

scratched (the TRE extension for mIRC color stripping).

Issue History

Date Modified Username Field Change
2004-12-28 16:49 codemastr New Issue
2004-12-29 20:44 aquanight Note Added: 0008688
2004-12-29 20:45 aquanight Note Edited: 0008688
2004-12-29 20:49 codemastr Note Added: 0008689
2004-12-29 20:52 codemastr Note Edited: 0008689
2004-12-29 20:53 codemastr Note Edited: 0008689
2004-12-29 20:56 aquanight Note Added: 0008690
2004-12-29 21:02 codemastr Note Added: 0008691
2005-01-03 00:35 aquanight Note Added: 0008695
2005-01-03 11:35 syzop Note Added: 0008696
2007-04-18 12:46 Stealth Note Added: 0013521
2007-04-18 17:09 djGrrr Note Added: 0013537
2007-04-18 18:12 vonitsanet Note Added: 0013545
2007-06-11 13:11 stskeeps Assigned To codemastr =>
2007-06-21 14:22 stskeeps Note Added: 0014397
2013-01-09 10:17 syzop Note Added: 0017318
2013-01-09 10:17 syzop Status assigned => closed
2013-01-09 10:18 syzop Assigned To => syzop
2013-01-09 10:18 syzop Resolution open => fixed
2013-01-09 10:18 syzop Resolution fixed => no change required