View Issue Details

IDProjectCategoryView StatusLast Update
0002882unrealircdpublic2010-07-14 17:49
ReporterTrocotronic Assigned To 
PrioritynormalSeverityfeatureReproducibilityalways
Status acknowledgedResolutionopen 
PlatformAMD K6 32bitsOSWindows XP ProfessionalOS VersionSP2
Product Version3.2.5 
Summary0002882: Configurable CASEMAPPING (Lower/Uppercase with charsets)
DescriptionI know that charsets for diferent languages is very complex.
I have loaded spanish and catalan charsets.
For example, á is lowercase of Á. So, eáe and eÁe are the same word.
Yes, to distinguish lower and upper letters for every charset sounds too waste. But I think it could be possible.
TagsNo tags attached.
Attached Files
Locedit.zip (356,926 bytes)
locedit-fixed.rar (251,161 bytes)
3rd party modules

Relationships

related to 0002739 closed Badwords does not support charsets 
related to 0002718 closed Russian nicks is case sensivity 
has duplicate 0003101 closed Suggestion about cyrillic nicks 
related to 0002589 closedsyzop mixed charsets in nicknames 

Activities

Stealth

2006-04-13 19:03

reporter   ~0011548

This has been mentioned many times, and will not be fixed. It even describes this in the documentation:

[quote]NOTE 2: Casemapping (if a certain lowercase character belongs to an upper one) is done according to US-ASCII, this means that o" and O" are not recognized as 'the same character' and hence someone can have a nick with B"ar and someone else BA"r at the same time. This is a limitation of the current system and IRCd standards that cannot be solved anytime soon. People should be aware of this limitation. Note that this limitation has always also been applied to channels, in which nearly all characters were always permitted and US-ASCII casemapping was always performed.[/quote]

Trocotronic

2006-04-13 19:18

reporter   ~0011549

Standards, standards, standards... the better improve of unreal is its non-standaring.
You will agree with me that if unreal supports or enchances this feature, will be a better ircd.
Why cannot you break this rule? Is there any special reason for accept US-ASCII as the unique alternative?

Bock

2006-04-19 08:05

reporter   ~0011580

/me send patch to syzop about it. He say, that at 3.3 maybe it be.

Trocotronic

2006-04-20 16:10

reporter   ~0011589

Bock, could you upload your patch, please?
Thank you.

Spider84

2006-05-19 12:10

reporter   ~0011741

ftp://ftp.bynets.org/sources/unreal3.2.4-bynets.diff
this patch add support of different locales in file mode.
You can change locale without server rebuild, just by rehash command.

Bock

2006-05-19 16:16

reporter   ~0011742

yep, this is it.
I hope, that this patch will be in 3.3*.
On our network for 180 days never be bug or crash server with this patch.

Bock

2006-06-05 03:35

reporter   ~0011854

It's clean version of patch to current version cvs (2006-06-05). Fixes trouble for Russian and Belarussian lower/upper issues. Read reame.txt in locale/ for your locale.
PS: It worked on our network (ByNets) since 2006-02 and no bugs/crash not found.

Bock

2006-06-11 09:10

reporter   ~0011944

I added file (Locedit.zip) which conteins GUI editor for locales, patch for current unrealircd (3.2.5-rc3) and directory with locales.
For example you may view to belarussian-w1251 or russian-w1251 files for understanding principle.
Author (Killer{R} - [email protected])) say, that trouble will be (maybe) with multibyte codes (files), but if it be, he may to correct this (chinese for example).
It fixes trouble with namechannels too. Advantage of this patch - add and reload locales file without recompilation and restarting ircd, only rehash.
Since 02.2006 - no errors, crashes, etc.
See you :]

Bock

2006-06-11 12:38

reporter   ~0011945

The fixed version of locedit (supported multibyte too) with sources.

avenger

2006-06-13 08:30

reporter   ~0011950

Note that the patch posted by Spider84 not only adds a support for the locale accents uppercase matching, but add new modules (with other totally unrelated functions), and some ByNets-specific thingies. :)

Bock

2006-06-13 09:23

reporter   ~0011951

2 avenger - yes, it's patch to our network, clean version is listed below. :]

syzop

2006-11-01 07:49

administrator   ~0012540

I've linked a couple of bugids to this one.
Renamed this title to 'Configurable CASEMAPPING (Lower/Uppercase with charsets)', since that's what it is...

What is CASEMAPPING? CASEMAPPING decides which characters "belong to each others", or in other words... which upcase character belong to which lowcase character.
http://www.irc.org/tech_docs/005.html ctrl+f CASEMAPPING
We currently always use 'ascii', which is what everyone is familiar with I guess.

The idea is to make this a configurable option in the conf to set it to an alternative CASEMAPPING. This one will then be used, and will be properly announced in 005 etc.
What are the limitations?
You can only have ONE casemapping configured (eg 'ascii' or 'some-latin1-thingy'. You cannot do casemapping for both a russian charset, an hebrew charset, and some eastern european charset... Why not? Because the same character mean different things in each charset. This is why bug 0002987 was closed, because some people don't seem to understand that.

What IS possible?
It's possible to configure a different CASEMAPPING, for example iso8859-1 (latin1), this will then be used for comparing if things are "the same", such as nicks and channels.
This COULD also be used by something like spamfilter (basically any strcasecmp/stricmp in our code) which can be open to debate whether that's a good idea or bad (I currently don't see how it could be bad, but perhaps someone can tell). I don't know if TRE supports it, but it would make sense if it would.

As for which technique to use, I haven't looked into it. So maybe it could be discussed here... Something like setlocale() seems to make most sense? What are the disadvantages/advantages of each approach?

syzop

2006-11-01 07:56

administrator   ~0012541

When we're at it, it's worth mentioning that in some character sets like russian, some characters like the 'a' will look very similar (or exactly the same) like the latin (western) 'a'. Not sure if something like that could also be resolved, and if we should even bother to do so... Some people argue that should be handled client-side. There have been written various papers on this, see also the discussion 1/2/3 years back when international domain names where introduced.

It's not that differently than 'l' vs 'I', and such things, which look very similar in some fonts like Fixedsys, and I haven't ever heard someone talking about comparing 'l' as if it was equal to 'I' :P. Then again, they still look *similar* and not 100% *equal* :P. Again, is that our problem, or is it the problem of the client / font / etc?

Bock

2006-11-01 11:14

reporter   ~0012546

For my opinion, some letters edentical (for ex. russian: "e, T, P(it's "R" russian), A, B (like "V" russian), O, and etc.etc.etc) and now many fonts not look different for different language (I see only in some *nix system, on windows system - verdana, fixedys, lucida console etc.: no different of language). Some letters in BIG (like B) looks like big letter V (russian), but little - no.
In patch, that I send you, and locedit-fixed.rar - you can create file of locale (now present russian, belorussian, maybe now will be ukranian) with casemappping AND resolving trouble with similar letters. I want to find people from other countries to make files locales.
1 year our network works with this patch and it work fine. Peoples, who find me about this and who testing it - noone bad report or so...Only gratitude..
If you don't agree with it - from patch you can take idea about add locales file to ircd without recompiling AND restarting ircd (it's about dynamical add language to ircd).

About badwords and spamfilter... To this question. If people start spamming, usual they don't change CaseSensitive and spamfilter works fine (I'm about russian spam or "happy letters").

[quote]
It's not that differently than 'l' vs 'I', and such things, which look very similar in some fonts like Fixedsys, and I haven't ever heard someone talking about comparing 'l' as if it was equal to 'I' :P. [/quote]
If I give to you see "E" and "A" russian, you don't see differents :]

Bock

2006-11-04 14:24

reporter   ~0012584

hm.. I think, that in frases about ONLY CASEMAPPING reason..

Issue History

Date Modified Username Field Change
2006-04-13 14:47 Trocotronic New Issue
2006-04-13 19:03 Stealth Note Added: 0011548
2006-04-13 19:18 Trocotronic Note Added: 0011549
2006-04-19 08:05 Bock Note Added: 0011580
2006-04-20 16:10 Trocotronic Note Added: 0011589
2006-05-19 12:10 Spider84 Note Added: 0011741
2006-05-19 16:16 Bock Note Added: 0011742
2006-06-05 03:35 Bock Note Added: 0011854
2006-06-05 03:35 Bock File Added: unreal.3.2.5-locale.tar.gz
2006-06-11 09:10 Bock Note Added: 0011944
2006-06-11 09:18 Bock File Added: Locedit.zip
2006-06-11 12:38 Bock File Added: locedit-fixed.rar
2006-06-11 12:38 Bock Note Added: 0011945
2006-06-13 08:30 avenger Note Added: 0011950
2006-06-13 09:23 Bock Note Added: 0011951
2006-11-01 07:33 syzop Relationship added related to 0003101
2006-11-01 07:33 syzop Relationship added related to 0002739
2006-11-01 07:39 syzop Relationship deleted related to 0003101
2006-11-01 07:39 syzop Relationship added has duplicate 0003101
2006-11-01 07:49 syzop Note Added: 0012540
2006-11-01 07:49 syzop Summary Lower/Uppercase with charsets => Configurable CASEMAPPING (Lower/Uppercase with charsets)
2006-11-01 07:56 syzop Note Added: 0012541
2006-11-01 11:14 Bock Note Added: 0012546
2006-11-03 13:40 syzop Relationship added related to 0002718
2006-11-04 14:24 Bock Note Added: 0012584
2007-04-19 18:37 stskeeps Status new => acknowledged
2007-04-27 05:50 stskeeps Relationship added related to 0002589
2010-07-14 17:49 syzop QA => Not touched yet by developer
2010-07-14 17:49 syzop U4: Need for upstream patch => No need for upstream InspIRCd patch
2010-07-14 17:49 syzop U4: Upstream notification of bug => Not decided
2010-07-14 17:49 syzop U4: Contributor working on this => None
2010-07-14 17:49 syzop Severity minor => feature