0002882: Configurable CASEMAPPING (Lower/Uppercase with charsets)

ID	Project	Category	View Status	Date Submitted	Last Update

0002882	unreal	ircd	public	2006-04-13 14:47	2010-07-14 17:49

Reporter	Trocotronic	Assigned To
Priority	normal	Severity	feature	Reproducibility	always
Status	acknowledged	Resolution	open
Platform	AMD K6 32bits	OS	Windows XP Professional	OS Version	SP2
Product Version	3.2.5

Summary	0002882: Configurable CASEMAPPING (Lower/Uppercase with charsets)
Description	I know that charsets for diferent languages is very complex. I have loaded spanish and catalan charsets. For example, á is lowercase of Á. So, eáe and eÁe are the same word. Yes, to distinguish lower and upper letters for every charset sounds too waste. But I think it could be possible.
Tags	No tags attached.
Attached Files	unreal.3.2.5-locale.tar.gz (8,258 bytes) Locedit.zip (356,926 bytes) locedit-fixed.rar (251,161 bytes)

3rd party modules

Stealth 2006-04-13 19:03 reporter ~0011548	This has been mentioned many times, and will not be fixed. It even describes this in the documentation: [quote]NOTE 2: Casemapping (if a certain lowercase character belongs to an upper one) is done according to US-ASCII, this means that o" and O" are not recognized as 'the same character' and hence someone can have a nick with B"ar and someone else BA"r at the same time. This is a limitation of the current system and IRCd standards that cannot be solved anytime soon. People should be aware of this limitation. Note that this limitation has always also been applied to channels, in which nearly all characters were always permitted and US-ASCII casemapping was always performed.[/quote]

Trocotronic 2006-04-13 19:18 reporter ~0011549	Standards, standards, standards... the better improve of unreal is its non-standaring. You will agree with me that if unreal supports or enchances this feature, will be a better ircd. Why cannot you break this rule? Is there any special reason for accept US-ASCII as the unique alternative?

Bock 2006-04-19 08:05 reporter ~0011580	/me send patch to syzop about it. He say, that at 3.3 maybe it be.

Trocotronic 2006-04-20 16:10 reporter ~0011589	Bock, could you upload your patch, please? Thank you.

Spider84 2006-05-19 12:10 reporter ~0011741	ftp://ftp.bynets.org/sources/unreal3.2.4-bynets.diff this patch add support of different locales in file mode. You can change locale without server rebuild, just by rehash command.

Bock 2006-05-19 16:16 reporter ~0011742	yep, this is it. I hope, that this patch will be in 3.3*. On our network for 180 days never be bug or crash server with this patch.

Bock 2006-06-05 03:35 reporter ~0011854	It's clean version of patch to current version cvs (2006-06-05). Fixes trouble for Russian and Belarussian lower/upper issues. Read reame.txt in locale/ for your locale. PS: It worked on our network (ByNets) since 2006-02 and no bugs/crash not found.

Bock 2006-06-11 09:10 reporter ~0011944	I added file (Locedit.zip) which conteins GUI editor for locales, patch for current unrealircd (3.2.5-rc3) and directory with locales. For example you may view to belarussian-w1251 or russian-w1251 files for understanding principle. Author (Killer{R} - [email protected])) say, that trouble will be (maybe) with multibyte codes (files), but if it be, he may to correct this (chinese for example). It fixes trouble with namechannels too. Advantage of this patch - add and reload locales file without recompilation and restarting ircd, only rehash. Since 02.2006 - no errors, crashes, etc. See you :]

Bock 2006-06-11 12:38 reporter ~0011945	The fixed version of locedit (supported multibyte too) with sources.

avenger 2006-06-13 08:30 reporter ~0011950	Note that the patch posted by Spider84 not only adds a support for the locale accents uppercase matching, but add new modules (with other totally unrelated functions), and some ByNets-specific thingies. :)

Bock 2006-06-13 09:23 reporter ~0011951	2 avenger - yes, it's patch to our network, clean version is listed below. :]

syzop 2006-11-01 07:49 administrator ~0012540	I've linked a couple of bugids to this one. Renamed this title to 'Configurable CASEMAPPING (Lower/Uppercase with charsets)', since that's what it is... What is CASEMAPPING? CASEMAPPING decides which characters "belong to each others", or in other words... which upcase character belong to which lowcase character. http://www.irc.org/tech_docs/005.html ctrl+f CASEMAPPING We currently always use 'ascii', which is what everyone is familiar with I guess. The idea is to make this a configurable option in the conf to set it to an alternative CASEMAPPING. This one will then be used, and will be properly announced in 005 etc. What are the limitations? You can only have ONE casemapping configured (eg 'ascii' or 'some-latin1-thingy'. You cannot do casemapping for both a russian charset, an hebrew charset, and some eastern european charset... Why not? Because the same character mean different things in each charset. This is why bug 0002987 was closed, because some people don't seem to understand that. What IS possible? It's possible to configure a different CASEMAPPING, for example iso8859-1 (latin1), this will then be used for comparing if things are "the same", such as nicks and channels. This COULD also be used by something like spamfilter (basically any strcasecmp/stricmp in our code) which can be open to debate whether that's a good idea or bad (I currently don't see how it could be bad, but perhaps someone can tell). I don't know if TRE supports it, but it would make sense if it would. As for which technique to use, I haven't looked into it. So maybe it could be discussed here... Something like setlocale() seems to make most sense? What are the disadvantages/advantages of each approach?

syzop 2006-11-01 07:56 administrator ~0012541	When we're at it, it's worth mentioning that in some character sets like russian, some characters like the 'a' will look very similar (or exactly the same) like the latin (western) 'a'. Not sure if something like that could also be resolved, and if we should even bother to do so... Some people argue that should be handled client-side. There have been written various papers on this, see also the discussion 1/2/3 years back when international domain names where introduced. It's not that differently than 'l' vs 'I', and such things, which look very similar in some fonts like Fixedsys, and I haven't ever heard someone talking about comparing 'l' as if it was equal to 'I' :P. Then again, they still look similar and not 100% equal :P. Again, is that our problem, or is it the problem of the client / font / etc?

Bock 2006-11-01 11:14 reporter ~0012546	For my opinion, some letters edentical (for ex. russian: "e, T, P(it's "R" russian), A, B (like "V" russian), O, and etc.etc.etc) and now many fonts not look different for different language (I see only in some *nix system, on windows system - verdana, fixedys, lucida console etc.: no different of language). Some letters in BIG (like B) looks like big letter V (russian), but little - no. In patch, that I send you, and locedit-fixed.rar - you can create file of locale (now present russian, belorussian, maybe now will be ukranian) with casemappping AND resolving trouble with similar letters. I want to find people from other countries to make files locales. 1 year our network works with this patch and it work fine. Peoples, who find me about this and who testing it - noone bad report or so...Only gratitude.. If you don't agree with it - from patch you can take idea about add locales file to ircd without recompiling AND restarting ircd (it's about dynamical add language to ircd). About badwords and spamfilter... To this question. If people start spamming, usual they don't change CaseSensitive and spamfilter works fine (I'm about russian spam or "happy letters"). [quote] It's not that differently than 'l' vs 'I', and such things, which look very similar in some fonts like Fixedsys, and I haven't ever heard someone talking about comparing 'l' as if it was equal to 'I' :P. [/quote] If I give to you see "E" and "A" russian, you don't see differents :]

Bock 2006-11-04 14:24 reporter ~0012584	hm.. I think, that in frases about ONLY CASEMAPPING reason..

Date Modified	Username	Field	Change
2006-04-13 14:47	Trocotronic	New Issue
2006-04-13 19:03	Stealth	Note Added: 0011548
2006-04-13 19:18	Trocotronic	Note Added: 0011549
2006-04-19 08:05	Bock	Note Added: 0011580
2006-04-20 16:10	Trocotronic	Note Added: 0011589
2006-05-19 12:10	Spider84	Note Added: 0011741
2006-05-19 16:16	Bock	Note Added: 0011742
2006-06-05 03:35	Bock	Note Added: 0011854
2006-06-05 03:35	Bock	File Added: unreal.3.2.5-locale.tar.gz
2006-06-11 09:10	Bock	Note Added: 0011944
2006-06-11 09:18	Bock	File Added: Locedit.zip
2006-06-11 12:38	Bock	File Added: locedit-fixed.rar
2006-06-11 12:38	Bock	Note Added: 0011945
2006-06-13 08:30	avenger	Note Added: 0011950
2006-06-13 09:23	Bock	Note Added: 0011951
2006-11-01 07:33	syzop	Relationship added	related to 0003101
2006-11-01 07:33	syzop	Relationship added	related to 0002739
2006-11-01 07:39	syzop	Relationship deleted	related to 0003101
2006-11-01 07:39	syzop	Relationship added	has duplicate 0003101
2006-11-01 07:49	syzop	Note Added: 0012540
2006-11-01 07:49	syzop	Summary	Lower/Uppercase with charsets => Configurable CASEMAPPING (Lower/Uppercase with charsets)
2006-11-01 07:56	syzop	Note Added: 0012541
2006-11-01 11:14	Bock	Note Added: 0012546
2006-11-03 13:40	syzop	Relationship added	related to 0002718
2006-11-04 14:24	Bock	Note Added: 0012584
2007-04-19 18:37	~~stskeeps~~	Status	new => acknowledged
2007-04-27 05:50	~~stskeeps~~	Relationship added	related to 0002589
2010-07-14 17:49	syzop	QA	=> Not touched yet by developer
2010-07-14 17:49	syzop	U4: Need for upstream patch	=> No need for upstream InspIRCd patch
2010-07-14 17:49	syzop	U4: Upstream notification of bug	=> Not decided
2010-07-14 17:49	syzop	U4: Contributor working on this	=> None
2010-07-14 17:49	syzop	Severity	minor => feature

View Issue Details

Relationships

Activities

Issue History

related to	0002739	closed		Badwords does not support charsets
related to	0002718	closed		Russian nicks is case sensivity
has duplicate	0003101	closed		Suggestion about cyrillic nicks
related to	0002589	closed	syzop	mixed charsets in nicknames