0003719: Add UTF-8 support - UnrealIRCd Bug Tracker

ID	Project	Category	View Status	Date Submitted	Last Update
0003719	unreal	ircd	public	2008-08-14 14:35	2021-03-08 00:51

Reporter	para_1461	Assigned To	syzop
Priority	normal	Severity	feature	Reproducibility	N/A
Status	resolved	Resolution	fixed
OS	N/A	OS Version	N/A
Product Version	4.0.0
Fixed in Version	4.0.17

Summary	0003719: Add UTF-8 support
Description	I'd like to see UTF-8 support for nicks. It's extremely difficult to find an IRCd that supports this, and the only one I know of can't be found on Google to download. I believe I'd find it a good feature for when our network gets more, various users.

3rd party modules

Stealth 2008-08-15 00:54 reporter ~0015361	IRCds don't normally allow UTF-8 in nicks due to the simple reason that anyone can use alternate UTF characters that look like other characters to spoof the appearance of another user. For example, my nick (Stealth) can have up to 127 possible fakes with UTF-8. So that means someone can load up to 127 clones with UTF-8 nicks all looking like "Stealth". Or what if someone with a similar host wants to pretend to be me to get my password? Or harass another user? Or carry out some other form of abuse? Then you have the other issues with upper and lower case characters - the same problems are present there as well. Unreal has a setting to enable other character maps for this purpose (set::allowed-nickchars), and that's even questionable because of the issues mentioned above.

n0kS 2010-10-29 16:26 reporter ~0016393	I agree with you Stealth, but that's why you can make the user to use only one encoding in his nickname, like: only cyrillic, only arabic or only chinese, and can't mix them. Because if in future is like that, if I start mixing cyrillic with latil letters, as you said, I can get a lot of "fakes".

qdinar 2012-04-22 14:20 reporter ~0016984	this is very useful because people will not need to press key combination of changing heyboard layot, to mention other users.

syzop 2013-01-09 11:10 administrator ~0017339 Last edited: 2013-01-09 20:47	There's a document called 'Unicode Security Considerations' which deals with exactly this: http://www.unicode.org/reports/tr36/ I lost my other link but there are also functions that can see which characters are identical or very similar. --> EDIT: NFKC, comnbined with 'case folding' to make it case insensitive. If I understand correctly that should solve most if not all of the security concerns (look alike characters). Of course, there are plenty of other things that still have to be solved/done before you have UTF8 support...

syzop 2015-12-26 10:31 administrator ~0018945	For some next series (not 4.0.x) I think this would be a nice release goal.

blank 2015-12-29 14:23 reporter ~0018993	YES.

blank 2016-03-20 14:01 reporter ~0019143	@syzop if a network was willing to sponsor this (in € terms), would it speed up getting this added?

syzop 2016-03-27 11:01 administrator ~0019147 Last edited: 2016-03-27 11:02	The next few months I'll mostly be working on things other than UnrealIRCd I'm afraid (so just bug fixes, minor things). I usually do that after such a lengthy period of UnrealIRCd development (a full year on U4 in this case). After that I'm seriously considering looking into this, since I think this would be an important feature.

syzop 2017-11-19 17:28 administrator ~0019973	Depends on https://github.com/ircv3/ircv3-specifications/pull/272 Once spec is agreed on (or direction is clear) we also need some library or drop-in code that IRC servers, services and clients can use to handle this.

k4be 2017-11-25 15:44 developer ~0019977	"I agree with you Stealth, but that's why you can make the user to use only one encoding in his nickname, like: only cyrillic, only arabic or only chinese, and can't mix them. Because if in future is like that, if I start mixing cyrillic with latil letters, as you said, I can get a lot of "fakes"." Possibly simplest solution: instead of allowing every possible UTF-8 character, just specify a fixed character list in a config file. Would differ from old allowed-nickchars in that, allowed characters would be longer than one byte. This will be sufficient for (probably) all networks dominated with a single language.

syzop 2017-11-25 16:23 administrator ~0019978	That is true. The thing is that https://github.com/ircv3/ircv3-specifications/pull/272 also deals with proper CASEMAPPING. So 'hell<o with accent>' is considered the same as 'HELL<O with accent>', as you would expect. So, ideally you would want to fix both these things at the same time. And, at the same time, services adding support for the same. But, yeah, the alternative is to just add the ranges like we do now. And ignore CASEMAPPING for now. That alternative is viable if the previously mentioned github pull request takes too long (and it seems stuck right now). Anyway, more on-topic: Of course, if we permit - say - UTF8 hebrew then we should only permit the UTF8 ranges and not non-UTF8 hebrew at the same time, as that would case the same display and security issues as previously mentioned.

syzop 2017-11-25 21:18 administrator ~0019979	Added, without the casemapping (just like existing set::allowed-nickchars): https://github.com/unrealircd/unrealircd/commit/e3b91f8b94aa775ad2536576a8b5c324754b99ff * Added UTF8 support in set::allowed-nickchars See https://www.unrealircd.org/docs/Nick_Character_Sets Example: set { allowed-nickchars { latin-utf8; }; }; Important remarks: * All your servers must be on UnrealIRCd 4.0.17 (or later) * Most(?) services do not support this, so users using UTF8 nicknames won't be able to register at NickServ. * In set::allowed-nickchars you must either choose an utf8 language or a non-utf8 character set. You cannot combine the two. * You also cannot combine multiple scripts/alphabets, such as: latin, greek, cyrillic and hebrew. You must choose one. * If you are already using set::allowed-nickchars on your network (eg: 'latin1') then be careful when migrating (to eg: 'latin-utf8'): * Your clients may still assume non-UTF8 * If users registered nicks with accents or other special characters at NickServ then they may not be able to access their account after the migration to UTF8. [!] Work in progress [!]

mcken 2018-01-11 17:20 reporter ~0020012	It was a long awaited feature and we are really grateful for having it now. Would it be possible to add an optional full utf-8 support for nicks, where all non-text utf-8 characters such as ??????( ? )? could be used as well? Generally speaking, these UTF-8 shouldn't break IRC core functionality in general when used in nicks.

mcken 2018-01-11 17:24 reporter ~0020013	Sorry for double posting, but editing is not possible. My UTF-8 characters at the previous post are not rendered correctly due to database charset configuration or something similar. The characters I mentioned can be viewed here as an example: http://upli.st/l/list-of-all-ascii-emoticons

syzop 2018-07-14 16:59 administrator ~0020209	Just an update: I'm not working on this for 4.0.19. We'll have to see after that but I'm not aware of services and ircv3 drafts and such catching up.. pitty.. hoped I would have started something. Due to different priorities in life and time constraints I have to pick my release targets and this one won't be one of them for next release. As for the last post from mcken: I'm personally kinda reluctant to add such things. As you can see from previous work we try to pick characters/symbols that are "language" so to say, and not symbols like in math or smileys/emoticons and so on.

syzop 2020-09-27 20:07 administrator ~0021773	We added UTF8 nick characters in 4.0.17. Similarly, we have the option to only allow valid utf8 in channel names (it is even the default) since 5.0.0. CASEMAPPING is an entirely different matter though with still plenty of problems and unimplemented: https://bugs.unrealircd.org/view.php?id=2882

Date Modified	Username	Field	Change
2008-08-14 14:35	para_1461	New Issue
2008-08-15 00:54	Stealth	Note Added: 0015361
2008-08-15 00:54	Stealth	Status	new => feedback
2008-08-28 01:12	Stealth	Relationship added	has duplicate 0003723
2010-10-29 16:26	n0kS	Note Added: 0016393
2012-04-22 14:20	qdinar	Note Added: 0016984
2013-01-09 11:10	syzop	Note Added: 0017339
2013-01-09 20:45	syzop	Note Edited: 0017339
2013-01-09 20:47	syzop	Note Edited: 0017339
2015-12-26 10:29	syzop	Relationship added	has duplicate 0004503
2015-12-26 10:31	syzop	Note Added: 0018945
2015-12-26 10:31	syzop	Assigned To	=> syzop
2015-12-26 10:31	syzop	Status	feedback => acknowledged
2015-12-26 10:33	syzop	Product Version	3.3-alpha0 => 4.0.0
2015-12-26 10:33	syzop	Summary	UTF-8 charset in UnrealIRCd 3.3 => Add UTF-8 support
2015-12-26 10:33	syzop	Description Updated
2015-12-29 14:23	blank	Note Added: 0018993
2016-03-20 14:01	blank	Note Added: 0019143
2016-03-27 11:01	syzop	Note Added: 0019147
2016-03-27 11:02	syzop	Note Edited: 0019147
2017-11-19 17:28	syzop	Note Added: 0019973
2017-11-25 15:45	k4be	Note Added: 0019977
2017-11-25 16:23	syzop	Note Added: 0019978
2017-11-25 21:18	syzop	Note Added: 0019979
2018-01-11 17:20	mcken	Note Added: 0020012
2018-01-11 17:24	mcken	Note Added: 0020013
2018-07-14 16:59	syzop	Note Added: 0020209
2020-09-27 20:07	syzop	Status	acknowledged => resolved
2020-09-27 20:07	syzop	Resolution	open => fixed
2020-09-27 20:07	syzop	Fixed in Version	=> 4.0.17
2020-09-27 20:07	syzop	Note Added: 0021773

View Issue Details

Relationships

Activities

Issue History

has duplicate	0003723	closed		Adding Unicode Support
has duplicate	0004503	closed	syzop	Disallowed umlauts in Username