View Issue Details

IDProjectCategoryView StatusLast Update
0001977unrealircdpublic2015-07-19 19:51
ReporterSET Assigned Tosyzop  
PrioritynormalSeveritymajorReproducibilityalways
Status resolvedResolutionfixed 
PlatformAMD K6OSLinuxOS VersionLinux 2.6.7
Product Version3.2.1 
Fixed in Version3.4-beta2 
Summary0001977: Desyncs after nickcollisions
DescriptionIf two servers are relinking after a netsplit and there is a nickcollision with a winner and a loser (ie not both clients are killed) it will cause desyncs.

Example: "testing" and "observer1" on "server.1" and "testing" and "observer2" on "server.2", everybody is a chanop

A netlink on "server.2" looks like this, if the user on "server.2" loses:

1. *** testing left irc: Nick collision <-- server.2 killed the local user
2. *** testing joined the channel <-- the user on server.1
3. *** observer1 joined the channel
4. *** server.1 sets mode +oo testing observer1
5. *** testing left irc: Killed (server.1 (Nick Collision)) <-- server.1 tried to kill the user on server.2 but is too late, it killed its own user
6. *** testing joined the channel <-- server.1 reintroduced its user

But the user "testing" won't get any chanmodes on this side of the net because server.1 isn't sending them again after reintroducing the user.
As soon as "testing" tries to do something that requires chanop status, like kicking observer2, everyone'll be really surprised.

btw, testing's SWHOIS line is also lost but that's not that important.
TagsNo tags attached.
3rd party modules

Relationships

related to 0002994 resolvedsyzop Missing users after nick collision 

Activities

codemastr

2004-09-04 16:05

reporter   ~0007514

Have you ever actually seen this problem? I've never seen this, nor can I reproduce it. So either this was just a hypothetical scenario, or you're leaving something out of your description.

aquanight

2004-09-24 06:46

reporter   ~0007783

I've just seen this recently:

A net I was on had an extended netsplit, and during such I used a redirector bot to relay messages to a channel between the two sides. It used to same nick on both sides, and when the net remerged, one of the connection was closed, and this very issue occured. I observed from the side with the losing client:

[00:39:51] * Quits: _ ([email protected]) (Nick collision)
[00:39:51] * Joins: sprite ([email protected])
[00:39:51] * Joins: zapth ([email protected])
[00:39:51] * Joins: Drift3r ([email protected])
[00:39:51] * Joins: renegade ([email protected])
[00:39:51] * Joins: shadownight ([email protected])
[00:39:51] * Joins: _ ([email protected])
[00:39:51] * Joins: evilgod69 ([email protected])
[00:39:51] * Joins: w00t ([email protected])
[00:39:51] * Joins: Corey^ ([email protected])
[00:39:51] * Joins: trust ([email protected])
[00:39:51] * Joins: team_42o ([email protected])
[00:39:51] * Mode (boo.netronet.com): +ao sprite sprite
[00:39:51] * Quits: _ ([email protected]) (Killed (boo.netronet.com (Nick Collision)))
[00:39:51] * Joins: _ ([email protected])

As you can see, the losing server killed it's own. The winning server brought it's version in and tried to kill the losing one, but ended up killing it's own, and it reintroduced it. There were no channel modes in effect on either copy, but...

aquanight

2004-09-25 05:51

reporter   ~0007785

Oh I forgot to mention: the two servers the client was on were one hop apart, if that means anything.

aquanight

2004-10-27 15:41

reporter   ~0008132

I think I may have located the relevant code:

This is from m_nick in s_user.c:
        if ((differ && (acptr->lastnick < lastnick)) ||
            (!differ && (acptr->lastnick > lastnick)))
        {
            /*
             * Introduce our "correct" user to the other server
             */

            sendto_one(cptr, ":%s KILL %s :%s (Nick Collision)",
                me.name, parv[1], me.name);
            send_umode(NULL, acptr, 0, SEND_UMODES, buf);
            sendto_one_nickcmd(cptr, acptr, buf);
            if (acptr->user->away)
                sendto_one(cptr, ":%s AWAY :%s", acptr->name,
                    acptr->user->away);
            send_user_joins(cptr, acptr);
            return 0; /* Ignore the NICK */
        }

send_user_joins is (from channel.c):

void send_user_joins(aClient *cptr, aClient *user)
{
    Membership *lp;
    aChannel *chptr;
    int cnt = 0, len = 0, clen;
    char *mask;

    snprintf(buf, sizeof buf, ":%s %s ", user->name,
        (IsToken(cptr) ? TOK_JOIN : MSG_JOIN));
    len = strlen(buf);

    for (lp = user->user->channel; lp; lp = lp->next)
    {
        chptr = lp->chptr;
        if ((mask = index(chptr->chname, ':')))
            if (match(++mask, cptr->name))
                continue;
        if (*chptr->chname == '&')
            continue;
        clen = strlen(chptr->chname);
        if (clen + 1 + len > BUFSIZE - 3)
        {
            if (cnt)
            {
                buf[len - 1] = '\0';
                sendto_one(cptr, "%s", buf);
            }
            snprintf(buf, sizeof buf, ":%s %s ", user->name,
                (IsToken(cptr) ? TOK_JOIN : MSG_JOIN));
            len = strlen(buf);
            cnt = 0;
        }
        (void)strlcpy(buf + len, chptr->chname, sizeof buf-len);
        cnt++;
        len += clen;
        if (lp->next)
        {
            len++;
            (void)strlcat(buf, ",", sizeof buf);
        }
    }
    if (*buf && cnt)
        sendto_one(cptr, "%s", buf);

    return;
}

Note that it does not send mode changes. I think the fix would be to use SJOIN to send the user prefix information as well.

stskeeps

2007-04-15 08:37

reporter   ~0013384

Bump. Did we ever fix this?

aquanight

2007-04-15 11:44

reporter   ~0013388

Just by looking at code, it would appear not (m_nick does not send modes, neither does send_user_joins).

The question becomes which is the preferred solution. Currently send_user_joins tries to cat channels together to avoid sending a potentially big list of commands. Meaning the list must be looped again for modes, or change to sending SJOINS. (Either way a potential crapload of lines goes out.)

Or we could do something else entirely like extend server-server JOIN to accept things like: @#channel1 +#channel2 ~@#channel3 #etc, but then it'd need a timestamp spot.

satmd

2007-04-19 03:29

reporter   ~0013583

I wonder if nick collision kills should be restricted to local users.

I think of it like this:
the losing server kills the nicks that lose locally
and the winning server denies introduction of losing nicks.

From what I'm seeing here, I guess we are sending out multiple kills for the same nick and most likely introduced a race condition.

I say we should not try to kill losers on the other end, but simply deny/drop them during netmerge - the losing server should kill them locally.

I dunno wether/how to code this, so I appreciate a coder's opinion.

stskeeps

2007-04-19 03:36

reporter   ~0013585

Some relevant logs:
[16:36:12]<Trocotronic> I test 0002994
[16:36:53]<Trocotronic> if there is a collision, the nick on the server that made collision hasn't his channel modes
[16:38:07]<Trocotronic> for example
[16:38:24]<Trocotronic> server A and B
[16:39:12]<Trocotronic> well, wait, it's more complex
[16:39:41]<Syzop> maybe you can write some interesting bugnote ;p
[16:39:50] * Syzop not paying too much attention atm (work)
[16:40:02]<Syzop> well, it's up to you, I can read irclogs as well ;)
[17:16:33]<Trocotronic> done
[17:16:42]<Trocotronic> we have a very big desynch
[17:20:17]<Trocotronic> I put it as private
[17:21:16]<Stskeeps> *looks*
[17:22:42]<Stskeeps> Trocotronic: missing user after a nick collision is intentional. We cannot at burst time anticipate a user will be removed later
[17:23:03]<Trocotronic> emmm
[17:23:03]<Stskeeps> the missing user is from the server that the user was behind
[17:23:10]<Trocotronic> I didn't get this message
[17:23:18]<Stskeeps> okay.
[17:23:36]<Stskeeps> When you burst, it sends first NICK introductions, and then comes SJOIN stuff
[17:23:36]<Trocotronic> maybe I forget up me a smode
[17:23:42]<Trocotronic> yes
[17:23:58]<Stskeeps> but okay
[17:23:59]<Trocotronic> but nick is killed again after sjoin
[17:24:17]<Stskeeps> okay, sec, i will look more closely at bug report :P
[17:24:28]<Stskeeps> but Missing user is not a problem, usually
[17:24:29]<Trocotronic> k
[17:24:34]<Stskeeps> as in
[17:24:35]<Stskeeps> the notice
[17:24:39]<Stskeeps> it is just a sideeffect of nick collisions
[17:25:20]<Trocotronic> the problem comes if user has @ or any cmode
[17:25:38]<Stskeeps> are we speaing the user who got killed, or?
[17:25:48]<Trocotronic> yes
[17:25:56]<Trocotronic> no no sorry
[17:26:03]<Trocotronic> the user not killed
[17:26:11]<Stskeeps> okay
[17:26:21]<Stskeeps> hmm.
[17:26:29]<Trocotronic> killed users loss everything (he's killed)
[17:26:50]<Stskeeps> why the devil does he get killed twice..
[17:28:21]<Stskeeps> oh god..
[17:29:04]<Trocotronic> this bug I think acrossed since 3.2betas
[17:29:33]<Stskeeps> and i assume you do not have auto reconnect on?
[17:30:41]<Trocotronic> yes of course
[17:30:47]<Stskeeps> :[email protected] is on the same server as Fourier?
[17:31:17]<Trocotronic> mmm, out of context I can't know
[17:31:24]<Trocotronic> there was two Trocotronic, on both servers
[17:31:32]<Stskeeps> well there's Trocononic! @ virtual-F1F7 smth
[17:31:35]<Stskeeps> and the one with .IP
[17:32:10]<Stskeeps> i assume localhost is the one who is on B
[17:32:17]<Stskeeps> it's just..
[17:32:19]<Trocotronic> yes
[17:32:28]<Trocotronic> localhost is B, and A is remote
[17:32:31] * Stskeeps glares
[17:33:14]<Stskeeps> Fourier isn't on B
[17:33:24]<Stskeeps> <- :irc.redyc.com NOTICE Fourier
[17:33:38]<Stskeeps> so what is what again? :P
[17:33:47]<Trocotronic> ouf
[17:33:49]<Trocotronic> sorry
[17:33:55]<Stskeeps> so A is B and B is A? ;
[17:34:10]<Trocotronic> fourier on A
[17:34:21]<Stskeeps> ok, and Trocotonic on A as well, right?
[17:34:24]<Stskeeps> as in
[17:34:25]<Stskeeps> @localhost
[17:34:42]<Trocotronic> yes
[17:34:45]<Stskeeps> k
[17:35:03]<Trocotronic> A localhost, Fourier on A
[17:35:06]<Stskeeps> k
[17:35:16]<Stskeeps> and B is irc-ssl.redyc.com and A is irc.redyc.com
[17:35:20]<Stskeeps> right?
[17:35:44]<Stskeeps> okay then
[17:35:58]<Stskeeps> what it goes like is .. Trocotonic on A gets quitted cos of nick collision
[17:36:07]<Stskeeps> and Trocotonic B comes in, gets his op
[17:36:16]<Stskeeps> and then B goes in and kills his own user!
[17:36:53]<Stskeeps> O_O
[17:36:59]<Stskeeps> that is what im seeing
[17:36:59]<Trocotronic> it likes to kill me many times
[17:37:10]<Stskeeps> well B kills his own user
[17:37:13]<Stskeeps> which doesnt make sense
[17:37:45]<Stskeeps> i think the problem is we both have Nick collision handling internally, and some that sends off KILLs
[17:37:51]<Stskeeps> and we are seeing a possible race here
[17:38:44]<Stskeeps> and i bet that it sees a nick collision, sneds a KILL to the user
[17:38:46]<Trocotronic> but A sends a KILL message?
[17:38:50]<Stskeeps> no, it doesnt
[17:39:08]<Stskeeps> B does
[17:39:16]<Stskeeps> it kills his own user
[17:39:27]<Trocotronic> mmmm
[17:39:36]<Trocotronic> but it does after sjoin
[17:39:47]<Trocotronic> there aren't any nick message after sjoin
[17:39:53]<Stskeeps> no, but the KILL might be in the buffer
[17:40:01]<Stskeeps> i dont understand why it rejoins, though
[17:40:30]<Stskeeps> i think there's some "reintroduction code"
[17:40:33]<Stskeeps> is this on same machine?
[17:40:36]<Trocotronic> no
[17:41:01]<Trocotronic> I have two machines
[17:41:04]<Trocotronic> .5 and .2
[17:41:08]<Trocotronic> (.5 is localhost)
[17:41:19]<Stskeeps> idbut you are right, that is a bug
[17:42:05]<Stskeeps> arrgh
[17:43:26]<Stskeeps> Syzop: i could use your angle on this
[17:43:51]<Trocotronic> who did this part on m_nick?
[17:44:02]<Stskeeps> it's all the way from dreamforge, afaik
[17:44:42]<Trocotronic> and nobody has solve it?
[17:44:45]<Stskeeps> maybe.
[17:44:50]<Stskeeps> and it's dangerous to touch this code :)
[17:45:04]<Trocotronic> but it's very ugly
[17:45:08]<Stskeeps> yup.
[17:45:15]<Trocotronic> there are a lot of kills
[17:45:47]<Stskeeps> well
[17:45:50]<Stskeeps> we can fix this two ways..
[17:46:04]<Stskeeps> we can try and fix the messy TS code
[17:46:12]<Stskeeps> or we can limit how KILL works
[17:47:02]<Stskeeps> the problem is, there should never be a :server KILL user :reason where user is behind server
[17:47:18]<Stskeeps> it should be :user QUIT :Killed (Nick collision)
[17:47:31]<Stskeeps> am i right?
[17:47:35]<Stskeeps> and that is what the code here relies on
[17:47:37]<Trocotronic> mmmmm
[17:47:41]<Stskeeps> that's the easy fix.
[17:48:05]<Stskeeps> the difficult fix is either relying on that servers kick off the users that nick collide (ie, no kill)
[17:48:14]<Stskeeps> or change nick collisions into kills
[17:48:21]<Stskeeps> you see the problem? ;)
[17:48:38]<Stskeeps> and the last part may cause services to crash
[17:48:49]<Stskeeps> ie, the difficult fix
[17:49:46]<Trocotronic> look at this
[17:49:46]<Trocotronic> <- :[email protected] QUIT :Killed (irc-ssl.redyc.com (Nick Collision))
[17:49:52]<Trocotronic> irc-ssl is remote server
[17:49:57]<Stskeeps> yeah, exactly
[17:50:00]<Stskeeps> that is what im trying to say :)
[17:50:07]<Stskeeps> it kills his own user
[17:50:11]<Stskeeps> which is wrong
[17:50:13]<Stskeeps> in so many ways
[17:50:28]<Stskeeps> the QUIT stuff is generated by the local server
[17:50:31]<Stskeeps> in this instnace
[17:50:31]<Trocotronic> but Trocotronic on irc-ssl still remains!
[17:50:36]<Stskeeps> correct
[17:50:46]<Stskeeps> because the kill came from the server, it doesnt send it back
[17:50:53]<Stskeeps> (hence the Received KILL message from irc-ssl
[17:50:59]<Stskeeps> i think we should change how /kill works instead.
[17:51:12]<Trocotronic> yah
[17:51:19]<Trocotronic> so, irc-ssl sends a kill message to B
[17:51:24]<Trocotronic> to A
[17:51:36]<Stskeeps> yes, and A should just ignore it
[17:51:42]<Stskeeps> because it concerns a user behind B
[17:51:53]<Trocotronic> exactly
[17:51:55]<Stskeeps> and that should be :useronB QUIT :Killed (
[17:52:06]<Stskeeps> we should however be VERY careful not to break other stuff
[17:52:14]<Trocotronic> no, why?
[17:52:30]<Trocotronic> user on A doesn't see any nick introduce
[17:52:34]<Trocotronic> "he wins"
[17:52:47]<Stskeeps> nono
[17:52:49]<Stskeeps> wait :P
[17:53:37]<Stskeeps> when there's a nick collision, and B (irc-ssl) wins, then the client on A quits. B ALSO sends out a KILL for the other client to be sure it's gone
[17:54:06]<Trocotronic> yes
[17:54:55]<Stskeeps> in some cases, B tries to kill client on A, but A misunderstands since it knows the new Trocotonic is from B
[17:55:00]<Stskeeps> err
[17:55:02]<Stskeeps> in some cases, B tries to kill client on A, but A misunderstands since it knows the new Trocotonic is from A
[17:55:12]<Stskeeps> and it kills the Trocotonic from A
[17:55:42]<Stskeeps> or atleast, it does it in it's own little world
[17:55:59]<Stskeeps> and B should have just ignored the KILL Trocotonic-A since it comes from A
[17:56:16]<Stskeeps> there are some interesting consequences to this though
[17:56:27]<Stskeeps> if we fix it so it ignores
[17:56:54]<Stskeeps> you can no longer see if someone kills from other servers, unless if the kill passes your server to find the target
[17:57:18]<Trocotronic> I'm looking for this KILL message
[17:58:17]<Trocotronic> I think it's the last if before nickkilldone
[17:58:55]<Trocotronic> he makes a kill to cptr
[18:02:51]<Trocotronic> mmm but B receives a kill message, then B makes a join
[18:02:53]<Trocotronic> arg
[18:05:56]<Stskeeps> im not sure of the consequences of this
[18:07:25]<Stskeeps> okay
[18:07:32]<Stskeeps> changing KILL semantics is suicide
[18:08:16]<Stskeeps> i vote we go in and assume servers handle nick collisions sanely (quit their own user)
[18:08:51]<Trocotronic> how?
[18:08:59]<Stskeeps> well changing the code..
[18:09:05]<Trocotronic> avoiding kill from serverS?
[18:09:07]<Stskeeps> well
[18:09:13]<Stskeeps> avoiding to send the KILL and reintroduction stuff
[18:09:29]<Stskeeps> this needs BIG CONSIDERATION
[18:10:27]<Stskeeps> but the kill and reintroduction stuff is a lack from the old days of non-TS servers
[18:10:38]<Stskeeps> but we might have services problems with this
[18:13:19]<Trocotronic> send_user_joins(cptr, acptr);
[18:13:28]<Stskeeps> that should set modes too
[18:14:43]<Trocotronic> irc-ssl introduce the nick again
[18:14:58]<Trocotronic> both servers introduce the nick
[18:16:12]<Trocotronic> send_user_joins only sends a JOIN
[18:16:27]<Trocotronic> not MODE to his channels
[18:17:08]<Stskeeps> ah
[18:17:14]<Stskeeps> well we could also fix it like that
[18:17:21]<Stskeeps> by sending the modes
[18:17:22]<Trocotronic> if ((differ && (acptr->lastnick < lastnick)) ||
[18:17:22]<Trocotronic> (!differ && (acptr->lastnick > lastnick)))
[18:17:23]<Trocotronic> {
[18:17:32]<Trocotronic> I think we could remove this if
[18:17:45]<Stskeeps> trust me, that is dangerous code :P
[18:17:55]<Trocotronic> ok ok
[18:18:25]<Trocotronic> this part causes the second kill and rejoin
[18:18:37]<Trocotronic> I dont' know why still is there
[18:20:37]<Stskeeps> well i know some services rely on it
[18:20:41]<Stskeeps> because they dont do TS checking
[18:21:16]<Trocotronic> but this part is only A-B
[18:21:28]<Trocotronic> not for the rest

w00t

2007-04-19 03:59

reporter   ~0013586

My vote, wait for UID.

The problem here is that it's killing the 'wrong' person when they sign back on. UID fixes this in that we don't hit the wrong targets.

We can fix it better, too, by removing KILL on collide and moving towards a system of forcing nickchange to UID.

satmd

2007-04-19 04:23

reporter   ~0013588

I probably should be more precise on my idea, so here a scenario:
[The wording may sound weird - but I'm trying to include scenarios that involve multiple hops as well]
server.A and server.B split
nick1 will be winning on server.A

server.A:
We will win with our nick1, so we keep ours and drop(!) the incoming nick1 from server.B

server.B:
We will lose with our nick1 (or another server on our end of the merge), so we kill our own nick1 first locally and forward the kill to ALL servers on OUR end (and NOT to the server we just try to link to). After that, we'll introduce server.A's nick1 locally and the servers on our own end of the merge.

This should work pretty safe and eliminate the double KILL that currently causes this bug, too. As far as I see, this also works fine if we consider that there might be nicks losing from remote servers (e.g. nick1 came from server.C which is linked to server.B all the time).

Please discuss :)

syzop

2015-07-19 19:51

administrator   ~0018531

This was indeed screwed up. Other servers being confused by this shit.

commit 87c0bbed3ccfd4bf3bf101078e0bc064935328ba
Author: Bram Matthys <[email protected]>
Date: Sun Jul 19 19:47:41 2015 +0200

    Send nick collision kill with a winner to the losing side only. Trust the NICK is in-flight to that server and that it will kill his own and assign ours the winner. This fixes a bug until now where it was killing the wrong user (reproduced locally) and prevents re-introducing of our client which we will never do correctly. (0001977). If you can present a legit case where this is wrong, let me know.


https://github.com/unrealircd/unrealircd/commit/87c0bbed3ccfd4bf3bf101078e0bc064935328ba

So we have a bit more trust now.. I tested a number of nick collisions with artificial lag and it's much better now. Previously I tested with 3 nicks on 3 different servers and at one point I had 4 users in my nick list... interesting desynch. Now it looks good.

Issue History

Date Modified Username Field Change
2004-07-18 21:06 SET New Issue
2004-09-04 16:05 codemastr Note Added: 0007514
2004-09-24 06:46 aquanight Note Added: 0007783
2004-09-25 05:51 aquanight Note Added: 0007785
2004-10-27 15:41 aquanight Note Added: 0008132
2007-04-15 08:37 stskeeps Note Added: 0013384
2007-04-15 08:37 stskeeps Relationship added related to 0002994
2007-04-15 11:44 aquanight Note Added: 0013388
2007-04-19 03:02 stskeeps Status new => confirmed
2007-04-19 03:29 satmd Note Added: 0013583
2007-04-19 03:36 stskeeps Note Added: 0013585
2007-04-19 03:59 w00t Note Added: 0013586
2007-04-19 04:23 satmd Note Added: 0013588
2015-07-19 19:51 syzop Note Added: 0018531
2015-07-19 19:51 syzop Status confirmed => resolved
2015-07-19 19:51 syzop Fixed in Version => 3.4-beta2
2015-07-19 19:51 syzop Resolution open => fixed
2015-07-19 19:51 syzop Assigned To => syzop