|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0004044||unreal||ircd||public||2011-10-20 16:32||2015-07-10 13:07|
|Target Version||Fixed in Version|
|Summary||0004044: Servers doesn't want to link after a split or netsplit after link|
|Description||Hello, this is a strange issue we're having on our network. I want to say that we haven't had this problem with UnrealIRCd 3.2.7 or 126.96.36.199.|
Here's the problem (well, they might be two problems)
- Server 2, which is linked to server 1, gets disconnected due to ping timeout (something very strange) and it can't auto reconnect. Server 2 (or server 1) has to restart the ircd for the connection to happen (there is no internet connection problem here).
- (this is only sometimes) After server 2 has restarted ircd and it connects, it enters in some immediatly ping timeout state (users from server 2 can't talk to users on server 1 neither can identify to services (which are connected to server 1), etc) so the server doesn't respond to anything, but doesn't get delinked in some time.
|Additional Information||[16:06:48] * *** Notice -- (link) ZIPLink irc.politeia.in -> Calculate.linuxmaniac.net[@0:0:0:0:0:ffff:188.8.131.52.58265] established|
[16:06:48] * (link) ZIPLink Calculate.linuxmaniac.net -> irc.politeia.in[@184.108.40.206.0] established
[16:06:48] * *** Notice -- Possible negative TS split at link Calculate.linuxmaniac.net (1319119608 - 1319119610 = -2)
[16:06:48] * *** Notice -- Link Calculate.linuxmaniac.net -> irc.politeia.in is now synced [secs: -2 recv: 0.478 sent: 2.821]
[16:06:48] * *** Notice -- Zipstats for link to Calculate.linuxmaniac.net[@0:0:0:0:0:ffff:220.127.116.11.58265]: decompressed (in): 170=>234 (72.6%), compressed (out): 8679=>2502 (28.8%)
[16:06:49] * (sync) Link irc.politeia.in -> Calculate.linuxmaniac.net is now synced [secs: 2 recv: 2.821 sent: 0.482]
[16:10:04] * *** Notice -- No response from Calculate.linuxmaniac.net[18.104.22.168], closing link
|Tags||No tags attached.|
|3rd party modules|
Thanks for the report.
You say the connection doesn't happen until a restart.. what exactly happens before you do the restart? Only 'No response from ...'? Or also any other messages?
Regarding ping timeout, this seems strange as well, normally I point towards incorrect time / timers, but in your case the clocks are only 2 seconds apart, which cannot explain this, so must be something else. I don't have an explanation for this at this point.
You probably would have mentioned it if it did, but I'm asking anyway:
Did you get any 'TimeShift' or other warning prior first ping timeout?
Last edited: 2011-10-21 19:05
>You say the connection doesn't happen until a restart.. what exactly happens before you do the restart? Only 'No response from ...'?
>Or also any other messages?
No other messages.
>Did you get any 'TimeShift' or other warning prior first ping timeout?
No. Everything that shows from when the servers connect to when they disconnect I've pasted it already where the 'Additional Information' is.
Also, from about 2 days ago, we started using 22.214.171.124 and this problem hasn't happened even once. With the 3.2.9-rc* if the problem doesn't appear immediatly, it can occur 8 or 12 hours after the link. And then again, one of the servers in the link has to be restarted to get them linked again.
Last edited: 2011-10-25 21:58
Okay this happened today:
[21:30:43] * *** Notice -- No response from Politeia.Services[126.96.36.199], closing link
[21:30:43] * *** Notice -- No response from Politeia.Stats[188.8.131.52], closing link
[21:30:43] * *** Notice -- No response from Politeia.Slaves[184.108.40.206], closing link
The services (Anope/Denora/NeoStats) split from my server. All of them (ircd+services) being on the same box...
Ok, that rules out a ziplinks or SSL problem (which already would make no sense, anyway).
So, what happened 'today', is that with Unreal 3.2.9-rcX again?
I don't see how this can be a fault in UnrealIRCd. However, if you consistently see a clear difference between 3.2.9-rcX behavior and 220.127.116.11 behavior, then... strange.
The only way to debug this (further) would be doing a network capture, with 'tcpdump' or 'wireshark'.
Then, capture the traffic between your server and another server you know will fail eventually (or from 1 server to all servers, don't know your topology)...
Example command: tcpdump -w network_dump.pcap -s 0 -i ethXXX 'port 12345'
Replace ethXXX with the correct ethernet device to which traffic to the other servers travels.
The 'port 12345' assumes you use a special server port so you can easily filter on that port, otherwise you can use something like 'host 18.104.22.168 or host 22.214.171.124 or host 126.96.36.199' and so on...
better yet, put an & at the end of that tcpdump line so it runs in the background and you don't need to leave the terminal open...
Now, when you experience the symptoms and you have seen the 'no response' a few times, you can 'killall -15 tcpdump' (preferably -15, and not a hard kill -9), and then your 'network_dump.pcap' contains all the traffic.
You can load this 'network_dump.pcap' in wireshark and scroll down to the latest XX frames where the problem appears.
You can also zip & send the file to me at firstname.lastname@example.org
An alternative is to start the process from above when the symptoms appear. That means you won't be able to debug the initial disconnection, but you can debug why it would not properly reconnect. This alternative results in a much much shorter file, and less (privacy-)sensitive data.
Or: do both (just be sure to log to a different file in the second instance) :)
Let me know if you get any results. And if anything is unclear about what I wrote, just ask.
From my above message I meant that the server even drops servers (services in this case) that are on the same computer with the ircd. Which means (for my understanding) it's nearly impossible to be ping timeout or something similar, so it must be something in the code that makes the server drop the other servers.
>if you consistently see a clear difference between 3.2.9-rcX behavior and 188.8.131.52 behavior, then...
Yes. As I said before, I'm having no problems with the 184.108.40.206 which is also running in the same system with other services (on different ports just for testing stability).
I'll try what you suggest with tcpdump and as soon I have the data I'll send it to your email.
Ah ok, I misread. I thought you meant all 3 who split were on the same box ;P.
It was also not clear to me that you use 3.2.8 and 3.2.9-rcX at the same time.
Ok, that makes things rather interesting.
Let me know any results from the trace, I've good hopes that it will help us find the issue.
Did you run the trace, n0ks? I don't remember ever receiving anything :)
Anyone else experiencing this?
Hello Syzop, first of all, sorry for me not sending you anything of the requested. I had to abandon further investigation on this bug due to stopping one of my two servers on my network.
Although, I've seen a couple of people having this same issue (either in the IRC support channel and on other networks). Unexpectedly, their servers experienced something similar: two of their servers just disconnected from each other while both of the implied servers' boxes had a perfect connection to the internet (no packets dropped and such).
As this happens on a completely random basis I couldn't make any of the netadmins send me a valid packet capture... Lazy people I guess.
I'm sure of only one thing... It's not always reproducible and it doesn't happen to everyone. A specific network configuration has to be given (currently unknown to me) for this to happen.
Also, I just hope all these speculations are wrong and these random disconnects happen because of naturally produced ping timeouts during network trace between the two boxes and it's not something from unreal's core.
Please close this bug and lets hope someone that hits this same problem opens a bug with more and proper info.
Thanks for your time Syzop.
|Closed. And using 0003972 (which may be closed soon too).|
|2011-10-20 16:32||n0kS||New Issue|
|2011-10-21 10:07||syzop||Note Added: 0016756|
|2011-10-21 10:07||syzop||Status||new => acknowledged|
|2011-10-21 18:25||n0kS||Note Added: 0016757|
|2011-10-21 19:05||n0kS||Note Edited: 0016757||View Revisions|
|2011-10-25 21:57||n0kS||Note Added: 0016758|
|2011-10-25 21:58||n0kS||Note Edited: 0016758||View Revisions|
|2011-10-26 10:17||syzop||Note Added: 0016759|
|2011-10-26 10:26||syzop||Note Added: 0016760|
|2011-10-27 02:00||n0kS||Note Added: 0016762|
|2011-10-28 21:40||syzop||Note Added: 0016763|
|2012-08-17 13:31||syzop||Note Added: 0017080|
|2012-08-17 13:31||syzop||Status||acknowledged => feedback|
|2012-08-18 03:07||n0kS||Note Added: 0017093|
|2015-07-10 13:07||syzop||Note Added: 0018471|
|2015-07-10 13:07||syzop||Status||feedback => closed|
|2015-07-10 13:07||syzop||Assigned To||=> syzop|
|2015-07-10 13:07||syzop||Resolution||open => duplicate|