View Issue Details

IDProjectCategoryView StatusLast Update
0004044unrealircdpublic2015-07-10 13:07
Reportern0kS Assigned Tosyzop  
PrioritynormalSeveritymajorReproducibilitysometimes
Status closedResolutionduplicate 
PlatformLinuxOSGentooOS Version2.1
Product Version3.2.9-RC2 
Summary0004044: Servers doesn't want to link after a split or netsplit after link
DescriptionHello, this is a strange issue we're having on our network. I want to say that we haven't had this problem with UnrealIRCd 3.2.7 or 3.2.8.1.
Here's the problem (well, they might be two problems)
1st.
- Server 2, which is linked to server 1, gets disconnected due to ping timeout (something very strange) and it can't auto reconnect. Server 2 (or server 1) has to restart the ircd for the connection to happen (there is no internet connection problem here).
2nd.
- (this is only sometimes) After server 2 has restarted ircd and it connects, it enters in some immediatly ping timeout state (users from server 2 can't talk to users on server 1 neither can identify to services (which are connected to server 1), etc) so the server doesn't respond to anything, but doesn't get delinked in some time.
Additional Information[16:06:48] * *** Notice -- (link) ZIPLink irc.politeia.in -> Calculate.linuxmaniac.net[@0:0:0:0:0:ffff:85.217.253.135.58265] established
[16:06:48] * (link) ZIPLink Calculate.linuxmaniac.net -> irc.politeia.in[@90.175.216.100.0] established
[16:06:48] * *** Notice -- Possible negative TS split at link Calculate.linuxmaniac.net (1319119608 - 1319119610 = -2)
[16:06:48] * *** Notice -- Link Calculate.linuxmaniac.net -> irc.politeia.in is now synced [secs: -2 recv: 0.478 sent: 2.821]
[16:06:48] * *** Notice -- Zipstats for link to Calculate.linuxmaniac.net[@0:0:0:0:0:ffff:85.217.253.135.58265]: decompressed (in): 170=>234 (72.6%), compressed (out): 8679=>2502 (28.8%)
[16:06:49] * (sync) Link irc.politeia.in -> Calculate.linuxmaniac.net is now synced [secs: 2 recv: 2.821 sent: 0.482]
[16:10:04] * *** Notice -- No response from Calculate.linuxmaniac.net[85.217.253.135], closing link
TagsNo tags attached.
3rd party modules

Activities

syzop

2011-10-21 10:07

administrator   ~0016756

Thanks for the report.

You say the connection doesn't happen until a restart.. what exactly happens before you do the restart? Only 'No response from ...'? Or also any other messages?

Regarding ping timeout, this seems strange as well, normally I point towards incorrect time / timers, but in your case the clocks are only 2 seconds apart, which cannot explain this, so must be something else. I don't have an explanation for this at this point.

You probably would have mentioned it if it did, but I'm asking anyway:
Did you get any 'TimeShift' or other warning prior first ping timeout?

n0kS

2011-10-21 18:25

reporter   ~0016757

Last edited: 2011-10-21 19:05

>You say the connection doesn't happen until a restart.. what exactly happens before you do the restart? Only 'No response from ...'?
Exactly!

>Or also any other messages?
No other messages.

>Did you get any 'TimeShift' or other warning prior first ping timeout?
No. Everything that shows from when the servers connect to when they disconnect I've pasted it already where the 'Additional Information' is.


Also, from about 2 days ago, we started using 3.2.8.1 and this problem hasn't happened even once. With the 3.2.9-rc* if the problem doesn't appear immediatly, it can occur 8 or 12 hours after the link. And then again, one of the servers in the link has to be restarted to get them linked again.

n0kS

2011-10-25 21:57

reporter   ~0016758

Last edited: 2011-10-25 21:58

Okay this happened today:
[21:30:43] * *** Notice -- No response from Politeia.Services[90.175.216.100], closing link
[21:30:43] * *** Notice -- No response from Politeia.Stats[90.175.216.100], closing link
[21:30:43] * *** Notice -- No response from Politeia.Slaves[90.175.216.100], closing link

The services (Anope/Denora/NeoStats) split from my server. All of them (ircd+services) being on the same box...

syzop

2011-10-26 10:17

administrator   ~0016759

Ok, that rules out a ziplinks or SSL problem (which already would make no sense, anyway).

So, what happened 'today', is that with Unreal 3.2.9-rcX again?

I don't see how this can be a fault in UnrealIRCd. However, if you consistently see a clear difference between 3.2.9-rcX behavior and 3.2.8.1 behavior, then... strange.

syzop

2011-10-26 10:26

administrator   ~0016760

The only way to debug this (further) would be doing a network capture, with 'tcpdump' or 'wireshark'.
Then, capture the traffic between your server and another server you know will fail eventually (or from 1 server to all servers, don't know your topology)...
such as:
Example command: tcpdump -w network_dump.pcap -s 0 -i ethXXX 'port 12345'
Replace ethXXX with the correct ethernet device to which traffic to the other servers travels.
The 'port 12345' assumes you use a special server port so you can easily filter on that port, otherwise you can use something like 'host 1.2.3.4 or host 5.6.7.8 or host 1.1.1.1' and so on...

better yet, put an & at the end of that tcpdump line so it runs in the background and you don't need to leave the terminal open...

Now, when you experience the symptoms and you have seen the 'no response' a few times, you can 'killall -15 tcpdump' (preferably -15, and not a hard kill -9), and then your 'network_dump.pcap' contains all the traffic.
You can load this 'network_dump.pcap' in wireshark and scroll down to the latest XX frames where the problem appears.
You can also zip & send the file to me at [email protected]

An alternative is to start the process from above when the symptoms appear. That means you won't be able to debug the initial disconnection, but you can debug why it would not properly reconnect. This alternative results in a much much shorter file, and less (privacy-)sensitive data.

Or: do both (just be sure to log to a different file in the second instance) :)

Let me know if you get any results. And if anything is unclear about what I wrote, just ask.

n0kS

2011-10-27 02:00

reporter   ~0016762

From my above message I meant that the server even drops servers (services in this case) that are on the same computer with the ircd. Which means (for my understanding) it's nearly impossible to be ping timeout or something similar, so it must be something in the code that makes the server drop the other servers.

>if you consistently see a clear difference between 3.2.9-rcX behavior and 3.2.8.1 behavior, then...
Yes. As I said before, I'm having no problems with the 3.2.8.1 which is also running in the same system with other services (on different ports just for testing stability).

----------------

I'll try what you suggest with tcpdump and as soon I have the data I'll send it to your email.

syzop

2011-10-28 21:40

administrator   ~0016763

Ah ok, I misread. I thought you meant all 3 who split were on the same box ;P.
It was also not clear to me that you use 3.2.8 and 3.2.9-rcX at the same time.

Ok, that makes things rather interesting.

Let me know any results from the trace, I've good hopes that it will help us find the issue.

syzop

2012-08-17 13:31

administrator   ~0017080

Did you run the trace, n0ks? I don't remember ever receiving anything :)

Anyone else experiencing this?

n0kS

2012-08-18 03:07

reporter   ~0017093

Hello Syzop, first of all, sorry for me not sending you anything of the requested. I had to abandon further investigation on this bug due to stopping one of my two servers on my network.

Although, I've seen a couple of people having this same issue (either in the IRC support channel and on other networks). Unexpectedly, their servers experienced something similar: two of their servers just disconnected from each other while both of the implied servers' boxes had a perfect connection to the internet (no packets dropped and such).

As this happens on a completely random basis I couldn't make any of the netadmins send me a valid packet capture... Lazy people I guess.

I'm sure of only one thing... It's not always reproducible and it doesn't happen to everyone. A specific network configuration has to be given (currently unknown to me) for this to happen.

Also, I just hope all these speculations are wrong and these random disconnects happen because of naturally produced ping timeouts during network trace between the two boxes and it's not something from unreal's core.

Please close this bug and lets hope someone that hits this same problem opens a bug with more and proper info.

Thanks for your time Syzop.

syzop

2015-07-10 13:07

administrator   ~0018471

Closed. And using 0003972 (which may be closed soon too).

Issue History

Date Modified Username Field Change
2011-10-20 16:32 n0kS New Issue
2011-10-21 10:07 syzop Note Added: 0016756
2011-10-21 10:07 syzop Status new => acknowledged
2011-10-21 18:25 n0kS Note Added: 0016757
2011-10-21 19:05 n0kS Note Edited: 0016757
2011-10-25 21:57 n0kS Note Added: 0016758
2011-10-25 21:58 n0kS Note Edited: 0016758
2011-10-26 10:17 syzop Note Added: 0016759
2011-10-26 10:26 syzop Note Added: 0016760
2011-10-27 02:00 n0kS Note Added: 0016762
2011-10-28 21:40 syzop Note Added: 0016763
2012-08-17 13:31 syzop Note Added: 0017080
2012-08-17 13:31 syzop Status acknowledged => feedback
2012-08-18 03:07 n0kS Note Added: 0017093
2015-07-10 13:07 syzop Note Added: 0018471
2015-07-10 13:07 syzop Status feedback => closed
2015-07-10 13:07 syzop Assigned To => syzop
2015-07-10 13:07 syzop Resolution open => duplicate