View Issue Details
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0003972||unreal||ircd||public||2010-12-01 16:25||2020-06-26 15:23|
|Fixed in Version||5.0.6|
|Summary||0003972: Autoconnect link does not reconnect after split|
|Description||Sometimes (not always but nearly), when a connection between two hubs is lost, the link does not come back up even though it is set to autoconnect in the config file.|
The ircd "sees" the split (ie "Lost connection to ..." notice and SQUIT in the logs), but remains on its own indefinitely (10 hours in the last instance until someone reconnected manually).
It seems that the problem builds up over time, ie. the first couple of disconnections after a restart/rehash are usually handled correctly (autotonnect kicks in), but later on, autoconnect just does not happen any longer.
|Additional Information||The only clue so far is:|
[Wed Dec 1 00:19:13 2010] - WARNING: Time jumped 23 seconds ahead! (1291159130
[Wed Dec 1 00:19:13 2010] - [TimeShift] Resetting some timers!
The server has ntpd running and has *not* skipped ahead itself.
|Tags||No tags attached.|
|3rd party modules||delaylist, regexcept, quarantine, nocodes|
I've found the mantis bug, it was caused by ~ followed by certain numbers.
I already reported it straight away on http://www.mantisbt.org/bugs/view.php?id=12580 when I heard about this (better email next time ;p), but as you can see I had to trace it myself...
Anyway, the rest will be on-topic.
So far haven't looked at your issue.
The jumps in time are really quite odd.. and big.
Could it be that your server clock is running too slow and ntp is adjusting it every once in a while in a huge leap? Actually ntpd shouldn't do that (except 1st tiem on boot sometimes), only ntpdate does things like that...
also there was a bug resolved which involved autoconnect & time offsets, but that one probably does not apply to you:
- Fixed issue where a negative time offset (either caused by ircd.tune or
timesynch) made autoconnect not work for the duration of the offset
(eg: -60 would make autoconnect wait 60 seconds after boot, instead of
autoconnecting almost immediately). Reported by aragon (0003853).
I mean.. it shouldn't apply, unless you have a negative offset of several days... or something ;)
See ircd.tune, first number, for the TS offset. Or use /TSCTL ALLTIME or something.
I'm running a test ircd alongside one of the production ones so as to mimic the setup as closely as possible. The test daemon is a debug+Debug compile, and it has been patched according to issue 3853, as it was suggested that it might solve my issue. I'll let it run for a while and simulate splits, see what happens.
Also, I forgot to mention that the kernel has several GrSec features enabled, especially some of the /proc restrictions. Any chance it might be related ?
@sysop : ntp resets time by about -0.15s every hour or so, which doesn't seem so bad but still might be improved by re-allowing /proc access and using adjtimex maybe. It never uses bigger leaps. ircd.tune has 0 916 in it.
TSCTL ALLTIME reports that all servers are in sync, except one at the moment running at -3. Never seen a TS offset in the days order of magnitude.
Then I don't think 0003853 is related to this.
I also don't see how grsecurity or any /proc restrictions could have any affect.
The tech info regarding detecting time jumps is:
1) it saves the current time
2) the ircd does what it needs to.. I mean.. accepting connections, reading, writing, etc... the main loop
3) it compares the current time with the saved time
If, like in this case, 23 seconds have passed between 1 and 3, then we assume the clock was adjusted.
It MAY have a different cause, like if you have anything else stalling the ircd for 23 seconds, like a very very heavy spamfilter or you are attaching a debugger and doing things and then 'c' continuing. But.. it is very unlikely that you wouldn't have noticed that if you & others use that server ;)
Is it relatively easy to reproduce? Like if I have your config, modify the server names of course, then /squit or tcpkill the connection a few times and then it stops autoconnecting?
If you can give a test case - like something I can reproduce here - that would be one way to trace it.
Bleh, should have put this all in one comment ;)
Another important question to find out is: is this problem specific to this server?
It's not easy to reproduce. I've set up the test/debug server for that reason. It lives on the same physical host where the failing ircd is, so I hope that it will suffer the same symptom sooner or later, but there's no guarantee. It doesn't seem to do it after a few disconnects, but only after a certain amount of time.
Also, it seems to happen more frequently if the split is caused by external factors rather than SQUIT.
The issue is not specific to this server, we have seen this behaviour on two different hubs.
Regarding the 23 seconds time shift, I have no idea what might have happened :S
same problem here.
My server is connected in irc network using ssl, it's running Debian Squeeze and time is synchronized with level 2 ntp server.
There is many timeshifts every day (from 20s to 50s). And after netsplit, I have to make rehash, otherwise it wont connect.
||This may seem simple but what is the value for connfreq in your server's class block?|
my connfreq is set to 30, but it doesn't want to connect even after an hour :). But only in the case of netsplit - after restart, etc it's ok.
And the timeshift problem it kinda unpleasant too, because all clients are freezed during the timejump :(.
Regarding time jumps, you could run a bash script for a day (or at least to catch one or two jumps) like:
And see if UnrealIRCd is lying or if there are really jumps ;)
Simply look back at date.log whenever unreal says the time jumped.
date.log should show 1 (and sometimes 2) second increments of time, anything big like >5 means the box itself has some kind of clock problem (or major performance issues).
IIRC I coded it that during a jump all client timers are reset so clients shouldn't notice any 'lag', however the freeze might be caused by the problem with the machine itself (I mean, it being a primary symptom caused by the machine itself, rather than a symptom caused by unreal).
Well, if you can, run the script, and see if you get any interesting results ;)
I've experienced the same issue I think, but I only have log files, I was not online so cannot see if it did any (failed) linking attempts for sure.
[Thu Dec 30 10:46:10 2010] - TSCTL: U:line security.xxxx set time to be 1293727573 (timediff: 392, was 391)
[Thu Dec 30 11:03:02 2010] - Lost connection to huba.xxxx[74.xxxx]:Operation timed out
[Thu Dec 30 11:03:02 2010] - SQUIT huba.xxxxx (Operation timed out)
[Thu Dec 30 12:39:28 2010] - OPER (Syzop) by (Syzop!~none@xxxxx)
[Thu Dec 30 12:39:37 2010] - SERVER hubb.xxxx
This server has 2 server classes, one with connfreq 27 and one with connfreq 90.
huba is in the 90s one, and hubb in the 27s one.
It seems neither autoconnect was triggered. When I went online this server was not linked to any server (/MAP).
As soon as I opered up and did a '/connect hubb*' it linked instantly.
Seems we got a real issue here.
Strange thing is, I don't recall having seen it before.
Unreal version: 220.127.116.11
I guess I'll have to go over the autoconnect code (haven't really looked or touched that ever before), and play a bit with tcpkill and such.
Just to add to the already existing weirdness ...
We've had maybe 4 or 5 splits recently. The last one displayed the bug, but the 3 or 4 before that went well, ie. autoconnect kicked in. So it's really quite random ... I'll check the dates, see if anything comes up from that
Quite some code changed in this area, and I also did this (which in itself does not fix this issue, but would fix other cases I guess):
Author: Bram Matthys <firstname.lastname@example.org>
Date: Fri Jul 24 11:57:09 2015 +0200
In fix_timers(), called for time jumps, also deal with aconf->hold (outgoing autoconnects). Change throttling flush to a more conservative approach.
I hate to be the bearer of bad news, but this issue might still be current.
We've got a network (different from the one that caused the initial bug report btw) running 4.0.17 (upgrade in progress). This morning we suffered a couple of splits. The leaf running 4.0.17 reconnected in about 10 seconds. Another running 4.0.14 stayed disconnected until I manually issued a CONNECT.
Aside from the obvious move to upgrade to 4.0.17, what other tools could be used to help diagnose this issue? Certainly 0003973 would help, but I'm open to other suggestions as well.
||I also need to add that another link between two 4.0.14 hubs had no trouble coming back up automatically, so it's still rather random… which makes it all the more difficult to diagnose without some insight as to whether reconnecting is even attempted.|
||Yes, must be rather annoying, especially with the random aspect (which is also annoying for me :D). I'll implement 0003973 in 4.0.19 and hopefully it will give you more information that allows you to help this bug report as well.|
||As promised, 0003973 has been implemented in 4.0.19. Should help a lot to trace this issue.|
||Have you seen this bug happen since 4.0.19 (or later)?|
I confirm the issue.
* *** Connection to (hidden)[(hidden)] activated.
* Lost connection to (hidden)[(hidden)]: SSL_connect(): Internal OpenSSL error or protocol error: ssl_undefined_function: called a function you should not call. Please verify that listen::options::ssl is enabled on port 7029 in (hidden)'s configuration file.
But immediately after rehash (no config change)...
* *** Loading IRCd configuration..
* *** Configuration loaded without any problems.
* *** Connection to (hidden)[(hidden)] activated.
* *** (link) Secure link (hidden) -> (hidden)[@(hidden)] established (TLSv1.2-ECDHE-RSA-AES256-GCM-SHA384)
||I see you mentioned a link error, but.. this issue is meant about failure to reconnect at all, like: no attempts at all. Is that what happened? Like, do you have a connfreq of e.g. 1 minute and it was completely silent for several minutes? (>1m anyway)|
||No attempts are visible on the receiving side, but they are made. The messages i have posted appear on the connecting side. Autoonnect is enabled, connfreq is 100. The issue seems to have appeared after enabling spkifp password on all servers. I think more than half of the splits need intervention (rehash the leaf that won't connect). Version is 4.2.3.|
What k4be talks about at the end is a linking attempt which fails, that is not what this issue was about. This issue was about no linking attempt at all being made anymore (nor logged).
So with respect to this (original) issue: I am marking as fixed since it has not been reported anymore since 4.0.19 eh.. 4.2.0.
I am sorry to report that this problem still exists in v5. Our whole network has been upgraded to 18.104.22.168 and last night, as the main hub experience network issues, this happened:
[Wed Apr 1 22:02:06 2020] - No response from hub3.ournetwork.tld[x.x.x.x], closing link
[Wed Apr 1 22:02:06 2020] - SQUIT hub3.outnetwork.tld (Ping timeout: 140 seconds)
[Wed Apr 1 22:03:22 2020] - Connection handshake timeout while connecting to server 'hub3.ournetwork.tld' (x.x.x.x)
Since then, hub3 does not appear anywhere in the logs: there has been no connection attempts from the main hub to hub3, despite the autoconnect option being set. Every other leaf managed to connect back to the main hub.
After further investigation, 2 out of our 5 leaves were "stuck" in the following state
[Wed Apr 1 19:59:25 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y)
[Wed Apr 1 19:59:35 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y)
[Wed Apr 1 19:59:46 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y)
[Wed Apr 1 20:00:39 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y)
They remained that way untile we simply rehashed them. In this instance, the autoconnect "worked", but since the erro message is the same as when hub1's autoconnect did not, I'm including this for completeness' sake. Perhaps it deserves its own, additional bug?
||For now just here is fine, thanks for the report and notes.. I'll have a look when I have time again. It at least gives me some scenario's for trying to reproduce it I guess:)|
||Good. Also, all our links are ssl-enabled and use spkifp for authentication, in case it matters.|
Still trying to reproduce it here but still no luck. And yeah I played a little with iptables and that sort of stuff.
Important to note here is that I just wrote something that improves the logging. It previously was "not good enough", so in the future this may help you and others:
commit f437593b8dbd7e86de8b296e25be289e47182742 (HEAD -> unreal50, origin/unreal50) Author: Bram Matthys <email@example.com> Date: Mon Apr 13 13:36:58 2020 +0200 Rewrite and expand notices+logging with regards to server linking / lost link. When connecting, use slightly different wording (and use it consistently): "Trying to activate link with server xyz" When the connection is lost before synced: "Unable to link with server xyz" When the connection is lost after fully synced (eg: minutes later): "Lost server link to xyz" Important small changes (other than text): * Log ERRORs from remote servers to the log (previously only shown to ircops) * Some link errors could have been previously suppressed due to old code assuming other parts of the code would send or log the error (this would be the case for an error when calling SSL/TLS write functions) * More?
The notices from above are in in 5.0.4, which was just released. Also, as mentioned, there there was one case (failing SSL/TLS write) where notices could have been missed entirely. I hope it will help with tracing this issue!
Le_Coyote: just wondering, are you on 5.0.4 (or later)? And are you still experiencing the issue?
I will leave it open for now, even if you report none, but... just wondering.
We are now running 5.0.4 network-wide. We had a couple of long lasting splits after the upgrade. In both cases, a rehash *immediately* solved the problem.
In the first occurrence, the log had lots of: "Connection handshake timeout while trying to link to server 'hub.ournetwork.org'"
In the second, the error was different: "Unable to link with server hub.ournetwork.org[redacted]: SSL_connect(): Internal OpenSSL error or protocol error: ssl_undefined_function: called a function you should not call. Please verify that listen::options::ssl is enabled on port 7026 in hub.ournetwork.org's configuration file."
There had been no configuration change on either end that might explain either message.
I can't be certain these incidents are related to the initial bug.
||I have reported the "called a function you should not call" problem on https://bugs.unrealircd.org/view.php?id=5607|
||Thanks @k4be, I've added a note there as well|
0005607 has been solved now ("called a function you should not call")
Let's close this one (0003972) as well, since it seems logging is working correctly now :). Previously there were no errors at all so it was all a mystery...
|2010-12-01 16:25||Le_Coyote||New Issue|
|2010-12-02 04:02||ohnobinki||Additional Information Updated|
|2010-12-02 04:03||ohnobinki||Additional Information Updated|
|2010-12-10 20:48||syzop||Additional Information Updated|
|2010-12-10 20:48||syzop||Additional Information Updated|
|2010-12-10 20:48||syzop||Additional Information Updated|
|2010-12-10 20:55||syzop||Note Added: 0016447|
|2010-12-10 21:03||syzop||Additional Information Updated|
|2010-12-10 21:04||syzop||Note Added: 0016449|
|2010-12-10 21:06||syzop||Note Added: 0016450|
|2010-12-10 21:54||Le_Coyote||Note Added: 0016451|
|2010-12-11 07:30||Le_Coyote||Note Added: 0016453|
|2010-12-11 12:26||syzop||Note Added: 0016454|
|2010-12-11 12:28||syzop||Note Added: 0016455|
|2010-12-11 12:29||syzop||Note Added: 0016456|
|2010-12-11 17:44||Le_Coyote||Note Added: 0016457|
|2010-12-26 20:52||karel||Note Added: 0016513|
|2010-12-26 23:46||katsklaw||Note Added: 0016514|
|2010-12-27 02:39||karel||Note Added: 0016515|
|2010-12-27 02:39||karel||Note Edited: 0016515|
|2010-12-27 02:42||karel||Note Edited: 0016515|
|2010-12-27 11:09||syzop||Note Added: 0016516|
|2010-12-30 21:03||syzop||Note Added: 0016545|
|2011-01-23 20:33||Le_Coyote||Note Added: 0016587|
|2011-06-05 21:37||syzop||Relationship added||child of 0003776|
|2011-07-17 23:24||syzop||Relationship deleted||child of 0003776|
|2011-07-17 23:24||syzop||Relationship added||child of 0003915|
|2012-08-23 19:23||syzop||Note Edited: 0016516|
|2012-10-06 12:22||syzop||Relationship deleted||child of 0003915|
|2015-07-24 12:00||syzop||Note Added: 0018562|
|2015-07-24 12:00||syzop||Status||new => resolved|
|2015-07-24 12:00||syzop||Fixed in Version||=> 3.4-beta3|
|2015-07-24 12:00||syzop||Resolution||open => fixed|
|2015-07-24 12:00||syzop||Assigned To||=> syzop|
|2018-05-04 11:19||Le_Coyote||Status||resolved => feedback|
|2018-05-04 11:19||Le_Coyote||Resolution||fixed => reopened|
|2018-05-04 11:19||Le_Coyote||Note Added: 0020129|
|2018-05-04 11:24||Le_Coyote||Note Added: 0020130|
|2018-07-14 16:44||syzop||Note Added: 0020202|
|2018-09-09 17:06||syzop||Note Added: 0020303|
|2019-02-01 14:34||syzop||Note Added: 0020474|
|2019-08-05 06:37||k4be||Note Added: 0020797|
|2019-11-20 17:16||syzop||Note Added: 0021105|
|2019-11-20 18:21||k4be||Note Added: 0021115|
|2020-01-20 15:31||syzop||Status||feedback => resolved|
|2020-01-20 15:31||syzop||Fixed in Version||3.4-beta3 => 4.2.0|
|2020-01-20 15:31||syzop||Note Added: 0021232|
|2020-04-02 10:20||Le_Coyote||Status||resolved => feedback|
|2020-04-02 10:20||Le_Coyote||Note Added: 0021393|
|2020-04-02 10:30||syzop||Product Version||3.2.8 => 22.214.171.124|
|2020-04-02 10:40||Le_Coyote||Note Added: 0021394|
|2020-04-02 10:40||syzop||Fixed in Version||4.2.0 =>|
|2020-04-02 10:42||syzop||Note Added: 0021395|
|2020-04-02 11:13||Le_Coyote||Note Added: 0021396|
|2020-04-13 13:46||syzop||Note Added: 0021467|
|2020-04-13 13:46||syzop||Note Edited: 0021467|
|2020-04-19 15:24||syzop||Note Added: 0021512|
|2020-04-19 15:25||syzop||Note Edited: 0021512|
|2020-06-26 07:28||syzop||Note Added: 0021643|
|2020-06-26 12:45||Le_Coyote||Note Added: 0021644|
|2020-06-26 12:54||k4be||Note Added: 0021645|
|2020-06-26 14:40||Le_Coyote||Note Added: 0021647|
|2020-06-26 15:23||syzop||Note Added: 0021650|
|2020-06-26 15:23||syzop||Status||feedback => resolved|
|2020-06-26 15:23||syzop||Fixed in Version||=> 5.0.6|