0003972: Autoconnect link does not reconnect after split

ID	Project	Category	View Status	Date Submitted	Last Update
0003972	unreal	ircd	public	2010-12-01 16:25	2020-06-26 15:23

Reporter	Le_Coyote	Assigned To	syzop
Priority	normal	Severity	major	Reproducibility	sometimes
Status	resolved	Resolution	reopened
Product Version	5.0.3.1
Fixed in Version	5.0.6

Summary	0003972: Autoconnect link does not reconnect after split
Description	Sometimes (not always but nearly), when a connection between two hubs is lost, the link does not come back up even though it is set to autoconnect in the config file. The ircd "sees" the split (ie "Lost connection to ..." notice and SQUIT in the logs), but remains on its own indefinitely (10 hours in the last instance until someone reconnected manually). It seems that the problem builds up over time, ie. the first couple of disconnections after a restart/rehash are usually handled correctly (autotonnect kicks in), but later on, autoconnect just does not happen any longer.
Additional Information	The only clue so far is: [Wed Dec 1 00:19:13 2010] - WARNING: Time jumped 23 seconds ahead! (1291159130 -> 1291159153) [Wed Dec 1 00:19:13 2010] - [TimeShift] Resetting some timers! The server has ntpd running and has not skipped ahead itself.

3rd party modules	delaylist, regexcept, quarantine, nocodes

syzop 2010-12-10 20:55 administrator ~0016447	I've found the mantis bug, it was caused by ~ followed by certain numbers. I already reported it straight away on http://www.mantisbt.org/bugs/view.php?id=12580 when I heard about this (better email next time ;p), but as you can see I had to trace it myself... Anyway, the rest will be on-topic. So far haven't looked at your issue.

syzop 2010-12-10 21:04 administrator ~0016449	The jumps in time are really quite odd.. and big. Could it be that your server clock is running too slow and ntp is adjusting it every once in a while in a huge leap? Actually ntpd shouldn't do that (except 1st tiem on boot sometimes), only ntpdate does things like that...

syzop 2010-12-10 21:06 administrator ~0016450	also there was a bug resolved which involved autoconnect & time offsets, but that one probably does not apply to you: - Fixed issue where a negative time offset (either caused by ircd.tune or timesynch) made autoconnect not work for the duration of the offset (eg: -60 would make autoconnect wait 60 seconds after boot, instead of autoconnecting almost immediately). Reported by aragon (0003853). I mean.. it shouldn't apply, unless you have a negative offset of several days... or something ;) See ircd.tune, first number, for the TS offset. Or use /TSCTL ALLTIME or something.

Le_Coyote 2010-12-10 21:54 reporter ~0016451	I'm running a test ircd alongside one of the production ones so as to mimic the setup as closely as possible. The test daemon is a debug+Debug compile, and it has been patched according to issue 3853, as it was suggested that it might solve my issue. I'll let it run for a while and simulate splits, see what happens. Also, I forgot to mention that the kernel has several GrSec features enabled, especially some of the /proc restrictions. Any chance it might be related ?

Le_Coyote 2010-12-11 07:30 reporter ~0016453	@sysop : ntp resets time by about -0.15s every hour or so, which doesn't seem so bad but still might be improved by re-allowing /proc access and using adjtimex maybe. It never uses bigger leaps. ircd.tune has 0 916 in it. TSCTL ALLTIME reports that all servers are in sync, except one at the moment running at -3. Never seen a TS offset in the days order of magnitude.

syzop 2010-12-11 12:26 administrator ~0016454	Then I don't think 0003853 is related to this. I also don't see how grsecurity or any /proc restrictions could have any affect. The tech info regarding detecting time jumps is: 1) it saves the current time 2) the ircd does what it needs to.. I mean.. accepting connections, reading, writing, etc... the main loop 3) it compares the current time with the saved time If, like in this case, 23 seconds have passed between 1 and 3, then we assume the clock was adjusted. It MAY have a different cause, like if you have anything else stalling the ircd for 23 seconds, like a very very heavy spamfilter or you are attaching a debugger and doing things and then 'c' continuing. But.. it is very unlikely that you wouldn't have noticed that if you & others use that server ;)

syzop 2010-12-11 12:28 administrator ~0016455	Is it relatively easy to reproduce? Like if I have your config, modify the server names of course, then /squit or tcpkill the connection a few times and then it stops autoconnecting? If you can give a test case - like something I can reproduce here - that would be one way to trace it.

syzop 2010-12-11 12:29 administrator ~0016456	Bleh, should have put this all in one comment ;) Another important question to find out is: is this problem specific to this server?

Le_Coyote 2010-12-11 17:44 reporter ~0016457	It's not easy to reproduce. I've set up the test/debug server for that reason. It lives on the same physical host where the failing ircd is, so I hope that it will suffer the same symptom sooner or later, but there's no guarantee. It doesn't seem to do it after a few disconnects, but only after a certain amount of time. Also, it seems to happen more frequently if the split is caused by external factors rather than SQUIT. The issue is not specific to this server, we have seen this behaviour on two different hubs. Regarding the 23 seconds time shift, I have no idea what might have happened :S

karel 2010-12-26 20:52 reporter ~0016513	Hello, same problem here. My server is connected in irc network using ssl, it's running Debian Squeeze and time is synchronized with level 2 ntp server. There is many timeshifts every day (from 20s to 50s). And after netsplit, I have to make rehash, otherwise it wont connect.

katsklaw 2010-12-26 23:46 reporter ~0016514	This may seem simple but what is the value for connfreq in your server's class block?

karel 2010-12-27 02:39 reporter ~0016515 Last edited: 2010-12-27 02:42	Hello katsklaw, my connfreq is set to 30, but it doesn't want to connect even after an hour :). But only in the case of netsplit - after restart, etc it's ok. And the timeshift problem it kinda unpleasant too, because all clients are freezed during the timejump :(.

syzop 2010-12-27 11:09 administrator ~0016516 Last edited: 2012-08-23 19:23	Regarding time jumps, you could run a bash script for a day (or at least to catch one or two jumps) like: #!/bin/bash while true do date >>date.log sleep 1 done And see if UnrealIRCd is lying or if there are really jumps ;) Simply look back at date.log whenever unreal says the time jumped. date.log should show 1 (and sometimes 2) second increments of time, anything big like >5 means the box itself has some kind of clock problem (or major performance issues). IIRC I coded it that during a jump all client timers are reset so clients shouldn't notice any 'lag', however the freeze might be caused by the problem with the machine itself (I mean, it being a primary symptom caused by the machine itself, rather than a symptom caused by unreal). Well, if you can, run the script, and see if you get any interesting results ;)

syzop 2010-12-30 21:03 administrator ~0016545	I've experienced the same issue I think, but I only have log files, I was not online so cannot see if it did any (failed) linking attempts for sure. Anyway, log: [Thu Dec 30 10:46:10 2010] - TSCTL: U:line security.xxxx set time to be 1293727573 (timediff: 392, was 391) [Thu Dec 30 11:03:02 2010] - Lost connection to huba.xxxx[74.xxxx]:Operation timed out [Thu Dec 30 11:03:02 2010] - SQUIT huba.xxxxx (Operation timed out) [Thu Dec 30 12:39:28 2010] - OPER (Syzop) by (Syzop!~none@xxxxx) [Thu Dec 30 12:39:37 2010] - SERVER hubb.xxxx This server has 2 server classes, one with connfreq 27 and one with connfreq 90. huba is in the 90s one, and hubb in the 27s one. It seems neither autoconnect was triggered. When I went online this server was not linked to any server (/MAP). As soon as I opered up and did a '/connect hubb*' it linked instantly. Seems we got a real issue here. Strange thing is, I don't recall having seen it before. Unreal version: 3.2.8.1 I guess I'll have to go over the autoconnect code (haven't really looked or touched that ever before), and play a bit with tcpkill and such.

Le_Coyote 2011-01-23 20:33 reporter ~0016587	Just to add to the already existing weirdness ... We've had maybe 4 or 5 splits recently. The last one displayed the bug, but the 3 or 4 before that went well, ie. autoconnect kicked in. So it's really quite random ... I'll check the dates, see if anything comes up from that

syzop 2015-07-24 12:00 administrator ~0018562	Quite some code changed in this area, and I also did this (which in itself does not fix this issue, but would fix other cases I guess): https://github.com/unrealircd/unrealircd/commit/8bea34626398cb10217e5f097456f113d22dae3b commit 8bea34626398cb10217e5f097456f113d22dae3b Author: Bram Matthys <[email protected]> Date: Fri Jul 24 11:57:09 2015 +0200 In fix_timers(), called for time jumps, also deal with aconf->hold (outgoing autoconnects). Change throttling flush to a more conservative approach.

Le_Coyote 2018-05-04 11:19 reporter ~0020129	I hate to be the bearer of bad news, but this issue might still be current. We've got a network (different from the one that caused the initial bug report btw) running 4.0.17 (upgrade in progress). This morning we suffered a couple of splits. The leaf running 4.0.17 reconnected in about 10 seconds. Another running 4.0.14 stayed disconnected until I manually issued a CONNECT. Aside from the obvious move to upgrade to 4.0.17, what other tools could be used to help diagnose this issue? Certainly 0003973 would help, but I'm open to other suggestions as well.

Le_Coyote 2018-05-04 11:24 reporter ~0020130	I also need to add that another link between two 4.0.14 hubs had no trouble coming back up automatically, so it's still rather random… which makes it all the more difficult to diagnose without some insight as to whether reconnecting is even attempted.

syzop 2018-07-14 16:44 administrator ~0020202	Yes, must be rather annoying, especially with the random aspect (which is also annoying for me :D). I'll implement 0003973 in 4.0.19 and hopefully it will give you more information that allows you to help this bug report as well.

syzop 2018-09-09 17:06 administrator ~0020303	As promised, 0003973 has been implemented in 4.0.19. Should help a lot to trace this issue.

syzop 2019-02-01 14:34 administrator ~0020474	Have you seen this bug happen since 4.0.19 (or later)?

k4be 2019-08-05 06:37 developer ~0020797	I confirm the issue. * *** Connection to (hidden)[(hidden)] activated. * Lost connection to (hidden)[(hidden)]: SSL_connect(): Internal OpenSSL error or protocol error: ssl_undefined_function: called a function you should not call. Please verify that listen::options::ssl is enabled on port 7029 in (hidden)'s configuration file. But immediately after rehash (no config change)... * *** Loading IRCd configuration.. * *** Configuration loaded without any problems. * *** Connection to (hidden)[(hidden)] activated. * *** (link) Secure link (hidden) -> (hidden)[@(hidden)] established (TLSv1.2-ECDHE-RSA-AES256-GCM-SHA384)

syzop 2019-11-20 17:16 administrator ~0021105	I see you mentioned a link error, but.. this issue is meant about failure to reconnect at all, like: no attempts at all. Is that what happened? Like, do you have a connfreq of e.g. 1 minute and it was completely silent for several minutes? (>1m anyway)

k4be 2019-11-20 18:21 developer ~0021115	No attempts are visible on the receiving side, but they are made. The messages i have posted appear on the connecting side. Autoonnect is enabled, connfreq is 100. The issue seems to have appeared after enabling spkifp password on all servers. I think more than half of the splits need intervention (rehash the leaf that won't connect). Version is 4.2.3.

syzop 2020-01-20 15:31 administrator ~0021232	What k4be talks about at the end is a linking attempt which fails, that is not what this issue was about. This issue was about no linking attempt at all being made anymore (nor logged). So with respect to this (original) issue: I am marking as fixed since it has not been reported anymore since 4.0.19 eh.. 4.2.0.

Le_Coyote 2020-04-02 10:20 reporter ~0021393	I am sorry to report that this problem still exists in v5. Our whole network has been upgraded to 5.0.3.1 and last night, as the main hub experience network issues, this happened: [Wed Apr 1 22:02:06 2020] - No response from hub3.ournetwork.tld[x.x.x.x], closing link [Wed Apr 1 22:02:06 2020] - SQUIT hub3.outnetwork.tld (Ping timeout: 140 seconds) [Wed Apr 1 22:03:22 2020] - Connection handshake timeout while connecting to server 'hub3.ournetwork.tld' (x.x.x.x) Since then, hub3 does not appear anywhere in the logs: there has been no connection attempts from the main hub to hub3, despite the autoconnect option being set. Every other leaf managed to connect back to the main hub.

Le_Coyote 2020-04-02 10:40 reporter ~0021394	After further investigation, 2 out of our 5 leaves were "stuck" in the following state [Wed Apr 1 19:59:25 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y) [Wed Apr 1 19:59:35 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y) [Wed Apr 1 19:59:46 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y) [Wed Apr 1 20:00:39 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y) They remained that way untile we simply rehashed them. In this instance, the autoconnect "worked", but since the erro message is the same as when hub1's autoconnect did not, I'm including this for completeness' sake. Perhaps it deserves its own, additional bug?

syzop 2020-04-02 10:42 administrator ~0021395	For now just here is fine, thanks for the report and notes.. I'll have a look when I have time again. It at least gives me some scenario's for trying to reproduce it I guess:)

Le_Coyote 2020-04-02 11:13 reporter ~0021396	Good. Also, all our links are ssl-enabled and use spkifp for authentication, in case it matters.

syzop 2020-04-13 13:46 administrator ~0021467 Last edited: 2020-04-13 13:46	Still trying to reproduce it here but still no luck. And yeah I played a little with iptables and that sort of stuff. Important to note here is that I just wrote something that improves the logging. It previously was "not good enough", so in the future this may help you and others: commit f437593b8dbd7e86de8b296e25be289e47182742 (HEAD -> unreal50, origin/unreal50) Author: Bram Matthys <[email protected]> Date: Mon Apr 13 13:36:58 2020 +0200 Rewrite and expand notices+logging with regards to server linking / lost link. When connecting, use slightly different wording (and use it consistently): "Trying to activate link with server xyz" When the connection is lost before synced: "Unable to link with server xyz" When the connection is lost after fully synced (eg: minutes later): "Lost server link to xyz" Important small changes (other than text): * Log ERRORs from remote servers to the log (previously only shown to ircops) * Some link errors could have been previously suppressed due to old code assuming other parts of the code would send or log the error (this would be the case for an error when calling SSL/TLS write functions) * More?

syzop 2020-04-19 15:24 administrator ~0021512 Last edited: 2020-04-19 15:25	The notices from above are in in 5.0.4, which was just released. Also, as mentioned, there there was one case (failing SSL/TLS write) where notices could have been missed entirely. I hope it will help with tracing this issue!

syzop 2020-06-26 07:28 administrator ~0021643	Le_Coyote: just wondering, are you on 5.0.4 (or later)? And are you still experiencing the issue? I will leave it open for now, even if you report none, but... just wondering.

Le_Coyote 2020-06-26 12:45 reporter ~0021644	We are now running 5.0.4 network-wide. We had a couple of long lasting splits after the upgrade. In both cases, a rehash immediately solved the problem. In the first occurrence, the log had lots of: "Connection handshake timeout while trying to link to server 'hub.ournetwork.org'" In the second, the error was different: "Unable to link with server hub.ournetwork.org[redacted]: SSL_connect(): Internal OpenSSL error or protocol error: ssl_undefined_function: called a function you should not call. Please verify that listen::options::ssl is enabled on port 7026 in hub.ournetwork.org's configuration file." There had been no configuration change on either end that might explain either message. I can't be certain these incidents are related to the initial bug.

k4be 2020-06-26 12:54 developer ~0021645	I have reported the "called a function you should not call" problem on https://bugs.unrealircd.org/view.php?id=5607

Le_Coyote 2020-06-26 14:40 reporter ~0021647	Thanks @k4be, I've added a note there as well

syzop 2020-06-26 15:23 administrator ~0021650	0005607 has been solved now ("called a function you should not call") Let's close this one (0003972) as well, since it seems logging is working correctly now :). Previously there were no errors at all so it was all a mystery...

Date Modified	Username	Field	Change
2010-12-01 16:25	Le_Coyote	New Issue
2010-12-02 04:02	ohnobinki	Additional Information Updated
2010-12-02 04:03	ohnobinki	Additional Information Updated
2010-12-10 20:48	syzop	Additional Information Updated
2010-12-10 20:48	syzop	Additional Information Updated
2010-12-10 20:48	syzop	Additional Information Updated
2010-12-10 20:55	syzop	Note Added: 0016447
2010-12-10 21:03	syzop	Additional Information Updated
2010-12-10 21:04	syzop	Note Added: 0016449
2010-12-10 21:06	syzop	Note Added: 0016450
2010-12-10 21:54	Le_Coyote	Note Added: 0016451
2010-12-11 07:30	Le_Coyote	Note Added: 0016453
2010-12-11 12:26	syzop	Note Added: 0016454
2010-12-11 12:28	syzop	Note Added: 0016455
2010-12-11 12:29	syzop	Note Added: 0016456
2010-12-11 17:44	Le_Coyote	Note Added: 0016457
2010-12-26 20:52	karel	Note Added: 0016513
2010-12-26 23:46	katsklaw	Note Added: 0016514
2010-12-27 02:39	karel	Note Added: 0016515
2010-12-27 02:39	karel	Note Edited: 0016515
2010-12-27 02:42	karel	Note Edited: 0016515
2010-12-27 11:09	syzop	Note Added: 0016516
2010-12-30 21:03	syzop	Note Added: 0016545
2011-01-23 20:33	Le_Coyote	Note Added: 0016587
2011-06-05 21:37	syzop	Relationship added	child of 0003776
2011-07-17 23:24	syzop	Relationship deleted	child of 0003776
2011-07-17 23:24	syzop	Relationship added	child of 0003915
2012-08-23 19:23	syzop	Note Edited: 0016516
2012-10-06 12:22	syzop	Relationship deleted	child of 0003915
2015-07-24 12:00	syzop	Note Added: 0018562
2015-07-24 12:00	syzop	Status	new => resolved
2015-07-24 12:00	syzop	Fixed in Version	=> 3.4-beta3
2015-07-24 12:00	syzop	Resolution	open => fixed
2015-07-24 12:00	syzop	Assigned To	=> syzop
2018-05-04 11:19	Le_Coyote	Status	resolved => feedback
2018-05-04 11:19	Le_Coyote	Resolution	fixed => reopened
2018-05-04 11:19	Le_Coyote	Note Added: 0020129
2018-05-04 11:24	Le_Coyote	Note Added: 0020130
2018-07-14 16:44	syzop	Note Added: 0020202
2018-09-09 17:06	syzop	Note Added: 0020303
2019-02-01 14:34	syzop	Note Added: 0020474
2019-08-05 06:37	k4be	Note Added: 0020797
2019-11-20 17:16	syzop	Note Added: 0021105
2019-11-20 18:21	k4be	Note Added: 0021115
2020-01-20 15:31	syzop	Status	feedback => resolved
2020-01-20 15:31	syzop	Fixed in Version	3.4-beta3 => 4.2.0
2020-01-20 15:31	syzop	Note Added: 0021232
2020-04-02 10:20	Le_Coyote	Status	resolved => feedback
2020-04-02 10:20	Le_Coyote	Note Added: 0021393
2020-04-02 10:30	syzop	Product Version	3.2.8 => 5.0.3.1
2020-04-02 10:40	Le_Coyote	Note Added: 0021394
2020-04-02 10:40	syzop	Fixed in Version	4.2.0 =>
2020-04-02 10:42	syzop	Note Added: 0021395
2020-04-02 11:13	Le_Coyote	Note Added: 0021396
2020-04-13 13:46	syzop	Note Added: 0021467
2020-04-13 13:46	syzop	Note Edited: 0021467
2020-04-19 15:24	syzop	Note Added: 0021512
2020-04-19 15:25	syzop	Note Edited: 0021512
2020-06-26 07:28	syzop	Note Added: 0021643
2020-06-26 12:45	Le_Coyote	Note Added: 0021644
2020-06-26 12:54	k4be	Note Added: 0021645
2020-06-26 14:40	Le_Coyote	Note Added: 0021647
2020-06-26 15:23	syzop	Note Added: 0021650
2020-06-26 15:23	syzop	Status	feedback => resolved
2020-06-26 15:23	syzop	Fixed in Version	=> 5.0.6

View Issue Details

Activities

Issue History