View Issue Details

IDProjectCategoryView StatusLast Update
0003972unrealircdpublic2020-06-26 15:23
ReporterLe_Coyote Assigned Tosyzop  
PrioritynormalSeveritymajorReproducibilitysometimes
Status resolvedResolutionreopened 
Platformx86OSLinuxOS Version2.6.22
Product Version5.0.3.1 
Fixed in Version5.0.6 
Summary0003972: Autoconnect link does not reconnect after split
DescriptionSometimes (not always but nearly), when a connection between two hubs is lost, the link does not come back up even though it is set to autoconnect in the config file.

The ircd "sees" the split (ie "Lost connection to ..." notice and SQUIT in the logs), but remains on its own indefinitely (10 hours in the last instance until someone reconnected manually).

It seems that the problem builds up over time, ie. the first couple of disconnections after a restart/rehash are usually handled correctly (autotonnect kicks in), but later on, autoconnect just does not happen any longer.
Additional InformationThe only clue so far is:
[Wed Dec 1 00:19:13 2010] - WARNING: Time jumped 23 seconds ahead! (1291159130
-> 1291159153)
[Wed Dec 1 00:19:13 2010] - [TimeShift] Resetting some timers!

The server has ntpd running and has *not* skipped ahead itself.
TagsNo tags attached.
3rd party modulesdelaylist, regexcept, quarantine, nocodes

Activities

syzop

2010-12-10 20:55

administrator   ~0016447

I've found the mantis bug, it was caused by ~ followed by certain numbers.
I already reported it straight away on http://www.mantisbt.org/bugs/view.php?id=12580 when I heard about this (better email next time ;p), but as you can see I had to trace it myself...

Anyway, the rest will be on-topic.
So far haven't looked at your issue.

syzop

2010-12-10 21:04

administrator   ~0016449

The jumps in time are really quite odd.. and big.

Could it be that your server clock is running too slow and ntp is adjusting it every once in a while in a huge leap? Actually ntpd shouldn't do that (except 1st tiem on boot sometimes), only ntpdate does things like that...

syzop

2010-12-10 21:06

administrator   ~0016450

also there was a bug resolved which involved autoconnect & time offsets, but that one probably does not apply to you:
- Fixed issue where a negative time offset (either caused by ircd.tune or
  timesynch) made autoconnect not work for the duration of the offset
  (eg: -60 would make autoconnect wait 60 seconds after boot, instead of
  autoconnecting almost immediately). Reported by aragon (0003853).

I mean.. it shouldn't apply, unless you have a negative offset of several days... or something ;)
See ircd.tune, first number, for the TS offset. Or use /TSCTL ALLTIME or something.

Le_Coyote

2010-12-10 21:54

reporter   ~0016451

I'm running a test ircd alongside one of the production ones so as to mimic the setup as closely as possible. The test daemon is a debug+Debug compile, and it has been patched according to issue 3853, as it was suggested that it might solve my issue. I'll let it run for a while and simulate splits, see what happens.

Also, I forgot to mention that the kernel has several GrSec features enabled, especially some of the /proc restrictions. Any chance it might be related ?

Le_Coyote

2010-12-11 07:30

reporter   ~0016453

@sysop : ntp resets time by about -0.15s every hour or so, which doesn't seem so bad but still might be improved by re-allowing /proc access and using adjtimex maybe. It never uses bigger leaps. ircd.tune has 0 916 in it.
TSCTL ALLTIME reports that all servers are in sync, except one at the moment running at -3. Never seen a TS offset in the days order of magnitude.

syzop

2010-12-11 12:26

administrator   ~0016454

Then I don't think 0003853 is related to this.
I also don't see how grsecurity or any /proc restrictions could have any affect.

The tech info regarding detecting time jumps is:
1) it saves the current time
2) the ircd does what it needs to.. I mean.. accepting connections, reading, writing, etc... the main loop
3) it compares the current time with the saved time
If, like in this case, 23 seconds have passed between 1 and 3, then we assume the clock was adjusted.
It MAY have a different cause, like if you have anything else stalling the ircd for 23 seconds, like a very very heavy spamfilter or you are attaching a debugger and doing things and then 'c' continuing. But.. it is very unlikely that you wouldn't have noticed that if you & others use that server ;)

syzop

2010-12-11 12:28

administrator   ~0016455

Is it relatively easy to reproduce? Like if I have your config, modify the server names of course, then /squit or tcpkill the connection a few times and then it stops autoconnecting?
If you can give a test case - like something I can reproduce here - that would be one way to trace it.

syzop

2010-12-11 12:29

administrator   ~0016456

Bleh, should have put this all in one comment ;)

Another important question to find out is: is this problem specific to this server?

Le_Coyote

2010-12-11 17:44

reporter   ~0016457

It's not easy to reproduce. I've set up the test/debug server for that reason. It lives on the same physical host where the failing ircd is, so I hope that it will suffer the same symptom sooner or later, but there's no guarantee. It doesn't seem to do it after a few disconnects, but only after a certain amount of time.

Also, it seems to happen more frequently if the split is caused by external factors rather than SQUIT.

The issue is not specific to this server, we have seen this behaviour on two different hubs.

Regarding the 23 seconds time shift, I have no idea what might have happened :S

karel

2010-12-26 20:52

reporter   ~0016513

Hello,
same problem here.

My server is connected in irc network using ssl, it's running Debian Squeeze and time is synchronized with level 2 ntp server.

There is many timeshifts every day (from 20s to 50s). And after netsplit, I have to make rehash, otherwise it wont connect.

katsklaw

2010-12-26 23:46

reporter   ~0016514

This may seem simple but what is the value for connfreq in your server's class block?

karel

2010-12-27 02:39

reporter   ~0016515

Last edited: 2010-12-27 02:42

Hello katsklaw,
my connfreq is set to 30, but it doesn't want to connect even after an hour :). But only in the case of netsplit - after restart, etc it's ok.

And the timeshift problem it kinda unpleasant too, because all clients are freezed during the timejump :(.

syzop

2010-12-27 11:09

administrator   ~0016516

Last edited: 2012-08-23 19:23

Regarding time jumps, you could run a bash script for a day (or at least to catch one or two jumps) like:

#!/bin/bash
while true
do
date >>date.log
sleep 1
done

And see if UnrealIRCd is lying or if there are really jumps ;)
Simply look back at date.log whenever unreal says the time jumped.
date.log should show 1 (and sometimes 2) second increments of time, anything big like >5 means the box itself has some kind of clock problem (or major performance issues).
IIRC I coded it that during a jump all client timers are reset so clients shouldn't notice any 'lag', however the freeze might be caused by the problem with the machine itself (I mean, it being a primary symptom caused by the machine itself, rather than a symptom caused by unreal).

Well, if you can, run the script, and see if you get any interesting results ;)

syzop

2010-12-30 21:03

administrator   ~0016545

I've experienced the same issue I think, but I only have log files, I was not online so cannot see if it did any (failed) linking attempts for sure.

Anyway, log:
[Thu Dec 30 10:46:10 2010] - TSCTL: U:line security.xxxx set time to be 1293727573 (timediff: 392, was 391)
[Thu Dec 30 11:03:02 2010] - Lost connection to huba.xxxx[74.xxxx]:Operation timed out
[Thu Dec 30 11:03:02 2010] - SQUIT huba.xxxxx (Operation timed out)
[Thu Dec 30 12:39:28 2010] - OPER (Syzop) by (Syzop!~none@xxxxx)
[Thu Dec 30 12:39:37 2010] - SERVER hubb.xxxx

This server has 2 server classes, one with connfreq 27 and one with connfreq 90.
huba is in the 90s one, and hubb in the 27s one.

It seems neither autoconnect was triggered. When I went online this server was not linked to any server (/MAP).
As soon as I opered up and did a '/connect hubb*' it linked instantly.

Seems we got a real issue here.

Strange thing is, I don't recall having seen it before.

Unreal version: 3.2.8.1

I guess I'll have to go over the autoconnect code (haven't really looked or touched that ever before), and play a bit with tcpkill and such.

Le_Coyote

2011-01-23 20:33

reporter   ~0016587

Just to add to the already existing weirdness ...

We've had maybe 4 or 5 splits recently. The last one displayed the bug, but the 3 or 4 before that went well, ie. autoconnect kicked in. So it's really quite random ... I'll check the dates, see if anything comes up from that

syzop

2015-07-24 12:00

administrator   ~0018562

Quite some code changed in this area, and I also did this (which in itself does not fix this issue, but would fix other cases I guess):
https://github.com/unrealircd/unrealircd/commit/8bea34626398cb10217e5f097456f113d22dae3b

commit 8bea34626398cb10217e5f097456f113d22dae3b
Author: Bram Matthys <[email protected]>
Date: Fri Jul 24 11:57:09 2015 +0200

    In fix_timers(), called for time jumps, also deal with aconf->hold (outgoing autoconnects). Change throttling flush to a more conservative approach.

Le_Coyote

2018-05-04 11:19

reporter   ~0020129

I hate to be the bearer of bad news, but this issue might still be current.

We've got a network (different from the one that caused the initial bug report btw) running 4.0.17 (upgrade in progress). This morning we suffered a couple of splits. The leaf running 4.0.17 reconnected in about 10 seconds. Another running 4.0.14 stayed disconnected until I manually issued a CONNECT.

Aside from the obvious move to upgrade to 4.0.17, what other tools could be used to help diagnose this issue? Certainly 0003973 would help, but I'm open to other suggestions as well.

Le_Coyote

2018-05-04 11:24

reporter   ~0020130

I also need to add that another link between two 4.0.14 hubs had no trouble coming back up automatically, so it's still rather random… which makes it all the more difficult to diagnose without some insight as to whether reconnecting is even attempted.

syzop

2018-07-14 16:44

administrator   ~0020202

Yes, must be rather annoying, especially with the random aspect (which is also annoying for me :D). I'll implement 0003973 in 4.0.19 and hopefully it will give you more information that allows you to help this bug report as well.

syzop

2018-09-09 17:06

administrator   ~0020303

As promised, 0003973 has been implemented in 4.0.19. Should help a lot to trace this issue.

syzop

2019-02-01 14:34

administrator   ~0020474

Have you seen this bug happen since 4.0.19 (or later)?

k4be

2019-08-05 06:37

developer   ~0020797

I confirm the issue.

* *** Connection to (hidden)[(hidden)] activated.
* Lost connection to (hidden)[(hidden)]: SSL_connect(): Internal OpenSSL error or protocol error: ssl_undefined_function: called a function you should not call. Please verify that listen::options::ssl is enabled on port 7029 in (hidden)'s configuration file.

But immediately after rehash (no config change)...

* *** Loading IRCd configuration..
* *** Configuration loaded without any problems.
* *** Connection to (hidden)[(hidden)] activated.
* *** (link) Secure link (hidden) -> (hidden)[@(hidden)] established (TLSv1.2-ECDHE-RSA-AES256-GCM-SHA384)

syzop

2019-11-20 17:16

administrator   ~0021105

I see you mentioned a link error, but.. this issue is meant about failure to reconnect at all, like: no attempts at all. Is that what happened? Like, do you have a connfreq of e.g. 1 minute and it was completely silent for several minutes? (>1m anyway)

k4be

2019-11-20 18:21

developer   ~0021115

No attempts are visible on the receiving side, but they are made. The messages i have posted appear on the connecting side. Autoonnect is enabled, connfreq is 100. The issue seems to have appeared after enabling spkifp password on all servers. I think more than half of the splits need intervention (rehash the leaf that won't connect). Version is 4.2.3.

syzop

2020-01-20 15:31

administrator   ~0021232

What k4be talks about at the end is a linking attempt which fails, that is not what this issue was about. This issue was about no linking attempt at all being made anymore (nor logged).
So with respect to this (original) issue: I am marking as fixed since it has not been reported anymore since 4.0.19 eh.. 4.2.0.

Le_Coyote

2020-04-02 10:20

reporter   ~0021393

I am sorry to report that this problem still exists in v5. Our whole network has been upgraded to 5.0.3.1 and last night, as the main hub experience network issues, this happened:

[Wed Apr 1 22:02:06 2020] - No response from hub3.ournetwork.tld[x.x.x.x], closing link
[Wed Apr 1 22:02:06 2020] - SQUIT hub3.outnetwork.tld (Ping timeout: 140 seconds)
[Wed Apr 1 22:03:22 2020] - Connection handshake timeout while connecting to server 'hub3.ournetwork.tld' (x.x.x.x)

Since then, hub3 does not appear anywhere in the logs: there has been no connection attempts from the main hub to hub3, despite the autoconnect option being set. Every other leaf managed to connect back to the main hub.

Le_Coyote

2020-04-02 10:40

reporter   ~0021394

After further investigation, 2 out of our 5 leaves were "stuck" in the following state

[Wed Apr 1 19:59:25 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y)
[Wed Apr 1 19:59:35 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y)
[Wed Apr 1 19:59:46 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y)
[Wed Apr 1 20:00:39 2020] - Connection handshake timeout while connecting to server 'hub1.ournetwork.tld' (y.y.y.y)

They remained that way untile we simply rehashed them. In this instance, the autoconnect "worked", but since the erro message is the same as when hub1's autoconnect did not, I'm including this for completeness' sake. Perhaps it deserves its own, additional bug?

syzop

2020-04-02 10:42

administrator   ~0021395

For now just here is fine, thanks for the report and notes.. I'll have a look when I have time again. It at least gives me some scenario's for trying to reproduce it I guess:)

Le_Coyote

2020-04-02 11:13

reporter   ~0021396

Good. Also, all our links are ssl-enabled and use spkifp for authentication, in case it matters.

syzop

2020-04-13 13:46

administrator   ~0021467

Last edited: 2020-04-13 13:46

Still trying to reproduce it here but still no luck. And yeah I played a little with iptables and that sort of stuff.

Important to note here is that I just wrote something that improves the logging. It previously was "not good enough", so in the future this may help you and others:

commit f437593b8dbd7e86de8b296e25be289e47182742 (HEAD -> unreal50, origin/unreal50)
Author: Bram Matthys <[email protected]>
Date:   Mon Apr 13 13:36:58 2020 +0200

    Rewrite and expand notices+logging with regards to server linking / lost link.
    
    When connecting, use slightly different wording (and use it consistently):
    "Trying to activate link with server xyz"
    
    When the connection is lost before synced:
    "Unable to link with server xyz"
    
    When the connection is lost after fully synced (eg: minutes later):
    "Lost server link to xyz"
    
    Important small changes (other than text):
    * Log ERRORs from remote servers to the log (previously only shown to ircops)
    * Some link errors could have been previously suppressed due to
      old code assuming other parts of the code would send or log the error
      (this would be the case for an error when calling SSL/TLS write functions)
    * More?

syzop

2020-04-19 15:24

administrator   ~0021512

Last edited: 2020-04-19 15:25

The notices from above are in in 5.0.4, which was just released. Also, as mentioned, there there was one case (failing SSL/TLS write) where notices could have been missed entirely. I hope it will help with tracing this issue!

syzop

2020-06-26 07:28

administrator   ~0021643

Le_Coyote: just wondering, are you on 5.0.4 (or later)? And are you still experiencing the issue?
I will leave it open for now, even if you report none, but... just wondering.

Le_Coyote

2020-06-26 12:45

reporter   ~0021644

We are now running 5.0.4 network-wide. We had a couple of long lasting splits after the upgrade. In both cases, a rehash *immediately* solved the problem.
In the first occurrence, the log had lots of: "Connection handshake timeout while trying to link to server 'hub.ournetwork.org'"
In the second, the error was different: "Unable to link with server hub.ournetwork.org[redacted]: SSL_connect(): Internal OpenSSL error or protocol error: ssl_undefined_function: called a function you should not call. Please verify that listen::options::ssl is enabled on port 7026 in hub.ournetwork.org's configuration file."
There had been no configuration change on either end that might explain either message.

I can't be certain these incidents are related to the initial bug.

k4be

2020-06-26 12:54

developer   ~0021645

I have reported the "called a function you should not call" problem on https://bugs.unrealircd.org/view.php?id=5607

Le_Coyote

2020-06-26 14:40

reporter   ~0021647

Thanks @k4be, I've added a note there as well

syzop

2020-06-26 15:23

administrator   ~0021650

0005607 has been solved now ("called a function you should not call")
Let's close this one (0003972) as well, since it seems logging is working correctly now :). Previously there were no errors at all so it was all a mystery...

Issue History

Date Modified Username Field Change
2010-12-01 16:25 Le_Coyote New Issue
2010-12-02 04:02 ohnobinki Additional Information Updated
2010-12-02 04:03 ohnobinki Additional Information Updated
2010-12-10 20:48 syzop Additional Information Updated
2010-12-10 20:48 syzop Additional Information Updated
2010-12-10 20:48 syzop Additional Information Updated
2010-12-10 20:55 syzop Note Added: 0016447
2010-12-10 21:03 syzop Additional Information Updated
2010-12-10 21:04 syzop Note Added: 0016449
2010-12-10 21:06 syzop Note Added: 0016450
2010-12-10 21:54 Le_Coyote Note Added: 0016451
2010-12-11 07:30 Le_Coyote Note Added: 0016453
2010-12-11 12:26 syzop Note Added: 0016454
2010-12-11 12:28 syzop Note Added: 0016455
2010-12-11 12:29 syzop Note Added: 0016456
2010-12-11 17:44 Le_Coyote Note Added: 0016457
2010-12-26 20:52 karel Note Added: 0016513
2010-12-26 23:46 katsklaw Note Added: 0016514
2010-12-27 02:39 karel Note Added: 0016515
2010-12-27 02:39 karel Note Edited: 0016515
2010-12-27 02:42 karel Note Edited: 0016515
2010-12-27 11:09 syzop Note Added: 0016516
2010-12-30 21:03 syzop Note Added: 0016545
2011-01-23 20:33 Le_Coyote Note Added: 0016587
2011-06-05 21:37 syzop Relationship added child of 0003776
2011-07-17 23:24 syzop Relationship deleted child of 0003776
2011-07-17 23:24 syzop Relationship added child of 0003915
2012-08-23 19:23 syzop Note Edited: 0016516
2012-10-06 12:22 syzop Relationship deleted child of 0003915
2015-07-24 12:00 syzop Note Added: 0018562
2015-07-24 12:00 syzop Status new => resolved
2015-07-24 12:00 syzop Fixed in Version => 3.4-beta3
2015-07-24 12:00 syzop Resolution open => fixed
2015-07-24 12:00 syzop Assigned To => syzop
2018-05-04 11:19 Le_Coyote Status resolved => feedback
2018-05-04 11:19 Le_Coyote Resolution fixed => reopened
2018-05-04 11:19 Le_Coyote Note Added: 0020129
2018-05-04 11:24 Le_Coyote Note Added: 0020130
2018-07-14 16:44 syzop Note Added: 0020202
2018-09-09 17:06 syzop Note Added: 0020303
2019-02-01 14:34 syzop Note Added: 0020474
2019-08-05 06:37 k4be Note Added: 0020797
2019-11-20 17:16 syzop Note Added: 0021105
2019-11-20 18:21 k4be Note Added: 0021115
2020-01-20 15:31 syzop Status feedback => resolved
2020-01-20 15:31 syzop Fixed in Version 3.4-beta3 => 4.2.0
2020-01-20 15:31 syzop Note Added: 0021232
2020-04-02 10:20 Le_Coyote Status resolved => feedback
2020-04-02 10:20 Le_Coyote Note Added: 0021393
2020-04-02 10:30 syzop Product Version 3.2.8 => 5.0.3.1
2020-04-02 10:40 Le_Coyote Note Added: 0021394
2020-04-02 10:40 syzop Fixed in Version 4.2.0 =>
2020-04-02 10:42 syzop Note Added: 0021395
2020-04-02 11:13 Le_Coyote Note Added: 0021396
2020-04-13 13:46 syzop Note Added: 0021467
2020-04-13 13:46 syzop Note Edited: 0021467
2020-04-19 15:24 syzop Note Added: 0021512
2020-04-19 15:25 syzop Note Edited: 0021512
2020-06-26 07:28 syzop Note Added: 0021643
2020-06-26 12:45 Le_Coyote Note Added: 0021644
2020-06-26 12:54 k4be Note Added: 0021645
2020-06-26 14:40 Le_Coyote Note Added: 0021647
2020-06-26 15:23 syzop Note Added: 0021650
2020-06-26 15:23 syzop Status feedback => resolved
2020-06-26 15:23 syzop Fixed in Version => 5.0.6