UnrealIRCd Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0003972unrealircdpublic2010-12-01 16:252012-08-23 19:23
ReporterMr_Smoke 
Assigned To 
PrioritynormalSeveritymajorReproducibilitysometimes
StatusnewResolutionopen 
Platformx86OSLinuxOS Version2.6.22
Product Version3.2.8 
Target VersionFixed in Version 
Summary0003972: Autoconnect link does not reconnect after split
DescriptionSometimes (not always but nearly), when a connection between two hubs is lost, the link does not come back up even though it is set to autoconnect in the config file.

The ircd "sees" the split (ie "Lost connection to ..." notice and SQUIT in the logs), but remains on its own indefinitely (10 hours in the last instance until someone reconnected manually).

It seems that the problem builds up over time, ie. the first couple of disconnections after a restart/rehash are usually handled correctly (autotonnect kicks in), but later on, autoconnect just does not happen any longer.
Additional InformationThe only clue so far is:
[Wed Dec 1 00:19:13 2010] - WARNING: Time jumped 23 seconds ahead! (1291159130
-> 1291159153)
[Wed Dec 1 00:19:13 2010] - [TimeShift] Resetting some timers!

The server has ntpd running and has *not* skipped ahead itself.
TagsNo tags attached.
3rd party modulesdelaylist, regexcept, quarantine, nocodes
Attached Files

- Relationships

-  Notes
(0016447)
syzop (administrator)
2010-12-10 20:55

I've found the mantis bug, it was caused by ~ followed by certain numbers.
I already reported it straight away on http://www.mantisbt.org/bugs/view.php?id=12580 [^] when I heard about this (better email next time ;p), but as you can see I had to trace it myself...

Anyway, the rest will be on-topic.
So far haven't looked at your issue.
(0016449)
syzop (administrator)
2010-12-10 21:04

The jumps in time are really quite odd.. and big.

Could it be that your server clock is running too slow and ntp is adjusting it every once in a while in a huge leap? Actually ntpd shouldn't do that (except 1st tiem on boot sometimes), only ntpdate does things like that...
(0016450)
syzop (administrator)
2010-12-10 21:06

also there was a bug resolved which involved autoconnect & time offsets, but that one probably does not apply to you:
- Fixed issue where a negative time offset (either caused by ircd.tune or
  timesynch) made autoconnect not work for the duration of the offset
  (eg: -60 would make autoconnect wait 60 seconds after boot, instead of
  autoconnecting almost immediately). Reported by aragon (0003853).

I mean.. it shouldn't apply, unless you have a negative offset of several days... or something ;)
See ircd.tune, first number, for the TS offset. Or use /TSCTL ALLTIME or something.
(0016451)
Mr_Smoke (reporter)
2010-12-10 21:54

I'm running a test ircd alongside one of the production ones so as to mimic the setup as closely as possible. The test daemon is a debug+Debug compile, and it has been patched according to issue 3853, as it was suggested that it might solve my issue. I'll let it run for a while and simulate splits, see what happens.

Also, I forgot to mention that the kernel has several GrSec features enabled, especially some of the /proc restrictions. Any chance it might be related ?
(0016453)
Mr_Smoke (reporter)
2010-12-11 07:30

@sysop : ntp resets time by about -0.15s every hour or so, which doesn't seem so bad but still might be improved by re-allowing /proc access and using adjtimex maybe. It never uses bigger leaps. ircd.tune has 0 916 in it.
TSCTL ALLTIME reports that all servers are in sync, except one at the moment running at -3. Never seen a TS offset in the days order of magnitude.
(0016454)
syzop (administrator)
2010-12-11 12:26

Then I don't think 0003853 is related to this.
I also don't see how grsecurity or any /proc restrictions could have any affect.

The tech info regarding detecting time jumps is:
1) it saves the current time
2) the ircd does what it needs to.. I mean.. accepting connections, reading, writing, etc... the main loop
3) it compares the current time with the saved time
If, like in this case, 23 seconds have passed between 1 and 3, then we assume the clock was adjusted.
It MAY have a different cause, like if you have anything else stalling the ircd for 23 seconds, like a very very heavy spamfilter or you are attaching a debugger and doing things and then 'c' continuing. But.. it is very unlikely that you wouldn't have noticed that if you & others use that server ;)
(0016455)
syzop (administrator)
2010-12-11 12:28

Is it relatively easy to reproduce? Like if I have your config, modify the server names of course, then /squit or tcpkill the connection a few times and then it stops autoconnecting?
If you can give a test case - like something I can reproduce here - that would be one way to trace it.
(0016456)
syzop (administrator)
2010-12-11 12:29

Bleh, should have put this all in one comment ;)

Another important question to find out is: is this problem specific to this server?
(0016457)
Mr_Smoke (reporter)
2010-12-11 17:44

It's not easy to reproduce. I've set up the test/debug server for that reason. It lives on the same physical host where the failing ircd is, so I hope that it will suffer the same symptom sooner or later, but there's no guarantee. It doesn't seem to do it after a few disconnects, but only after a certain amount of time.

Also, it seems to happen more frequently if the split is caused by external factors rather than SQUIT.

The issue is not specific to this server, we have seen this behaviour on two different hubs.

Regarding the 23 seconds time shift, I have no idea what might have happened :S
(0016513)
karel (reporter)
2010-12-26 20:52

Hello,
same problem here.

My server is connected in irc network using ssl, it's running Debian Squeeze and time is synchronized with level 2 ntp server.

There is many timeshifts every day (from 20s to 50s). And after netsplit, I have to make rehash, otherwise it wont connect.
(0016514)
katsklaw (reporter)
2010-12-26 23:46

This may seem simple but what is the value for connfreq in your server's class block?
(0016515)
karel (reporter)
2010-12-27 02:39
edited on: 2010-12-27 02:42

Hello katsklaw,
my connfreq is set to 30, but it doesn't want to connect even after an hour :). But only in the case of netsplit - after restart, etc it's ok.

And the timeshift problem it kinda unpleasant too, because all clients are freezed during the timejump :(.

(0016516)
syzop (administrator)
2010-12-27 11:09
edited on: 2012-08-23 19:23

Regarding time jumps, you could run a bash script for a day (or at least to catch one or two jumps) like:

#!/bin/bash
while true
do
date >>date.log
sleep 1
done

And see if UnrealIRCd is lying or if there are really jumps ;)
Simply look back at date.log whenever unreal says the time jumped.
date.log should show 1 (and sometimes 2) second increments of time, anything big like >5 means the box itself has some kind of clock problem (or major performance issues).
IIRC I coded it that during a jump all client timers are reset so clients shouldn't notice any 'lag', however the freeze might be caused by the problem with the machine itself (I mean, it being a primary symptom caused by the machine itself, rather than a symptom caused by unreal).

Well, if you can, run the script, and see if you get any interesting results ;)
(0016545)
syzop (administrator)
2010-12-30 21:03

I've experienced the same issue I think, but I only have log files, I was not online so cannot see if it did any (failed) linking attempts for sure.

Anyway, log:
[Thu Dec 30 10:46:10 2010] - TSCTL: U:line security.xxxx set time to be 1293727573 (timediff: 392, was 391)
[Thu Dec 30 11:03:02 2010] - Lost connection to huba.xxxx[74.xxxx]:Operation timed out
[Thu Dec 30 11:03:02 2010] - SQUIT huba.xxxxx (Operation timed out)
[Thu Dec 30 12:39:28 2010] - OPER (Syzop) by (Syzop!~none@xxxxx)
[Thu Dec 30 12:39:37 2010] - SERVER hubb.xxxx

This server has 2 server classes, one with connfreq 27 and one with connfreq 90.
huba is in the 90s one, and hubb in the 27s one.

It seems neither autoconnect was triggered. When I went online this server was not linked to any server (/MAP).
As soon as I opered up and did a '/connect hubb*' it linked instantly.

Seems we got a real issue here.

Strange thing is, I don't recall having seen it before.

Unreal version: 3.2.8.1

I guess I'll have to go over the autoconnect code (haven't really looked or touched that ever before), and play a bit with tcpkill and such.
(0016587)
Mr_Smoke (reporter)
2011-01-23 20:33

Just to add to the already existing weirdness ...

We've had maybe 4 or 5 splits recently. The last one displayed the bug, but the 3 or 4 before that went well, ie. autoconnect kicked in. So it's really quite random ... I'll check the dates, see if anything comes up from that

- Issue History
Date Modified Username Field Change
2010-12-01 16:25 Mr_Smoke New Issue
2010-12-02 04:02 ohnobinki Additional Information Updated View Revisions
2010-12-02 04:03 ohnobinki Additional Information Updated View Revisions
2010-12-10 20:48 syzop Additional Information Updated View Revisions
2010-12-10 20:48 syzop Additional Information Updated View Revisions
2010-12-10 20:48 syzop Additional Information Updated View Revisions
2010-12-10 20:49 syzop Note Added: 0016446
2010-12-10 20:53 syzop Note Deleted: 0016446
2010-12-10 20:55 syzop Note Added: 0016447
2010-12-10 21:02 syzop Note Added: 0016448
2010-12-10 21:03 syzop Note Deleted: 0016448
2010-12-10 21:03 syzop Additional Information Updated View Revisions
2010-12-10 21:04 syzop Note Added: 0016449
2010-12-10 21:06 syzop Note Added: 0016450
2010-12-10 21:54 Mr_Smoke Note Added: 0016451
2010-12-11 07:30 Mr_Smoke Note Added: 0016453
2010-12-11 12:26 syzop Note Added: 0016454
2010-12-11 12:28 syzop Note Added: 0016455
2010-12-11 12:29 syzop Note Added: 0016456
2010-12-11 17:44 Mr_Smoke Note Added: 0016457
2010-12-26 20:52 karel Note Added: 0016513
2010-12-26 23:46 katsklaw Note Added: 0016514
2010-12-27 02:39 karel Note Added: 0016515
2010-12-27 02:39 karel Note Edited: 0016515 View Revisions
2010-12-27 02:42 karel Note Edited: 0016515 View Revisions
2010-12-27 11:09 syzop Note Added: 0016516
2010-12-30 21:03 syzop Note Added: 0016545
2011-01-23 20:33 Mr_Smoke Note Added: 0016587
2011-06-05 21:37 syzop Relationship added child of 0003776
2011-07-17 23:24 syzop Relationship deleted child of 0003776
2011-07-17 23:24 syzop Relationship added child of 0003915
2012-08-23 19:23 syzop Note Edited: 0016516 View Revisions
2012-10-06 12:22 syzop Relationship deleted child of 0003915


Copyright © 2000 - 2014 MantisBT Team
Powered by Mantis Bugtracker