View Issue Details

IDProjectCategoryView StatusLast Update
0003650unrealircdpublic2008-08-08 09:30
ReporterMonk Assigned To 
PrioritynormalSeveritycrashReproducibilityrandom
Status closedResolutionno change required 
Product Version3.2.7 
Summary0003650: Strange crash
DescriptionWe have 3 servers running at sh3lls.net on FreeBSD boxes. From time to time the ircd just terminates without any apparent reason. The pid file is just left behind but no core file is written.
To analyse the problem I attached gdb with:
$ gdb /path/to/the/ircd PID

Shortly after attaching gdb two servers crashed, one is still running. The backtrace of both follows:

FreeBSD 4.11-STABLE FreeBSD 4.11-STABLE #0: Thu Apr 13 09:29:20 CDT 2006

#0 0x2824b370 in select () from /usr/lib/libc.so.4
No symbol table info available.
#1 0x8057488 in read_message (delay=1, listp=0x8147540) at s_bsd.c:1763
        cptr = (aClient *) 0x81454e0
        nfds = 1
        wait = {tv_sec = 1, tv_usec = 0}
        read_set = {fds_bits = {1756103198, 1677881474, 1092159788, 2151682177, 2172649480, 2248212800, 0, 514, 36896, 2097152, 135266368, 2147483648,
    8388738, 1073741826, 0, 1140852736, 3, 16384, 0, 41943040, 0, 83886336, 144, 66560, 1048576, 17827841, 12, 303039488, 268455936, 536870912, 8522752,
    2097152, 0, 169869440, 0, 2, 2147483649, 0, 0, 4096, 2097168, 1048576, 256, 1073745920, 0, 524800, 0}}
        write_set = {fds_bits = {0 <repeats 47 times>}}
        j = 113
        k = -1077937816
        delay2 = 1
        res = 0
        length = -1077937816
        fd = -1077937628
        i = 93
        sockerr = 93
#2 0x8060010 in main (argc=1, argv=0xbfbffb98) at ircd.c:1597
        oldtimeofday = 1203963888
        argc = 113
        argv = (char **) 0x1
        uid = 1129
        euid = 1129
        gid = 1129
        egid = 1129
        delay = 1
        portarg = 93
        nextfdlistcheck = 1203963889
(gdb)

==========================================================================================================================================
FreeBSD 6.2-RC1 FreeBSD 6.2-RC1 #0: Sat Dec 16 01:29:54 CST 2006

(gdb) bt full
#0 0x0a2e02c3 in select () from /lib/libc.so.6
No symbol table info available.
#1 0x08057059 in read_message (delay=1, listp=0x8146a60) at s_bsd.c:1763
        s = 4
        cptr = (aClient *) 0xbfbfeab0
        nfds = 135547392
        wait = {tv_sec = 1, tv_usec = 0}
        read_set = {__fds_bits = {2168487966, 830996737, 1359478816, 1048708, 134744072, 2048, 268435608, 3229679872, 1744830464, 262208, 34112640,
    2359296, 0, 1073743008, 0, 8388992, 6818048, 268435504, 285212672, 1073774852, 268501121, 4194304, 327680, 134512672, 8, 1243611136, 4098, 1074807297,
    1073741824, 134324224, 4, 0, 2097152, 2684370948, 2151677953, 16844805, 738336768, 151003136, 0, 1082163200, 0, 8912896, 20, 34734112, 135528448,
    1212220040, 16448, 8388608, 8388608, 1111492736, 3145728, 33563648, 65552, 22020113, 49184, 0, 1276125504, 64, 536870913, 134481920, 134217744, 0, 0}}
        write_set = {__fds_bits = {0 <repeats 63 times>}}
        j = 181
        k = -1077941840
        delay2 = 1
        res = 0
        length = -1077941840
        fd = 135547392
        i = 4
        sockerr = 4
#2 0x080602ab in main (argc=135547392, argv=0xbfbfec90) at ircd.c:1597
        uid = 1799
        euid = 1799
        gid = 1799
        egid = 1799
        delay = 1
        portarg = 4
        nextfdlistcheck = 1203963814
(gdb)
3rd party modules

Activities

Monk

2008-02-25 18:59

reporter   ~0015176

On a sidenote:
Is there a way to start Unreal directly in gdb?
Whenever I tried it just seems to fork and gdb tells me "Program exited normally."
I wasn't able to find something like Anope's "run -debug -nofork"

nate

2008-02-25 19:13

reporter   ~0015179

I think you're referencing Unreal's -backtrace trigger (./unreal -backtrace).

Use that trigger and get its output from one of the core dumpfiles that Unreal should have generated on the crash.

Personally, in my own opinion anyways, I'd honestly up front like to think this most likely isn't an UnrealIRCd problem as I was with sh3lls once for a little while, and ran into a multitude of issues with their servers as well as their services/support/admin, but that might be a bit bias of me : P

Really though try to post a paste of the backtrace unreal will run from its -backtrace feature.

Monk

2008-02-25 19:45

reporter   ~0015182

nate, thanks for the pointer but as I wrote above, there is no core file written in this strange case so this is of no use.
Regarding the sh3lls admins, I cannot say something bad about them sofar. They are helpful and polite.

nate

2008-02-25 21:51

reporter   ~0015185

You said the one was still running though also after this 'crash'? O_o

Didn't quite get that entirely, or are you talking about after trying to attach gdb to it?

Monk

2008-02-26 06:02

reporter   ~0015188

Yeah, I was not very clear here:
I attached gdb to our 3 different sh3lls and like 5 minutes later two of them crashed with the bt above. Meanwhile the third one also crashed with a similar bt.

syzop

2008-02-29 13:44

administrator   ~0015200

So, after attaching, you did 'c' (continue), and then.. it crashed.. right? (so not just running gdb without the continue? the reason I ask is that it can look identical to this ;p).
hm. I see the backtrace but, what was the message it crashed with... segmentation fault? broken pipe? signal error.. whatever...

as for your question:
gdb src/ircd
(or whatever your 'ircd' BINARY is)
r -F
that's running in foreground mode

Monk

2008-02-29 18:05

reporter   ~0015206

After the command:

gdb /path/to/the/ircd PID

gdb issues some lines with loading symbols ... and then it went to the prompt. There I did nothing more, thinking it has attached and was done with it. After like 5 mins the ircd suddenly disconnected from our network. I waited a while and then typed the "bt full" command. All output is copied above.
Thanks for the foreground argument. As the problem still persists, I will try the direct gdb run.

syzop

2008-03-06 12:51

administrator   ~0015209

Ok, thanks. Then the backtrace wasn't a backtrace of the crash I'm afraid, I'll explain.

When attaching, so when you do:
gdb /path/to/the/ircd PID
then you get a (gdb) prompt, then the ircd (or any program in gdb, really) hangs, until you give it the continue command ('c')
So yeah, after a couple of minutes it would have disconnected, ping timeout probably, because it didn't respond.
So next time:
gdb /path/to/the/ircd PID
-blabla loading symbols bla-
(gdb)
then do: c
then it should continue, until it crashes that is ;)
actually even better would be two commands:
handle SIGPIPE nostop
c

the 'handle SIGPIPE nostop' tells it not to bother you with sigpipe crap.. things that sometimes happen (or happened, I forgot).. without it you may get a (gdb) prompt again a couple of minutes, or hours, later, which will stall everything again for a stupid reason (no crash).

Actually the same in the foreground thing might be a good idea as well, then it is:
handle SIGPIPE nostop
r -F

Hope it helps :)

Monk

2008-03-07 18:28

reporter   ~0015218

Last edited: 2008-03-07 18:46

Many thanks for the explanation syzop. Following your detailed question about how I did it with gdb I expected something like this ;)

As I ran the ircd in gdb with the arguements you suggested, I now have another reason why the ircd just terminest, leaving the pid behind:

=====================
Program received signal SIGKILL, Killed.
0x08056f7b in read_message (delay=1, listp=0x814a3e0) at s_bsd.c:1695
1695 if (IsLog(cptr))
=====================

It just received a SIGKILL. Now this can possibly have two explanations:
1) The folks at sh3lls are nuts - Unlikely as I experienced them as friendly and helpful and they explicitly allowed me to run the ircd in gdb

2) The number of file descriptors is limited, probably with a hard limit in limits.conf. The shell I rented allows me to run 1500 file descriptors. So in the config file the number of clients is limited to 1495 and the server connects to one hub. No other services/bncs/whatsoever are connected to the server. This should in total give no more than 1496 open files.
Is there a way to see how many files where open when it received the SIGKILL?

Edit:
The command
(gdb) shell lsof -p 36309 | wc -l
    1245
(gdb)
would indicate that it was not a hard security limits kill, tho I don't know if this way of getting open files is valid in this context or if it would get leaked files as will.

syzop

2008-03-29 20:56

administrator   ~0015239

At least we know it isn't a crash :).
Now as to why it receives SIGKILL from somewhere... I've no idea.
If it hits a fd limit, the ircd would just send error messages and such and not SIGKILL... that's my experience.
Perhaps you could bother the provider, various ones kill processes for various reasons.. inluding cpu usage, memory usage, or.. whatever..

syzop

2008-08-08 09:29

administrator   ~0015344

I'm closing this one Monk, because I don't think it's a fault in Unreal. Hope you solved things.

Issue History

Date Modified Username Field Change
2008-02-25 18:48 Monk New Issue
2008-02-25 18:59 Monk Note Added: 0015176
2008-02-25 19:13 nate Note Added: 0015179
2008-02-25 19:45 Monk Note Added: 0015182
2008-02-25 21:51 nate Note Added: 0015185
2008-02-26 06:02 Monk Note Added: 0015188
2008-02-29 13:44 syzop Note Added: 0015200
2008-02-29 18:05 Monk Note Added: 0015206
2008-03-06 12:51 syzop Note Added: 0015209
2008-03-07 18:28 Monk Note Added: 0015218
2008-03-07 18:45 Monk Note Edited: 0015218
2008-03-07 18:46 Monk Note Edited: 0015218
2008-03-29 20:56 syzop Note Added: 0015239
2008-08-08 09:29 syzop QA => Not touched yet by developer
2008-08-08 09:29 syzop U4: Need for upstream patch => No need for upstream InspIRCd patch
2008-08-08 09:29 syzop Status new => closed
2008-08-08 09:30 syzop Note Added: 0015344
2008-08-08 09:30 syzop Resolution open => no change required