View Issue Details

IDProjectCategoryView StatusLast Update
0002499unrealircdpublic2005-10-10 14:36
ReporterDeltaflyer Assigned Tosyzop  
PrioritynormalSeveritycrashReproducibilityrandom
Status resolvedResolutionfixed 
PlatformLinux, FreeBSD 
Product Version3.2.3 
Fixed in Version3.2.4 
Summary0002499: find_cache_number crash
DescriptionOur IRCd Servers were crashing a few times since we updated them to 3.2.3
Now we have a corefile.
At the moment we don't know that this crases caused.

*** SYZOP ****
This is the mainbug for the find_cache_number crash, there have been 4 reports in the past few months (and like 10 in the past 2 years), this issue is very hard to trace down, I ask anyone which is experiencing this bug to run valgrind (see bugnotes for instructions) which might help tracing this bug down (hopefully, that is).
*** /SYZOP ***
Steps To Reproduceseems semi-random...
Additional Informationircd@h64944:~/Unreal3.2$ gdb src/ircd core
gdb: Symbol `emacs_ctlx_keymap' has different size in shared object, consider re-linking
GNU gdb 2002-04-01-cvs
..
#0 0x08066c4d in find_cache_number (rptr=0x0, numb=0x8209ba0 "Ùái@3\203") at res.c:1443
1443 for (i = 0; HE(cp)->h_addr_list[i]; i++)
(gdb) bt
#0 0x08066c4d in find_cache_number (rptr=0x0, numb=0x8209ba0 "Ùái@3\203") at res.c:1443
#1 0x08065a58 in gethost_byaddr (addr=0x8209ba0 "Ùái@3\203", lp=0xbfffec10) at res.c:466
#2 0x0806adad in start_of_normal_client_handshake (acptr=0x8209ba0) at s_bsd.c:1353
#3 0x0806aa98 in add_connection (cptr=0x8171a28, fd=45) at s_bsd.c:1338
#4 0x0806b5bb in read_message (delay=2, listp=0x813a860) at s_bsd.c:1872
#5 0x0805fe6e in main (argc=0, argv=0x81358c9) at ircd.c:1564
TagsNo tags attached.
Attached Files
3rd party modulesModules: [no pattern here]

Relationships

has duplicate 0002551 closed When running with openSSL support, server crashes randomly 
has duplicate 0002558 closed Unknown random crash 
has duplicate 0002559 closed crash at res.c:1301 
has duplicate 0002642 closed Exited on signal 11 (core dumped) 
related to 0002505 closed Crash in find_cache_number - res.c:1446 
related to 0002501 closed Segmentation Fault 
related to 0002618 closed Random Crash 
related to 0002616 closed Unknown crash, maybe the dns cache thing again ? 

Activities

syzop

2005-04-28 16:30

administrator   ~0009850

(modified copypaste of questions @ 0002501)
These bugs are hard to trace, so I'm gonna flood you with some questions:
1. Did you recently reconfigure/recompile unreal, like.. changing a config.h setting or running ./Config to enable/disable something?
2. How many clients were connected locally?
3. How many servers were connected locally?
4. How long have you been running unrealircd without problems? Or did this happen anytime before?
5. If you have multiple servers, did they ever experience the same issue?

FYI, the crash is in the DNS routine, if you can think of anything DNS related that this might have caused, let us know as well.

Deltaflyer

2005-04-28 17:53

reporter   ~0009853

1. we enabled the umlautsupport for german umlauts like ä,ö,ü,ß.
apart from that I don't know from something we changed.
2. We have about 20 to 60 Users on the Server
3. This Server is connecting as leaf to our hub
4. Our ircd's were running without problems for months with the 3.2.2
But after the upgrad to 3.2.3 (about a month ago) we had several crashes (but without corefiles)
5. we had that problems on 2 servers for one time. On 2 Servers for more than one time.

I don't think we changed somnething DNS related before the crash. We only took 2 Servers out of the Robin Round, but that was 2-3 weeks ago.

Deltaflyer

2005-05-18 05:30

reporter   ~0009955

any news about that Problem?
we had the same crash a few days ago again.

syzop

2005-05-18 10:32

administrator   ~0009957

Hm right, my bad.
You can help out by running the ircd with valgrind (www.valgrind.org, but perhaps your distro already has a nice package for it). Then run the ircd by executing the following command (taken from HERZ ;p):
valgrind --error-limit=no --verbose --time-stamp=yes --log-file=unrealdebug.log src/ircd

Whenever you crash, you can mail me the log (zipped?) to [email protected]

Thanks :)

syzop

2005-06-06 20:37

administrator   ~0010052

{moved all other related bugreports to here + moved users monitoring this bug also to here ;p}

Could everyone experiencing this bug paste me their /etc/resolv.conf? Feel free to xxxx out part of the ip (eg: the last 2 or 3 octets) and the 'search xxxxxx, yyyyyy' etc...

For example mine is:
-
search xxxxxxx
nameserver 192.168.x.xxx
-

Also, did anyone get a crash when running with valgrind? Or did for everyone mysteriously the issue dissapear for now ;)

syzop

2005-06-06 20:46

administrator   ~0010054

A shell I was on that has this problem has the following resolv.conf:
-
search xxxxxxxx.com xxxxxx.net
nameserver 0.0.0.0
nameserver 63.xxx.xxx.xxx
nameserver 63.yyy.yyy.yyy
-

firstof9

2005-06-12 04:10

reporter   ~0010078

here's the resolv.conf syzop:

nameserver 127.0.0.1
nameserver 4.2.2.2
nameserver 4.2.2.1

i will try to get a valgrind log.

syzop

2005-06-12 10:43

administrator   ~0010082

Thanks.

Just for the record, I've been playing with a bruteforcer several days ago... but I wasn't able to cause a crash :(

I'm personally using 1 nameserver in my resolv.conf however, so I was wondering if perhaps you need multiple nameservers in resolv.conf to trigger this bug. Or.. perhaps you just have more chance of triggering it then, I don't know.

I suppose that, since nobody else posted a reaction on this bug, they are currently not experiencing this bug (although I don't see why my resolv.conf cannot be answered).. Which is another property of this bug... it seems it's quite sporadic, and when it happens it often happens in "peaks" (and after that a long period of nothing). I guess that's also one of the things that makes this bug hard to catch... If I only could reproduce it here... :)

firstof9

2005-06-12 14:32

reporter   ~0010085

It was working perfectly fine till a few days ago, but here's the log. (--time-stamp didn't work as a flag for some reason)

syzop

2005-06-12 15:05

administrator   ~0010086

{I noticed you also mailed it, but since it's also here I suppose I'll reply here :)}

Thanks.

Problem though.. why did you start valgrind with --tool=none?
Basically now it did no memory (out-of-bounds/double free/access-after-free/etc) checks whatsoever, so I'm afraid it is not of much use for me :(.

If you could run it again without the --tool=none argument then that would be helpful (--tool=memcheck is the default), thanks :).

Deltaflyer

2005-06-14 10:27

reporter   ~0010090

Hi, sorry for the late response, but I was on holyday :)
We had another crash on Thursday.
This is the resolv.conf of this server:
search xxxx.info
nameserver 217.160.xxx.xxx
nameserver 195.20.yyy.yyy
nameserver 195.20.zzz.zzz

This ircd wasn't started with valgrin, so I have no debug info. Sorry.

Monk

2005-06-24 18:11

reporter   ~0010105

resolv.conf:

nameserver 1.2.3.4
nameserver 1.2.3.5

Regards,

Monk

syzop

2005-06-24 18:31

administrator   ~0010108

Thanks. Got a question though: LAN or remote ips?

--
Forgot to mention this: I got a valgrind log from firstof9 along with core files, I'll be researching this information (further) very soon and hope it will be helpful. (That said, valgrind did not see it as a 'double free' or 'read-after-free' from the resolver, rather as an unknown/damaged pointer thingy).

Monk

2005-06-24 19:35

reporter   ~0010116

Remote IP's

Monk

Deltaflyer

2005-06-26 16:37

reporter   ~0010142

also remote IPs here

syzop

2005-07-03 19:02

administrator   ~0010167

Last edited: 2005-07-03 19:03

If someone is getting (highly) annoyed by this, - if possible - you could try setting up a local bind9/dnscache and having only 1 nameserver entry with 127.0.0.1 as ip (no remote nameservers) in /etc/resolv.conf. Just a wild guess.....
If it works then it is - of course - still just a workaround.

PS: After changing /etc/resolv.conf you'll have to restart the ircd -- unfortunately at this time it won't reread resolv.conf during /rehash.

HERZ

2005-07-04 05:50

reporter   ~0010173

Just a wild guess from my side:

Why don't activate the Setting in Unreal ?:
      dns
        {
                nameserver 213.131.254.5;
                timeout 2s;
                retries 2;
        };

I know that Unrealircd ignore set setting (i don't know why) but
it makes some things easier to check if server is crashing again
when all Servers on a Network use the same Nameserver ? :)

And a idea for maybe Unreal3.3:

      dns
        {
                nameserver1 213.131.254.5;
                nameserver2 127.0.0.1;
                timeout 2s;
                retries 2;
        };

Add a Backup Nameserver.

Regards
Chris aka HERZ

syzop

2005-07-28 21:28

administrator   ~0010277

#2603 also reported the same bug.

What is interesting is that this user also has (2) remote nameservers.

--

HERZ: I agree, and it's certainly something I think must be done for 3.3*.

Btw on the "why does it ignore it in 3.2" part: well, since this was never used for years, the thought was: if we would now enable it, we would break hundreds (if not thousands) of irc servers (and not just in an obvious way, but rather.. sneaky ;p).

Stealth

2005-07-28 22:43

reporter   ~0010278

> Btw on the "why does it ignore it in 3.2" part: well, since this was never used for years, the thought was: if we would now enable it, we would break hundreds (if not thousands) of irc servers (and not just in an obvious way, but rather.. sneaky ;p).

That's why people are supposed to read docs/changelog when they are upgrading... I say if they don't, it is their problem.

Deltaflyer

2005-07-29 15:57

reporter   ~0010284

is there still a need of valgrind debug Logs?
I send you (syzop) one, but never got a reply.

syzop

2005-08-11 17:27

administrator   ~0010326

Deltaflyer: valgrind logs are always welcome :). So far they have only helped a little, but I don't want to miss any log since perhaps it COULD one day contain something useful :).
--
status update: I still have no idea why this happens (and where it happens), main problem is that I cannot reproduce it. I've tried (&coded) stuff like a dns bruteforcer that randomly corrupts bytes, that randomly delays it ("virtual timeouts"), etc... all when connecting ~200 clones every minute or so (and dns TTL of objects at like 60 or 300). I've used both valgrind (a heap debugger) and mudflap (a stack/object debugger included in gcc4) when doing the above described tests. No success... I guess I've spent now like 20-30 hours on this bug, without ever being able to trigger it... annoying (well probably not as annoying as having your server crashed, but...) :(.

Cnils

2005-08-28 14:40

reporter   ~0010404

I'd like to put the focus on this bug. It's happening more and more frequent now. I just hope it's not an user exploitable bug we're looking at here. The increasing number of crashes could suggest that.

syzop

2005-08-28 15:08

administrator   ~0010405

Do you have some more core files? Could you send them to me (along with the src/ircd binary and src/modules/commands.so) to [email protected] (or an url, of course).

Did you try the suggestion mentioned a few posts back? Having 127.0.0.1 as a nameserver (and only that nameserver) in /etc/resolv.conf.

Gilou

2005-08-28 15:54

reporter   ~0010411

Just to add something on what HERZ suggested, it would even be more smart to allow a NS pool like
set {
      nameservers { ip; ip; ip; };
}


Just in case you were to add this :))

syzop

2005-09-05 22:57

administrator   ~0010444

right.

Important status update:
we are currently (seriously) reviewing to see if we can use another dns library + simplified rewritten caching, this should "only" take (up to) 10h of work.
After it is is finished we will do extensive testing ourselves, and if that turns out to be ok, we'll provide a link here and/or send out an email to all known people that are affected by this bug. This would enable you guys to test if that version works properly and does indeed get rid of the issue.
If results are positive, then we will probably integrate them in CVS. If time permits, then it will also be in 3.2.4 (but then we really want to be sure it works ok and has been proven to work on live networks).

I don't know exactly when this version will be available, but it should be a matter of days (3-7d), not weeks.

syzop

2005-09-18 15:54

administrator   ~0010486

UPDATE: Basically it's been working on *NIX for IPv4 (IPv6 is untested) for over a week now, but we are having a struggle with autoconf and stuff which is basically bringing everything to a halt :/.

Windows support is also untested (well, basically it won't work yet ;p), but I doubt anyone HERE cares about that :P.

Cnils

2005-09-18 16:06

reporter   ~0010487

Just fyi. The many crashes we had over a period of a month has diminished to nearly no crashes now. Nothing's changed, everything is set up as it used to be.

syzop

2005-09-21 20:31

administrator   ~0010500

Good :).

I just found a box (vmware ;p) with autoconf 2.53, so I can get further again.
[newer versions are giving problems, that are hard to resolve ;p]

syzop

2005-09-21 20:57

administrator   ~0010501

Last edited: 2005-09-21 20:58

early alpha version.... I don't recommend using it yet, unless you are crashing like every day at the moment. Just see it as a last resort for now. I gotto add IPv6 support and windows support and improve some other things (and actually do much more tests like valgrind).

So.. JUST in case your whole net is crashing and you need something quickly ;).
[/disclaimer]

cvs -d :pserver:[email protected]:/home/cmunk/ircsystems/cvsroot login
[press enter when asked for the password]
cvs -z3 -d :pserver:[email protected]:/home/cmunk/ircsystems/cvsroot co -r c-ares_resolver -d Unreal unreal
And run ./Config and make

So just to repeat: work will continue for the next 7 days or so, until it's finished (IPv6/windows/improved more secure c-ares randomized numbers/more c-ares stats probably), then I'll post again, with the suggestion to actually try it :).

syzop

2005-09-27 20:26

administrator   ~0010523

Last edited: 2005-09-27 20:27

update: ipv6 support is now working as well, and so is win32 (though, win32 is not perfect yet).

syzop

2005-09-27 20:27

administrator   ~0010524

Uhh forget my last comment (which has now been editted out) ;).

On 2nd thought, I think we (unreal team) should betatest it first! :).

syzop

2005-10-01 18:22

administrator   ~0010549

Last edited: 2005-10-01 18:24

Ok, I think I've betatested it extensively now :).
I even printed out the resolver source and hunted down a few bugs, also ran ipv6 tests. Tested it both on i686 and amd64 (both w/valgrind).

So if anyone wants to give it a try :):

-
cvs -d :pserver:[email protected]:/home/cmunk/ircsystems/cvsroot login
[press enter when asked for the password]
cvs -z3 -d :pserver:[email protected]:/home/cmunk/ircsystems/cvsroot co -r c-ares_resolver -d Unreal unreal
And run ./Config and make
-

Now, I don't know what the next decission will be, merge it with unreal3_2_fixes or keep it a seperate branch for now, would have to really think about that (this also affects the 3.2.4 release date) :).

coolvibe

2005-10-10 04:13

reporter   ~0010566

Any updates on this? Does the alternative resolver help? Because I'm having the crashes in the same place.

Do note that I have only one entry in my resolv.conf file.

search xxx.net
nameserver 194.126.xxx.xxx

darko

2005-10-10 06:05

reporter   ~0010567

Thing to note is that, out of 6 crashes I had in 4 days, all of them happened with the hashv returned from hash_number being 12. I can't seem to figure why would it be important but it's very unusual coincidence that hashing all sorts oh hosts produce same result. Poor hashing algorithm? I am sick as a dog (hence my being home instead of at work) so I am calling it quits for this morning, but someone should take a look at hashing routines. Did anyone check distribution of hashes? Perhaps a hidden overflow occures in one of the buckets?

I should also mention that it has usually been a while between crashes - coupple of months - but now it happened so more often. Out of the other two servers on the network (all 3 running 3.2.2b) one is quite busy taking about 1500-2000 users (40% of the total) while the other one is populated by mere 50-100. The 'little one' never crashed and the 'middle one' had crashed only twice since 3.2.2b bugfix.

syzop

2005-10-10 09:33

administrator   ~0010568

Last edited: 2005-10-10 09:41

The alternative resolver should work (ALL old resolver code is gone), but.. I guess nobody with the crashproblem tried so yet :P.

darko: I noticed some unusual pattern too when analyzing ~10 core files, but.. I couldn't trace it down. Also, it was not always the same hash (or near the same hash), but there was a (clear) pattern around 50-55 and 10-15. It doesn't have to necessarily mean something but...
Anyway, I don't intend to take a look at this anymore, I've done so too many times, and now we got the new resolver written... I don't see why :)

I'm seriously considering merging it with current CVS now, since I don't think it will receive much testing otherwise ;). I'll post a comment when done so.
*EDIT: I mean with the unreal3_2_fixes branch of course ;p*

syzop

2005-10-10 14:36

administrator   ~0010575

Done [.387], changelog entry follows:

- Removed all old resolver code and switched over to c-ares (+our caching routines).
  This should get rid of some annoying untracable (and usually rare) crashbugs in the
  old resolver. Besides that, it makes things look more clean and understandable.
  This should be the fix for the following bugids (all the same issue): 0002499, 0002551, 0002558,
  0002559, #2603, 0002642, 0002502, 0002501, 0002618, 0002616.
  Feedback and testing is very much welcomed ([email protected]).

Issue History

Date Modified Username Field Change
2005-04-24 14:13 Deltaflyer New Issue
2005-04-24 14:13 Deltaflyer 3rd party modules => Modules: m_staff, operjoin, m_jumpserver, operpasswd; Services: kickservices
2005-04-28 16:27 syzop Relationship added related to 0002505
2005-04-28 16:30 syzop Note Added: 0009850
2005-04-28 17:53 Deltaflyer Note Added: 0009853
2005-05-18 05:30 Deltaflyer Note Added: 0009955
2005-05-18 10:32 syzop Note Added: 0009957
2005-05-18 10:37 syzop Relationship added related to 0002501
2005-06-06 20:29 syzop Summary IRCd Crash with no reason, may SSL Problem => find_cache_number crash
2005-06-06 20:30 syzop View Status private => public
2005-06-06 20:37 syzop Note Added: 0010052
2005-06-06 20:38 syzop Relationship added has duplicate 0002551
2005-06-06 20:44 syzop 3rd party modules Modules: m_staff, operjoin, m_jumpserver, operpasswd; Services: kickservices => Modules: [no pattern here]
2005-06-06 20:44 syzop Reproducibility have not tried => random
2005-06-06 20:44 syzop Status new => acknowledged
2005-06-06 20:44 syzop OS debian =>
2005-06-06 20:44 syzop OS Version Kernel 2.6.8-1 =>
2005-06-06 20:44 syzop Platform Linux => Linux, FreeBSD
2005-06-06 20:44 syzop Description Updated
2005-06-06 20:44 syzop Steps to Reproduce Updated
2005-06-06 20:44 syzop Additional Information Updated
2005-06-06 20:46 syzop Note Added: 0010054
2005-06-11 14:47 syzop Relationship added has duplicate 0002558
2005-06-12 00:40 codemastr Relationship added has duplicate 0002559
2005-06-12 04:10 firstof9 Note Added: 0010078
2005-06-12 10:43 syzop Note Added: 0010082
2005-06-12 14:32 firstof9 Note Added: 0010085
2005-06-12 14:32 firstof9 File Added: unrealdebug.log.pid18807
2005-06-12 15:05 syzop Note Added: 0010086
2005-06-14 10:27 Deltaflyer Note Added: 0010090
2005-06-24 18:11 Monk Note Added: 0010105
2005-06-24 18:31 syzop Note Added: 0010108
2005-06-24 19:35 Monk Note Added: 0010116
2005-06-26 16:37 Deltaflyer Note Added: 0010142
2005-07-03 19:02 syzop Note Added: 0010167
2005-07-03 19:02 syzop Note Edited: 0010167
2005-07-03 19:03 syzop Note Edited: 0010167
2005-07-04 05:50 HERZ Note Added: 0010173
2005-07-28 21:28 syzop Note Added: 0010277
2005-07-28 22:43 Stealth Note Added: 0010278
2005-07-29 15:57 Deltaflyer Note Added: 0010284
2005-08-11 17:27 syzop Note Added: 0010326
2005-08-19 11:56 syzop Relationship added related to 0002618
2005-08-19 11:57 syzop Relationship added related to 0002616
2005-08-28 14:40 Cnils Note Added: 0010404
2005-08-28 15:08 syzop Note Added: 0010405
2005-08-28 15:54 Gilou Note Added: 0010411
2005-09-05 22:57 syzop Note Added: 0010444
2005-09-18 15:54 syzop Note Added: 0010486
2005-09-18 16:06 Cnils Note Added: 0010487
2005-09-21 20:13 syzop Relationship added has duplicate 0002642
2005-09-21 20:31 syzop Note Added: 0010500
2005-09-21 20:57 syzop Note Added: 0010501
2005-09-21 20:58 syzop Note Edited: 0010501
2005-09-27 20:26 syzop Note Added: 0010523
2005-09-27 20:27 syzop Note Edited: 0010523
2005-09-27 20:27 syzop Note Added: 0010524
2005-10-01 18:22 syzop Note Added: 0010549
2005-10-01 18:24 syzop Note Edited: 0010549
2005-10-10 04:13 coolvibe Note Added: 0010566
2005-10-10 06:05 darko Note Added: 0010567
2005-10-10 09:33 syzop Note Added: 0010568
2005-10-10 09:41 syzop Note Edited: 0010568
2005-10-10 14:36 syzop Status acknowledged => resolved
2005-10-10 14:36 syzop Fixed in Version => 3.2.4
2005-10-10 14:36 syzop Resolution open => fixed
2005-10-10 14:36 syzop Assigned To => syzop
2005-10-10 14:36 syzop Note Added: 0010575