View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0002499 | unreal | ircd | public | 2005-04-24 14:13 | 2005-10-10 14:36 |
Reporter | Deltaflyer | Assigned To | syzop | ||
Priority | normal | Severity | crash | Reproducibility | random |
Status | resolved | Resolution | fixed | ||
Platform | Linux, FreeBSD | ||||
Product Version | 3.2.3 | ||||
Fixed in Version | 3.2.4 | ||||
Summary | 0002499: find_cache_number crash | ||||
Description | Our IRCd Servers were crashing a few times since we updated them to 3.2.3 Now we have a corefile. At the moment we don't know that this crases caused. *** SYZOP **** This is the mainbug for the find_cache_number crash, there have been 4 reports in the past few months (and like 10 in the past 2 years), this issue is very hard to trace down, I ask anyone which is experiencing this bug to run valgrind (see bugnotes for instructions) which might help tracing this bug down (hopefully, that is). *** /SYZOP *** | ||||
Steps To Reproduce | seems semi-random... | ||||
Additional Information | ircd@h64944:~/Unreal3.2$ gdb src/ircd core gdb: Symbol `emacs_ctlx_keymap' has different size in shared object, consider re-linking GNU gdb 2002-04-01-cvs .. #0 0x08066c4d in find_cache_number (rptr=0x0, numb=0x8209ba0 "Ùái@3\203") at res.c:1443 1443 for (i = 0; HE(cp)->h_addr_list[i]; i++) (gdb) bt #0 0x08066c4d in find_cache_number (rptr=0x0, numb=0x8209ba0 "Ùái@3\203") at res.c:1443 #1 0x08065a58 in gethost_byaddr (addr=0x8209ba0 "Ùái@3\203", lp=0xbfffec10) at res.c:466 #2 0x0806adad in start_of_normal_client_handshake (acptr=0x8209ba0) at s_bsd.c:1353 #3 0x0806aa98 in add_connection (cptr=0x8171a28, fd=45) at s_bsd.c:1338 #4 0x0806b5bb in read_message (delay=2, listp=0x813a860) at s_bsd.c:1872 #5 0x0805fe6e in main (argc=0, argv=0x81358c9) at ircd.c:1564 | ||||
Tags | No tags attached. | ||||
Attached Files | |||||
3rd party modules | Modules: [no pattern here] | ||||
has duplicate | 0002551 | closed | When running with openSSL support, server crashes randomly | |
has duplicate | 0002558 | closed | Unknown random crash | |
has duplicate | 0002559 | closed | crash at res.c:1301 | |
has duplicate | 0002642 | closed | Exited on signal 11 (core dumped) | |
related to | 0002505 | closed | Crash in find_cache_number - res.c:1446 | |
related to | 0002501 | closed | Segmentation Fault | |
related to | 0002618 | closed | Random Crash | |
related to | 0002616 | closed | Unknown crash, maybe the dns cache thing again ? |
|
(modified copypaste of questions @ 0002501) These bugs are hard to trace, so I'm gonna flood you with some questions: 1. Did you recently reconfigure/recompile unreal, like.. changing a config.h setting or running ./Config to enable/disable something? 2. How many clients were connected locally? 3. How many servers were connected locally? 4. How long have you been running unrealircd without problems? Or did this happen anytime before? 5. If you have multiple servers, did they ever experience the same issue? FYI, the crash is in the DNS routine, if you can think of anything DNS related that this might have caused, let us know as well. |
|
1. we enabled the umlautsupport for german umlauts like ä,ö,ü,ß. apart from that I don't know from something we changed. 2. We have about 20 to 60 Users on the Server 3. This Server is connecting as leaf to our hub 4. Our ircd's were running without problems for months with the 3.2.2 But after the upgrad to 3.2.3 (about a month ago) we had several crashes (but without corefiles) 5. we had that problems on 2 servers for one time. On 2 Servers for more than one time. I don't think we changed somnething DNS related before the crash. We only took 2 Servers out of the Robin Round, but that was 2-3 weeks ago. |
|
any news about that Problem? we had the same crash a few days ago again. |
|
Hm right, my bad. You can help out by running the ircd with valgrind (www.valgrind.org, but perhaps your distro already has a nice package for it). Then run the ircd by executing the following command (taken from HERZ ;p): valgrind --error-limit=no --verbose --time-stamp=yes --log-file=unrealdebug.log src/ircd Whenever you crash, you can mail me the log (zipped?) to [email protected] Thanks :) |
|
{moved all other related bugreports to here + moved users monitoring this bug also to here ;p} Could everyone experiencing this bug paste me their /etc/resolv.conf? Feel free to xxxx out part of the ip (eg: the last 2 or 3 octets) and the 'search xxxxxx, yyyyyy' etc... For example mine is: - search xxxxxxx nameserver 192.168.x.xxx - Also, did anyone get a crash when running with valgrind? Or did for everyone mysteriously the issue dissapear for now ;) |
|
A shell I was on that has this problem has the following resolv.conf: - search xxxxxxxx.com xxxxxx.net nameserver 0.0.0.0 nameserver 63.xxx.xxx.xxx nameserver 63.yyy.yyy.yyy - |
|
here's the resolv.conf syzop: nameserver 127.0.0.1 nameserver 4.2.2.2 nameserver 4.2.2.1 i will try to get a valgrind log. |
|
Thanks. Just for the record, I've been playing with a bruteforcer several days ago... but I wasn't able to cause a crash :( I'm personally using 1 nameserver in my resolv.conf however, so I was wondering if perhaps you need multiple nameservers in resolv.conf to trigger this bug. Or.. perhaps you just have more chance of triggering it then, I don't know. I suppose that, since nobody else posted a reaction on this bug, they are currently not experiencing this bug (although I don't see why my resolv.conf cannot be answered).. Which is another property of this bug... it seems it's quite sporadic, and when it happens it often happens in "peaks" (and after that a long period of nothing). I guess that's also one of the things that makes this bug hard to catch... If I only could reproduce it here... :) |
|
It was working perfectly fine till a few days ago, but here's the log. (--time-stamp didn't work as a flag for some reason) |
|
{I noticed you also mailed it, but since it's also here I suppose I'll reply here :)} Thanks. Problem though.. why did you start valgrind with --tool=none? Basically now it did no memory (out-of-bounds/double free/access-after-free/etc) checks whatsoever, so I'm afraid it is not of much use for me :(. If you could run it again without the --tool=none argument then that would be helpful (--tool=memcheck is the default), thanks :). |
|
Hi, sorry for the late response, but I was on holyday :) We had another crash on Thursday. This is the resolv.conf of this server: search xxxx.info nameserver 217.160.xxx.xxx nameserver 195.20.yyy.yyy nameserver 195.20.zzz.zzz This ircd wasn't started with valgrin, so I have no debug info. Sorry. |
|
resolv.conf: nameserver 1.2.3.4 nameserver 1.2.3.5 Regards, Monk |
|
Thanks. Got a question though: LAN or remote ips? -- Forgot to mention this: I got a valgrind log from firstof9 along with core files, I'll be researching this information (further) very soon and hope it will be helpful. (That said, valgrind did not see it as a 'double free' or 'read-after-free' from the resolver, rather as an unknown/damaged pointer thingy). |
|
Remote IP's Monk |
|
also remote IPs here |
|
If someone is getting (highly) annoyed by this, - if possible - you could try setting up a local bind9/dnscache and having only 1 nameserver entry with 127.0.0.1 as ip (no remote nameservers) in /etc/resolv.conf. Just a wild guess..... If it works then it is - of course - still just a workaround. PS: After changing /etc/resolv.conf you'll have to restart the ircd -- unfortunately at this time it won't reread resolv.conf during /rehash. |
|
Just a wild guess from my side: Why don't activate the Setting in Unreal ?: dns { nameserver 213.131.254.5; timeout 2s; retries 2; }; I know that Unrealircd ignore set setting (i don't know why) but it makes some things easier to check if server is crashing again when all Servers on a Network use the same Nameserver ? :) And a idea for maybe Unreal3.3: dns { nameserver1 213.131.254.5; nameserver2 127.0.0.1; timeout 2s; retries 2; }; Add a Backup Nameserver. Regards Chris aka HERZ |
|
#2603 also reported the same bug. What is interesting is that this user also has (2) remote nameservers. -- HERZ: I agree, and it's certainly something I think must be done for 3.3*. Btw on the "why does it ignore it in 3.2" part: well, since this was never used for years, the thought was: if we would now enable it, we would break hundreds (if not thousands) of irc servers (and not just in an obvious way, but rather.. sneaky ;p). |
|
> Btw on the "why does it ignore it in 3.2" part: well, since this was never used for years, the thought was: if we would now enable it, we would break hundreds (if not thousands) of irc servers (and not just in an obvious way, but rather.. sneaky ;p). That's why people are supposed to read docs/changelog when they are upgrading... I say if they don't, it is their problem. |
|
is there still a need of valgrind debug Logs? I send you (syzop) one, but never got a reply. |
|
Deltaflyer: valgrind logs are always welcome :). So far they have only helped a little, but I don't want to miss any log since perhaps it COULD one day contain something useful :). -- status update: I still have no idea why this happens (and where it happens), main problem is that I cannot reproduce it. I've tried (&coded) stuff like a dns bruteforcer that randomly corrupts bytes, that randomly delays it ("virtual timeouts"), etc... all when connecting ~200 clones every minute or so (and dns TTL of objects at like 60 or 300). I've used both valgrind (a heap debugger) and mudflap (a stack/object debugger included in gcc4) when doing the above described tests. No success... I guess I've spent now like 20-30 hours on this bug, without ever being able to trigger it... annoying (well probably not as annoying as having your server crashed, but...) :(. |
|
I'd like to put the focus on this bug. It's happening more and more frequent now. I just hope it's not an user exploitable bug we're looking at here. The increasing number of crashes could suggest that. |
|
Do you have some more core files? Could you send them to me (along with the src/ircd binary and src/modules/commands.so) to [email protected] (or an url, of course). Did you try the suggestion mentioned a few posts back? Having 127.0.0.1 as a nameserver (and only that nameserver) in /etc/resolv.conf. |
|
Just to add something on what HERZ suggested, it would even be more smart to allow a NS pool like set { nameservers { ip; ip; ip; }; } Just in case you were to add this :)) |
|
right. Important status update: we are currently (seriously) reviewing to see if we can use another dns library + simplified rewritten caching, this should "only" take (up to) 10h of work. After it is is finished we will do extensive testing ourselves, and if that turns out to be ok, we'll provide a link here and/or send out an email to all known people that are affected by this bug. This would enable you guys to test if that version works properly and does indeed get rid of the issue. If results are positive, then we will probably integrate them in CVS. If time permits, then it will also be in 3.2.4 (but then we really want to be sure it works ok and has been proven to work on live networks). I don't know exactly when this version will be available, but it should be a matter of days (3-7d), not weeks. |
|
UPDATE: Basically it's been working on *NIX for IPv4 (IPv6 is untested) for over a week now, but we are having a struggle with autoconf and stuff which is basically bringing everything to a halt :/. Windows support is also untested (well, basically it won't work yet ;p), but I doubt anyone HERE cares about that :P. |
|
Just fyi. The many crashes we had over a period of a month has diminished to nearly no crashes now. Nothing's changed, everything is set up as it used to be. |
|
Good :). I just found a box (vmware ;p) with autoconf 2.53, so I can get further again. [newer versions are giving problems, that are hard to resolve ;p] |
|
early alpha version.... I don't recommend using it yet, unless you are crashing like every day at the moment. Just see it as a last resort for now. I gotto add IPv6 support and windows support and improve some other things (and actually do much more tests like valgrind). So.. JUST in case your whole net is crashing and you need something quickly ;). [/disclaimer] cvs -d :pserver:[email protected]:/home/cmunk/ircsystems/cvsroot login [press enter when asked for the password] cvs -z3 -d :pserver:[email protected]:/home/cmunk/ircsystems/cvsroot co -r c-ares_resolver -d Unreal unreal And run ./Config and make So just to repeat: work will continue for the next 7 days or so, until it's finished (IPv6/windows/improved more secure c-ares randomized numbers/more c-ares stats probably), then I'll post again, with the suggestion to actually try it :). |
|
update: ipv6 support is now working as well, and so is win32 (though, win32 is not perfect yet). |
|
Uhh forget my last comment (which has now been editted out) ;). On 2nd thought, I think we (unreal team) should betatest it first! :). |
|
Ok, I think I've betatested it extensively now :). I even printed out the resolver source and hunted down a few bugs, also ran ipv6 tests. Tested it both on i686 and amd64 (both w/valgrind). So if anyone wants to give it a try :): - cvs -d :pserver:[email protected]:/home/cmunk/ircsystems/cvsroot login [press enter when asked for the password] cvs -z3 -d :pserver:[email protected]:/home/cmunk/ircsystems/cvsroot co -r c-ares_resolver -d Unreal unreal And run ./Config and make - Now, I don't know what the next decission will be, merge it with unreal3_2_fixes or keep it a seperate branch for now, would have to really think about that (this also affects the 3.2.4 release date) :). |
|
Any updates on this? Does the alternative resolver help? Because I'm having the crashes in the same place. Do note that I have only one entry in my resolv.conf file. search xxx.net nameserver 194.126.xxx.xxx |
|
Thing to note is that, out of 6 crashes I had in 4 days, all of them happened with the hashv returned from hash_number being 12. I can't seem to figure why would it be important but it's very unusual coincidence that hashing all sorts oh hosts produce same result. Poor hashing algorithm? I am sick as a dog (hence my being home instead of at work) so I am calling it quits for this morning, but someone should take a look at hashing routines. Did anyone check distribution of hashes? Perhaps a hidden overflow occures in one of the buckets? I should also mention that it has usually been a while between crashes - coupple of months - but now it happened so more often. Out of the other two servers on the network (all 3 running 3.2.2b) one is quite busy taking about 1500-2000 users (40% of the total) while the other one is populated by mere 50-100. The 'little one' never crashed and the 'middle one' had crashed only twice since 3.2.2b bugfix. |
|
The alternative resolver should work (ALL old resolver code is gone), but.. I guess nobody with the crashproblem tried so yet :P. darko: I noticed some unusual pattern too when analyzing ~10 core files, but.. I couldn't trace it down. Also, it was not always the same hash (or near the same hash), but there was a (clear) pattern around 50-55 and 10-15. It doesn't have to necessarily mean something but... Anyway, I don't intend to take a look at this anymore, I've done so too many times, and now we got the new resolver written... I don't see why :) I'm seriously considering merging it with current CVS now, since I don't think it will receive much testing otherwise ;). I'll post a comment when done so. *EDIT: I mean with the unreal3_2_fixes branch of course ;p* |
|
Done [.387], changelog entry follows: - Removed all old resolver code and switched over to c-ares (+our caching routines). This should get rid of some annoying untracable (and usually rare) crashbugs in the old resolver. Besides that, it makes things look more clean and understandable. This should be the fix for the following bugids (all the same issue): 0002499, 0002551, 0002558, 0002559, #2603, 0002642, 0002502, 0002501, 0002618, 0002616. Feedback and testing is very much welcomed ([email protected]). |
Date Modified | Username | Field | Change |
---|---|---|---|
2005-04-24 14:13 | Deltaflyer | New Issue | |
2005-04-24 14:13 | Deltaflyer | 3rd party modules | => Modules: m_staff, operjoin, m_jumpserver, operpasswd; Services: kickservices |
2005-04-28 16:27 | syzop | Relationship added | related to 0002505 |
2005-04-28 16:30 | syzop | Note Added: 0009850 | |
2005-04-28 17:53 | Deltaflyer | Note Added: 0009853 | |
2005-05-18 05:30 | Deltaflyer | Note Added: 0009955 | |
2005-05-18 10:32 | syzop | Note Added: 0009957 | |
2005-05-18 10:37 | syzop | Relationship added | related to 0002501 |
2005-06-06 20:29 | syzop | Summary | IRCd Crash with no reason, may SSL Problem => find_cache_number crash |
2005-06-06 20:30 | syzop | View Status | private => public |
2005-06-06 20:37 | syzop | Note Added: 0010052 | |
2005-06-06 20:38 | syzop | Relationship added | has duplicate 0002551 |
2005-06-06 20:44 | syzop | 3rd party modules | Modules: m_staff, operjoin, m_jumpserver, operpasswd; Services: kickservices => Modules: [no pattern here] |
2005-06-06 20:44 | syzop | Reproducibility | have not tried => random |
2005-06-06 20:44 | syzop | Status | new => acknowledged |
2005-06-06 20:44 | syzop | OS | debian => |
2005-06-06 20:44 | syzop | OS Version | Kernel 2.6.8-1 => |
2005-06-06 20:44 | syzop | Platform | Linux => Linux, FreeBSD |
2005-06-06 20:44 | syzop | Description Updated | |
2005-06-06 20:44 | syzop | Steps to Reproduce Updated | |
2005-06-06 20:44 | syzop | Additional Information Updated | |
2005-06-06 20:46 | syzop | Note Added: 0010054 | |
2005-06-11 14:47 | syzop | Relationship added | has duplicate 0002558 |
2005-06-12 00:40 |
|
Relationship added | has duplicate 0002559 |
2005-06-12 04:10 | firstof9 | Note Added: 0010078 | |
2005-06-12 10:43 | syzop | Note Added: 0010082 | |
2005-06-12 14:32 | firstof9 | Note Added: 0010085 | |
2005-06-12 14:32 | firstof9 | File Added: unrealdebug.log.pid18807 | |
2005-06-12 15:05 | syzop | Note Added: 0010086 | |
2005-06-14 10:27 | Deltaflyer | Note Added: 0010090 | |
2005-06-24 18:11 | Monk | Note Added: 0010105 | |
2005-06-24 18:31 | syzop | Note Added: 0010108 | |
2005-06-24 19:35 | Monk | Note Added: 0010116 | |
2005-06-26 16:37 | Deltaflyer | Note Added: 0010142 | |
2005-07-03 19:02 | syzop | Note Added: 0010167 | |
2005-07-03 19:02 | syzop | Note Edited: 0010167 | |
2005-07-03 19:03 | syzop | Note Edited: 0010167 | |
2005-07-04 05:50 | HERZ | Note Added: 0010173 | |
2005-07-28 21:28 | syzop | Note Added: 0010277 | |
2005-07-28 22:43 | Stealth | Note Added: 0010278 | |
2005-07-29 15:57 | Deltaflyer | Note Added: 0010284 | |
2005-08-11 17:27 | syzop | Note Added: 0010326 | |
2005-08-19 11:56 | syzop | Relationship added | related to 0002618 |
2005-08-19 11:57 | syzop | Relationship added | related to 0002616 |
2005-08-28 14:40 | Cnils | Note Added: 0010404 | |
2005-08-28 15:08 | syzop | Note Added: 0010405 | |
2005-08-28 15:54 | Gilou | Note Added: 0010411 | |
2005-09-05 22:57 | syzop | Note Added: 0010444 | |
2005-09-18 15:54 | syzop | Note Added: 0010486 | |
2005-09-18 16:06 | Cnils | Note Added: 0010487 | |
2005-09-21 20:13 | syzop | Relationship added | has duplicate 0002642 |
2005-09-21 20:31 | syzop | Note Added: 0010500 | |
2005-09-21 20:57 | syzop | Note Added: 0010501 | |
2005-09-21 20:58 | syzop | Note Edited: 0010501 | |
2005-09-27 20:26 | syzop | Note Added: 0010523 | |
2005-09-27 20:27 | syzop | Note Edited: 0010523 | |
2005-09-27 20:27 | syzop | Note Added: 0010524 | |
2005-10-01 18:22 | syzop | Note Added: 0010549 | |
2005-10-01 18:24 | syzop | Note Edited: 0010549 | |
2005-10-10 04:13 | coolvibe | Note Added: 0010566 | |
2005-10-10 06:05 | darko | Note Added: 0010567 | |
2005-10-10 09:33 | syzop | Note Added: 0010568 | |
2005-10-10 09:41 | syzop | Note Edited: 0010568 | |
2005-10-10 14:36 | syzop | Status | acknowledged => resolved |
2005-10-10 14:36 | syzop | Fixed in Version | => 3.2.4 |
2005-10-10 14:36 | syzop | Resolution | open => fixed |
2005-10-10 14:36 | syzop | Assigned To | => syzop |
2005-10-10 14:36 | syzop | Note Added: 0010575 |