View Issue Details
ID | Project | Category | View Status | Date Submitted | Last Update |
---|---|---|---|---|---|
0006484 | unreal | ircd | public | 2024-11-26 06:35 | 2025-10-05 16:24 |
Reporter | craftxbox | Assigned To | syzop | ||
Priority | normal | Severity | minor | Reproducibility | always |
Status | resolved | Resolution | fixed | ||
Platform | x86_64 | OS | Ubuntu | OS Version | Mixed |
Product Version | 6.1.7.2 | ||||
Fixed in Version | 6.2.1 | ||||
Summary | 0006484: Bad performance when handling thousands of users being synced or lost in netsplit. | ||||
Description | I made a pseudo-server script that introduces 10,000 users to a network, and joins them all to 10 separate channels. When the pseudo-server links, the introduction of the users works fine with no performance issues. When joining them to a single channel, there are likewise no noticeable performance issues. At 10 channels, there can be seen a 20 second delay between the pseudoserver's end of sync, and the sync acknowledgement from the remote server: ``` xmit: :999 EOS xmit: NETINFO 0 1732596854 6100 * 0 0 0 :CRXB Industries xmit: PING :test.dev.crxb.cc [2024-11-26T04:54:25.550Z] Write buffer exhausted. [2024-11-26T04:54:45.079Z] recv: :3W3 SLOG warn link LINK_UNRELIABLE :Warning, no response from par1.fr.crxb.cc for 15 seconds [2024-11-26T04:54:45.079Z] recv: [2024-11-26T04:54:46.010Z] recv: :3W3 SLOG info link SERVER_SYNCED :Link test.dev.crxb.cc -> hel1.fi.crxb.cc is now synced [secs: 31, recv: 3010525, sent: 18319] [2024-11-26T04:54:46.010Z] recv: :hel1.fi.crxb.cc PONG hel1.fi.crxb.cc :test.dev.crxb.cc ``` This slowdown can also propagate across the network, causing servers to drop from ping-timeout: ``` [01:24:14] hel1.fi.crxb.cc link.SERVER_LINKED [info] Server linked: hel1.fi.crxb.cc -> test.dev.crxb.cc [secure: TLSv1.3-TLS_CHACHA20_POLY1305_SHA256] [01:24:44] hel1.fi.crxb.cc link.LINK_UNRELIABLE [warn] Warning, no response from par1.fr.crxb.cc for 15 seconds [01:24:45] hel1.fi.crxb.cc link.SERVER_SYNCED [info] Link test.dev.crxb.cc -> hel1.fi.crxb.cc is now synced [secs: 31, recv: 3010525, sent: 18319] [01:25:29] hel1.fi.crxb.cc link.LINK_DISCONNECTED [error] Lost server link to par1.fr.crxb.cc [2001:bc8:710:3215:aaaa:dead:beef:cafe]: No response (Ping timeout) [01:25:36] hel1.fi.crxb.cc link.LINK_RESOLVING [info] Resolving hostname par1.fr.crxb.cc... [01:25:36] hel1.fi.crxb.cc link.LINK_CONNECTING [info] Trying to activate link with server par1.fr.crxb.cc (2001:bc8:710:3215:dc00:ff:fe3f:5a1:6900)... [01:25:42] hel1.fi.crxb.cc link.SERVER_LINKED [info] Server linked: hel1.fi.crxb.cc -> par1.fr.crxb.cc [secure: TLSv1.3-TLS_CHACHA20_POLY1305_SHA256] [01:25:42] hel1.fi.crxb.cc link.SERVER_SYNCED [info] Link par1.fr.crxb.cc -> hel1.fi.crxb.cc is now synced [secs: 0, recv: 14153, sent: 106354] [01:25:42] hel1.fi.crxb.cc link.SERVER_LINKED_REMOTE [info] Server linked: tor1.ca.crxb.cc -> par1.fr.crxb.cc [01:25:43] hel1.fi.crxb.cc link.SERVER_LINKED_REMOTE [info] Server linked: stj1.ca.crxb.cc -> tor1.ca.crxb.cc [01:25:43] hel1.fi.crxb.cc link.SERVER_LINKED_REMOTE [info] Server linked: vrg1.us.crxb.cc -> tor1.ca.crxb.cc [01:25:44] par1.fr.crxb.cc link.SERVER_SYNCED [info] Link tor1.ca.crxb.cc -> par1.fr.crxb.cc is now synced [secs: 1, recv: 16620, sent: 1856938] [01:25:45] tor1.ca.crxb.cc link.SERVER_SYNCED [info] Link par1.fr.crxb.cc -> tor1.ca.crxb.cc is now synced [secs: 2, recv: 2103926, sent: 16852] [01:26:59] hel1.fi.crxb.cc link.LINK_UNRELIABLE [warn] Warning, no response from par1.fr.crxb.cc for 15 seconds [01:27:18] par1.fr.crxb.cc link.SERVER_SYNCED [info] Link hel1.fi.crxb.cc -> par1.fr.crxb.cc is now synced [secs: 96, recv: 3306511, sent: 24187] ``` The same effect occurs when the pseudoserver is disconnected aswell. | ||||
Steps To Reproduce | I have included the Node script I used to perform this testing. You will have to change the details, obviously, but you can change authMethod to `pass` and provide `password` property if you do not feel like setting spkifp for it. | ||||
Additional Information | I was not performing this kind of stress testing on purpose, the original purpose for this amount of usercount was bridging a large (~6,000 member) discord server with puppet users on IRC. While I can't particularly test this in a live environment, I expect this could also occur from a particularly large network getting netsplit. During the sync/disconnect process I can witness unrealircd pinning the entire CPU core it's running on on pretty much every server in the network. My test network of 5 servers running on relatively lowish spec machines took 11 minutes to fully stabilize after a test run of 10k users to 50 channels. | ||||
Tags | No tags attached. | ||||
Attached Files | |||||
3rd party modules | |||||
|
Sorry I don't have time for this at the moment but plan to revisit this later in March/April as I surely want to optimize this :) |
|
I have not tested this particular script but have been profiling for a week now, first with optimizing 1000 locally clients, that was the main focus. Today I have been testing with 10k and later 100k clones in 1 channel via a psuedo-server, where the psuedo-server is linked to server B, and then i let server A and B connect, so the 100k clones are introduced and joined all at once (at A). I have done massive performance improvements, cutting this 100k UID+SJOIN stuff down to only a few seconds. Things haven't been tested well yet but... something tells me performance should be much much better for your script also :D. And I'm not even finished yet, i only started today to work on the remote server case and syncing case. |
|
Ah I can fully reproduce your problem now with squit, will look into it :) |
|
Fixed, thanks for the report, and your patience. commit af0a7844647277f32e75fd1ce0d371dcb9c75de4 Author: Bram Matthys <[email protected]> Date: Sun Oct 5 08:24:14 2025 +0200 Make member & membership point to each other so lookups can be much faster. This also makes them proper list items, again to make certain fast operations possible. Main thing is that removing an entry does not require us to walk all of those lists. Not all code has been modified yet to benefit this, actually only very little, the most performance-impacting ones. This fixes SQUIT of a server with 100k users in a single channel taking 40 seconds of 100% CPU. It now takes only 1 second. Reported by craftxbox in https://bugs.unrealircd.org/view.php?id=6484 (Can't make member & membership one entry atm, that would be too much change in U6) |
|
Oh and there have been various follow-up commits to make other things faster too :) |
Date Modified | Username | Field | Change |
---|---|---|---|
2024-11-26 06:35 | craftxbox | New Issue | |
2024-11-26 06:35 | craftxbox | File Added: example.ts | |
2024-11-26 06:35 | craftxbox | File Added: package.json | |
2025-02-16 08:55 | syzop | Note Added: 0023431 | |
2025-10-03 19:22 | syzop | Note Added: 0023518 | |
2025-10-03 19:22 | syzop | Note Edited: 0023518 | |
2025-10-03 19:24 | syzop | Note Edited: 0023518 | |
2025-10-03 19:27 | syzop | Note Edited: 0023518 | |
2025-10-03 19:42 | syzop | Note Edited: 0023518 | |
2025-10-04 19:25 | syzop | Note Added: 0023519 | |
2025-10-04 19:25 | syzop | Assigned To | => syzop |
2025-10-04 19:25 | syzop | Status | new => confirmed |
2025-10-05 15:48 | syzop | Status | confirmed => resolved |
2025-10-05 15:48 | syzop | Resolution | open => fixed |
2025-10-05 15:48 | syzop | Fixed in Version | => 6.2.1 |
2025-10-05 15:48 | syzop | Note Added: 0023520 | |
2025-10-05 16:24 | syzop | Note Added: 0023521 |