Ticket #168 (closed defect: fixed)
Patch: Limit the size of the nad cache
| Reported by: | markdoliner | Owned by: | smoku |
|---|---|---|---|
| Priority: | major | Component: | General |
| Version: | 2.1.17 | Keywords: | |
| Cc: | Tracforge_linkmap: | ||
| Blocking: | Blocked By: |
Description
Ok, the other patches I've submitted so far have been minor code cleanup, but this patch is more significant. This patch places limits on the number of nads that get added to the nad cache, and the maximum size of each nad.
I work at the instant messaging company meebo, and we have a pretty large jabber user base (I don't remember the exact numbers, but 30,000 simultaneous users comes to mind).
Messages that are passed around within jabberd2 are put into a 'nad' struct. It's an extremely bare-bones way to handle XML. Each nad struct contains 4 memory buffers. Each of these memory buffers starts off small, but will increase dynamically if more space is needed (so if jabberd2 inserts a large chunk into the cdata section of an xml node, then the cdata buffer might be expanded).
jabberd2 will allocate a new nad struct when handing an incoming or outgoing message. When it's finished with that particular nad it inserts it into a nad cache (it does NOT free the nad). This nad cache contains a bunch of nads that are available for use by jabberd2. The next time jabberd2 needs a new nad struct, it first goes to the nad cache and checks if there are any unused/available nads. If the nad cache is empty then it allocates a new nad. In this way old nad structs are re-used as much as possible. The 4 buffers inside each nad are similarly re-used--they're never freed, and they will increase in size if needed.
All this was done because memory allocation is an 'expensive' operation. By putting nads into the nad cache they can be re-used and there will be fewer calls to malloc() and free(). However, we've found that the ever-increasing size of the nad cache is a far bigger problem than the CPU load incurred from malloc() and free().
If a jabberd2 server has a large spike in traffic, then it might need to allocate a large number of nad structs to handle all the message. So then you have a lot of relatively small structs that live forever. As jabberd2 continues to send and receive messages, these structs will be rotated in and out of use. That's fine... but some messages are larger than others. Over time the buffers within each nad will grow to accommodate large messages.
The real-world evidence of the nad cache shows that it had a drastic affect on the memory usage. Before the patch our sm binary would start at around 200MB of memory and slowly climb to over 2.5GB before we were forced to restart it. After this patch our sm binary got up to around 1GB of memory and stayed there for four months without needing to be restarted.
