Ticket #168 (closed defect: fixed)

Opened 13 months ago

Last modified 13 months ago

Patch: Limit the size of the nad cache

Reported by: markdoliner Owned by: smoku
Priority: major Component: General
Version: 2.1.17 Keywords:
Cc: Tracforge_linkmap:
Blocking: Blocked By:

Description

Ok, the other patches I've submitted so far have been minor code cleanup, but this patch is more significant. This patch places limits on the number of nads that get added to the nad cache, and the maximum size of each nad.

I work at the instant messaging company meebo, and we have a pretty large jabber user base (I don't remember the exact numbers, but 30,000 simultaneous users comes to mind).

Messages that are passed around within jabberd2 are put into a 'nad' struct. It's an extremely bare-bones way to handle XML. Each nad struct contains 4 memory buffers. Each of these memory buffers starts off small, but will increase dynamically if more space is needed (so if jabberd2 inserts a large chunk into the cdata section of an xml node, then the cdata buffer might be expanded).

jabberd2 will allocate a new nad struct when handing an incoming or outgoing message. When it's finished with that particular nad it inserts it into a nad cache (it does NOT free the nad). This nad cache contains a bunch of nads that are available for use by jabberd2. The next time jabberd2 needs a new nad struct, it first goes to the nad cache and checks if there are any unused/available nads. If the nad cache is empty then it allocates a new nad. In this way old nad structs are re-used as much as possible. The 4 buffers inside each nad are similarly re-used--they're never freed, and they will increase in size if needed.

All this was done because memory allocation is an 'expensive' operation. By putting nads into the nad cache they can be re-used and there will be fewer calls to malloc() and free(). However, we've found that the ever-increasing size of the nad cache is a far bigger problem than the CPU load incurred from malloc() and free().

If a jabberd2 server has a large spike in traffic, then it might need to allocate a large number of nad structs to handle all the message. So then you have a lot of relatively small structs that live forever. As jabberd2 continues to send and receive messages, these structs will be rotated in and out of use. That's fine... but some messages are larger than others. Over time the buffers within each nad will grow to accommodate large messages.

The real-world evidence of the nad cache shows that it had a drastic affect on the memory usage. Before the patch our sm binary would start at around 200MB of memory and slowly climb to over 2.5GB before we were forced to restart it. After this patch our sm binary got up to around 1GB of memory and stayed there for four months without needing to be restarted.

Attachments

jabberd2_limit_nadcache.diff (2.9 KB) - added by markdoliner 13 months ago.

Change History

Changed 13 months ago by markdoliner

Changed 13 months ago by smoku

  • status changed from new to assigned

I do agree with your reasoning. I think the resolution is correct and simple enough.

I applied it with one change - there is no reason to malloc the nad pointer. I've included it directly in the nad_cache structure.

Changed 13 months ago by smoku

  • status changed from assigned to closed
  • resolution set to fixed

Commited in [440].

Changed 13 months ago by markdoliner

Awesome, thanks!

Note: See TracTickets for help on using tickets.