Ticket #124 (closed defect: fixed)

Opened 16 months ago

Last modified 10 months ago

running jabberd 2.1.12 with electric fence

Reported by: tofu Owned by: smoku
Priority: major Component: General
Version: 2.1.12 Keywords:
Cc: jack Tracforge_linkmap:
Blocking: Blocked By:

Description

Ok, this ticket is for notes for running jabberd 2 on a 64 bit platform with electric fence.

When I first run it, it starts up just fine. Then when a client connects, auths with SASL, and sends <session/> resolver crashes. The following is the core dump.

Core was generated by `/home/chesspark/dev/bin/resolver -c /home/chesspark/dev/etc/jabberd/resolver.xm'.
Program terminated with signal 11, Segmentation fault.
#0  0x08056e49 in _mio_run (m=0xf7ae9e74, timeout=5) at mio_impl.h:253
253                     MIO_UNSET_READ(m, FD(m,fd));
(gdb) backtrace
#0  0x08056e49 in _mio_run (m=0xf7ae9e74, timeout=5) at mio_impl.h:253
#1  0x0804caa5 in main (argc=3, argv=0xffcb73f4) at resolver.c:622



Change History

Changed 16 months ago by tofu

Now when I by pass resolver so it will not crash I get sm to crash with the following core dump:

Core was generated by `/home/chesspark/dev/bin/resolver -c /home/chesspark/dev/etc/jabberd/resolver.xm'.
Program terminated with signal 11, Segmentation fault.
#0  0x08056e49 in sx_env_free (env=0xf7ae9e74) at env.c:45
45          free(env->plugins);
(gdb) backtrace
#0  0x08056e49 in sx_env_free (env=0xf7ae9e74) at env.c:45
#1  0x0804caa5 in SSL_CTX_use_certificate_chain_file@plt ()
#2  0xf7ae9e74 in ?? ()
#3  0x00000005 in ?? ()
#4  0x08063448 in log_new (type=log_STDOUT, ident=0xf7fa2cc0 "", facility=0xffcb73c8 "") at log.c:72
#5  0xf7badea8 in ?? ()
#6  0x00000000 in ?? ()

Changed 16 months ago by tofu

Now c2s, this seems to be something that happens without electric fence too

Core was generated by `/home/chesspark/dev/bin/c2s -D -c /home/chesspark/dev/etc/jabberd/c2s.xml'.
Program terminated with signal 11, Segmentation fault.
#0  _mio_run (m=0xf77d4e74, timeout=5) at mio_impl.h:257
257             if(FD(m,fd)->type == type_NORMAL && MIO_CAN_WRITE(m,iter))
(gdb) backtrace
#0  _mio_run (m=0xf77d4e74, timeout=5) at mio_impl.h:257
#1  0x08053215 in main (argc=0, argv=0x0) at main.c:643

Changed 16 months ago by tofu

Ok, I have been trying to debug c2s a bit looking for why fd is null on this loop.

In line 252 of mio/mio_impl.h active_READ seems to generate a close and when we come back to the loop fd is null. So fd is gone for the rest of the loop.

We are not sure what is the best way to fix this. Should c2s or sx not close sockets? Or should the loop have checks for this?

Any ideas?

Changed 16 months ago by jack

It looks to me after investigating that [313] is just plain wrong. You can't free the socket from _mio_close as this can be called from callbacks in mio_run. This means that during the middle of the event processing the fd pointer can get freed and set to NULL.

I don't understand why this change was made. The only thing I can think of is that epoll is not generating an event for when the socket is closed, but that is probably because you are removing the socket from the event queue.

If it is leaking fds, then there needs to be a closing_queue type thing that gets freed by mio_run. For now, it is better to leak than crash!

Changed 16 months ago by jack

I also suggest changing the MIO_FREE_FD macro to additionally set the pointer to NULL explicitly after free is called.

Changed 16 months ago by jack

  • cc jack added

Changed 16 months ago by smoku

  • status changed from new to assigned

Yes. [313] is a memory leak fix. The removed from epoll fd did never generate events for type_CLOSED fd. I do not use other than epoll MIO backends, so I didn't find that it breaks things for other backends.

Changed 16 months ago by smoku

(In [337]) Reverted [313] because it causes crashes in non-epoll MIO backends. Refs #124

Changed 16 months ago by smoku

  • status changed from assigned to new
  • version changed from 2.1.11 to 2.1.12

Changed 16 months ago by tofu

Just a note, we are using epoll backend and it seems type_CLOSED events are being generated. There are some cases where they are not. We have a crash when it is called sometimes. #78 is that crash. I can add another backtrace for that if you would like.

Changed 16 months ago by smoku

(In [343]) Applied MIO memleak fix related to time_checks by Christof Meerwald. Refs #124, fixes #78

Changed 10 months ago by smoku

  • status changed from new to closed
  • resolution set to fixed

Seems that fixed the problem.

Note: See TracTickets for help on using tickets.