Ticket #78 (closed defect: fixed)

Opened 16 months ago

Last modified 3 months ago

c2s crash

Reported by: tofu Owned by: smoku
Priority: critical Component: c2s
Version: 2.1 Keywords:
Cc: jack Tracforge_linkmap:
Blocking: Blocked By:

Description

Our c2s has been crashing. I have a core dump. I was gonna look at the code, but I thought I would create a ticket and see if someone could help quicker than me going through the code.

This is debian stable running jabberd2-2.1.6

Here is the gdb backtrace.

Core was generated by `/home/chesspark/production/bin/c2s -c /home/chesspark/production/etc/jabberd/c2'.
Program terminated with signal 6, Aborted.
#0  0xffffe410 in __kernel_vsyscall ()
(gdb) backtrace
#0  0xffffe410 in __kernel_vsyscall ()
#1  0xf7bf8811 in raise () from /lib/tls/i686/cmov/libc.so.6
#2  0xf7bf9fb9 in abort () from /lib/tls/i686/cmov/libc.so.6
#3  0xf7c2dd3a in __fsetlocking () from /lib/tls/i686/cmov/libc.so.6
#4  0xf7c33dfc in malloc_usable_size () from /lib/tls/i686/cmov/libc.so.6
#5  0xf7c35c68 in free () from /lib/tls/i686/cmov/libc.so.6
#6  0xf7c3792f in malloc () from /lib/tls/i686/cmov/libc.so.6
#7  0x08059489 in _sx_buffer_new (data=0x0, len=14, notify=0, notify_arg=0x0) at sx.c:212
#8  0x080564b7 in sx_can_read (s=0x808bb20) at io.c:172
#9  0x0804f004 in c2s_router_mio_callback (m=0x808b938, a=action_READ, fd=0x808bb08, data=0x0, arg=0x806e008) at c2s.c:1080
#10 0x0805d389 in _mio_run (m=0x808b938, timeout=5) at mio_impl.h:239
#11 0x08053491 in main (argc=Cannot access memory at address 0x3609
) at main.c:637


Change History

  Changed 16 months ago by tofu

  • priority changed from major to critical

  Changed 16 months ago by jack

I looked into this a bit, and it appears some corruption is happening somewhere. _sx_buffer_new is called with (NULL, 1024, NULL, NULL), yet the backtrace shows len as 14. Also, there is no reason the malloc should fail.

Memory use at the time of this crash was 4GB available (out of 8GB total). CPU load was quite high when this happened. This is a dual quad-core machine, and the 5m average load before the spike was about 1. At the time of the crash it was just under 5.

I'm not sure if the load was caused by c2s or one of our other server components, but it seems suspicious.

  Changed 15 months ago by tofu

  • version changed from 2.1 to 2.1.6

The core keeps happening. Here is a new backtrace on a core dump.

Core was generated by `/home/chesspark/production/bin/c2s -c /home/chesspark/production/etc/jabberd/c2'.
Program terminated with signal 6, Aborted.
#0  0xffffe410 in __kernel_vsyscall ()
(gdb) backtrace
#0  0xffffe410 in __kernel_vsyscall ()
#1  0xf7c67811 in raise () from /lib/tls/i686/cmov/libc.so.6
#2  0xf7c68fb9 in abort () from /lib/tls/i686/cmov/libc.so.6
#3  0xf7c9cd3a in __fsetlocking () from /lib/tls/i686/cmov/libc.so.6
#4  0xf7ca2dfc in malloc_usable_size () from /lib/tls/i686/cmov/libc.so.6
#5  0xf7ca4c68 in free () from /lib/tls/i686/cmov/libc.so.6
#6  0xf7ca692f in malloc () from /lib/tls/i686/cmov/libc.so.6
#7  0x08056d69 in _sx_buffer_new (data=0x0, len=2, notify=0, notify_arg=0x0) at sx.c:212
#8  0x08054b17 in sx_can_read (s=0x83e7000) at io.c:172
#9  0x0805004a in _c2s_client_mio_callback (m=0x8086938, a=action_READ, fd=0x82e1ff0, data=0x0, arg=0x848f148) at c2s.c:427
#10 0x08059eb9 in _mio_run (m=0x8086938, timeout=5) at mio_impl.h:239
#11 0x080522d1 in main (argc=Cannot access memory at address 0x7ee3
) at main.c:637

follow-up: ↓ 5   Changed 15 months ago by smoku

Is it 64bit architecture machine?

in reply to: ↑ 4   Changed 15 months ago by jack

Replying to smoku:

Is it 64bit architecture machine?

Yes.

  Changed 15 months ago by smoku

So this is probably 64bit related bug. Main development platform of jabberd2 is still 32bit platform.

  Changed 15 months ago by jack

I'm happy to give you access to a machine to debug this on if that is what it takes to fix this.

  Changed 14 months ago by tofu

There is a new crash with 2.1.8 :

(gdb) backtrace
#0  0xf7b61c85 in malloc_usable_size () from /lib/tls/i686/cmov/libc.so.6
#1  0xf7b63c68 in free () from /lib/tls/i686/cmov/libc.so.6
#2  0xf7b6592f in malloc () from /lib/tls/i686/cmov/libc.so.6
#3  0xf7c486e6 in XML_ParserReset () from /usr/lib/libexpat.so.1
#4  0xf7c4907c in XML_ParserCreate_MM () from /usr/lib/libexpat.so.1
#5  0xf7c490f0 in XML_ParserCreateNS () from /usr/lib/libexpat.so.1
#6  0x08056d1d in sx_new (env=0x8070740, tag=12, cb=0x804fe10 <_c2s_client_sx_callback>, arg=0x80a3a38) at sx.c:37
#7  0x0804fa90 in _c2s_client_mio_callback (m=0x8086938, a=action_ACCEPT, fd=0x80a7ee8, data=0xff9c215a, arg=0x8069008) at c2s.c:480
#8  0x08059e13 in _mio_run (m=0x8086938, timeout=5) at mio_impl.h:164
#9  0x08051ed1 in main (argc=Cannot access memory at address 0x1
) at main.c:639



  Changed 14 months ago by smoku

Is it also SIGABRT?

What architecture?

  Changed 14 months ago by tofu

  • version changed from 2.1.6 to 2.1

Are you asking if it received signal SIGABRT? Or if it crashes when it gets that?

I believe it is a Core 2 Duo E6700 2.66GHz. Running on debian stable.

follow-up: ↓ 12   Changed 14 months ago by smoku

Yes, I'm asking wether it recieved SIGABRT. The default handler for this signal is crashdump.

Is Core 2 Duo an IA32 or IA64?

in reply to: ↑ 11 ; follow-up: ↓ 13   Changed 14 months ago by tofu

Replying to smoku:

Yes, I'm asking wether it recieved SIGABRT. The default handler for this signal is crashdump.

No, it was just running normally and then crashed.

Is Core 2 Duo an IA32 or IA64?

IA64.

in reply to: ↑ 12   Changed 14 months ago by smoku

Replying to tofu:

Replying to smoku:

Yes, I'm asking wether it recieved SIGABRT. The default handler for this signal is crashdump.

No, it was just running normally and then crashed.

But which signal caused the crash? This really matters, because SIGABRT (on the contrary to ex. SIGSEGV) is generated programatically not by kernel.

Is Core 2 Duo an IA32 or IA64?

IA64.

Shish... Deep in-libc IA64 specific crash... It's going to be hard...

  Changed 14 months ago by tofu

Core was generated by `/home/chesspark/production/bin/c2s -c /home/chesspark/production/etc/jabberd/c2'. Program terminated with signal 11, Segmentation fault. #0 0xf7b61c85 in malloc_usable_size () from /lib/tls/i686/cmov/libc.so.6

So SIGSEGV caused it.

I know it is hard, we are looking into it too. We need all the help we can get, we would like to continue to use jabberd2.

  Changed 14 months ago by smoku

Is it possible that you run the server under Electric Fence? http://perens.com/FreeSoftware/ElectricFence/

It slows down program heavily, but causes a crash right once the wrong memory pointer is used.

  Changed 14 months ago by tofu

Sure, I will get back with the results.

  Changed 13 months ago by jack

  • cc jack added

Note that #124 was filed about errors running under electric fence.

  Changed 13 months ago by jack

We just had another one of these tonight. This is the only weird thing in the log:

Mon Aug 6 23:07:31 2007 [notice] [98] [latency1@…/watch_arbiter] write error: Broken pipe (32)

Looking at the previous crashes in the logs there is one of these lines before every one with this backtrace. There are other instances of Broken pipe errors from regular users, but these don't seem to crash the server. A first guess is that this may be related to legacy auth (since our normal clients use sasl and the latency tester doesn't).

  Changed 13 months ago by smoku

  • status changed from new to closed
  • resolution set to fixed

(In [343]) Applied MIO memleak fix related to time_checks by Christof Meerwald. Refs #124, fixes #78

  Changed 13 months ago by smoku

If it did not actually fix this bug, please reopen the ticket.

Note: See TracTickets for help on using tickets.