Ticket #100 (closed defect: fixed)

Opened 14 months ago

Last modified 3 months ago

connection refused against 5222/5269 - ports appear unopened

Reported by: dwaite Owned by: smoku
Priority: critical Component: General
Version: 2.1.19 Keywords:
Cc: Tracforge_linkmap:
Blocking: Blocked By:

Description

Against 2.1.7, using the default .dist files, I am able to start the server but it does not listen on any ports. netstat -nalp shows no listening ports belonging to c2s or s2s, although both have connections to the router.

compilation is done using ./configure --prefix=/opt/jabberd2 --enable-cyrus --enable-ssl --enable-mysql --disable-gsasl

System is a debian 'testing' system, up to date as of today.

Attachments

debug.log.bz2 (10.2 kB) - added by dwaite 14 months ago.
debug log
jabberlog (11.8 kB) - added by mstevens 14 months ago.
log of failed startup

Change History

  Changed 14 months ago by smoku

Please compile with --enable-debug and start c2s and s2s with -D to see the debug messages.

  Changed 14 months ago by smoku

  • version changed from 2.1 to 2.1.7

  Changed 14 months ago by dwaite

Are there particular debug messages you are looking for, or do you just want a log generated from starting each?

  Changed 14 months ago by smoku

I was counting that you will find what's wrong by yourself. :-)

If you can't you can attach the logsto the ticket.

Changed 14 months ago by dwaite

debug log

follow-ups: ↓ 6 ↓ 14   Changed 14 months ago by dwaite

it appears that the components are not finishing their negotiation with the router, so they do not get to the point of calling mio_listen. I don't really understand which point within the distributed state machine they are getting caught up at, however.

in reply to: ↑ 5   Changed 14 months ago by wiwi

  • priority changed from major to critical

same problem here. Tried 2.1.6 and 2.1.7 under FreeBSD 6.2 *amd64*.

[...] C2S : Tue Jul 10 20:51:23 2007 c2s.c:567 want read C2S : Tue Jul 10 20:51:23 2007 c2s.c:1072 read action on fd 5 C2S : sx (io.c:169) 5 ready for reading C2S : sx (io.c:175) tag 5 event 2 data 0x525500 C2S : Tue Jul 10 20:51:23 2007 c2s.c:577 reading from 5 C2S : Tue Jul 10 20:51:23 2007 c2s.c:602 read 144 bytes C2S : sx (io.c:191) passed 144 read bytes C2S : sx (chain.c:93) calling io read chain C2S : sx (io.c:215) decoded read data (144 bytes): <?xml version='1.0'?><stream:stream xmlns:stream='http://etherx.jabber.org/streams' version='1.0' id='c8ohcefu1dtwmts2v1x6umb22125wdsbhqql88zs'> C2S : sx (client.c:70) stream response: to (null) from (null) version 1.0 id c8ohcefu1dtwmts2v1x6umb22125wdsbhqql88zs C2S : sx (client.c:85) 5 state change from 2 to 3 C2S : sx (client.c:86) tag 5 event 4 data 0x0 C2S : Tue Jul 10 20:51:23 2007 c2s.c:1072 read action on fd 5

[...] C2S : Tue Jul 10 20:51:23 2007 c2s.c:567 want read C2S : Tue Jul 10 20:51:23 2007 c2s.c:1072 read action on fd 5 C2S : sx (io.c:169) 5 ready for reading C2S : sx (io.c:175) tag 5 event 2 data 0x525500 C2S : Tue Jul 10 20:51:23 2007 c2s.c:577 reading from 5 C2S : Tue Jul 10 20:51:23 2007 c2s.c:602 read 144 bytes C2S : sx (io.c:191) passed 144 read bytes C2S : sx (chain.c:93) calling io read chain C2S : sx (io.c:215) decoded read data (144 bytes): <?xml version='1.0'?><stream:stream xmlns:stream='http://etherx.jabber.org/streams' version='1.0' id='c8ohcefu1dtwmts2v1x6umb22125wdsbhqql88zs'> C2S : sx (client.c:70) stream response: to (null) from (null) version 1.0 id c8ohcefu1dtwmts2v1x6umb22125wdsbhqql88zs C2S : sx (client.c:85) 5 state change from 2 to 3 C2S : sx (client.c:86) tag 5 event 4 data 0x0 C2S : Tue Jul 10 20:51:23 2007 c2s.c:1072 read action on fd 5

  Changed 14 months ago by mstevens

  • version changed from 2.1.7 to 2.1.10

I also see this problem with jabberd 2.1.10 on FreeBSD 6.2. I'm getting nothing listening on port 5222, and 4 jabber processes using 100% of CPU between them. I'll attach a log of the debug output.

Changed 14 months ago by mstevens

log of failed startup

  Changed 14 months ago by smoku

I think the problem is specific to Cyrus SASL backend. Please try with GnuSASL backend.

  Changed 14 months ago by sxw

The 'jabberlog' debug trace would appear to be from a system using GnuSASL.

What would help hugely, if you've got processes in a busy wait (using 100% of CPU), is to attach gdb to each of the processes, and produce a backtrace. If you need information on how to do this, let me know and I'd be happy to talk you through it.

Simon.

  Changed 13 months ago by mikiecx

i have the same problem with jabberd-2.1.12. it doesn't listen on 5222. there is only router listening on 5347.

  Changed 10 months ago by wiwi

  • version changed from 2.1.10 to 2.1.19

problem still in 2.1.19 (and FreeBSD 7). Anyone working on this problem? I might help with gdb traces...

sm: (gdb) next 231 MIO_ITERATE_RESULTS(m, retval, iter) (gdb) next 276 } (gdb) next main (argc=Variable "argc" is not available. ) at main.c:337 337 if(sm_logrotate) { (gdb) next 346 if(sm_lost_router) { (gdb) next 334 while(!sm_shutdown) { (gdb) next 335 mio_run(sm->mio, 5); (gdb) next

Breakpoint 1, _mio_run (m=0x801c68c00, timeout=Variable "timeout" is not available. ) at mio_impl.h:246 246 if(FD(m,fd)->type & type_CONNECT && (gdb) next 254 if(FD(m,fd)->type == type_NORMAL && MIO_CAN_READ(m,iter)) (gdb) next 231 MIO_ITERATE_RESULTS(m, retval, iter) (gdb) next 276 } (gdb) next main (argc=Variable "argc" is not available. ) at main.c:337 337 if(sm_logrotate) { (gdb) next 346 if(sm_lost_router) { (gdb) next 334 while(!sm_shutdown) { (gdb) bt #0 main (argc=Variable "argc" is not available. ) at main.c:334

s2s:

#0 0x0000000801a13f4c in select () from /lib/libc.so.7 #1 0x000000000040e62b in _mio_run (m=0x801c13000, timeout=Variable "timeout" is not available. ) at mio_impl.h:91 #2 0x0000000000405123 in main (argc=1194124073, argv=Variable "argv" is not available. ) at main.c:523 (gdb) trace trace command requires an argument (gdb) step Single stepping until exit from function select, which has no line number information. _mio_run (m=0x801c13000, timeout=Variable "timeout" is not available. ) at mio_impl.h:218 218 if(retval == 0) return; (gdb) step 221 if(retval < 0) (gdb) step 231 MIO_ITERATE_RESULTS(m, retval, iter) (gdb) step 208 { (gdb) step 156 newfd = accept(fd->fd, (struct sockaddr*)&serv_addr, &addrlen); (gdb) step 208 { (gdb) step 156 newfd = accept(fd->fd, (struct sockaddr*)&serv_addr, &addrlen); (gdb) step 208 { (gdb) step 156 newfd = accept(fd->fd, (struct sockaddr*)&serv_addr, &addrlen); (gdb) 163 j_inet_ntop(&serv_addr, ip, sizeof(ip)); (gdb) step 233 mio_fd_t fd = MIO_ITERATOR_FD(m,iter); (gdb) step 236 if(FD(m,fd)->type == type_CLOSED) continue; (gdb) step 239 if(FD(m,fd)->type == type_LISTEN && MIO_CAN_READ(m,iter)) (gdb) step 246 if(FD(m,fd)->type & type_CONNECT && (gdb) step 254 if(FD(m,fd)->type == type_NORMAL && MIO_CAN_READ(m,iter)) (gdb) step 231 MIO_ITERATE_RESULTS(m, retval, iter) (gdb) step 276 } (gdb) step main (argc=1194124073, argv=Variable "argv" is not available. ) at main.c:525 525 if(s2s_logrotate) { (gdb) step 534 if(s2s_lost_router) { (gdb) step 556 while(jqueue_size(s2s->dead) > 0) (gdb) step jqueue_size (q=0x801c1b080) at jqueue.c:125 125 int jqueue_size(jqueue_t q) { (gdb) step 127 } (gdb) step main (argc=1194124073, argv=Variable "argv" is not available. ) at main.c:560 560 while(jqueue_size(s2s->dead_conn) > 0) { (gdb) step jqueue_size (q=0x801c1b0b0) at jqueue.c:125 125 int jqueue_size(jqueue_t q) { (gdb) step 127 } (gdb) step main (argc=1194124073, argv=Variable "argv" is not available. ) at main.c:571 571 if(s2s->check_interval > 0 && time(NULL) >= s2s->next_check) { (gdb) step 580 if(time(NULL) > check_time + 60) { (gdb) step 522 while(!s2s_shutdown) { (gdb) step 523 mio_run(s2s->mio, 5); (gdb) step _mio_run (m=0x801c13000, timeout=5) at mio_impl.h:208 208 { (gdb) step 91 MIO_FUNCS

  Changed 10 months ago by kbrantley

  • cc kyle@… added

I'm seeing this same problem on CentOS5, 2.1.19.

Anyone know of the most recent working version ('recent' relative to several months, it seems)? I'm highly interested in getting a working copy of jabberd2, even if it means downgrading in the short term.

Can attach any logs of any length by request to assist in debugging.

  Changed 10 months ago by kbrantley

  • cc kyle@… removed

Okay, I just tried .19, .18, .17, and then jumped back to .6. All of these versions are showing this problem.

Also removing my CC because I just noticed the RSS link at the bottom.

in reply to: ↑ 5   Changed 10 months ago by smoku

Replying to dwaite:

it appears that the components are not finishing their negotiation with the router, so they do not get to the point of calling mio_listen.

I think this is the case. So there is no point really in getting traces from the main mio_run() loop, because the listening socket is not attached to it, and it happened earlier.

It would be interesting to get trace from C2S c2s/c2s.c lines 746-850, to know why it does not open the socket.

  Changed 10 months ago by wiwi

c2s/c2s.c

746: case event_PACKET: 747: nad = (nad_t) data;

(gdb) break c2s.c:747 Breakpoint 1 at 0x407a02: file c2s.c, line 747. (gdb) run Starting program: /usr/local/bin/c2s

the breakpoint isn't ever hit... Other suggestions?

The program still hangs at:

_mio_run (m=0x801c4dc00, timeout=Variable "timeout" is not available.) at mio_impl.h:218 218 if(retval == 0) return; [next] 221 if(retval < 0) [next] 231 MIO_ITERATE_RESULTS(m, retval, iter) [next] 208 { [next] 156 newfd = accept(fd->fd, (struct sockaddr*)&serv_addr, &addrlen); [next] 208 {

  Changed 10 months ago by smoku

Maybe it would be easier if I could get my hands (ssh) on the affected machine? If it is possible, please contact me on xmpp:smoku@xiaoka.com

follow-up: ↓ 19   Changed 10 months ago by kbrantley

I added you the other day and we talked a bit. I've got that server ready to go - your SSH key included. It's currently running .19 with --enable-debug and --enable-developer. The binaries are running in this problematic state at this very moment. If you have any questions or need to get in contact with me, xmpp:kyle@aboveaverageurl.com.

follow-up: ↓ 20   Changed 10 months ago by mikevg

Hi,

Is this coming anywhere?

I tried installing jabberd about 1 month ago and hit at this wall (back then i did not find this ticket). Now I tried the same thing (noticed my gsasl was out of date, so I updated it to the latest). But that didn't make my problem go away.

As I can see it now, there is no way I can configure the jabber server so that any of the 4 client processes can connect to the router. It looks almost as if the router doesn't really accept the connection so the clients (c2s, s2s, resolver and sm) can't start the authentication process.

I tried jabberd jabberd 2.1.19 on latest debian stable. I custom compiled gsasl and libgsasl and expat to be the latest versions.

I haven't tried cyrus sasl, but as I was reading this that shouldn't make any difference. if it does please let me know and I'll try that aswell.

If you want a debian machine to test jabberd, let me know.

Kind regards, Mike van Goor.

in reply to: ↑ 17   Changed 9 months ago by smoku

Replying to kbrantley:

I added you the other day and we talked a bit. I've got that server ready to go - your SSH key included. It's currently running .19 with --enable-debug and --enable-developer. The binaries are running in this problematic state at this very moment. If you have any questions or need to get in contact with me, xmpp:kyle@aboveaverageurl.com.

sx (sasl_cyrus.c:1041) initialising sasl plugin
sx (sasl_cyrus.c:1099) sasl context initialised; appname=jabberd-router

This is definetely related to CyrusSASL backend.

After recompiling with GnuSASL, the c2s connects to router properly and waits for connections.

sx (sasl_gsasl.c:690) initialising sasl plugin
sx (sasl_gsasl.c:718) sasl context initialised
[...]
Sun Dec 16 09:04:57 2007 [notice] [0.0.0.0, port=5222] listening for connections
Sun Dec 16 09:04:57 2007 [notice] ready for connections

in reply to: ↑ 18   Changed 9 months ago by smoku

Replying to mikevg:

If you want a debian machine to test jabberd, let me know.

Please do contact me on xmpp:smoku@xiaoka.com about this.

follow-up: ↓ 24   Changed 8 months ago by holler

This is definitely a problem through a compiler bug. I've had the same problem using the cyrus-backend and have traced it donw to the assembler-level.

I've used gcc 4.1.1 and have compiled cyrus-sasl with -Os and -march=c3-2. In sasl_getprop(sasl_conn_t *conn, int propnum, const void **pvalue) (file lib/common.c) the part responsible for evaluating SASL_MECHNAME will end in this assembler code (the comments are from me):

; *((const char **)pvalue) = ... ; ... ; if(!(*pvalue) && result == SASL_OK) result = SASL_NOTDONE; cmpl $0, (%ebp) ; test if *pvalue (ebp) != 0 movl %eax, (%ebp) ; set *pvalue jne .L228 ; normal return jmp .L264 ; -> return with -6 (SASL_NOTDONE)

This means gcc has ordered the cmpl and movl wrong, it should be movl cmpl. The result is, that sasl_getprop will return a wrong rc if the pointer it should set is zero before. That happens e.g. in router.c resulting that the connection will not be authorized and thus e.g. c2s will not get the bindings and so does not start listening. I assume people using the gsasl-backend will have somewhere a similiar problem (either in jabberd or gsasl), if they compiled using a buggy compiler.

Using gcc 4.2.2 the assembler-source looks ok and the error does not appear, so it seems they have fixed it (I haven't searched through the gcc-buglist).

  Changed 8 months ago by holler

Here is assembler-part preformatted: ;)

        ; *((const char **)pvalue) = ...
        ; ...
        ; if(!(*pvalue) && result == SASL_OK) result = SASL_NOTDONE;
        cmpl    $0, (%ebp) ; test if *pvalue (ebp) != 0
        movl    %eax, (%ebp) ; set *pvalue
        jne     .L228 ; normal return
        jmp     .L264 ; -> return with -6 (SASL_NOTDONE)

  Changed 8 months ago by holler

People who don't want to update/downgrade the compiler should use -O2, this seems to work (at least here).

in reply to: ↑ 21 ; follow-up: ↓ 25   Changed 8 months ago by mikiecx

Replying to holler:

This is definitely a problem through a compiler bug. I've had the same problem using the cyrus-backend and have traced it donw to the assembler-level. I've used gcc 4.1.1 and have compiled cyrus-sasl with -Os and -march=c3-2.

nice research but... my gcc version is 3.4.6. (slackware 11).

mike@elmo:~$ gcc -v
Reading specs from /usr/lib/gcc/i486-slackware-linux/3.4.6/specs
Configured with: ../gcc-3.4.6/configure --prefix=/usr --enable-shared --enable-threads=posix --enable-__cxa_atexit --disable-checking --with-gnu-ld --verbose --target=i486-slackware-linux --host=i486-slackware-linux
Thread model: posix
gcc version 3.4.6

in reply to: ↑ 24   Changed 8 months ago by holler

Replying to mikiecx:

nice research but... my gcc version is 3.4.6. (slackware 11).

Another possible problem I've had (on a debian stable machine) was, that the db was locked (through an unclean shutdown), afterwards the init hanged while trying to initialize the db. db4.2_recover -v -h /var/lib/jabberd2/db/ fixed this. Maybe this helps.

  Changed 8 months ago by smoku

  • status changed from new to closed
  • resolution set to fixed

Added check for optimization level in [530].

Note: See TracTickets for help on using tickets.