Ticket #206 (closed defect: fixed)

Opened 7 months ago

Last modified 6 months ago

resolver is not asynchronous

Reported by: Simon Owned by: Simon
Priority: major Component: resolver
Version: 2.2 Keywords:
Cc: Tracforge_linkmap:
Blocking: Blocked By:

Description

The resolver is not asynchronous and only responds to each request as it comes in. This effectively makes new s2s outgoing connections block and fail if the resolver is busy: 00:00:00 s2s needs to resolve A for outgoing connection, sends to resolver 00:00:01 resolver starts resolving A 00:00:05 s2s needs to resolve B for outgoing connection, sends to resolver 00:01:30 s2s checks timeouts and fails the outgoing connection to both A and B 00:02:01 resolver fails to resolve A and returns failure response 00:02:01 resolver starts resolving B 00:02:02 resolver returns response

The resolver needs to queue and asynchronously resolve hostnames instead to avoid unresponsive DNS resolution blocking other connections from working.

Attachments

patch_206_r1.patch (68.6 KB) - added by Simon 6 months ago.
Initial patch to fix this (note: resolver/ dir deleted)
patch_206_r2.patch (70.5 KB) - added by Simon 6 months ago.
Second revision, checks dns_submit hasn't failed (note: resolver/ dir deleted)
patch_206_r3.patch (80.9 KB) - added by Simon 6 months ago.
Third revision, negative dns caching, and automatic connection retries (note: resolver/ dir deleted)
patch_206_r4.patch (87.5 KB) - added by Simon 6 months ago.
Fourth revision, all planned changes completed (note: resolver/ dir deleted)
patch_206_r5.patch (87.5 KB) - added by Simon 6 months ago.
Fifth revision, fixes use after free if dns cache is disabled and freeing of cancelled active queries (note: resolver/ dir deleted)
patch_206_r6.patch (89.3 KB) - added by Simon 6 months ago.
Fixes retry failure in non-reuse mode. Removes jabberd.cfg references to resolver and resolver man page.
patch_206_r7.patch (89.6 KB) - added by Simon 6 months ago.
Accumulate weighting if a host is repeated, fix a bug in the reuse dest key use.
patch_206_r8.patch (90.0 KB) - added by Simon 6 months ago.
Updated to handle adding up the weights of gmail.com's multiple hostnames all resolving to the same IP/port…
patch_206_r9.patch (90.5 KB) - added by Simon 6 months ago.
Remove hosts from failure list when sending dialbacks (i.e. connection is online)
patch_206_r10.patch (99.0 KB) - added by Simon 6 months ago.
Includes fixes for #216 and #208

Change History

Changed 7 months ago by smoku

  • version changed from 2.1.23 to 2.1.24

This is all true.

I already started working on adns based s2s in [443], but I do not have much time to pursue.

I will gladly accept any contributions though.

Changed 7 months ago by Simon

I'll have a look at it once I've finished with c2s/s2s.

Changed 7 months ago by smoku

OK. So i commited my changes to branches/adns - it might help.

Changed 6 months ago by Simon

  • owner changed from smoku to Simon
  • status changed from new to assigned

Changed 6 months ago by Simon

Initial patch to fix this (note: resolver/ dir deleted)

Changed 6 months ago by Simon

The above patch is against svn trunk r570, the following is my todo list:

  1. [x] outgoing connections should be keyed by domain not host/ip
  2. [x] outgoing packets should find existing connections before using dns cache
  3. [x] dns cache should store multiple responses
  4. [x] outgoing connections should select a host randomly
  5. [x] temporary blacklist of host/ips which fail to accept connections
  6. [x] delete unused outgoing queues
  7. [x] minimum dns ttl (30s)
  8. [x] maximum dns ttl (1d)
  9. [x] cleanup running dns queries when the cache entry is expiried
  10. [x] expire dns cache entry early if all ips are blacklisted and there are expired non-blacklisted ips
  11. [ ] where possible use out->dkey instead of iterating out->routes
  12. [ ] split out_domain off out_packet, returns connection
  13. [ ] retry connection attempts using out_domain if host was bad and there are non-bad hosts
  14. [ ] negative dns cache (30s)

I'll finish the last four soon - the first entry on this list is a configuration option that defaults to the existing behaviour; see s2s.xml.dist.in. It requires more testing in both modes, ieally with short check times, but I think I've fixed all the bugs.

In reuse mode:

  • conn->key is ip/port
  • conn->dkey is NULL
  • conn->routes is the only list of domains for this conn
  • s2s->out_host is unique list of conns
  • s2s->out_dest is for finding existing conns

In non-reuse mode:

  • conn->key is ip/port
  • conn->dkey is domain
  • conn->routes will all be for the same dest domain
  • s2s->out_host is not used
  • s2s->out_dest is unique list of conns

It is important to use the right method to find a unique list of all connections or data will be double-freed. Also, conn->dkey is not useable in reuse mode, so conn->routes must be used instead to find all the out_dest entries.

Changed 6 months ago by Simon

Second revision, checks dns_submit hasn't failed (note: resolver/ dir deleted)

Changed 6 months ago by Simon

Updated to check that the dns_submit call hasn't failed:

  1. [x] where possible use out->dkey instead of iterating out->routes
  2. [x] detect immediate dns submission failures
  3. [ ] split out_domain off out_packet, returns connection
  4. [ ] retry connection attempts using out_domain if host was bad and there are non-bad hosts
  5. [ ] negative dns cache (30s)

Changed 6 months ago by smoku

I'm not sure about

4. [x] outgoing connections should select a host randomly

DNS SRV records have weight and the next candidate to connect should be based on the weight.

We should also use the DNS server round-robin support. When there are multiple answers for the same query, the DNS server is doing round-robin for us, by giving other first entry every time asked. So the best practise to select candidate to connect is to choose the first one on the list recieved from DNS server.

Changed 6 months ago by Simon

You can't rely on the DNS server to give reponses in a different order each time, although with a cache of bad hosts it doesn't really matter.

The weight value could be stored with each host (choosing the lowest weight if multiple SRV records return the same host) and this could then be used to determine what to try next... although it'd still need to use a random host if they've all failed.

Changed 6 months ago by smoku

Sure. You cannot rely on the DNS. But I think we should expect this behavior, because it is a standard, and do gracefull fallback when this is not met.

And it does not really matter which host we try next when all candidates are dead, so random is as good as anything else... ;-)

Changed 6 months ago by Simon

Third revision, negative dns caching, and automatic connection retries (note: resolver/ dir deleted)

Changed 6 months ago by Simon

Updated to support retrying connection attempts immediately before bouncing the queue of packets:

  1. [x] split out_route off out_packet and make use of bad hosts optional
  2. [x] retry connection attempts using out_route if host was bad and there are non-bad hosts
  3. [x] queue retry limit (5m)
  4. [x] allow cache to be disabled
  5. [x] negative dns cache (30s)
  6. [ ] use srv weight values to select hosts

Changed 6 months ago by Simon

Fourth revision, all planned changes completed (note: resolver/ dir deleted)

Changed 6 months ago by Simon

This should be it completed now:

  1. [x] use srv weight values to select hosts

It'll prefer, in order:

  1. reuseable connections
  2. IPv6
  3. IPv4
  4. bad hosts

In each list (aside from the last), it will include only those at the most preferred (lowest) priority and keep track of a running weight which is then used to determine which host is next at random (RFC 2782). I mapped weight 0 to 16 and weights 1-65535 to 256-16776960 (see end of _dns_result_a) so that they're spread out more and weight 0 is given only a very low chance of selection.

Changed 6 months ago by Simon

Fifth revision, fixes use after free if dns cache is disabled and freeing of cancelled active queries (note: resolver/ dir deleted)

Changed 6 months ago by Simon

Fixes retry failure in non-reuse mode. Removes jabberd.cfg references to resolver and resolver man page.

Changed 6 months ago by Simon

Accumulate weighting if a host is repeated, fix a bug in the reuse dest key use.

Changed 6 months ago by Simon

Updated to handle adding up the weights of gmail.com's multiple hostnames all resolving to the same IP/port...

Changed 6 months ago by Simon

Remove hosts from failure list when sending dialbacks (i.e. connection is online)

Changed 6 months ago by Simon

Includes fixes for #216 and #208

Changed 6 months ago by smoku

  • status changed from assigned to closed
  • resolution set to fixed

In [576]: Merged asynchronous domain resolving in s2s component support by Simon Arlott. Closes #206 #208 #216

Changed 6 months ago by smoku

  • version changed from 2.1.24 to 2.2
Note: See TracTickets for help on using tickets.