In the previous post,
Observing ECMP with Dublin Traceroute,
I briefly talked about ECMP and showed how to visualize all (well, hopefully) the paths
between two nodes A and B. In particular I had run
dublin-traceroute from a 3G connection in Ireland towards Google’s public DNS,
18.104.22.168. This is a very interesting IP address, as it is a DNS, it is deployed
via anycast, and is widely distributed around the world, so I will use it often
thorughout this blog.
In this post we will analyze it, and at the end I’ll show how to interpret certain packet drops and what to do in those cases.
Dublin Traceroute’s output
In the previous post we ran simply
dublin-traceroute 22.214.171.124. The output
Starting dublin-traceroute Traceroute from 0.0.0.0:12345 to 126.96.36.199:33434~33453 (probing 20 paths, min TTL is 1, max TTL is 30, delay is 10 ms)
The second line contains some interesting information:
dublin-traceroute is sending
188.8.131.52 with 20 different destination ports (33434 to 33453), the source
port is 12345, and the TTL in the packets is variable between 1 and 30.
All the packets are UDP, which is the only layer 4 protocol that
implements at the moment, more will come though. Each packet has a custom payload
that is generated in a way to control the UDP checksum (more on this in a future post).
Before sendind each packet,
dublin-traceroute will wait 10 milliseconds, that
is 100 packets per second. It may seem a small number for a modern network, but
we will see later that for traceroute it could be too large.
All of the above means that
dublin-traceroute will send 30 * 20 = 600 packets
for a full 20-paths traceroute, and it will take approximately 6 seconds to send
Next we see a header:
== Flow ID 33434 ==
This header is repeated below with different values, and represent a single flow of packets through an unique path (unless the paths change during the traceroute). This flow also corresponds to the destination port of the UDP packets, and as seen above, we are using 20 different paths, that correspond to 20 different destination ports.
As also discussed above,
dublin-traceroute will send 30 packets packets for each
(source port, destination port) combination towards a given host.
Flows, microflows, 5-tuples?
A flow in Dublin Traceroute’s slang is a 5-tuple of source host and port, destination host and port, and protocol. This is alsk known as microflow (RFC2474, Differentiated Service fields in IPv4 and IPv6), and as discussed in Paris Traceroute. This 5-tuple is used by ECMP to decide to which next hop a packet will be forwarded to, and by Dublin Traceroute to force packets through a specific path.
Then we see a summary of the replies for each TTL in the given flow. Not all of them are interesting, so only a few are shown below (output line broken for readability):
1 192.168.43.1 (gateway), IP ID: 17503 RTT 7.657 ms ICMP (type=11, code=0) 'TTL expired in transit', NAT ID: 0, flow hash: 25516
This is the first hop, specifically the gateway. We have:
- the IP address that responded to our probe,
- its host name as resolved by DNS (
- the IP ID of the packet that this device received from us (
17503). This is one of the new features added by Dublin Traceroute, and it is important to detect NATs
- the Round-Trip Time in milliseconds (
- the ICMP type, code, and description, TTL expired in transit
- the NAT ID, related to the IP ID above
- the flow hash,
25516, a number used internally to represent a network path between two hosts
The silent hop
The second hop did not respond at all, so we just see an asterisk:
Things are getting interesting at hop 11:
11 172.16.101.1 (172.16.101.1), IP ID: 0 RTT 40.522 ms ICMP (type=11, code=0) 'TTL expired in transit', NAT ID: 42753 (NAT detected), flow hash: 25516
Here we see a new information:
NAT ID: 42753 (NAT detected). What this means
is that Dublin Traceroute has detected a device distant 11 hops that is
translating network addresses. NATs rewrite part of the IP header to replace at
least the source IP, and possibly more fields, and this condition can be
detected by the data contained in the response. How this is done is material for
another article, so we’ll skip this now. But the source code is
available so feel free to find out yourself :-)
One last note on this: Dublin Traceroute can detect multiple different NATs, so if a new address translation happens on the path it will be detected.
And eventually, the destination
18 184.108.40.206 (google-public-dns-a.google.com), IP ID: 39240 RTT 68.68 ms ICMP (type=3, code=3) 'Destination port unreachable', NAT ID: 42753, flow hash: 25516
The last hop is our destination,
220.127.116.11, who diligently responded to our
requests. To be fair, we were diligent too, and used the port ranges dedicated
to traceroute, i.e. 33434 and above.
Side note: 33434 seems to be born as 2 ^ 15 (halfway the highest port number) + 666, or in other words traceroute is the tool of the beast.
Note the “Destination port unreachable” ICMP, different from the ‘TTL expired in transit’ that we received so far. this means that we reached the target, and Dublin Traceroute will consider this the last hop.
What if some intermediate device or the target did not respond?
Simple enough: we see an asterisk! Does this mean that the device is broken? Not necessarily, there are several possible causes. For example:
- well, yes, the device is broken
- the return path is asymmetric, and something is broken on the return path
- the device is overloaded and dropped our request
- the device is actively dropping some or all of our requests
This list is not exhaustive.
We will see in the future that it’s possible to detect all of these issues in some simpler or harder way. But let’s see at least what happened to our original traceroute.
If you look at the traceroute from the
previous post you will notice
that some hops are consistently not responding, while a couple of them
18.104.22.168) are not responding only to certain probes.
While it’s not possible to tell with certainty what’s happening, we can interpret it and make good (or bad) guesses. After all, traceroute is all about guessing what’s going on!
The totally unresponsive hops: my guess is that these hops are just configured not to respond with ICMP TTL-expired. Why? Maybe to save resources, since ICMP TTL-expired packets are processed in the slow path, i.e. in the CPU rather than in the faster ASICs, and for this reason it’s very expensive. Or maybe the network administrators for that device believe that it’s better to hide it for security reasons. I won’t judge :-)
The partially-responsive hops:
these are more interesting. Why did they respond to certain packets, and ignored
others? Maybe we hit a rate limiter: after all ICMP TTL-expired are expensive,
and the network administrators may have configured a rate limiter to avoid that
many ICMP TTL-exired could take down the network device. The rate limiters
normally use a counter that is shared among all the packets, so running
traceroute from a different IP won’t probably help. Remember when I said
that 100 packets per second could be too much? Well, this could be the case.
We can try and increase the inter-packet delay using the
--delay option of
dublin-traceroute, and the responses should appear again.
Or maybe there is an actual problem on some paths, i.e. some faulty network device or cable. In that case we would see the problem consistently, and slowing down the packet rate won’t help. And obviously we would see that all the packets sent on that path will stop responding after that point. What can we do? Well, wait until it’s fixed, or contact the network administrator :-)
In the next post I will show how the graphical representation of a multipath traceroute can be more immediate to understand, and can be useful to spot problems that would otherwise be hard to see. All of this with some real life example of course :-)comments powered by Disqus