Feb 21, 2013

Last summer, I had some internet connectivity problems. Specifically, I would have massive latency issues that affected my conversations on Skype and my relatively pathetic under the best of circumstances efforts at online gaming. It was driving me up a wall and I couldn't figure it out. It hadn't occurred earlier with the same ISP so I thought it was just temporary issues with the network. However, the problem went on for weeks at various hours of the day.

I contacted the customer service at my ISP and was dismissed as being crazy. Their website's ping test tool showed me having a ping of around 40 ms and they couldn't see any problem on their end. The issues I was having were just with the remote sites. The fact that it was something like 30 different websites or services that had this problem never phased the tech support guy.

I as frustrated but I couldn't blame them. As far as they could see, no problem existed. And it was one of those evil connection problems that only acts up some of the time on some of the packets. Even when I was having slowdowns, I could open a terminal and run ping google.com or some other site and it would come back with very reasonable times. Some of the time (2-3% of all packets), however, it would throw huge pings on the order of 300 to 700 ms. The tests that they were doing (a couple of packets and take the mean) would never find the problem. I needed to collect a lot of pings over a reasonably long length of time to be sure of catching and characterizing the problem.

I had done some research and it seems there is an transmission robustness option for DSL called interleaving that, more or less, queues up packets before sending them. This is known to increase latency. With trace route, I was able to see that the problem appeared to be on my ISP's network and not a problem on my LAN or the remote host. I grabbed 4 IPs off the trace route (my router, the “hop” to my ISP's network, the second hop on the ISP's network and then the remote host which is my Linode VPS).

A quick Google search pulled up an implentation of ping in Python. I wrote a small collection of scripts to use this implantation of a ping tool and to repeatedly ping the two selected IPs. I went a bit overboard and hit each IP 1000 times.

I took the data collected from the ping test and loaded into R. Sure enough, there was some funky stuff going on.

pingTimes <- importPingData("~/personalProjects/feelingPingy/hops3.csv")
pingTimes$prettyTargetIP <- ifelse(pingTimes$targetIP == "", "router", 
    ifelse(pingTimes$targetIP == "", "firstHop", ifelse(pingTimes$targetIP == 
        "", "secondHop", "remoteHost")))
# just an fyi for if you are doing this, the name is targetIP and is the
# xxx.xxx.xxx.xxx IP address by default
by(pingTimes$ping, pingTimes$prettyTargetIP, sd, na.rm = TRUE)
## pingTimes$prettyTargetIP: firstHop
## [1] 14.41
## -------------------------------------------------------- 
## pingTimes$prettyTargetIP: remoteHost
## [1] 8.255
## -------------------------------------------------------- 
## pingTimes$prettyTargetIP: router
## [1] 0.5125
## -------------------------------------------------------- 
## pingTimes$prettyTargetIP: secondHop
## [1] 16.98

We need the na.rm = TRUE flag because some of the attempts to ping the various IPs actually timed out (ping >= 2,000 ms). We can readily see that the variance is increasing as you move off the LAN with the first hop (me to my ISPs network) has a standard deviation of 14.4 ms and the second hop has a variance of 17.0 ms. Considering that a good ping is probably under 50 ms to the targeted IP, this isn't a very good bit of information, especially since the variance goes up by so much as soon as it leaves the LAN. The packets are screwed out of the gate, so to speak. Lets look at this visually.

ggplot(pingTimes) + geom_density(aes(x = ping, color = prettyTargetIP))

plot of chunk unnamed-chunk-2

We can see the greater variance on the remote IPs here. More striking is that the distribution of ping times to the remote host is clearly bimodal (green). This would suggest that there are two different processes generating this data. One gives a low ping, the other gives a higher ping. If we look at the two IPs tested between me and the targeted remote host, we see that the first hop seems to be giving the shape to both densities (the second hop is a function of the first hop's ping plus some marginal addition). However, they are all kind of hard to see because of the very high density for the pings on the LAN. Lets redo this looking only at the IPs that aren't local.

ggplot(pingTimes[pingTimes$prettyTargetIP != "router", ]) + geom_density(aes(x = ping, 
    color = prettyTargetIP))

plot of chunk unnamed-chunk-3

Now that looks better. We can really see the bimodal, almost trimodal, nature of the ping times at the remote host (in green). We can also see that this shape seems to also be clear in the densities for the first and second hop (on my ISPs network). Some packets leave right away, some wait a bit long and some seem to wait forever to make the hop from my modem to the first remote node. We see this shape show up again in the second hop (since the since hop is an additive function of the first, this is expected). If the second hope was also slow or had multiple processes going on or if the problem was at my VPS, the curves at each node would look different. The fairly constant shape suggests that there is a rate limiting step that determines the distribution of ping times.

The fact that the second and later hops all have the same shape as the first hop suggests that the rate limiting step is the transfer of the packets from my modem to my ISPs network. And the problem was real. The reason it wasn't showing up on their simple mean with an n of 10 type tests is clear in the ecdf.

ggplot(pingTimes[pingTimes$prettyTargetIP != "router", ]) + stat_ecdf(aes(x = ping, 
    color = prettyTargetIP))

plot of chunk unnamed-chunk-4

Not all the packets were affected, in fact, nearly half left without excessive delay. However, 25% took over 40 ms to merely move from my modem to the ISPs network. Given that 40 ms is a very long time for a single hop (and over half of what I would expect for a round trip time), the impact I was seeing on Skype and other places was real. Armed with the new data*, I was able to get my connection moved from interleaving to fastpath.

I figured I would post this simple analysis and the tools I used in case they can ever help anyone in the future. I can't be the only one who has ever had this type of problem!

*If you ever have internet connection issues that aren't being fixed by the over-the-phone support or even tech visits, going to the social media teams (see your ISPs Facebook page or check them out on DSLReports) typically brings faster and better results. Once I knew that the problem was real, it took 2 emails and 18 hours to get it fixed via my ISPs social media support people. This is after 2-3 weeks of dealing with phone support and getting nowhere.

Tags: internet Python R applied statistics


At 2013-06-06 10:55 JustSomeOne writes:

I like your post and how you've found the problem with your ISP. Some of the graphs don't seem to be quite right, though. They're about "rent", "bedrooms", "long" and "lat"...

At 2013-06-06 11:49 Louis writes:

Cool post. Wonder if my ISP will even understand this. Your first chart is probably not the right one? Rooms/rent?

At 2013-06-06 12:55 Michael Spencer writes:

Can you change the first two graphs so they reflect what you talk about in the post?

At 2013-06-06 13:54 Trey writes:

Did you mean to have the bedrooms x rent plot up there? Great post, otherwise.

At 2013-06-06 13:16 Larry writes:

Maybe I missing something but the first two graphs don't seem to match your blog content.

At 2013-06-06 11:53 Louis writes:

Actually the second chart also doesn't make sense.

At 2013-06-06 11:47 Paul Courtney writes:

As one who has suffered at the hands of latency, I am glad to see what you have put together. I will try this out at my own home office and see what I see. I am curious though about the labels you have for the first two graphs: the first is rent vs. bedrooms and the second is lat(itude) vs. long(itude). Are you sure you have the right graphs in the post?

At 2013-06-06 14:29 Jacob writes:

Yes, it looks like there was a mistake in the HTML when I made converted the knitr output to something that would work with R Bloggers. The graphs have been corrected. Thanks to everyone who noticed!

At 2013-06-09 01:21 efrique writes:

You say things like this: "the second hop has a variance of 17.0 ms". This doesn't make sense - variance is in units-squared, not original units. Do you mean standard deviation?

Post a comment

All comments are held for moderation.