Instrumental Variables Simulation

Instrumental Variables

Instrumental variables are an incredibly powerful for dealing with unobserved heterogenity within the context of regression but the language used to define them is mind bending. Typically, you hear something along the lines of “an instrumental variable is a variable that is correlated with x but uncorrelated with the outcome except through x.” At this point examples are listed — taxes on smoking likely effect health only through their actions on smoking — or the author drops right into the math stats.

I like math stats (when I am not getting a grade for it at least!) and will work through it. But at some point, I want to play with a simulation of the process. However, books like Mostly Harmless Econometerics just throw a bunch of nonsense in terms of emperical studies at you. Sure, the OLS and IV estimates are different… but I don't know which is right for sure in an empirical setting. Simulations are great for this (the lack of simulations is one of my biggest issues with that book. Right after the use of Stata.).

I turned to Google and did several searches and the only simple simulation that I could find was done using Stata. At least it was a simulation… but I wanted more. Since I couldn't find any examples using R, I play around until I got it and then write up the simulation to see if it helps anyone.

Overview

Suppose that you have a continuous variable \( y \) with the known mean response function

\[ E(y) = \beta_0 + \beta_1 x + \beta_2 c \]

and further that \( x \) and \( c \) are correlated with each other. If you knew \( c \), estimating \( \beta_1 \) would be easy by simply running lm(y ~ x + c). You would complete the regression, throw together a few diagnostics and call it a day.

But often we don't observe \( c \) in our data. \( c \) could be any number of things, such as treatment practices at a hospital or unmeasured differences between patients, but it is in the direct casual path of \( y \) and you don't know it. If you instead do the regression lm(y ~ x), you will not get an accurate estimate of \( \beta_0 \). If you think about the model fit in that call it looks like

\[ y_i = \beta_0 + \beta_1 x_i + \epsilon_i \] but in this case \[ \epsilon_i = f(c_i, \eta_i) \] where \( \eta_i \) is white noise centered on zero. As long as that function is linear, the expectation of \( \epsilon \) will be some multiple of \( c \). The result is that \( x \) and \( \epsilon_i \) are correlated. The estimates of \( \beta_1 \) will nessecarly be wrong. Interested parties can read more here and see just how wrong it will be.

Suppose there is a third variable \( z \), \( x \) can be expressed as some function of \( z \), \( z \) has no effect on \( y \) except through \( x \) (therefore \( z \) is independent of any other variable that effects \( y \)). Then we have the equations

\[ E(x) = \alpha_0 + \alpha_1 x^* + \alpha_2 z \] and \[ E(y) = \beta_0 + beta_1 x + f(c) \]

where \( x^* \) is some latent part of \( x \) and \( c \) is still unobserved. Looking at the first equation and realizing we don't know what \( x^* \) is the resulting model will have two regression coefficients. The intercept will be \( \alpha_0 + \alpha_1 E(x^*) \) which is not correlated with \( f(c) \) as and \( z \) is independent of \( f(c) \) by assumption.

If we take the fitted values of \( x \) from the first equation and plug it into the second equation, the \( x \) term is no independent of \( f(c) \). This independence means that we can produce consistent estimates of \( \beta_1 \) when we replace \( x \) with the fitted values of \( x \) from equation one. When the estimator used is OLS, this method is called two-stage least squares (2SLS).

Simulations

We need to start by generating values of \( x \) and \( c \) that are correlated. The easiest way to do this is to use the function mvrnorm() from MASS. Lets start by generating a 1000x2 random matrix with \( \rho = 0.5 \).

library(MASS)
# we are really generating x* and c and using a common variance
xStarAndC <- mvrnorm(1000, c(20, 15), matrix(c(1, 0.5, 0.5, 1), 2, 2))
xStar <- xStarAndC[, 1]
c <- xStarAndC[, 2]

and lets generate \( z \) and make the observed \( x \)

z <- rnorm(1000)
x <- xStar + z

And now generate the response variable

# using 1 makes it easy to estimate how 'wrong' an estimator is and toss
# some noise on y
y <- 1 + x + c + rnorm(1000, 0, 0.5)

Lets check to make sure that everything worked so far:

cor(x, c)
## [1] 0.3027
cor(z, c)
## [1] -0.02993

We have a moderate correlation between \( x \) and \( c \) and it appears that \( z \) and \( c \) are uncorrelated. So far, so good.

Lets estimate the full model (the model that we would estimate if we knew \( c \))

lm(y ~ x + c)
## 
## Call:
## lm(formula = y ~ x + c)
## 
## Coefficients:
## (Intercept)            x            c  
##       1.716        0.987        0.970

Not bad 0.9868 is basically 1. Now what would happen if we ignored the fact that \( c \) was unknown and estimated anyways?

lm(y ~ x)
## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##        12.0          1.2

Yikes! The complete model definitely works but when we ignore the heterogeneity caused by \( c \) we get terrible estimates. The estimate was 120% of the true effect of \( x \). That isn't good.

What happens when we break the correlation between the “x part” and the unobserved \( c \) by using the instrument and 2SLS?

xHat <- lm(x ~ z)$fitted.values
lm(y ~ xHat)
## 
## Call:
## lm(formula = y ~ xHat)
## 
## Coefficients:
## (Intercept)         xHat  
##      16.895        0.954

That is a pretty good estimate, only off by 4.6% of the true value of 1. Pretty nifty that we are able to accurately estimate the effect of \( x \) using the instrument despite the unobserved value of \( c \).

Its worth noting now that if you want to do this in R, you probably do not want to do 2SLS as I have done it here. The parameter estimates are correct but the errors are not. Using the method I have done above, you get standard errors from the second equation residuals whereas you want them from the first. The package AER includes a function ivreg() which will take care of this for you.

If you look at the equations, two things probably jump out at you. First, if \( x \) does a poor job predicting \( x \) (i.e., \( z \) is a weak instrument), then it is unlikely that IV and 2SLS will provide you with meaningful results. In fact, it is possible that the IV estimates will be more inaccurate than the naive OLS estimator. A rule of thumb is the first stage (i.e., the regression of x on the instrument and any covariates) should have an F statistic of at least 10.

Second, if \( x \) and \( c \) are only weakly correlated, the OLS and IV estimates should converge. As \( x \) and \( c \) become more correlated, however, the estimates move apart quickly and with even mild correlation (\( \rho \sim 0.2 \)) there can be significant differences in the estimates.

The dynamic graph below allows you to explore the effects of altering these parameters on the resulting OLS and IV estimates. The solid black line is the true value, the green line is the estimate using the full model (if you knew \( c \)), red is the IV estimate and blue is using the OLS estimator.

Introducing Hyde

Hyde is a brazen two-column Jekyll theme that pairs a prominent sidebar with uncomplicated content. It's based on Poole, the Jekyll butler.

Built on Poole

Poole is the Jekyll Butler, serving as an upstanding and effective foundation for Jekyll themes by @mdo. Poole, and every theme built on it (like Hyde here) includes the following:

  • Complete Jekyll setup included (layouts, config, 404, RSS feed, posts, and example page)
  • Mobile friendly design and development
  • Easily scalable text and component sizing with rem units in the CSS
  • Support for a wide gamut of HTML elements
  • Related posts (time-based, because Jekyll) below each post
  • Syntax highlighting, courtesy Pygments (the Python-based code snippet highlighter)

Hyde features

In addition to the features of Poole, Hyde adds the following:

  • Sidebar includes support for textual modules and a dynamically generated navigation with active link support
  • Two orientations for content and sidebar, default (left sidebar) and reverse (right sidebar), available via <body> classes
  • Eight optional color schemes, available via <body> classes

Head to the readme to learn more.

Browser support

Hyde is by preference a forward-thinking project. In addition to the latest versions of Chrome, Safari (mobile and desktop), and Firefox, it is only compatible with Internet Explorer 9 and above.

Download

Hyde is developed on and hosted with GitHub. Head to the GitHub repository for downloads, bug reports, and features requests.

Thanks!

Penalizing P Values

Penalizing P Values

Ioannidis' paper suggesting that most published results in medical research are not true is now high profile enough that even my dad, an artist who wouldn't know a test statistic if it hit him in the face, knows about it. It has even shown up recently in the Economist as a cover article and plays directly into the “decline effect” discussed in a cover story in the New Yorker from 2010. Something is seriously wrong with science if only a small fraction of papers can actually be replicated.

But, placed in the context of the “decline effect,” this result makes sense. And it is a fundamental aspect and potential flaw in the way frequentist inference treats hypotheses.

Using a wild example, suppose I return from a walk in the woods and report an encounter with bigfoot. Now, while it is possible that bigfoot is real, it seems unlikely. But I have some blurry pictures and video of something moving off in the bush. I claim that this is evidence that bigfoot is real.

I show you my evidence of bigfoot and tell you about my encounter. You know me and know that I am fairly sane and always wear my glasses. You think it is unlikely that I would make this up or mistake a deer for bigfoot. You think there is less than a 5% chance that I would make up something this or more convincing given that bigfoot is not real. Therefore, the evidence would suggest that you reject the null that bigfoot is not real.

Hopefully, you don't think that is reasonable. But that is exactly how frequentist inference treats evidence for or against the null. The p value is simply the \( P(\theta \geq \hat{\theta} | H_0) \). The claim that bigfoot is real is given as much credibility as the claim that smoking causes cancer (RA Fisher might think that is reasonable but the rest of us have reason for concern). We would probably conclude that it was much more likely that I saw a deer or a hoax then I saw an actual bigfoot.

This becomes a problem for a few reasons

  1. We notice things more when they are unexpected
  2. We report things more when they are unexpected
  3. Many things that are unexpected are unexpected for a reason

This problem is especially serious when people “throw statistics” at data with the goal of making causal inference without using a priori theory as a guide. They find that there is a relationship between X and Y that is significant at \( \alpha = 0.05 \) and publish.

The field of Bayesian statistics provides a radically different form of inference that can potentially be used to address this question, but a simple back of the envelope penalty term may work just as well. Consider the simple cases of Bayes theorem,

\[ P(A | B) = \frac{P(B | A) P(A)}{P(B | A) P(A) + P(B | A^c) P(A^c)} \]

Taking \( P(B | A) \) to be the the same as the p value and \( P(A) \) being our a priori estimate of how likely the null hypothesis is true. What is the probability of rejecting the null when the null is not true? That is simply the power of the test with the given parameters. Suppose we set \( P(B | A) \) to some constant value (e.g., 0.05), and label anything with \( p \) less than that value is significant and anything greater is non-significant, e.g., \( P(B | A) = \alpha \). We can then calculate the rate of “false positive” results for that value of \( \alpha \) and power with

\[ P(H_0 | \hat{\theta}) = \frac{P(\theta \geq \hat{\theta} | H_0) P(H_0)}{P(\theta \geq \hat{\theta} | H_0) P(H_0) + P(\theta \geq \hat{\theta} | H_0^c) (1 - P(H_0))} \]

I wanted to get a feel for what this would look like and how these different parameters would interact. Also I needed an excuse to learn Shiny. You can see how this comes together and play with the values in the dynamic graph below.

I would encourage you to play around with it and see how the different values effect the probability that the alternative is true. You can see in the default case where we place equal weight on the null being true or false and have well powered studies, we do pretty well for ourselves. But as soon as you lower the power to a plausible 0.35 it the probability of the results being spurious doubles. If you set the power back at 0.80 but set the probability of the null being true at 90%, as Ioannidis suggests, we see the probability of a false positive at $\alpha = 0.05% is now roughly 35%! If you combine the low power and unlikeliness of the tested claims, the probability of false conclusions is well over 50% using a standard \( \alpha \).

As exciting as it would be to be known as the guy who found bigfoot, odds are that it was just some high schoolers out to play games with you. The null should be treated differently when we are making seemingly obvious and unexpected results. Even a simple sanity test as described here may reduce the surprisingly and unsustainable large number of later falsified or unreproduced findings. It certainly explains one process by which they may occur.

TV Ratings Myths

TV Show Cancellations: Myths and Models

TV shows are amazing ways to waste time and, on occasion, the story is so good that you actually start to care. The problem is that some shows get cancelled before they jump the shark. Classic examples are shows like Firefly or Arrested Development. With the increasing serialization of TV shows, having the show be cancelled early means the story will potentially be unresolved or have a rushed conclusion.

As a result, a cottage industry of decision rules for predicting cancellation has emerged over the past few years. A lot of rules have come up such as “If it is good and it is on Fox, they will cancel it” or “If it airs on Friday, it will not return for next season.” But how accurate are these rules?

The most sophisticated method of tracking viewership I have seen is the Renew/Cancel Index by TV By The Numbers. Their insight is that a network is unlikely to cancel many of their shows in a given season. New series are typically just as costly to make but are not a known quantity. Many shows are cancelled during the course of the first season. For every new series that a network wants to air, they have to order \( cS \) series where \( c > 1 \) and \( S \) is number of ordered new series. As such, a network is loath to cancel too many shows due to the added cost of a new project and possible replacement for the new project. Like the old joke, you don't have to outrun the bear, you just have to outrun the slowest guy in the group.

The Renew/Cancel Index produced by TV By The Numbers takes this to heart. Instead of just looking at the number of viewers or share, they divide the number of viewers of the \( i^{th} \) show on the \( N \) network by the average number of viewers of all shows on \( N \) during that week. Now it doesn't matter that a show that would be a good performer for NBC would be a dud on Fox because they are normalized to the mean for each network.

If this index holds up as accurate, it provides an interesting way to test those two major “TV Myths.” From the TV By The Numbers website, I pulled 303 renewal decisions for the networks ABC, CBS, FOX and NBC from 2009-2010 to the 2012-2013 seasons.

Data Overview

When we look at the density of the index values split by network, we see that this normalization makes them roughly equal. Each network has some outstanding shows with index values over 1.5 and all have a few duds around 0.5.

ggplot(bigFour, aes(x = index, color = network)) + geom_density() + scale_x_continuous("Renew/Cancel Index")

bigFour$adjusted <- bigFour$index + bigFour$friday * 0.2
ggplot(bigFour, aes(x = adjusted, color = network)) + stat_ecdf() + geom_vline(aes(xintercept = 0.75)) + 
    geom_vline(aes(xintercept = 0.9)) + scale_x_continuous("Adjusted Renewal/Cancel Index") + 
    scale_y_continuous("Cumulative Density")

The “slow runners” on each network are also made clear. The people at TV By The Numbers use the following breaks for their predictions:

  • Index values over 0.90, renewal
  • Index values over 0.75 but under 0.90 are toss-ups
  • Index values under 0.75 will probably lead to a cancellation And shows that air on Friday's are given a slight bump of about 0.2 index units to account for the Friday Night Death Slot. These values are the vertical lines in the ecdf plot.

So how well does the index work?

bigFour$adjusted <- bigFour$index + bigFour$friday * 0.2
ggplot(bigFour, aes(x = adjusted, y = status)) + geom_point() + geom_vline(aes(xintercept = 0.75)) + 
    geom_vline(aes(xintercept = 0.9)) + scale_x_continuous("Adjusted Renewal/Cancel Index") + 
    scale_y_continuous("Renewal Status") + geom_smooth(method = "loess", alpha = 0.25)

It captured the expected sigmoidal shape of the response curve! That is something! Looking at their suggested breaks, they nicely land at about 30% and about 70% probability of renewal. Not too shabby for simple rules. Now, if we accept this index as a valid measure of cancellation risk, lets ask the hard questions.

Mythbusting: Does FOX Hate TV?

The cancellation of Firefly and Arrested Development in the early 2000s lead to the idea that Fox cancels everything good. Is that true? Does FOX really hate TV? Or more accurately, does FOX have a higher requirement for show performance than the other networks? Lets test that with a model. Let \( Y_{i, N} \) be the renewal status of the \( i^{th} \) show on network \( N \) and \( I_{i, N} \) be the renew/cancel index for the \( i^{th} \) show. I looked at three models, one that did not have a network effect (M1), a model with an intercept that varied by network (M2) and a model with both variable intercept and slope parameters (M3).

bigFour$reference <- ifelse(bigFour$network == "FOX", 1, ifelse(bigFour$network == 
    "ABC", 2, ifelse(bigFour$network == "CBS", 3, 4)))
bigFour$network <- reorder(bigFour$network, bigFour$reference)
m1 <- glm(status ~ adjusted, data = bigFour)
m2 <- glm(status ~ adjusted + network, data = bigFour)
m3 <- glm(status ~ adjusted + network + adjusted * network, data = bigFour)
summary(m1)
## 
## Call:
## glm(formula = status ~ adjusted, data = bigFour)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1782  -0.3110  -0.0048   0.3421   0.8238  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.3152     0.0673   -4.68  4.3e-06 ***
## adjusted      0.9635     0.0673   14.31  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1426)
## 
##     Null deviance: 70.833  on 293  degrees of freedom
## Residual deviance: 41.629  on 292  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: 265.6
## 
## Number of Fisher Scoring iterations: 2
summary(m2)
## 
## Call:
## glm(formula = status ~ adjusted + network, data = bigFour)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.0951  -0.3131  -0.0147   0.3393   0.8944  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.26429    0.07965   -3.32    0.001 ** 
## adjusted     0.95144    0.06759   14.08   <2e-16 ***
## networkABC  -0.05954    0.06359   -0.94    0.350    
## networkCBS   0.00875    0.06236    0.14    0.888    
## networkNBC  -0.11536    0.06578   -1.75    0.081 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1416)
## 
##     Null deviance: 70.833  on 293  degrees of freedom
## Residual deviance: 40.919  on 289  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: 266.6
## 
## Number of Fisher Scoring iterations: 2
summary(m3)
## 
## Call:
## glm(formula = status ~ adjusted + network + adjusted * network, 
##     data = bigFour)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1760  -0.3031  -0.0172   0.3297   0.9497  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -0.2709     0.1477   -1.83    0.068 .  
## adjusted              0.9584     0.1472    6.51  3.4e-10 ***
## networkABC           -0.0740     0.1876   -0.39    0.693    
## networkCBS            0.1497     0.2016    0.74    0.459    
## networkNBC           -0.2308     0.2083   -1.11    0.269    
## adjusted:networkABC   0.0167     0.1901    0.09    0.930    
## adjusted:networkCBS  -0.1415     0.1972   -0.72    0.474    
## adjusted:networkNBC   0.1240     0.2099    0.59    0.555    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1421)
## 
##     Null deviance: 70.833  on 293  degrees of freedom
## Residual deviance: 40.655  on 286  degrees of freedom
##   (9 observations deleted due to missingness)
## AIC: 270.7
## 
## Number of Fisher Scoring iterations: 2

We see in all three models that the show's Friday-adjusted Renew/Cancel Index is a powerful predictor (OR of around 2.75!) but that the network effects were not significant.

Looking at this visually…

bigFour$fox <- ifelse(bigFour$network == "FOX", "FOX", "Others")
ggplot(bigFour, aes(x = adjusted, y = status, lty = fox)) + geom_smooth(alpha = 0.25, 
    method = "loess") + scale_x_continuous("Adjusted Index") + scale_y_continuous("Renewal Probability")

This is interesting. Fox is slightly less likely to cancel low performing shows (see Fringe) than other networks. However, Fox is more likely to cull shows with a more middling performance. This might be the results of shows like Fringe having favorable economics ( syndication rights) that leads to the higher renewal rates for FOX at the lower index values. However, it is fairly clear that FOX's reputation as a killer of all that is good on TV is (at least since 2010) unfounded.

Mythbusting: Friday Night Death Slot

One of the most interesting questions regarding the Friday Night Death slot is whether airing on Friday night kills the shows (due to the lack of viewers on Friday evening) or whether being moved to Friday night is the result of a lack of viewership on other nights. That is an interesting question but can't be addressed by the current form of the data I have. I can, however, answer whether Friday night is actually a death slot or a safe harbor.

Because Friday nights are going to have lower viewer numbers regardless of what show airs and because the networks want to have programming on Friday night, shows airing on Friday night might be renewed at viewership numbers that would have resulted in cancellation with extreme prejudice on any other night. This is because the cost of making a new series, and the midseason replacement in case the new project is a flop, is no cheaper and often more expensive than airing the known quantity.

m4 <- glm(status ~ index + friday, data = bigFour)
m4
## 
## Call:  glm(formula = status ~ index + friday, data = bigFour)
## 
## Coefficients:
## (Intercept)        index       friday  
##      -0.300        0.953        0.134  
## 
## Degrees of Freedom: 293 Total (i.e. Null);  291 Residual
##   (9 observations deleted due to missingness)
## Null Deviance:       70.8 
## Residual Deviance: 41.5  AIC: 267

Looking at the two models, shows on Friday don't get a huge break relative to shows airing on other days during the week (about 0.14 boost compared to the suggested 0.2 used by TV By The Numbers). Here is the same thing visually:

ggplot(bigFour, aes(x = index, y = status, color = as.factor(friday))) + geom_smooth(span = 0.75, 
    method = "loess")

The shows on Friday all have lower viewer counts by are more likely to be renewed at those lower counts than other nights. In this sense, Friday night time slots may offer a chance for shows with cult followings, like Fringe or Chuck to find an economically viable timeslot and audience. At the very least, the networks do not hold Friday shows to Thursday standards.

Hopefully this year I won't have to worry about getting my six seasons and a movie but if I do, I can rest confident knowing that the network doesn't matter and Friday offers a refuge for cult shows.

Fixing My Internet With R and Python

Last summer, I had some internet connectivity problems. Specifically, I would have massive latency issues that affected my conversations on Skype and my relatively pathetic under the best of circumstances efforts at online gaming. It was driving me up a wall and I couldn't figure it out. It hadn't occurred earlier with the same ISP so I thought it was just temporary issues with the network. However, the problem went on for weeks at various hours of the day.

I contacted the customer service at my ISP and was dismissed as being crazy. Their website's ping test tool showed me having a ping of around 40 ms and they couldn't see any problem on their end. The issues I was having were just with the remote sites. The fact that it was something like 30 different websites or services that had this problem never phased the tech support guy.

I as frustrated but I couldn't blame them. As far as they could see, no problem existed. And it was one of those evil connection problems that only acts up some of the time on some of the packets. Even when I was having slowdowns, I could open a terminal and run ping google.com or some other site and it would come back with very reasonable times. Some of the time (2-3% of all packets), however, it would throw huge pings on the order of 300 to 700 ms. The tests that they were doing (a couple of packets and take the mean) would never find the problem. I needed to collect a lot of pings over a reasonably long length of time to be sure of catching and characterizing the problem.

I had done some research and it seems there is an transmission robustness option for DSL called interleaving that, more or less, queues up packets before sending them. This is known to increase latency. With trace route, I was able to see that the problem appeared to be on my ISP's network and not a problem on my LAN or the remote host. I grabbed 4 IPs off the trace route (my router, the “hop” to my ISP's network, the second hop on the ISP's network and then the remote host which is my Linode VPS).

A quick Google search pulled up an implentation of ping in Python. I wrote a small collection of scripts to use this implantation of a ping tool and to repeatedly ping the two selected IPs. I went a bit overboard and hit each IP 1000 times.

I took the data collected from the ping test and loaded into R. Sure enough, there was some funky stuff going on.

source("~/personalProjects/feelingPingy/importPingData.R")
pingTimes <- importPingData("~/personalProjects/feelingPingy/hops3.csv")
pingTimes$prettyTargetIP <- ifelse(pingTimes$targetIP == "192.168.1.1", "router", 
    ifelse(pingTimes$targetIP == "209.181.206.17", "firstHop", ifelse(pingTimes$targetIP == 
        "209.181.211.129", "secondHop", "remoteHost")))
# just an fyi for if you are doing this, the name is targetIP and is the
# xxx.xxx.xxx.xxx IP address by default
by(pingTimes$ping, pingTimes$prettyTargetIP, sd, na.rm = TRUE)
## pingTimes$prettyTargetIP: firstHop
## [1] 14.41
## -------------------------------------------------------- 
## pingTimes$prettyTargetIP: remoteHost
## [1] 8.255
## -------------------------------------------------------- 
## pingTimes$prettyTargetIP: router
## [1] 0.5125
## -------------------------------------------------------- 
## pingTimes$prettyTargetIP: secondHop
## [1] 16.98

We need the na.rm = TRUE flag because some of the attempts to ping the various IPs actually timed out (ping >= 2,000 ms). We can readily see that the variance is increasing as you move off the LAN with the first hop (me to my ISPs network) has a standard deviation of 14.4 ms and the second hop has a variance of 17.0 ms. Considering that a good ping is probably under 50 ms to the targeted IP, this isn't a very good bit of information, especially since the variance goes up by so much as soon as it leaves the LAN. The packets are screwed out of the gate, so to speak. Lets look at this visually.

library(ggplot2)
ggplot(pingTimes) + geom_density(aes(x = ping, color = prettyTargetIP))

plot of chunk unnamed-chunk-2

We can see the greater variance on the remote IPs here. More striking is that the distribution of ping times to the remote host is clearly bimodal (green). This would suggest that there are two different processes generating this data. One gives a low ping, the other gives a higher ping. If we look at the two IPs tested between me and the targeted remote host, we see that the first hop seems to be giving the shape to both densities (the second hop is a function of the first hop's ping plus some marginal addition). However, they are all kind of hard to see because of the very high density for the pings on the LAN. Lets redo this looking only at the IPs that aren't local.

ggplot(pingTimes[pingTimes$prettyTargetIP != "router", ]) + geom_density(aes(x = ping, 
    color = prettyTargetIP))

plot of chunk unnamed-chunk-3

Now that looks better. We can really see the bimodal, almost trimodal, nature of the ping times at the remote host (in green). We can also see that this shape seems to also be clear in the densities for the first and second hop (on my ISPs network). Some packets leave right away, some wait a bit long and some seem to wait forever to make the hop from my modem to the first remote node. We see this shape show up again in the second hop (since the since hop is an additive function of the first, this is expected). If the second hope was also slow or had multiple processes going on or if the problem was at my VPS, the curves at each node would look different. The fairly constant shape suggests that there is a rate limiting step that determines the distribution of ping times.

The fact that the second and later hops all have the same shape as the first hop suggests that the rate limiting step is the transfer of the packets from my modem to my ISPs network. And the problem was real. The reason it wasn't showing up on their simple mean with an n of 10 type tests is clear in the ecdf.

ggplot(pingTimes[pingTimes$prettyTargetIP != "router", ]) + stat_ecdf(aes(x = ping, 
    color = prettyTargetIP))

plot of chunk unnamed-chunk-4

Not all the packets were affected, in fact, nearly half left without excessive delay. However, 25% took over 40 ms to merely move from my modem to the ISPs network. Given that 40 ms is a very long time for a single hop (and over half of what I would expect for a round trip time), the impact I was seeing on Skype and other places was real. Armed with the new data*, I was able to get my connection moved from interleaving to fastpath.

I figured I would post this simple analysis and the tools I used in case they can ever help anyone in the future. I can't be the only one who has ever had this type of problem!

*If you ever have internet connection issues that aren't being fixed by the over-the-phone support or even tech visits, going to the social media teams (see your ISPs Facebook page or check them out on DSLReports) typically brings faster and better results. Once I knew that the problem was real, it took 2 emails and 18 hours to get it fixed via my ISPs social media support people. This is after 2-3 weeks of dealing with phone support and getting nowhere.

Older Newer