I recently discovered the R package `tidytext` and fell in love with it. It combines the "tidy" ecosystem, which I'm very familar and comfortable in, with natural language processing (something that has been more challenging for me, not least of all because it rarely is tidy). I loved playing the package and modeling how my language and sentiments varied in my thesis. The heavy use of Jane Austen in the `tidytext` examples certainly didn't hurt either.
Cross-validation is a useful approach for estimating out-of-sample error. The `modelr` package has made estimating models with cross-validation much easier, and the `resample` object type is vital for cross-validation with large data in R.
One of the major problems in observational research is estimating the true treatment effect. This is not hard when the selection and outcome processes are uncorrelated and all relevant variables are observed and properly controlled for. However, when the selection and outcome are correlated and it is not possible to remove this correlation on the basis of the observables, biased estimation results. The Heckman selection model affords one way of dealing with and minimizing this introduced bias. A parallel R based simulation of a Heckman style estimator compared to least squares and propensity scores highlights the potential utility of this framework.
Propensity scores are increasingly in vogue as a way to adjust for differences between populations in estimating treatment effects. Some view propensity scores as an almost mythical way of dealing with confounding. However, they are limited to adjustment for the observables, just like standard regression. So it raises the question "how do propensity scores compare as an estimator relative to linear regression?" The answer is short --- "not well."
When you go to war, it can be useful to know how many tanks the other side has. However, they often refuse to tell you. Worse even, they will often vastly inflate production numbers. They are at war, after all. If only there was a way to convert that pesky sequential serial number to an estimate of the total number of tanks...
You need to come up with a regression model for some response. You have tons of predictor variables that you might want to consider. How do you decide what variables to consider in your model? If you started with bivariate correlations of the response and each predictor, you may be in for some trouble.
Instrumental variables provide a power method for getting around unobserved heterogeneity and are increasingly popular in observational research. By exploiting a third variable, known as an instrumental variable, this method breaks the correlation between an independent variable and theomitted or unobserved variables. However, the definitions are mind boggling and the process is often unclear, even when advertised as "Mostly Harmless." In cases like this, a simulation is often handy, especially ones written in R.
If I told you I saw bigfoot, would you believe me? Could I present any evidence that would change your mind? Probably (hopefully) not. The likelihood of bigfoot being real is so small that the only people who report seeing one are also the same people who "believe" Ancient Aliens belongs on the History Channel. But if you think that way and penalize an extraordinary claim, why do we use statistical inference that doesn't do the same thing? Could this be the cause of the the rash of high profile failures to reproduce studies and the "decline effect"?
Everyone loves TV and hates it when their favorite show gets canceled. That is why they refuse to watch shows on Fox or watch anything that is airing Friday evening. But do these commonly held beliefs hold true? Does Fox hate TV and is Friday night a graveyard for scripted TV?
Over the summer I had some problems with my Internet connection, specifically very high latency. It was causing problems whenever I tried to do something particularly sensitive to latency issues. I am already bad enough at online games, having massive lag was not helping. But, like nearly all connection issues, it wasn't happening all the time or even to all the packets at the times when it was acting up. I called my ISP but kept "passing" the ping tests on their end and they said nothing was wrong and that my 100-300 ms pings to multiple sites was something I was dreaming up. To get the problem fixed, I wrote a little Python script to grab the pings and a second R script to check out what was going on.
Shortly after making yesterday's post, I saw a visualization of apartment rental prices in Boston. As is commonly known, the three most important things in real estate are location, location, location. But location can go either way for prices, where is the good area and living on Yucca Mountain. How can we figure out which locations are good or bad without knowing anything else about Boston? We use the same method discussed in yesterday's simulation but on good old fashioned real data.
Observational studies have the same problem as poker, you have to play the cards you are dealt. This can be a problem when you expect people to responder differently to some variable according to some number of unobserved variables. While expectation-maximization probably won't help you in your weekly Texas Hold'em game, it can be an ace up your sleeve in data analysis.
Everyone complains about how slow R is, especially without vectorized code. I was working on a Project Euler problem and decided to see how slow R really is. I wrote a program to calculate the 10,001st prime using C, Python and R. The results were not pretty (at least for R). Will the byte code compiler save the day?
subscribe to this feed for only R posts or here for all posts via RSS