Saturday, February 6, 2016

Time series is a specialty area in the field of analytics. The most well known application is in the area of Econometrics. In this blog, I like to introduce a simple and clearly hypothetical time series use case to stimulate interest. 

The most prevalent analytics formula you will see in statistics is linear regression, something of the form

In time series, the independent variables x represents lagged values of the regressant y or one or more other independent variables

You can see that in addition to the x and y terms, sometimes the equation can be further complicated with a constant, a linear trend, and a noise term. It's not hard to see how mathematicians can really run wild with something of this complexity. However, as always, once you break down the different aspects, one finds that the beast is tamable after all.

So what are the key components?

1. auto-regression – when y depends on its own past values, it is known as auto-regression, for obvious reasons

2. moving average – when y depends on past values of another independent valuable x, it is called moving average, also obvious after you think about it

3.  - the constant term is a simple bias, usually set to zero to simplify the derivations

4.  - a linear trend representing a continuous increase or decrease of y over time

5.  - finally, there is always a noise term which explains away any discrepancy in the predicted y value. 

Statisticians use the term iid (independent and identically distributed) to characterize this noise term.
So where does time series prediction come in in the real world? You can probably come up with a dozen different ideas off the bat, but none as entertaining as what these guys Thurman and Fisher published on the age old question of which came first? The chicken or the egg? You should never accuse statisticians for being humor less after this. 

Walter Thurman and Mark Fisher, “Chickens, Eggs, and Causality, or Which Came First?”, American Journal of Agricultural Economics, May 1988

They took actual data on US egg production and chick population from 1930 to 1983, and performed a Granger causality test. This is a test which finds the difference in prediction power on y, either using lagged values of 1) both x and y or 2) y only. You can find an exercise in R which carries out the computation here, attributed to Cory Lemeister:


Note that Clive Granger himself pointed out that his method is often misinterpreted and abused. When the method shows that x causes y, it may not be a true causation, but rather that there may be another factor z, which causes both x and y, but with the effect on x becomingobservable before y. For this reason, the proper term to use is x granger-causes y, rather than x causes y.
Professor Dave Giles from the University of Victoria wrote up a very good blog on the equations and concepts involved in performing Granger causality test.


First notice the use of the term VAR (Vector Auto Regression). The inclusion of the word vector is due to the fact that there are two equations predicting both x and y. VAR along with a partner concept VEM (Vector Error Correction) had such a huge impact in the field of econometrics, that the guy who demonstrated its practical use, Christoper Sims, received the Nobel prize in economics in 2011.

The gist of this technique is try to eliminate all b parameters in eqn 1, and d parameters in eqn 2. For example, if all b parameters in eqn 1 are gone, then y does not dependent on x at all, and therefore x can not granger-cause y. In statistics speak, one would reject the null hypothesis that the b parameters are zero (or upon failure of rejection, accept the null hypothesis), which would imply there is Granger causality (or the reverse, there isn't). The rejection technique uses a number of tests that are common practices. Gile's blog mentions Ward test, while Lemeister uses p-value, each of which is a lesson by itself.

If you believe the results, apparently eggs granger-causes chickens!