Code: https://github.com/thistleknot/R/blob/master/reports/sdiffs-with-arimax-and-arfimax.html

I had an error in my dates (I hate dates… not the fun ones with people, or dried fruit, but date wrangling). I think I’ve mastered it though thanks to eom function.

Anyways. After fixing the error. I reran my numbers and found a correlated leading indicator for LA Condo Prices (LXXRCSA), MEAR.

This is with stationary (differenced) values. I.e. measuring the rate of change from quarter to quarter (sometimes differenced more than once) to arrive at stationarity.

The best model chosen for the left was ARIMAX, middle was ETS, and right was ARIMA

The best model chosen for the left was ARIMAX, middle was ETS, and right was ARIMA

All values are automatically differenced both seasonally and non seasonally to arrive at stationary variables (i.e columns of data that represent economic indicators to be used as predictors). This is a model assumption in linear regression so I started with it (see ending reference). The added effect is I [generally] never have an integrated/difference (I) term in ARIMA. This also plays out in the residual autocorrelation (ACF) plot during the linear model plot, i.e. none of the residuals are significantly auto correlated with past values because I differenced them already. Auto correlation means current value is correlated with past value (such as when a trend exists), by differencing (seasonally or non), trend is controlled for (and captured in the differencing term).

Cross correlation function of a large array of financial economic indicators sourced from St. Louis FRED data repository and ETFs (sourced from yahoo finance), shown as imported from a csv file (this is derived from a separate python program called FREDdata.ipynb in another of my github repository titled python-stocks).

Cross correlation finds the ideal lead number to pull forward data against the independent variable (in the html document shown its LXXRCSA, LA county condo prices).

The cross correlation (CCF) finds an ideal leading indicator (based on significant correlation identified as a lead over 1 time index, max of 4) for the dependent variable (Time index’s are measured in quarters). Resulting in a row offset of the variable/column by the ideal lead value (i.e. offsets the index of the predictor from the independent variable to be predicted for, i.e. a lead of 3 means the predictor variable 3 quarters back is predictive of the independent variable today).

I then compare 3 models.

* ARIMA (univariate model using auto regression (AR) of past independent values as well as a moving average (MA) of past errors (~correction from last value))* ARIMAX (a linear model using the dependent variable identified during CCF analysis) with the residuals of the linear model as an ARIMA model added back to the linear model)* ETS (an aggregate of many univariate time series models (including holts-winters) and the best is derived)

The models are compared using time series cross validation over a holdout period. The model resulting in the lowest prediction error is chosen to do the forecast.

The final forecasted values are reconstructed back to their non differenced values (using a custom function that begins with nv_…) and drawn with their respective lower and upper confidence intervals beyond the last date.

The best model for LA condo prices was the ARIMAX model which shows a relationship with predictor term INUV.

Note: A concern I have with arimax is auto.arima doesn’t include a constant term, so the model is built without a constant. If the values are differenced, then the values are stationarity, which also presumes a mean of 0. However it’s generally considered best practice to include a constant term. However when I did so sometimes auto.arima would fail to converge during cross validation. So I pulled the constant term and simply let the best model win based on error score alone.

The histograms and model plots show the basic model assumptions for linear regression are met.(stated earlier, ARIMAX is a linear model with an ARIMA model used on the linear models residuals to add back to the linear model (i.e. auto adjust the error term).

There are four assumptions associated with a linear regression model:

**Linearity**: The relationship between X and the mean of Y is linear.**Homoscedasticity**: The variance of residual is the same for any value of X.**Independence**: Observations are independent of each other.**Normality**: For any fixed value of X, Y is normally distributed.