Google Drive Link
Forecasting window using the same method
Update 2018-02-04
I decided I wanted to merge it with Empirical Cumulative Distribution function. I believe that since it affects both sides of the median equally, it’s not a big concern.
It merely puffs out one side of the median more. Since I merged it with RECDF, I also included a mechanism to do multiple linear regression to achieve a near perfect fit and forecast a 0 and 100% variable accordingly to get a forecasted min/max. Something I’ve always wanted to do. I think this method really captures the raw % of the data.
Original post
I think I finally cracked my normalization problem.
I had written/devised an algorithm that I labelled s transform that was a mix between empirical cumulative distribution function and min/max conversion around average and median.
Then I read about MAD from real statistics for excel. I had a similar idea of using the IQR range to map 25 to 75% and then from 75% to 100% (Max) and 0 (Min) to 25% (1st Quartile). That is still an option, but it would warm with skewed distributions.
This new method I think will work with skewed distributions a little bit better. It will still map 50% of the values between Median and either side of MAD. However, it will allow the Quartiles to slip from the +/-25% position equally, and max/min will not affect the average (unless there are multiple max/min’s, which I have seen, but even then, impact was negligible, <.02% from .5 Average on 294 Records.
I came up with a new method for normalization that produces near linear results and mean and median of .5 and with min 0 and max 1.
I’ve been looking for this method for a while. The average isn’t quite .5, but it’s close enough. In Cell A1 is listed the overall mean for the entire dataset, which is ~.499 I’m not utilizing RECDF. It appears the formula Tarn is using is linear.
I’m simply positioning around the median (as 50%). I utilize Median and MAD to derive a Z Score, then doing a min/max conversion by a simple if < 0 then [0-50%), if > 0 then [50 to 100%] else 50%. This effectively neutralizes the effect of the min and max as outlier’s by setting them to 0 and 1 respectively. Which makes deriving a linear percent with a mean of .5 much easier. Now I achieved a linear grade in the % around .5.
I think this tends to always produce similar results regardless of skew. It allows the quartiles to slip from 25/75% positions respectively to keep median at 50% and min/max’s at 0/100%, which is more/less how s transform worked.
I’ve been looking for this for quite some time. It does have the side affect of normalizing to 0/100% respectively, but that could be adjusted by merging with an RECDF transform to maintain min/max ~0/100% as well as median @50% and Q1/Q3 would weight more towards 25%, but I think it’s unnecessary.
Another holy grail is possibly deriving different Z Scores either side of the median using custom MAD’s, but I think it’s unnecessary. I think this method as is, is a type of integration function.
The purpose of the MAD Z Score is
* to position 50% of the var’s below median and above,
and)
* Using MAD, collect 50% of the values around the median (NOT 25% < 1Q and 25% > 3Q)
