Category Archives: Normalization

MAD Normalization

Google Drive Link

Forecasting window using the same method

Update 2018-02-04

I decided I wanted to merge it with Empirical Cumulative Distribution function.  I believe that since it affects both sides of the median equally, it’s not a big concern.

It merely puffs out one side of the median more.  Since I merged it with RECDF, I also included a mechanism to do multiple linear regression to achieve a near perfect fit and forecast a 0 and 100% variable accordingly to get a forecasted min/max.  Something I’ve always wanted to do.  I think this method really captures the raw % of the data.

Original post

I think I finally cracked my normalization problem.

I had written/devised an algorithm that I labelled s transform that was a mix between empirical cumulative distribution function and min/max conversion around average and median.

Then I read about MAD from real statistics for excel.  I had a similar idea of using the IQR range to map 25 to 75% and then from 75% to 100% (Max) and 0 (Min) to 25% (1st Quartile).  That is still an option, but it would warm with skewed distributions.

This new method I think will work with skewed distributions a little bit better.  It will still map 50% of the values between Median and either side of MAD.  However, it will allow the Quartiles to slip from the +/-25% position equally, and max/min will not affect the average (unless there are multiple max/min’s, which I have seen, but even then, impact was negligible, <.02% from .5 Average on 294 Records.

I came up with a new method for normalization that produces near linear results and mean and median of .5 and with min 0 and max 1.
I’ve been looking for this method for a while.  The average isn’t quite .5, but it’s close enough.  In Cell A1 is listed the overall mean for the entire dataset, which is ~.499  I’m not utilizing RECDF.  It appears the formula Tarn is using is linear. 
I’m simply positioning around the median (as 50%). I utilize Median and MAD to derive a Z Score, then doing a min/max conversion by a simple if < 0 then [0-50%), if > 0 then [50 to 100%] else 50%. This effectively neutralizes the effect of the min and max as outlier’s by setting them to 0 and 1 respectively.  Which makes deriving a linear percent with a mean of .5 much easier.  Now I achieved a linear grade in the % around .5.

I think this tends to always produce similar results regardless of skew.  It allows the quartiles to slip from 25/75% positions respectively to keep median at 50% and min/max’s at 0/100%, which is more/less how s transform worked.

I’ve been looking for this for quite some time.  It does have the side affect of normalizing to 0/100% respectively, but that could be adjusted by merging with an RECDF transform to maintain min/max ~0/100% as well as median @50% and Q1/Q3 would weight more towards 25%, but I think it’s unnecessary.
Another holy grail is possibly deriving different Z Scores either side of the median using custom MAD’s, but I think it’s unnecessary.  I think this method as is, is a type of integration function.
The purpose of the MAD Z Score is
* to position 50% of the var’s below median and above,
* Using MAD, collect 50% of the values around the median (NOT 25% < 1Q and 25% > 3Q)

S Transform updates

Been working on a Multi Criteria Decision Analysis algorithm.

I figured out a way to set the value to ~<50% if null when the other weights are disabled. Without having to do a pad value. I just extended min and max a tiny bit so when min=median (during s transform) and (x-min)/(median-min) results in 0, it won't now because it will be (x-newmin)/(median-newmin). A convenient hack, but does what I want it to do with drawing data. Not currently in the link atm, but in an offline file. Still not 100% sure how I want to extend the min/max, but it's based on 1-(count/(count+1))% degrees to play with.