Category Archives: Statistics

Medical Work Sample Regression Results


This image has an empty alt attribute; its file name is image.png

6th Try (Sensitivity and Specificity same algo, 2 separate runs flipped bit)

Sensitivity: 97.4%

Found another single variable that accounts for 97.4% of my samples diagnostic’s

End State Renal Disease

Cut off was .43

I modified my algorithm to find the cutoffs from the training partition vs the cross validation test partitions.  I was still trying to solve for specificity, but alas, it converges on sensitivity.  

I’m not overly worried about it.  I can always recode the response variable and converge on sensitivity.


Okay… so I tried or sensitivity and it converged on specificity.

The only thing I can think of is changing to training cutoffs vs cross validation test partitions for cutoff was it.

I use a function optimalCutoff and a var optimizeFor = Zeros or Ones depending on a flag at the beginning.  

I also check for specificity or sensitivity on confusionMatrix output based on this flipped flag.

Anyways… if I flip this flag, it does either sensitivity or specificity. So that is working.  Why it’s inverted from the default parameters… still not sure

But this IS better that it ALWAYS converging on sensitivity.

The reason for the slightly different results each pass is due to imputed variables I suspect and my static spss dataset which is an output of just one imputed set?  I use the same seed (poor programmming practice I know, but data science is supposed to converge on the same results regardless of randomization, aka cross validation).  In this case not so much the factors, but the classification scores (confusion matrix results).

5th Try (Solved Sensitivity)

Note: “3rd Try” is my specificity model (I coded the 1’s and 0’s backwards and mistook it for the true sensitivity model I was looking for)

An even better model

  • Serum Creatine
  • PET

Sensitivity: 99.3%

Cutoff: .345035

4th Try

Optimizing for cutoffs

I do not understand why. But when I tell R to test for specificity, I converge on sensitivity

The Answer is 

  • BMI
  • ESRD
  • Diabetes.Mellitus
  • Liver.cirrhosis
  • Hepatitis.B
  • SOB
  • Coagulopathy
  • Constant Term
  • Cutoff: .475
  • Sensitivity: 93.5%

All metrics are derived from test partitions from cross validation (to include cutoffs). I’m hitting the ball all right.

3rd try (Solved Specificity)

  • Cross Validation
  • Binary Logistic
  • Categories

This is my 3rd try at that Medical Data problem and the ask was to test for sensitivity.

This is my final result

I hit the ball out of the park.

Problem with this: I forgot my 1’s were incorrectly coded for the class of non interest.

Optimized for sensitivity using cross validation 🙂

Code is saved on my private github. It was a lot of trial and error, but I got it.

2nd try

Using guide here:

I initially tried cross validation using this hash matrix, but it’s still a WIP

Fundamental weakness: not optimized for sensitivity. Bugs in code. 2 level variables shouldn’t be factors.



1st Try

I finished a work sample challenge for medical data
I even imputed data!

Yeah, I consider myself a data scientist

70% accuracy

Fundamental weakness: didn’t use factors

Image may contain: text

Income Regression Model based on State Features


Data is based on 2007 Statistical Abstract of the United States

I’ve thoroughly analyzed my “best” cross validated model and further pruned the model using backwards stepwise regression on the final dataset from the cross validated term algorithm and did model diagnostics and highlighted in green variables that had positive residuals marked out and red for negative residuals (for further analysis)

This is the type of work I was thinking about publishing

This model is the “Income” model.  I’ve included quadratic terms and got an amazing MAPE of 2.65% and an Adjusted R^2 of .97

I’ve noticed the cross validation pruned the hierarchical dependency (crime).  I’m not sure what to make of that atm but I trust it knowing the MAPE was cross validated.  I know I can exclude INTERACTION hierarchy dependencies (unsure about quadratic), but I know Crime is also captured in the interactions, so technically maybe that’s why it’s showing significant.

I’ve included studentized residuals and mapped it to cook’s distance which gives a great view of outlier’s.

I would give my stamp of approval on this model and say it passes the 4 model assumptions

State’s of Note

* Alaska

* Arizona

* Connecticut

* Illinois

* Lousiana

* California

* Maryland

Backwards Best Subset Cross Validation including interacted terms

Some best formula inferences

  • Income,Poverty,White,Unemployed,Doctors*Infant.Mort,Doctors*Traf.Deaths,Doctors*Unemployed,Doctors*University,Infant.Mort*Unemployed,Infant.Mort*White
    • mape of 4.37%
  • Poverty,Crime,Traf.Deaths,Unemployed,Crime*Infant.Mort,Crime*Unemployed,Crime*White,Doctors*University,Income*Infant.Mort
    • MAPE of 6.57%
  • University,Poverty,Infant.Mort,White,Crime*Income,Doctors*Traf.Deaths,Doctors*Unemployed,Doctors*White
    • MAPE of 7.26%

I used concepts from Backward stepwise regression to find the best set of factors for inclusion in regression. I got the idea from p values, but I wanted to use cross validation. There is a post that does exactly this, but I realized the results might not be the best CV scores. So I coded a loop, a few actually. This has been an idea I’ve been working on on my linkedin for a while, going back and forth trying to derive millions of combinations of factors (interactions, quadratic terms, etc) to test.

I came up with the most ingenious solution for the best cross validated formula using interactions.

1. Derive all interactions.

2. Develop Cross Validation train/test splits

3. Build full model

3a. For each full model, subtract 1 variable. Find model that performs best over all folds.

3b. Repeat 3 until model does not improve anymore. That is the best formula.

It’s kind of like genetic algorithm, except no mutations.

I got the idea from backwards selection. It’s also a lot easier to code than trying to build a manual backwards selection.

I realized overfitting all terms wouldn’t be an issue as the validation partition wasn’t used for training. So variables that were not meant to be in the equation will stand out the most.

I imagine something similar could be done for forward regression. Start with the single most powerful variable and work your way forward.

Github (private):


Outputted scores:

data (states.csv):

Note: the source call to mape.r can be commented out

Cross Validation over every factorial combination

I learn and then apply

I dropped the self filtering method of deriving multiple R manually using matrix equations to using cross validation and then when the best model (set of parameters with the lowest RMSE) is chosen. I will then derive the final model against the full dataset.

I use combn to iterate over every combination of factors hashtag#dataScience

Multiple Regression Coefficients, Correlation Matrix, and Significance

I redid my Matrix Multiple Regression Spreadsheet so it’s easier to read and has less needless matrix multiplication outputs and less cluttered formula’s.


* Inverted Transposed Predictor Matrix * Predictor Matrix
* Covariance Matrix
* Inverted Correlation Matrix


* reproduced all Data Analysis toolpack data
* didn’t have to use linest
* nor results from the toolpack
* nor any 3rd party plugins


Multiple Regression Coefficients and Significance Using Matrix Algebra

I’m quite proud of myself.  I spent the good portion of the day trying to figure out how to calculate p-scores by hand (by hand I mean in excel as opposed to using R or data analysis toolpack).

Granted.  I did use the analysis toolpack to derive the residuals, but that was to shortcut rather than using linest.  I DID derive the coefficients manually using matrix algebra, so I could have derived the y’s and then the residuals.  But since I wasn’t focused on that, but the t scores and the subsequent p values.  I went straight to that, because that’s what I want to model in R.

Now I can derive the P Scores of traincontrol in R 🙂

Next I’m going to add correlation matrix to this sheet, which is pretty easy since I have the covariance matrix

This file can be found in my uploads

and I got it!




Project homepage Readme

Using ICPSR polling data of 8th & 10th grade Americans. I transform from a set of predictor terms into what I call a “semiotic grid” of 1’s and 0’s which are then used to identify a class of 1’s and 0’s of desired outcomes of 3 specific response terms. GPA, gang fights, and (gasp) presence of psychedelic drug use.

I use monte carlo resampling to achieve class balancing and do a modified bestglm algorithm to get a wider set of terms via cross validation then through Cross Validated holdout analysis then tabulated. That’s just for initial factor reduction/pooling potential candidates. Then these terms go through more class balancing, cross validation once more using actual bestglm unmodified to arrive at a final regression formula as well as terms that are always population significant & closing with ROC.

I am offering the project as a type of open house to potential employers to determine if my skillset would be a good fit for what you hope to do with numbers.

Alt KNN using simpler 2 factor data

One of my most popular LinkedIn posts

I need to correct my prior statement. What this shows is the euclidean distance (using the mean of each factor as the base). Being so popular, I updated my google doc sheet as well as my knn algorithm to use a method I found that actually maps a linear relationship to the z scored response variable where I can set the constant to 0 (sum z to y is ‘propensity threshold’ pic)!

Original Post

I was calculating my knn distances incorrectly. The book gave an example of u as the subtracted nearby element which I read as μ which means population mean. I thought the book was telling me to subtract the element from the factor’s mean which made sense to me intuitively because of measures of central tendency and ogives’s/tukey 7 number summaries.

I also thought you had a set of constant scores (kmeans?) for any set of values, but I now see it’s more dynamic than that with aggregating on nearest neighbors, however, each value has a static set of neighbors (set before running the algo). The way I just learned, scores are only derived for test values, (except with knndistplot).

There are many algorithms for determining neighbors, so here is my vision.

I got the idea from cutoff thresholds, and the recommendation to normalize.

Visual mockup [] of how the measures of central tendency work when the X is sorted and compared with Y (intent to show a relationship by mapping Y flag ‘propensity’ changes along X ogive). What I do is combine these ‘propensities’ using euclidean distances based on z scores of the original factors (note, the top and bottom 4 values for each X factor are “continuous” to a specific response value, i.e. top 4 = 1, bottom 4 = 0)

Actual Data