Category Archives: Machine Learning

LLM’s as Entities

Cartoon of Two Funny Robots that Make a Joke in black wooden | Etsy | Funny robot, Jokes for ...
One machine said to the other

llms are entities
inbetween people, ideas, and objects

Common between these concepts are ideas, what we normally associate with thoughts, cognits. Similar to word roots (lemmaitized ideas). This is because they generalized a gnn on a subset of humanities written thoughts.

I’m positing with enough of these entities in a room–with few shot generative adversial prompts between them–would synergize (create an interaction) that would result in an emergent convesriation that could qualify as sentient. Think of it simply as multiplying the vector space akin to how a and b make two linear lines into an area. This becomes the inferential space, a product of the inputs.

An idea I’m working on. I’m considering using the outputs of such conversations in a fine tuning pipeline as a type of reinforcement learning, but my aim is to avoid the need for expensive finetuning and rather simply iterate on the prompt engineering maybe with a llm that is doing just that.

I imagine I would hit some qualitative limit as a result of a models generalized ability, but that could be solved by upgrading the model when available.

I think something simple would be

  • “How to improve upon this joke?”
  • “How can I improve these few shot learning prompts? Can you think of any meta elements I’m missing that would help grab more attention from the responses?”

Then feed that back and forth between two model’s updating on actual responses to questions and update the few-shot learning prompts.

I got this idea from governmental bodies as entities and walked it back to LLM’s.



Neo4j Graph Database


Match 5 conditions
Match 4 conditions


I create relationships with a dataframe’s columns if a given record (in this case states) is above 1 median absolute deviation above the median towards a beneficial outcome (for example, low unemployment, population, or high income).

Here is the code to get the graph created (I used a lot of questions with chatgpt to get to this result, but I now know how to implement it properly with this POC).

I’m enjoying the ways you can slice and dice a graph database

This is showcasing states (and the regions I joined them to) that identify as being 1 median absolute deviation above the median (a binary factor derived in what would otherwise be known as a helper column, which is all etl logic done in python atm). This way of splitting the data made the most sense to me for non normal distributions (for a POC). Else median’s are too wishy washy as their center’s can change and you would get a different mix, where-as this is more akin to identifying upper and lower groups.

Turing GAN

I was thinking about how GAN’s have been used to create real life looking images and thought… we have chatbot’s that sound realistic enough (chatGPT). Couldn’t GAN’s be used to develop a conversation system (chatbot for lack of better word) that was validated by a GAN in much the same way as GAN’s are used to generate realistic images? The engine would just keep correcting itself until it passed the GAN’s thresholds.


chatGPT – Ice Age, Flood, and Atlantis

How is it that humans for thousands of years survived the Ice Age? Was the whole world covered in ice? If not were there warm/habitable regions that would have allowed for agricultural farming? What is the likelihood that civilizations sprang up in these areas and when the ice melted they were lost under the incoming water? What is more likely. That humans were hunter gatherer’s during this entire time, or that they developed basic agricultural skills. Please defend your case with examples.

Humans survived the Ice Age primarily through a combination of hunter-gatherer and, to some degree, agricultural practices. Hunter-gatherer activities such as fishing, hunting, gathering, and scavenging provided a steady and reliable source of sustenance across the entirety of the Ice Age. It is also likely that humans developed some basic agricultural skills during this time period, especially in areas with warmer climates that were not covered in ice. For example, evidence suggests that humans living in what is now the United Kingdom during the Ice Age practiced cereal cultivation on a small scale.

In regards to civilizations potentially springing up in areas with warm climates and then being lost when the ice melted, this is unlikely. While it is possible that such civilizations once existed, the archaeological record does not provide any evidence to suggest that this actually happened. Moreover, due to the migratory nature of humans living during the Ice Age, it is more likely that any civilizations that did form would have been widely dispersed rather than concentrated in one particular area.
Human: So… your take is the Atlantis story as described by Plato being handed down by the Egyptians doesn’t describe what exists in multiple world religions (a deluge) when the straights of Gibraltar broke and created the Mediterranean?

No, I do not believe that the Atlantis story as described by Plato is related to the flood described in multiple world religions. While Plato’s story features a large flood, it is likely an exaggeration of a much smaller event or event series that occurred in the region. Additionally, archaeological evidence does not support the idea that these two floods are related in any way.
Human: What does the archeological evidence say about these two floods then?
The archaeological evidence indicates that the floods described in multiple world religions likely occurred thousands of years before the events described in Plato’s story. Furthermore, the scale of the floods described in world religions is significantly larger and more destructive than the flood described in Plato’s story. This suggests that the two floods are likely not related.
Human: How do we have any basis to know the timeframes of when the two floods might have occurred?

We know that the floods described in multiple world religions occurred prior to the events described in Plato’s story thanks to a variety of archaeological evidence. For example, historians and archaeologists have studied ancient artifacts that predate Plato’s story and determined that they would have been destroyed by the flood writers describe in world religions. Additionally, paleontological evidence suggests that some species of animals were rendered extinct prior to the events described in Plato’s story, indicating that a significant flooding event must have occurred earlier. Finally, geological evidence indicates that land forms in particular regions have changed drastically since before Plato’s story, suggesting that large-scale flooding events must have occurred since then.

CAPM Portfolio’s

I know how to build a Markowitz Weighted Portfolio, and how to ‘hack it’, just up the quantities associated with higher beta’s which represents the Risk Premium (i.e. how much over the Risk Free Rate is expected as return, aka known as risk premium of the market, based on the DGS3MO).

But I let it resolve to optimal sharpe ratio and simply display the beta’s as derived from MDYG (SP1500).

So based on CAPM Expected Return (Average Risk Premium for past 5 years is .0142 (1.42%), the CAPM return is 4.33% + 1.42% * Portfolio Beta of 1.00116592, which comes out to be 5.75% for next quarter.

A different forecast, one based on Markowitz simulations has 9% for next quarter.

Another forecast based on an expected return factor model forecasted results using a model that has 13% MAPE, the weighted forecasted return is 13% for next quarter (i.e. 13% +/- (13%^2) (i.e. 13% +/- 0.0169%)

What’s frustrating is knowing I hit the ball out of the park when it comes to CAPM portfolio’s and Markowitz, but to know that those in academia that actively trade are not fans of the material they are hamstrung to teach. So I get various strong opinions about what works. Very cult of personality about methodologies, but not me. I’m open to trying as much as I can just for the opportunity to learn.

The Inefficient Stock Market is a gold mine in terms of what factors to look for. I’ve been doing my own research (FRED data, commodities, foreign exchanges, indexes, sectors, SP1500 prices, fundamentals, financial statements, Critiques of Piotroski, French Fama 3 and 5 Factor Models, Arbitrate Pricing Theory). The book suggests improved/revised factor models using a mix of financials and fundamentals offering 30 to look out for.

If it works and proves to match the projected expected returns within the risks shown. Then this could be used to borrow money on margin call knowing your returns are modeled/controlled for and you can make money on the spread, but it’s risky. Borrowed money is usually at the Risk Free Rate, so you aim for a risk premium return by controlling for risk.

The philosophy behind the filters is, “this vs that. Bifurcation.” Split everything somewhat subjectively to a simple filter no matter how complex the calculation is on the back end, aka a 1 or 0 is coded for every value with default being 0 (such as na’s), and add these filters together across ETF’s and sift the top results. Which allows me to focus on revising and expanding individual logic in factors encapsulated in sql and/or python files. For example modifying thresholds which affect proportion of occurrence for a given factor(field). If query logic is based on median’s, it’s easy to get 50% of the values every time for each factor.

Stock Database

I finished the database I was working on for stock market data.

for the sp1500
SEC filings for financial statements
as well as what yahoo offers (financial statements for annual and quarterly, earnings trend estimates)
fred data for econometrics

the whole etl job finishes now in about 30 minutes which I’ve encapsulated into a single folder

I intend to use tableau to parse through this and create some choice dashboards

once I finalize on the dashboards, I then intend to migrate them over to flask

Dask pygam nested cross validation

I combined new inferences from a few sources on how to do Nested Cross Validation so I [re-]wrote it for pygam that derives all combination of terms to find best subsets and then distributes the work across a DASK cluster.

Income and Population are the best predictor terms for Poverty using GAM (using 10 for k for folds).

Non-dask version:

Time Series Forecasting (ARIMAX, ARIMA, ETS)


I had an error in my dates (I hate dates… not the fun ones with people, or dried fruit, but date wrangling). I think I’ve mastered it though thanks to eom function.

Anyways. After fixing the error. I reran my numbers and found a correlated leading indicator for LA Condo Prices (LXXRCSA), MEAR.

This is with stationary (differenced) values. I.e. measuring the rate of change from quarter to quarter (sometimes differenced more than once) to arrive at stationarity.

The best model chosen for the left was ARIMAX, middle was ETS, and right was ARIMA

The best model chosen for the left was ARIMAX, middle was ETS, and right was ARIMA

All values are automatically differenced both seasonally and non seasonally to arrive at stationary variables (i.e columns of data that represent economic indicators to be used as predictors). This is a model assumption in linear regression so I started with it (see ending reference). The added effect is I [generally] never have an integrated/difference (I) term in ARIMA. This also plays out in the residual autocorrelation (ACF) plot during the linear model plot, i.e. none of the residuals are significantly auto correlated with past values because I differenced them already. Auto correlation means current value is correlated with past value (such as when a trend exists), by differencing (seasonally or non), trend is controlled for (and captured in the differencing term).
Cross correlation function of a large array of financial economic indicators sourced from St. Louis FRED data repository and ETFs (sourced from yahoo finance), shown as imported from a csv file (this is derived from a separate python program called FREDdata.ipynb in another of my github repository titled python-stocks).
Cross correlation finds the ideal lead number to pull forward data against the independent variable (in the html document shown its LXXRCSA, LA county condo prices).
The cross correlation (CCF) finds an ideal leading indicator (based on significant correlation identified as a lead over 1 time index, max of 4) for the dependent variable (Time index’s are measured in quarters). Resulting in a row offset of the variable/column by the ideal lead value (i.e. offsets the index of the predictor from the independent variable to be predicted for, i.e. a lead of 3 means the predictor variable 3 quarters back is predictive of the independent variable today).
I then compare 3 models.
* ARIMA (univariate model using auto regression (AR) of past independent values as well as a moving average (MA) of past errors (~correction from last value))* ARIMAX (a linear model using the dependent variable identified during CCF analysis) with the residuals of the linear model as an ARIMA model added back to the linear model)* ETS (an aggregate of many univariate time series models (including holts-winters) and the best is derived)
The models are compared using time series cross validation over a holdout period. The model resulting in the lowest prediction error is chosen to do the forecast.
The final forecasted values are reconstructed back to their non differenced values (using a custom function that begins with nv_…) and drawn with their respective lower and upper confidence intervals beyond the last date. 
The best model for LA condo prices was the ARIMAX model which shows a relationship with predictor term INUV.
Note: A concern I have with arimax is auto.arima doesn’t include a constant term, so the model is built without a constant. If the values are differenced, then the values are stationarity, which also presumes a mean of 0. However it’s generally considered best practice to include a constant term. However when I did so sometimes auto.arima would fail to converge during cross validation. So I pulled the constant term and simply let the best model win based on error score alone.
The histograms and model plots show the basic model assumptions for linear regression are met.(stated earlier, ARIMAX is a linear model with an ARIMA model used on the linear models residuals to add back to the linear model (i.e. auto adjust the error term).

There are four assumptions associated with a linear regression model:

  1. Linearity: The relationship between X and the mean of Y is linear.
  2. Homoscedasticity: The variance of residual is the same for any value of X.
  3. Independence: Observations are independent of each other.
  4. Normality: For any fixed value of X, Y is normally distributed.