A Junior Quant's Guide to Prediction [Code Included]

The prediction game is tough, except for when it's not.

Sep 29, 2024

∙ Paid

As a quant, trader, or even just a regular market participant; your job is simple: to predict things.

Sure, you also have to deal with data preparation, risk management and the million other things, but at the end of the day — you have to make a prediction.

As modern-century humans, we’ve pretty much perfected the art of prediction — heart attack risk, likelihood of loan defaults, text that accurately responds to given prompts — perfected it everywhere — except in Finance.

Take a look at this new study where even traders who were given inside information in advanced struggled to break 50/50 — ‘Crystal Ball’ Breaks as Traders Fail to Get Rich in New Study

Most market participants have resigned to the idea that it all comes down to a coin-flip, but of course, quantitative investing wouldn’t be a multi-billion dollar industry if that were the case.

So, today, we’ll be doing a deep dive into the nuts and bolts of actionable, quantitative predictions, allowing you to see that it’s a much bigger game than just predicting a 50/50 up or down.

Once Upon a Time, There Was a Regression Model

We want to start simple, so we’ll start with the simplest form of model — regression.

Regression models are simple — lower/higher values of X, our input, leads to lower/higher values of Y, our output — easy.

While these are the “trust me bro, the final value will be somewhere around this general area, probably, more likely than not…” of models, that’s not a bug, it’s a feature.

To see why, let’s quickly scrap together a regression model that we can use to predict SPX returns.

But first, we must address the most crucial core concept in predictive modeling:

Garbage In, Garbage Out.

Models are only as good as the underlying data we give it, so it’s crucial to make sure that our features (inputs) have solid fundamental sense.

So, on that thread, think about what kinds of things have reasonable reasons for why they’d be good at predicting our target — in this case, S&P 500 returns.

Seriously, think for a second about what things might be drivers of future returns. For each variable you think of, try to iron out the rationale of why it makes sense.

One option is the VIX index:

The VIX is largely derived from how expensive out-of-the-money options on SPX get
- If investors are paying more for protection, it’s likely because something bad has happened, is happening, or will happen.
- If investors are paying significantly less for protection, it’s a sign that the bad times are over, or at least perceptions of the bad times are easing.

If we want to model that simple relationship — higher expectations in vol = lower expectations of future SPX returns — a linear model would be perfect.

So, to start, we’ll create just 2 features from the VIX index, the 1-day return and the daily value of the index.

The 1-day return represents how much the VIX went up that_day
The daily value represents the value of the VIX that_day
that_day is defined as the given date

Our target will be the return of the S&P 500 the next day.

Now, returns are not normally distributed and we might screw with our model’s head if the VIX reaches 100 but the actual next day return is only -0.03%, so we’ll convert this into a binary classification task.

So, if the next day’s return was positive, we convert that value into a 1, if it was negative (or flat), we convert it into a 0.

Here’s a look at what this data will look like:

As you can visually inspect in this snippet, when the VIX goes up (daily return > 0), the next day return tends to be negative (0) — when it goes down (daily return < 0), the next day return tends to be positive (1).

Once we have our dataset, we’ll deploy walk-forward testing:

This is essentially going forward 1 day at a time, training the model on data only available on/before that day, then passing in that day’s data to get a prediction for the next.
- As to the ideal training size for each day, we personally prefer 252 prior samples, with a max of 504.
  - There are 252 trading days per year, so 252 ensures your data has enough samples to establish relationships, while still being relevant enough to predict contemporary data. You don’t want to use data from 2013 to predict a 2024 outcome just because you’ll have more data points.
This helps us get a better idea of how our model would’ve performed in real-life since it is the format we would actually deploy in production.
- Compared to other sciences’ methods of say, cross-fold validation, we won’t be predicting hundreds of values at once, especially since each value is a future point in time — we would only have the data before today and we would pass in just today’s data for tomorrow’s prediction.
  - We’re not ruling out using cross-fold validation as it can still be effective in deducing model skill, but walk-forward testing keeps things as realistic as possible.

After running the model, we’ll run a few performance metrics to see how well it was able to capture this relationship.

So, let’s see how it did:

If you’re already a paid subscriber, truly — thank you. ❤️ Your support powers better data, better tools, and better research.

If you’ve been enjoying the work and want to support what we’re building, consider becoming a paid subscriber. It means more than you think and helps us keep doing it right. 🫡

Keep reading with a 7-day free trial

Subscribe to The Quant's Playbook to keep reading this post and get 7 days of free access to the full post archives.