Home » Research Blog » The predictive power score

The predictive power score

The predictive power score is a summary metric for predictive relations between data series. Like correlation, it is suitable for quick data exploration. Unlike correlation, it can work with non-linear relations, categorical data, and asymmetric relations, where variable A informs on variable B more than variable B informs on variable A. Technically, the score is a measurement of the success of a Decision Tree model in predicting a target variable with the help of a predictor variable out-of-sample and relative to naïve approaches. For macro strategy development, predictive power score matrices can be easily created based on an existing python module and can increase the efficiency of finding hidden patterns in the data and selecting predictor variables.

Wetschoreck, Florian, “RIP correlation. Introducing the Predictive Power Score”, Towards Data Science, April 23 2020.
Kaggle; Predictive Power Score versus Correlation.
Github: ppscore – a Python implementation of the Predictive Power Score (PPS).

The below are quotes from the posts. Cursive text and text in brackets have been added for clarity.

The post ties up with this site’s summary on quantitative methods for macro information efficiency.

In a nutshell

“The predictive power score is an asymmetric, data-type-agnostic score that can detect linear or non-linear relationships between two columns. The score ranges from 0 (no predictive power) to 1 (perfect predictive power).”

“The predictive power score is an alternative to the correlation that finds more patterns in your data…[It] detects linear and non-linear relationships. The predictive power score can be applied to numeric and categoric columns and it is asymmetric…We can think of the predictive power score as a framework for a family of scores…We proposed an implementation and open-sourced a Python package.”

The problem

“[For] correlation…the score ranges from -1 to 1 and indicates if there is a strong linear relationship — either in a positive or negative direction…However, there are many non-linear relationships that the score simply won’t detect. For example, a sinus wave, a quadratic curve or a mysterious step function. The score will just be 0, saying: ‘Nothing interesting here’.”

“Also, correlation is only defined for numeric columns. So [many researchers’ habitual response is] let’s drop all the categoric columns… And no, I won’t convert the columns because they are not ordinal and OneHotEncoding will create a matrix that has more values than there are atoms in the universe.”

The correlation matrix is symmetric…Symmetry means that the correlation is the same whether you calculate the correlation of A and B or the correlation of B and A. However, relationships in the real world are rarely symmetric. More often, relationships are asymmetric. Here is an example: The last time I checked, my zip code of 60327 tells strangers quite reliably that I am living in Frankfurt, Germany. But when I only tell them my city, somehow they are never able to deduce the correct zip code…Another example is this: a column with 3 unique values will never be able to perfectly predict another column with 100 unique values. But the opposite might be true. Clearly, asymmetry is important because it is so common in the real world.”

[In] a typical quadratic relationship: the feature x is a uniform variable…and the target y is the square of x plus some error. In this case, x can predict y very well because there is a clear non-linear, quadratic relationship…However, this is not true in the other direction from y to x. For example, if y is 4, it is impossible to predict whether x was roughly 2 or -2. Thus, the predictive relationship is asymmetric and the scores should reflect this… If you don’t already know what you are looking for, the correlation will leave you hanging because the correlation is 0.”

“The correlation matrix…leaves out many interesting relationships [particularly by leaving out categoric variables and ignoring asymmetric relations].”

The idea

A score [should] tell if there is any relationship between two columns — no matter if the relationship is linear, non-linear, gaussian…Of course, the score should be asymmetric…The score should be 0 if there is no relationship and the score should be 1 if there is a perfect relationship. And…the score should be able to handle categoric and numeric columns out of the box.”

“Let’s say we have two columns and want to calculate the predictive power score of A predicting B. In this case, we treat B as our target variable and A as our (only) feature. We can now calculate a cross-validated Decision Tree and calculate a suitable evaluation metric. When the target is numeric we can use a Decision Tree Regressor and calculate the Mean Absolute Error (MAE). When the target is categoric, we can use a Decision Tree Classifier and calculate the weighted F1.”

N.B.: Decision Trees are a non-parametric supervised learning method used for classification and regression. They create a model that predicts a target variable by learning simple decision rules through data features. See StatQuest video on Decision Trees here.

“We need to ‘normalize’ our evaluation score. And how do you normalize a score? You define a lower and an upper limit and put the score into perspective…What should the lower and upper limit be? Let’s start with the upper limit because this is usually easier: a perfect F1 is 1. A perfect Mean Absolute Error is 0. [Next] we need to calculate a score for a very naive model. But what is a naive model? For a classification problem, always predicting the most common class is pretty naive. For a regression problem, always predicting the median value is pretty naive.”

N.B. The F1 score evaluates the performance of a classification model. It can be interpreted as a weighted average of the precision and recall. Precision is the reciprocal of variance and indicates how close different samples are to each other. Recall refers to the ability of a model to find all the relevant cases within a dataset. It is the number of true positives divided by the number of true positives plus the number of false negatives. The F1 score reaches its best value at 1 and worst at 0. See Data Science Dojo video on these concepts here.

“[In] the example of the zip codes and the city name…both columns are categoric. [If] we want to calculate the predictive power score of zip code to city…we use the weighted F1 score because city is categoric. Our cross-validated Decision Tree Classifier achieves a score of 0.95 F1. We calculate a baseline score via always predicting the most common city and achieve a score of 0.1 F1. If you normalize the score, you will get a final predictive power score of 0.94 after applying the following normalization formula: (0.95–0.1) / (1–0.1)…[A] predictive power score score of 0.94 is rather high, so the zip code seems to have a good predictive power towards the city. However, if we calculate the predictive power score in the opposite direction, we might achieve a predictive power score of close to 0 because the Decision Tree Classifier is not substantially better than just always predicting the most common zip code.”

“Looking at the predictive power score matrix, we can see effects that might be explained by causal chains.”

“Although the predictive power score has many advantages over the correlation, there is some drawback: it takes longer to calculate…[Also] you cannot compare the scores for different target variables in a strict mathematical way because they are calculated using different evaluation metrics.”

How to apply the predictive power score

“Find patterns in the data: The predictive power score finds every relationship that the correlation finds — and more. Thus, you can use the predictive power score matrix as an alternative to the correlation matrix to detect and understand linear or nonlinear patterns in your data. This is possible across data types using a single score that always ranges from 0 to 1.”

“Feature selection:…You can use the predictive power score to find good predictors for your target column. Also, you can eliminate features that just add random noise. Those features sometimes still score high in feature importance metrics. In addition, you can eliminate features that can be predicted by other features because they do not add new information. Besides, you can identify pairs of mutually predictive features in the predictive power score matrix — this includes strongly correlated features but will also detect non-linear relationships.”

“Detect information leakage: Use the predictive power score matrix to detect information leakage between variables — even if the information leakage is mediated via other variables.”
N.B.: Information leakage is a problem in machine learning that arises when information from outside the training dataset is used to create the model.

“Data Normalization: Find entity structures in the data via interpreting the predictive power score matrix as a directed graph. This might be surprising when the data contains latent structures that were previously unknown. For example: the TicketID in the Titanic dataset is often an indicator for a family.”

A simple example in Python

The Github page for the ‘ppscore package’ can be found here.

We apply the ppscore functions simply to a set of monthly market returns since 2000 contained in the dataframe dfx, specifically S&P500 future returns (USD_EQ_XR), Eurostoxx future returns (EUR_EQ_XR), EURUSD 1-month forward returns (EUR_FX_XR), 5-year USD interest rate swap receiver returns (USD_IRS_5_XR), 5-year EUR interest rate swap receiver returns (EUR_IRS_5_XR), Brent crude oil front future return (BRT_CO_XR), and the gold future return (GLD_CO_XR).

“The function pps.score(df, x, y…) calculates the Predictive Power Score (PPS) for ‘x predicts y’. The score always ranges from 0 to 1 and is data-type agnostic. A score of 0 means that the column x cannot predict the column y better than a naive baseline model. A score of 1 means that the column x can perfectly predict the column y given the model. A score between 0 and 1 states the ratio of how much potential predictive power the model achieved compared to the baseline model.”

For example, we can assess whether the S&P500 futures return predicts the Eurostoxx futures return in the same month. The function chooses the decision tree regression model and gives this prediction a score of roughly 0.1

Plausibly there is a positive relation between U.S. and European equity returns but predicting the exact size of the latter even knowing the former such that it significantly reduces the forecast error remains challenging.

It is easier to predict the direction of European stocks returns with the knowledge of the performance of U.S. stocks. In this case, the predictive power score uses a Decision Tree Classifier and achieves a predictive power score of 0.67.

The key practical advantage of predictive power scores is the ability to glance quickly at a broad range of relations, for example in matrix form. For comparison we plotted three relationship matrixes for the above-mentioned set of monthly market returns 2000-2020, using the seaborn module.

The first is the standard symmetric correlation matrix. It shows high positive correlation within asset classes (82% for equity and 70% for rates) but also positive correlation between the euro, gold and crude (since all are denominated in dollar) and 30-35% negative correlation between equity futures and IRS receivers.

The second is a predictive power score matrix based on exact returns. Note that the way to read the matrix is that the variable on the x-axis is used to predict the corresponding variable on the y-axis.

It shows that out-of-sample prediction of exact returns would not be easy, even if one had known the exact contemporaneous returns in other markets. The only predictive relations that would have accomplished some reduction in mean absolute errors are those between U.S. and European equity futures and between dollar and euro IRS receivers.

The third is a predictive power score matrix based on the sign or monthly returns. It suggests that the prediction of market direction is much easier when knowing the direction of contemporaneous returns in other markets.

From 2000 to 2020 most market return relations have been symmetric. But not all. Thus, IRS returns have been a good contemporaneous predictor of the direction of EURUSD FX forward returns, but EURUSD forward returns have not been a good predictor of IRS returns. Similarly, one could have predicted the direction of European equity returns with knowledge of IRS returns but not vice versa.



Related articles