Panel regression with JPMaQS #
In this notebook, we show how to apply panel regression models to macro-quantamental datasets. We will leverage the
statsmodels
and
linearmodels
packages in Python while cross-referencing with the widely used R package ‘plm’, commonly employed in academic research. In particular, we show the application of pooled regression, fixed-effects regression, random-effects regression, linear mixed-effects models, and seemingly unrelated regressions.
Largely, only the standard packages in the Python data science stack are required to run this notebook. The specialized
macrosynergy
package is also needed to download JPMaQS data and for quick analysis of quantamental data and value propositions.
The notebook is structured into the following key sections:
-
Getting Packages and JPMaQS Data: In this section, we guide you through the process of installing and importing the necessary Python packages as well as formatting the dataset for further analysis
-
The Importance of panel analysis: this section discusses why specific models are needed for macroeconomic panel analysis
-
Pooled regression - is the simplest form of regression, where you do not consider any group-specific effects.
-
Fixed-effects regression: the model accounts for individual/group-specific effects.
-
Random effects regression: the model considers group-specific effects, assuming that they are random variables.
-
Linear mixed effects model: Linear mixed-effects models handle both fixed and random effects.
-
Seemingly unrelated regressions: SUR models are used when you have multiple equations with correlated errors.
Get packages and JPMaQS data #
# Uncomment below if running on Kaggle
"""
%%capture
! pip install macrosynergy --upgrade"""
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib as mpl
from matplotlib import pyplot as plt
import os
from linearmodels import PooledOLS, PanelOLS
from linearmodels.panel import RandomEffects, compare
from linearmodels.system import SUR
import statsmodels.api as sm
import statsmodels.graphics.tsaplots as tsap
import statsmodels.formula.api as smf
from macrosynergy.download import JPMaQSDownload
%matplotlib inline
import macrosynergy.panel as msp
import warnings
warnings.simplefilter("ignore")
The JPMaQS indicators we consider are downloaded using the J.P. Morgan Dataquery API interface within the
macrosynergy
package. This is done by specifying ticker strings, formed by appending an indicator category code
DB(JPMAQS,<cross_section>_<category>,<info>)
, where
value
giving the latest available values for the indicator
eop_lag
referring to days elapsed since the end of the observation period
mop_lag
referring to the number of days elapsed since the mean observation period
grade
denoting a grade of the observation, giving a metric of real time information quality.
After instantiating the
JPMaQSDownload
class within the
macrosynergy.download
module, one can use the
download(tickers,start_date,metrics)
method to easily download the necessary data, where
tickers
is an array of ticker strings,
start_date
is the first collection date to be considered and
metrics
is an array comprising the times series information to be downloaded. For more information see
here
or use the free dataset on
Kaggle
To ensure reproducibility, only samples between January 2000 (inclusive) and May 2023 (exclusive) are considered.
# Lists of cross-sections (countries with IRS markets and appropriate data)
cids_dm = ["AUD", "CAD", "CHF", "EUR", "GBP", "JPY", "NOK", "NZD", "SEK", "USD"]
cids_em = [
"CLP",
"COP",
"CZK",
"HUF",
"IDR",
"ILS",
"INR",
"KRW",
"MXN",
"PLN",
"THB",
"TRY",
"TWD",
"ZAR",
]
cids = cids_dm + cids_em
cids_du = cids_dm + cids_em
cids_dux = list(set(cids_du) - set(["IDR", "NZD"]))
cids_xg2 = list(set(cids_dux) - set(["EUR", "USD"]))
# Lists of main quantamental and return categories
main = [
"CPIC_SA_P1M1ML12",
"CPIC_SJA_P3M3ML3AR",
"CPIC_SJA_P6M6ML6AR",
"CPIH_SA_P1M1ML12",
"CPIH_SJA_P3M3ML3AR",
"CPIH_SJA_P6M6ML6AR",
"INFTEFF_NSA",
"INTRGDP_NSA_P1M1ML12_3MMA",
"INTRGDPv5Y_NSA_P1M1ML12_3MMA",
"PCREDITGDP_SJA_D1M1ML12",
"RGDP_SA_P1Q1QL4_20QMA",
"RYLDIRS02Y_NSA",
"RYLDIRS05Y_NSA",
"PCREDITBN_SJA_P1M1ML12",
]
rets = [
"DU02YXR_NSA",
"DU05YXR_NSA",
"DU02YXR_VT10",
"DU05YXR_VT10",
"EQXR_NSA",
"EQXR_VT10",
"FXXR_NSA",
"FXXR_VT10",
]
xcats = main + rets
The description of each JPMaQS category is available either under Macro Quantamental Academy , JPMorgan Markets (password protected), or on Kaggle (just for the tickers used in this notebook). In particular, the set used for this notebook is using Consumer price inflation trends , Inflation targets , Intuitive growth estimates , Domestic credit ratios , Long-term GDP growth , Real interest rates , Private credit expansion , Duration returns , Equity index future returns , FX forward returns
# Download series from J.P. Morgan DataQuery by tickers
start_date = "2000-01-01"
end_date = "2023-05-01"
tickers = [cid + "_" + xcat for cid in cids for xcat in xcats]
print(f"Maximum number of tickers is {len(tickers)}")
# Retrieve credentials
client_id: str = os.getenv("DQ_CLIENT_ID")
client_secret: str = os.getenv("DQ_CLIENT_SECRET")
with JPMaQSDownload(client_id=client_id, client_secret=client_secret) as dq:
df = dq.download(
tickers=tickers,
start_date="2000-01-01",
end_date="2023-05-01",
suppress_warning=True,
metrics=["all"],
report_time_taken=True,
show_progress=True,
)
Maximum number of tickers is 528
Downloading data from JPMaQS.
Timestamp UTC: 2023-09-19 15:37:31
Connection successful!
Number of expressions requested: 2112
Requesting data: 100%|██████████| 106/106 [00:36<00:00, 2.94it/s]
Downloading data: 100%|██████████| 106/106 [00:14<00:00, 7.20it/s]
Time taken to download data: 51.96 seconds.
Time taken to convert to dataframe: 43.27 seconds.
Average upload size: 0.20 KB
Average download size: 325288.51 KB
Average time taken: 9.17 seconds
Longest time taken: 12.20 seconds
Average transfer rate : 283836.53 Kbps
# uncomment if running on Kaggle
"""for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))
df = pd.read_csv('../input/fixed-income-returns-and-macro-trends/JPMaQS_Quantamental_Indicators.csv', index_col=0, parse_dates=['real_date'])"""
display(df["xcat"].unique())
display(df["cid"].unique())
df["ticker"] = df["cid"] + "_" + df["xcat"]
# df.head(3)
array(['CPIC_SA_P1M1ML12', 'CPIC_SJA_P3M3ML3AR', 'CPIC_SJA_P6M6ML6AR',
'CPIH_SA_P1M1ML12', 'CPIH_SJA_P3M3ML3AR', 'CPIH_SJA_P6M6ML6AR',
'FXXR_NSA', 'FXXR_VT10', 'INFTEFF_NSA',
'INTRGDP_NSA_P1M1ML12_3MMA', 'INTRGDPv5Y_NSA_P1M1ML12_3MMA',
'PCREDITBN_SJA_P1M1ML12', 'PCREDITGDP_SJA_D1M1ML12',
'RGDP_SA_P1Q1QL4_20QMA', 'RYLDIRS02Y_NSA', 'RYLDIRS05Y_NSA',
'DU02YXR_NSA', 'DU02YXR_VT10', 'DU05YXR_NSA', 'DU05YXR_VT10',
'EQXR_NSA', 'EQXR_VT10'], dtype=object)
array(['AUD', 'CAD', 'CHF', 'CLP', 'COP', 'CZK', 'EUR', 'GBP', 'HUF',
'IDR', 'ILS', 'INR', 'JPY', 'KRW', 'MXN', 'NOK', 'NZD', 'PLN',
'SEK', 'THB', 'TRY', 'TWD', 'USD', 'ZAR'], dtype=object)
Reformatting the dataset #
We extract a sub-dataframe used for most of the examples below. The key categories used for most models in this notebook are the following:
-
CPIC_SJA_P6M6ML6AR
- Adjusted latest core consumer price trend: % 6m/6m ar -
DU02YXR_VT10
- Return on fixed receiver position, % of risk capital on position scaled to 10% (annualized) volatility target, assuming monthly roll: 2-year maturity
The choice of variables reflects the simple idea that inflation should have negative impact on next period’s duration returns. The dataset is the standard data set used for tutorials on JPMaQS-related methods and is available on Kaggle .
# Prepare for filtering
cids = ["AUD", "CAD", "EUR", "GBP", "JPY", "USD"] # only selected cross-section
for_feats = [
"CPIC_SJA_P6M6ML6AR"
] # categories (building blocks) for feature calculation
targs = ["DU05YXR_VT10"] # target returns
start_date = "2000-01-01" # earliest sample date
# Create filters
filt1 = df["xcat"].isin(for_feats) # filter for feature variables
filt2 = df["real_date"] >= pd.to_datetime(start_date) # filter for start date
filt3 = df["cid"].isin(cids) # filter for market
filt4 = df["xcat"].isin(targs)
# Apply filters
dfx_feats = df[filt1 & filt2 & filt3] # data frame for feature variables
dfx_target = df[filt4 & filt2 & filt3] # data frame for target variables
The we downsample the frequence of the time series and lag the features. The latter allows to align past features with present targets, i.e. assess the predictive power of feature panels.
# Frequency conversion of features and targets to quarterly means and sums respectively
scols = ["real_date", "cid", "xcat", "value"]
dfx_feats = dfx_feats[scols]
dfqx_feats = (
dfx_feats.groupby(["cid", "xcat"]).resample("M", on="real_date").mean()["value"]
)
dfqx_feats = dfqx_feats.reset_index()
dfqx_feats = dfqx_feats.dropna()
scols = ["real_date", "cid", "xcat", "value"]
dfx_target = dfx_target[scols]
dfqx_target = (
dfx_target.groupby(["cid", "xcat"]).resample("M", on="real_date").sum()["value"]
)
dfqx_target = dfqx_target.reset_index().dropna()
# Lag features (past values assigned to today's date)
dfqx_feats["value_lag"] = dfqx_feats.groupby(["cid", "xcat"])["value"].shift(1)
dfqx_feats = dfqx_feats.dropna()
# Pivot features and target dataframes to wide time series panel formats
dfqx_feats = dfqx_feats.pivot(
index=["cid", "real_date"], columns="xcat", values="value_lag"
)
dfqx_target = dfqx_target.pivot(
index=["cid", "real_date"], columns="xcat", values="value"
).dropna()
# Merge to joint features/targets dataframe
dfx_pan = pd.merge(
dfqx_target.reset_index(),
dfqx_feats.reset_index(),
on=["cid", "real_date"],
how="outer",
).set_index(["cid", "real_date"])
dfx_pan = dfx_pan[(dfx_pan != 0).all(1)]
dfx_pan = dfx_pan.dropna()
display(dfx_pan.head(5))
xcat | DU05YXR_VT10 | CPIC_SJA_P6M6ML6AR | |
---|---|---|---|
cid | real_date | ||
AUD | 2001-07-31 | 1.699832 | 2.203520 |
2001-08-31 | 6.972805 | 2.409941 | |
2001-09-30 | 2.483685 | 3.338835 | |
2001-10-31 | 3.285013 | 3.338835 | |
2001-11-30 | -4.312912 | 3.416523 |
The importance of panel analysis #
Panels of time series are datasets that are organized in (i) rows related to points in time or periods, and (ii) columns of cross-sections. Panel data contains information on individual cross-sections and group developments of overtime. They are a two-dimensional hybrid of time series and cross-sectional information. For panel regression analysis we typically consider multiple panels of the same structure, just as we consider multiple time series for simple time regression.
JPMaQS quantamental indicators are principally organized in panels: one type of indicator (category) is tracked overtime across a range of countries, currencies or markets. Since JPMaQS data are all daily information states, its panels or combinations thereof can easily be investigated in respect to their predictive power.
Research on panels is critically important for macro-quantamental financial market research. Since economic cycles take years to unfold, quantity and diversity of data are limited, even if one looks back several decades. By contrast, combining the experiences of many countries or markets one has a much greater supply of historic information and faster growth in relevant datasets. It is often useful to test a trading strategies on as many markets as possible, even if one can only trade a subset, since the broader set gives statistical analyses greater power.
Moreover, the two-dimensional structure of panel data offers several specific advantages for empirical macro strategy research:
-
Panels allow assessing influences that predominantly vary across-sections , not time, such as structural features of countries or government balance sheets.
-
Panels allow using a diversity of experiences across countries or markets . While there are common global factors, there are also many idiosyncratic developments, such as policy mistakes, crises, terms-of-trade shocks and so forth. A diverse set of time series makes empirical findings more robust. for example, Japan’s early deflation experience made it a valuable cross-section experience for all types of proposed macro trading strategies.
-
Panels are necessary to investigate the influence of relative macro factors across countries or markets . Relative macro factors are often critical for FX and cross-market strategies. Even if relative factors are not being used directly, maybe because of the enhanced leverage and cost of relative positions, their empirical relevance often backs up the credibility of trading strategies based on outright factors. This is because relative signals often strip out communal global factors and, thereby, increase the statistical power of the dataset. For example, if higher relative inflation predicts lower relative fixed-income returns across currency areas, this finding adds credibility to the proposition the higher inflation heralds weaker returns in general and has not just been the artefact of a few global events.
Panels support a special set of regression methods that respect - to varying degrees - respect their two-dimensional structure.
-
Pooled OLS is simply OLS regression performed on panel data. It ignores time and cross-sectional characteristics and focuses only on dependencies between units.The implicit assumption is that relations depend neither on time nor on cross-sections (or on unobserved variables associated with either of them.
-
The fixed-effects model allows a separate constant for each cross-section, each time period (or combination of both). The implicit assumption is that there has been a structural difference in the analyzed relation associated with the cross-section.
-
The random-effects model also allows differences in the relation across-sections. However, model parameters, such as intercepts, are assumed to be random variables. Unlike fixed-effects models, random-effects models do not assume that cross-sectional effects are not structurally related to explanatory variable and - hence - are not of interest in themselves, but rather in variance.
-
Mixed Effects as Combination of latter two. Depending on how we specify the model we can replicate random-effects using mixed effects (this is done later in the notebook)
-
Seemingly Unrelated Regressions (SUR) - is a generalization of linear regression model that consists of multiple cross-sectional equations, each having its own set of features. SUR also allows to use different set of regressors across cross-section, while using the information from other cross-sections.
In this notebook we use mainly the
linearmodels
module, which complements the standard
statsmodels
package with panel regression estimations.
Pooled regression #
The easiest way of analysing panel data is to simply pool the observations for all cross-sections and time periods into one dimensional datasets and then run a linear OLS regression, as one would for one-dimensional datasets. This method effectively disregards the dimensions of the dataset for formulating the underlying model.
While pooled regression is simple, it is based on a very restrictive model: all cross-sectional relations are assumed to be governed by the same intercept and slope coefficients. It does not use up many degrees of freedom and is therefore popular for smaller datasets. If the restrictive assumptions are appropriate, pooled regression makes best use of the available data. Other key assumptions are that there is no correlation between features and no autocorrelation of the error terms.
The Pooled OLS class of the
linearmodels
module provides “just plain OLS that understands various panel data structures”. The regression simply pools all observations for all markets and estimates one big OLS regression. However, it adds some information that is based on the panel structure.
X = sm.add_constant(dfx_pan["CPIC_SJA_P6M6ML6AR"])
lm_pooled = PooledOLS(dfx_pan.DU05YXR_VT10, X).fit(cov_type="clustered")
print(lm_pooled)
PooledOLS Estimation Summary
================================================================================
Dep. Variable: DU05YXR_VT10 R-squared: 0.0053
Estimator: PooledOLS R-squared (Between): -1.8826
No. Observations: 1600 R-squared (Within): 0.0061
Date: Tue, Sep 19 2023 R-squared (Overall): 0.0053
Time: 16:41:04 Log-likelihood -4261.8
Cov. Estimator: Clustered
F-statistic: 8.4524
Entities: 6 P-value 0.0037
Avg Obs: 266.67 Distribution: F(1,1598)
Min Obs: 219.00
Max Obs: 280.00 F-statistic (robust): 8.0930
P-value 0.0045
Time periods: 281 Distribution: F(1,1598)
Avg Obs: 5.6940
Min Obs: 0.0000
Max Obs: 6.0000
Parameter Estimates
======================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------------
const 0.6162 0.1486 4.1453 0.0000 0.3246 0.9077
CPIC_SJA_P6M6ML6AR -0.2066 0.0726 -2.8448 0.0045 -0.3491 -0.0642
======================================================================================
First, we check the significance of core CPI trends, measured as percent change over the last 6 months versus the previous 6 months, seasonally and jump-adjusted.
The pooled regression diagnoses high significance of the 1-month lagged core CPI trend for duration returns. As should be expected the relation between inflation and subsequent duration returns has been negative and the probability of the relationship not being accidental according to the pooled model has been over 99% (indicated by the p-value below being below 0.01.)
The top part of the summary shows standard R-squared and two panel-specific R-squared measures:
-
R-squared (Between) measures how much of the target variance across-sections is accounted for by feature variance. Here that is the variation explained by the currency area for which we measure the return.
-
R-squared (Within) measures how much of the model variance is explained by variation of the explanatory variable within the cross-sections. It gives the goodness of fit for data that remove the cross-section mean.
-
R-squared (Overall) is a weighted average of the above two.
F-statistics and their p-values indicate the historic significance of the explanatory variables, but they do so under the assumption that all observations in the pool are independent. For panels of market data and macro quantamental data this is rarely the case. For correlated time series in a pool, this statistic overstates significance , by concatenating related time series and treating them as added independent observations. If data are heavily correlated their use as a pool is often called pseudo-replication . A similar bias would apply to the coefficient p-values in the of the output table.
In our example, as expected, the inflation is negatively correlated with subsequent returns and this correlation is significant at 1% level.
Quick look at residual errors and at dispersion of target variable will establish if the assumption of OLS model hold. We are checking the fitted model for three key properties that impact the goodness of fit and efficiency of the estimation:
-
Normality : The quantile-quantile “QQ” test checks if residual errors are normally distributed. This is important for the reliability of confidence interval estimates and assessments of significance. In the top left chart of the panel The quantile-quantile “QQ” plot checks if regression errors are normally distributed, by plotting them against a theoretical distribution. In case of a normal normal distribution the dots should be on a diagonal line.
-
Cross-value homoskedasticity : The second plot investigated if the estimation errors are homoskedastic, i.e. if the variance of residual errors are constant across all values of the features and, hence, the OLS model is efficient. If the red line is not horizontal and errors are systematically changing with the feature values the estimation probably has not made use of all information.
-
Target-error correlation : High correlation between target values and regression errors indicates that our estimation does not explain much of the target variation. The information is inverse to the R-square value and high correlation (low R-square) is common for financial return series.
-
Cross-sectional homoskedasticity : The final chart checks if the targets have approximately similar means and standard deviations across-sections. If the means are very different, cross-section dummies may need to be considered in form of fixed or random effects, if they can be theoretically justified. If the the standard deviations are very different, normalization of targets have to be considered..
sns.set(rc={"figure.figsize": (15, 8)})
fig, ([ax1, ax2], [ax3, ax4]) = plt.subplots(2, 2)
# left = -1.8 #x coordinate for text insert
sm.graphics.qqplot(lm_pooled.resids, line="45", fit=True, ax=ax1)
ax1.set_title("QQ plot")
top = ax1.get_ylim()[1] * 0.75
ax2.set_title("Scatterplot of residuals")
ax2.set_xlim((-3, 8))
sns.residplot(
x="CPIC_SJA_P6M6ML6AR",
y="DU05YXR_VT10",
data=dfx_pan,
lowess=True,
line_kws=dict(color="r"),
color="darkseagreen",
ax=ax2,
)
ax3.set_title("Raw residuals of Pooled OLS versus DU02YXR_VT10")
ax3.set_ylabel("Residual (y - mu)")
ax3.scatter(
dfx_pan.DU05YXR_VT10,
lm_pooled.resids,
s=4,
c="purple",
label="Residual Error",
)
dfx_pan["cid_dummy"] = dfx_pan.index.get_level_values(0).values
sns.boxplot(x="cid_dummy", y=lm_pooled.resids, data=dfx_pan)
g = sns.stripplot(
x="cid_dummy", y=lm_pooled.resids, data=dfx_pan, palette="bright", ax=ax4
)
ax4.set_title("Pooled residuals by country")
fig.tight_layout()
plt.show()
Sometimes it is useful to ascertain that a relation holds roughly similarly across-sections (currency areas). The country-specific relations can be quickly visualized with Seaborn’s
lmplot
function.
grid = sns.lmplot(
x="CPIC_SJA_P6M6ML6AR",
y="DU05YXR_VT10",
col="cid_dummy",
sharex=False,
sharey=True,
aspect=2,
col_wrap=3,
data=dfx_pan,
height=2.5,
)
fig.tight_layout()
Fixed-effects regression #
Basics #
Fixed effects in panel models are influences that are specific to a cross-section or a time period (or both). Also, they are assumed to be constant. Fixed effects are conceptually viewed as unchangeable characteristics of the level, i.e. here of the cross-section or the observation period.
A good summary of the basic fixed-effects model can be found on Wikipedia
Generally, fixed effects are useful to mitigate omitted variable biases, i.e. incorrect attribution of explanatory or predictive power of an unobserved variable to an observed one. For prediction models omitted variable biases are not always a fault.
-
If correlation between observed and unobserved variables is stable, using the observed feature and its estimated influence for forecasts is a valid strategy.
-
However, if the correlation between observed and unserved variable has been circumstantial and not expected to persist, the omitted variable bias will mislead the forecaster. In financial market research this danger typically arises for semi-structural features. For example, long-term averages of currency returns may be driven by risk premia. However, if an unrelated slow-moving semi-structural indicator has also recorded long-term differences across currency areas over the sample period, a pooled regression may easily misinterpret as predictors of risk premia.
Using fixed effects models for prediction bears its own risks. The model is misleading if cross-sectional differences reflect unobserved features that are actually changing over time. This concern applies, for example, to regressions on currency area-specific returns. While returns may have displayed significant differences historically, due for example to different risk premia, they plausibly reflect the specific history of the currency area over the sample period. The assumption of a stable return outperformance of a currency area, unrelated to any observable feature, is typically unrealistic.
As a rule of thumb, fixed-effects models should not be used for temporary effects. Cross-section fixed effects would be misleading and period-specific effects would just be useless for predictions.
Using dummy variables and
statsmodels
#
Fixed effects can be estimated with the
statsmodels
package by adding cross-section- or period-specific dummy variables to the
PooledOLS
method. The below estimation assumes that each currency area adds or subtracts a fixed premium to the target return.
X = [
"CPIC_SJA_P6M6ML6AR",
"cid_dummy",
]
X = sm.add_constant(dfx_pan[X])
sm_fe_cs = PooledOLS(dfx_pan.DU05YXR_VT10, X).fit(cov_type="clustered")
print(sm_fe_cs)
PooledOLS Estimation Summary
================================================================================
Dep. Variable: DU05YXR_VT10 R-squared: 0.0069
Estimator: PooledOLS R-squared (Between): 1.0000
No. Observations: 1600 R-squared (Within): 0.0064
Date: Tue, Sep 19 2023 R-squared (Overall): 0.0069
Time: 16:41:06 Log-likelihood -4260.5
Cov. Estimator: Clustered
F-statistic: 1.8363
Entities: 6 P-value 0.0886
Avg Obs: 266.67 Distribution: F(6,1593)
Min Obs: 219.00
Max Obs: 280.00 F-statistic (robust): 1.7062
P-value 0.1158
Time periods: 281 Distribution: F(6,1593)
Avg Obs: 5.6940
Min Obs: 0.0000
Max Obs: 6.0000
Parameter Estimates
======================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------------
const 0.9753 0.3110 3.1357 0.0017 0.3652 1.5853
CPIC_SJA_P6M6ML6AR -0.2660 0.0874 -3.0447 0.0024 -0.4374 -0.0946
cid_dummy.CAD -0.3074 0.3056 -1.0058 0.3147 -0.9069 0.2921
cid_dummy.EUR -0.2246 0.3185 -0.7052 0.4808 -0.8492 0.4001
cid_dummy.GBP -0.2053 0.3164 -0.6489 0.5165 -0.8260 0.4153
cid_dummy.JPY -0.5607 0.3605 -1.5554 0.1200 -1.2678 0.1464
cid_dummy.USD -0.2882 0.3068 -0.9395 0.3476 -0.8899 0.3135
======================================================================================
The constant in the above output applied the first (alphabetical - in our case AUD) market. The effects for other markets can be obtained adding the respective dummy coefficient values to the AUD level.
The significance of fixed effects is indicated by the t-statistics and p-values of the dummy coefficients. Adding fixed effects for cross-section only increased the absolute value and strengthened the significance for chosen independent variable, though none of fixed effects are significant at 5% level.
Fixed effects across periods can be analyzed similarly. However, with longer time series the output of the fixed-effects panel regression becomes vast and unreadable if we did look at the effects of every recorded period. Hence, here we use calendar years as a dummy variable and still include country as a dummy variable.
dfx_pan["year"] = dfx_pan.index.get_level_values(1).year.astype("category")
dfx_pan["date"] = dfx_pan.index.get_level_values(1).astype("category")
X = ["CPIC_SJA_P6M6ML6AR", "cid_dummy", "year"]
X = sm.add_constant(dfx_pan[X])
sm_fe_time = PooledOLS(dfx_pan.DU05YXR_VT10, X).fit(cov_type="clustered")
print(sm_fe_time)
PooledOLS Estimation Summary
================================================================================
Dep. Variable: DU05YXR_VT10 R-squared: 0.0950
Estimator: PooledOLS R-squared (Between): 1.0000
No. Observations: 1600 R-squared (Within): 0.0945
Date: Tue, Sep 19 2023 R-squared (Overall): 0.0950
Time: 16:41:06 Log-likelihood -4186.2
Cov. Estimator: Clustered
F-statistic: 5.6829
Entities: 6 P-value 0.0000
Avg Obs: 266.67 Distribution: F(29,1570)
Min Obs: 219.00
Max Obs: 280.00 F-statistic (robust): 5.5167
P-value 0.0000
Time periods: 281 Distribution: F(29,1570)
Avg Obs: 5.6940
Min Obs: 0.0000
Max Obs: 6.0000
Parameter Estimates
======================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------------
const 2.0846 0.5987 3.4816 0.0005 0.9102 3.2590
CPIC_SJA_P6M6ML6AR 0.0375 0.1233 0.3038 0.7613 -0.2045 0.2794
cid_dummy.CAD -0.2337 0.2967 -0.7878 0.4309 -0.8156 0.3482
cid_dummy.EUR 0.0124 0.3147 0.0393 0.9686 -0.6050 0.6297
cid_dummy.GBP -0.1325 0.3043 -0.4355 0.6633 -0.7293 0.4643
cid_dummy.JPY 0.1960 0.4011 0.4886 0.6252 -0.5908 0.9828
cid_dummy.USD -0.1778 0.3001 -0.5925 0.5536 -0.7665 0.4109
year.2001 -0.9743 0.7005 -1.3908 0.1645 -2.3484 0.3998
year.2002 -0.7156 0.6524 -1.0970 0.2728 -1.9952 0.5640
year.2003 -1.9926 0.6915 -2.8815 0.0040 -3.3490 -0.6362
year.2004 -1.5753 0.6313 -2.4952 0.0127 -2.8136 -0.3369
year.2005 -2.0636 0.6395 -3.2267 0.0013 -3.3181 -0.8092
year.2006 -2.9158 0.6147 -4.7437 0.0000 -4.1214 -1.7101
year.2007 -2.4550 0.6802 -3.6093 0.0003 -3.7892 -1.1208
year.2008 -0.8591 0.6585 -1.3046 0.1922 -2.1508 0.4326
year.2009 -2.1720 0.6170 -3.5200 0.0004 -3.3823 -0.9617
year.2010 -1.1628 0.6519 -1.7837 0.0747 -2.4416 0.1159
year.2011 -0.8383 0.5865 -1.4293 0.1531 -1.9886 0.3121
year.2012 -1.2314 0.5901 -2.0870 0.0371 -2.3888 -0.0741
year.2013 -3.2710 0.7114 -4.5981 0.0000 -4.6664 -1.8756
year.2014 -0.5844 0.6014 -0.9718 0.3313 -1.7640 0.5951
year.2015 -1.7535 0.6110 -2.8697 0.0042 -2.9520 -0.5550
year.2016 -1.9195 0.7203 -2.6648 0.0078 -3.3324 -0.5066
year.2017 -2.5581 0.6557 -3.9012 0.0001 -3.8442 -1.2719
year.2018 -1.5821 0.6694 -2.3636 0.0182 -2.8950 -0.2692
year.2019 -0.8103 0.6620 -1.2241 0.2211 -2.1088 0.4881
year.2020 -1.0457 0.6253 -1.6722 0.0947 -2.2723 0.1809
year.2021 -4.0839 0.7970 -5.1240 0.0000 -5.6472 -2.5206
year.2022 -4.4528 0.7404 -6.0138 0.0000 -5.9051 -3.0005
year.2023 -1.4976 0.9669 -1.5488 0.1216 -3.3941 0.3990
======================================================================================
Even if the years’ fixed effects are significant, this information is of little use for forecasting since time periods do not recur, however an interesting takeaway from the above summary is that, after having taken into account the effect of cross-section and year, there coefficient for inflation turns positive and looses any significance.
Fixed-effect regression with
linearmodels
#
It is often more efficient to perform panel regression with the
statsmodels
extension
linearmodels
. The
linearmodels
module focuses on panel regression, instrumental variable estimators, system estimators and models for estimating asset prices. The package needs to be be installed separately with
pip
install
linearmodels
. While fixed effects regression is similar to using dummy variables, it is computationally more efficient
The
linearmodels
module features the class
PanelOLS
, which supports
one way
(any unobserved effects that are different across individuals but fixed across time) and
two-way fixed effects
estimation for panel data. This means one can consider cross-sectional effects (
entity_effects=True
) or period effects (
time_effects=True
) or both. However,
PanelOLS
only supports linear additive effects.
X = ["CPIC_SJA_P6M6ML6AR"]
X = sm.add_constant(dfx_pan[X])
lm_fe_cs = PanelOLS(
dfx_pan.DU05YXR_VT10, X, time_effects=False, entity_effects=True
).fit(
cov_type="clustered"
) # clustering at cross-section level
print(lm_fe_cs)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: DU05YXR_VT10 R-squared: 0.0064
Estimator: PanelOLS R-squared (Between): -3.6440
No. Observations: 1600 R-squared (Within): 0.0064
Date: Tue, Sep 19 2023 R-squared (Overall): 0.0048
Time: 16:41:06 Log-likelihood -4260.5
Cov. Estimator: Clustered
F-statistic: 10.214
Entities: 6 P-value 0.0014
Avg Obs: 266.67 Distribution: F(1,1593)
Min Obs: 219.00
Max Obs: 280.00 F-statistic (robust): 9.2699
P-value 0.0024
Time periods: 281 Distribution: F(1,1593)
Avg Obs: 5.6940
Min Obs: 0.0000
Max Obs: 6.0000
Parameter Estimates
======================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------------
const 0.7193 0.1729 4.1595 0.0000 0.3801 1.0585
CPIC_SJA_P6M6ML6AR -0.2660 0.0874 -3.0447 0.0024 -0.4374 -0.0946
======================================================================================
F-test for Poolability: 0.5157
P-value: 0.7646
Distribution: F(5,1593)
Included effects: Entity
The output above produces similar coefficient estimates and standard errors as pooled regression with dummy variables for each cross-section (
sm_fe_cs
in example above). The biggest difference in statistics comes from
\(R^2\)
calculation: in the
linearmodels
Fixed Effects regression
\(R^2\)
is given as “within
\(R^2\)
”, which means that it devided the usual Sum Squared of Residuals on
demeaned
Total sum of squared, encounting for the fixed effect.
An interesting statistics here is F-test for “poolability” and its p-value. A low p-value means high probability of effect significance and rejects the null hypothesis of no level-specific effects. This would be taken as an argument against poolability. By contrast, a high p-value of, say, above 0.1 means a lower probability of effect significance and would lend more support to using a pooled regression, which would be lest costly in terms of degrees of freedom.
Note that the summary does not display the fixed effect estimates (intercept). This is done to make the output compact: the model could potentially be used to analyse data with hundreds of “units” (“entities”) and displaying all of them in one table would be counter-productive.
The intercepts (constants) can be seen with command
lm_fe_cs.estimated_effects
, however they will be different numerically from pooled regression with dummy variables, where the benchmark was alphabetically first cross-section. To get numerically the same results the fixed effects should be called using formula below and after that the intercepts can be extracted.
lm_fe_cs = PanelOLS.from_formula(
"DU05YXR_VT10 ~ 0 + CPIC_SJA_P6M6ML6AR+ EntityEffects",
dfx_pan,
).fit()
chart_lm_fe_cs = lm_fe_cs.estimated_effects.unstack(level=1)
chart_lm_fe_cs = chart_lm_fe_cs[chart_lm_fe_cs.columns[~chart_lm_fe_cs.isnull().all()]]
chart_lm_fe_cs = chart_lm_fe_cs.transpose().reset_index()
filt = chart_lm_fe_cs["real_date"] >= pd.to_datetime(
"2023-02-01"
) # filter for any date (as long as there is no nan or 0)
chart_lm_fe_cs = chart_lm_fe_cs[filt] # filter out relevant data frame
sns.set(rc={"figure.figsize": (5, 3)})
chart_lm_fe_cs = chart_lm_fe_cs.drop(["level_0", "real_date"], axis=1)
sns.barplot(data=chart_lm_fe_cs).set(
title="Estimated Intercepts for each cross-section. Fixed Effects"
)
[Text(0.5, 1.0, 'Estimated Intercepts for each cross-section. Fixed Effects')]
Below example replicates time fixed effects using
linearmodels
. It will be equivalent to pooledOLS using time and “cid_dummy” as dummy variables.
X = [
"CPIC_SJA_P6M6ML6AR",
]
X = sm.add_constant(dfx_pan[X])
lm_fe_time = PanelOLS(
dfx_pan.DU05YXR_VT10, X, time_effects=True, entity_effects=True
).fit(cov_type="clustered")
print(lm_fe_time)
PanelOLS Estimation Summary
================================================================================
Dep. Variable: DU05YXR_VT10 R-squared: 0.0003
Estimator: PanelOLS R-squared (Between): 0.1480
No. Observations: 1600 R-squared (Within): 0.0022
Date: Tue, Sep 19 2023 R-squared (Overall): 0.0023
Time: 16:41:06 Log-likelihood -3288.0
Cov. Estimator: Clustered
F-statistic: 0.3870
Entities: 6 P-value 0.5340
Avg Obs: 266.67 Distribution: F(1,1314)
Min Obs: 219.00
Max Obs: 280.00 F-statistic (robust): 0.3234
P-value 0.5697
Time periods: 281 Distribution: F(1,1314)
Avg Obs: 5.6940
Min Obs: 0.0000
Max Obs: 6.0000
Parameter Estimates
======================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------------
const 0.3451 0.1619 2.1316 0.0332 0.0275 0.6627
CPIC_SJA_P6M6ML6AR -0.0506 0.0890 -0.5687 0.5697 -0.2251 0.1240
======================================================================================
F-test for Poolability: 11.001
P-value: 0.0000
Distribution: F(284,1314)
Included effects: Entity, Time
The interesting part here is that taking into consideration the effect of cross-section and time, the significance of coefficient for
CPIC_SJA_P6M6ML6AR
has decreased and its coefficient is no longer significant. Again, the relevant F statistics shows support for pooled regression. The issue with two-way fixed effect model is a huge loss in degrees of freedom (having estimated each period and cross-section separately we loose 281 degrees of freedom).
Random effects regression #
Basics #
Like fixed effects models, random effects models are used to account for cross-section- and period-specific effects. However, the purpose and assumptions of random effects models in financial markets research are different.
-
Random-effects models deal with the problem of pseudo-replication in panels . Random effects are commonalities of measurements that are connected, for example by belonging to the same cross-section or time period. Affected observations are not independent. Treating them as independent leads to pseudo-replication and erroneous inference. For example, coefficient significance in a pooled OLS regression with pseudo-replication may be overstated. Random effects models are made for dealing with non-independence, as opposed to fixed effects, which focus on the actual parameter estimates of cross-section- or period-specific effects.
-
Unlike fixed effects models, random effects models do not consider omitted variable bias as a problem . This often suits the purpose of financial market prediction models if one assumes that observed-unobserved variable correlation is stable. Then the observed feature can represent the effect of the unobserved one. After all, we care about forecast accuracy not about actual causality.
Importantly, the term “random” does not mean that the effects are erratic or not measurable. They are merely influences that are not explained by the features of the model and are specific to the level, i.e. the period or the cross-section. In a random-effects model we consider level-specific commonalities as draws from a random variable and disregard their specific values. The effects themselves do not add any information we are interested in.
Conceptually, random effects represent a sample of all possible values or instances of that effect. Thus, they generalize to all results obtainable from the population. In the case of financial market panel research, they generalize to all possible cross-sections (markets, currency areas or countries) and time periods, including the future. This generalization typically makes the random effects model more suitable for predictions.
“Random effect estimates are a function of the group [cross-section or time period] level information as well as the overall (grand) mean of the random effect. Group levels with low sample size and/or poor information are more strongly influenced by the grand mean, which is serving to add information to an otherwise poorly-estimated group. However, a group with a large sample size and/or strong information will have very little influence of the grand mean and largely reflect the information contained entirely within the group. This process is called partial pooling, as opposed to no pooling where no effect is considered or total pooling where separate models are run for the separate groups. Partial pooling results in the phenemenon known as shrinkage, which refers to the group-level estimates being shrink toward the mean. What does all this mean? If you use a random effect you should be prepared for your factor levels to have some influence from the overall mean of all levels. With a good, clear signal in each group, you won’t see much influence of the overall mean, but you will with small groups or those without much signal.” Steve Midway
Cross-section random effects with
linearmodels
#
One way of performing cross-sectional random effects panel regression is to use the
RandomEffects
class of the
linearmodels
package. It implements a cross-sectional random effects model.
X = sm.add_constant(dfx_pan["CPIC_SJA_P6M6ML6AR"])
lm_re = RandomEffects(dfx_pan.DU05YXR_VT10, X).fit(cov_type="clustered")
print(lm_re.summary)
RandomEffects Estimation Summary
================================================================================
Dep. Variable: DU05YXR_VT10 R-squared: 0.0053
Estimator: RandomEffects R-squared (Between): -1.8826
No. Observations: 1600 R-squared (Within): 0.0061
Date: Tue, Sep 19 2023 R-squared (Overall): 0.0053
Time: 16:41:06 Log-likelihood -4261.8
Cov. Estimator: Clustered
F-statistic: 8.4524
Entities: 6 P-value 0.0037
Avg Obs: 266.67 Distribution: F(1,1598)
Min Obs: 219.00
Max Obs: 280.00 F-statistic (robust): 8.0930
P-value 0.0045
Time periods: 281 Distribution: F(1,1598)
Avg Obs: 5.6940
Min Obs: 0.0000
Max Obs: 6.0000
Parameter Estimates
======================================================================================
Parameter Std. Err. T-stat P-value Lower CI Upper CI
--------------------------------------------------------------------------------------
const 0.6162 0.1486 4.1453 0.0000 0.3246 0.9077
CPIC_SJA_P6M6ML6AR -0.2066 0.0726 -2.8448 0.0045 -0.3491 -0.0642
======================================================================================
The output is equivalent to the pooled regression. Adding cross-sectional random effect in this example does not much change results, as none of cross-sectional fixed effects was significant.
Generally, we can check the importance of random effects by extracting the “theta”, a measure for the importance of the level, i.e. here the cross-sections, in explaining target variance.
The random effects estimator makes use of a quasi-differenced model:
\(y_{it} - \hat{\theta_i}\bar{y} = (1-\hat{\theta_i})\alpha_i + (x_{it} -\hat{\theta_i}x_i)\beta + (\epsilon_{it} - \hat{\theta_i}\bar{\epsilon})\)
where y is the target, x the feature, the subscript i denotes a cross-section, and subscript t a period. Here \(\hat\theta_i\) is a function of the variance of the cross-section specific error \(\epsilon_{it}\) , the variance of the intercepts \(\alpha_i\) and the number of observations for entity \(i\) . The coefficient \(\theta_i\) determines how much “demeaning” takes place. When this value is \(0\) , the random effects model effectively becomes a pooled model since this occurs when there is no variance in the effects. When \(\theta_i\) is \(1\) , the model effectively becomes a fixed effects model.
For example, in the above regression the \(\theta_i\) is zero for all cross-sections, implying a no significant cross-section-specific effect. Put simply, the cross-section does not explain much of the variation of the return. This is plausible, because almost all of the return variation has to do with changeable market conditions rather than the country. For financial time series period-specific effects are often far more important than cross-section specific effects.
print(lm_re.theta.transpose())
cid AUD CAD EUR GBP JPY USD
theta 0.0 0.0 0.0 0.0 0.0 0.0
Conveniently, the
compare
function of the
linearmodels.panel
module can be used to show the key output of different panel models side by side.
res = {
"Pooled": lm_pooled,
"One way fixed Effects": sm_fe_cs,
"Random cross-section specific": lm_re,
}
print(compare(res))
Model Comparison
========================================================================================
Pooled One way fixed Effects Random cross-section specific
----------------------------------------------------------------------------------------
Dep. Variable DU05YXR_VT10 DU05YXR_VT10 DU05YXR_VT10
Estimator PooledOLS PooledOLS RandomEffects
No. Observations 1600 1600 1600
Cov. Est. Clustered Clustered Clustered
R-squared 0.0053 0.0069 0.0053
R-Squared (Within) 0.0061 0.0064 0.0061
R-Squared (Between) -1.8826 1.0000 -1.8826
R-Squared (Overall) 0.0053 0.0069 0.0053
F-statistic 8.4524 1.8363 8.4524
P-value (F-stat) 0.0037 0.0886 0.0037
===================== ============== ============== ===============
const 0.6162 0.9753 0.6162
(4.1453) (3.1357) (4.1453)
CPIC_SJA_P6M6ML6AR -0.2066 -0.2660 -0.2066
(-2.8448) (-3.0447) (-2.8448)
cid_dummy.CAD -0.3074
(-1.0058)
cid_dummy.EUR -0.2246
(-0.7052)
cid_dummy.GBP -0.2053
(-0.6489)
cid_dummy.JPY -0.5607
(-1.5554)
cid_dummy.USD -0.2882
(-0.9395)
----------------------------------------------------------------------------------------
T-stats reported in parentheses
Time period-specific random effects with
statsmodels
#
Period-specific random effects are often highly relevant for trading strategy research, as most targets are financial returns of the same type over different markets. The variation of this target is predominantly period-specific and often returns over the same period are highly correlated. Not accounting for such correlation leads to pseudo-replication of data and overstated significance of results.
The Macrosynergy post
Testing macro trading factors
explains a period-specific random effects panel regression that is well suited for testing the influence of macro factors across countries. It is a practical and effective econometric technique to tackle pseudo-replication in panels. There is a R-script on
Kaggle
, replicating the result of the post. R offers better documentation and more flexibility for random/ linear mixed model effects than Python. This is done with two main packages -
plm
and
lme4
.
Here we replicate this procedure in Python. The easiest way to apply period-specific random effects in Python is to use mixed effects regression (
MixedLM
) from
statsmodels
package. The labeling of the function is, admittedly, a source of confusion. For practical purposes this function gives us the output of a period-specific random effects model, however. Indeed Python in this case tries to replicate the output of R, using R packages as benchmark.
y = dfx_pan["DU05YXR_VT10"]
X = dfx_pan["CPIC_SJA_P6M6ML6AR"]
X = sm.add_constant(X)
groups = dfx_pan.reset_index().real_date
re = sm.MixedLM(
y,
X,
groups,
).fit(
reml=True
) # Use restricted maximum likelihood
print(re.summary())
Mixed Linear Model Regression Results
=============================================================
Model: MixedLM Dependent Variable: DU05YXR_VT10
No. Observations: 1600 Method: REML
No. Groups: 280 Scale: 4.3401
Min. group size: 4 Log-Likelihood: -3784.0539
Max. group size: 6 Converged: Yes
Mean group size: 5.7
-------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-------------------------------------------------------------
const 0.455 0.199 2.281 0.023 0.064 0.845
CPIC_SJA_P6M6ML6AR -0.098 0.056 -1.753 0.080 -0.207 0.012
Group Var 7.752 0.381
=============================================================
The output of the model has less information than a standard OLS model or even a panel regression from linear model. The reason for that is that many of standard measurements are not applicable to the model:
-
Method states the fitting method (either Maximum Likelihood ( ML or REML (Restricted maximum likelihood) ), a modified version of standard maximum likelihood. Mixed effects models can be difficult to fit and since they do not have closed form solutions they rely on an optimizer. Standard maximum likelihood often fails by not converging (the summary indicates if the model has converged or not). REML method often yields a better result.
-
The Log-Likelihood value is the natural logarithm of probability of observing the sample data assuming it has been taken from a population that is characterized by the estimated parameters. It helps to compare different models.
-
The coefficients in the bottom part of the table for the intercept, and the covariates are Fixed Effects of the mixed model. Comparing the output with the very first pooled regression model, we notice, that the negative relation between the month-end core CPI trend and next month’s return still shows as negative, but the statistical probability of this relation is only around 88%.
-
The Group Var coefficient actually is the variance of the period-specific random effects overtime. The term
Group
is generic but in this specific case, where we specified as group thereal_date
category, refers to time periods.
Scale is the (scalar) error variance. It is basically the estimated error variance based on given estimates of the slopes and random effects covariance matrix. For discussion on its computation please see source code with remarks
To estimate the importance of random factors we need to extract the scale and variance of the period-specific random effects from the model and simply divide variance of the period-specific random effects by the sum of two variances (and multiply by 100 to get percentage):
scale = sm.MixedLM(
y,
X,
groups,
).get_scale(re.fe_params, re.cov_re_unscaled, re.vcomp)
RE_explained = ((re.cov_re / (re.cov_re + scale)) * 100).squeeze().round(2)
print("% of Variance attributable to random factors time, in %, is", RE_explained)
% of Variance attributable to random factors time, in %, is 64.11
The period-specific random effects can be extracted and plotted with the
.random_effects
method. For plotting see the mixed effects section below.
random_effects = pd.DataFrame.from_dict(re.random_effects)
random = random_effects.T
print(random.tail(5))
Group Var
2023-01-31 3.099347
2023-02-28 -3.490843
2023-03-31 3.974264
2023-04-30 0.036600
2023-05-31 -0.336469
To see how the model residuals are distributed against the fitted values and detect a possible pattern we can use a simple scatter plot
sns.scatterplot(x=(dfx_pan["DU05YXR_VT10"] - re.resid), y=re.resid, alpha=0.5).set(
title="Residual vs. Fitted"
)
plt.xlabel("Fitted Values")
plt.ylabel("Residuals")
Text(0, 0.5, 'Residuals')
fitted = pd.DataFrame()
fitted["CPIC_SJA_P6M6ML6AR"] = dfx_pan.CPIC_SJA_P6M6ML6AR
fitted["DU05YXR_VT10"] = dfx_pan.DU05YXR_VT10
fitted["re_fit"] = re.fittedvalues
fitted["pool_fit"] = lm_pooled.fitted_values
fitted["fixed_entity_fit"] = lm_fe_cs.fitted_values
fitted["fixed_time_fit"] = sm_fe_time.fitted_values
fitted_l = fitted.reset_index()
fitted_l = pd.melt(
fitted_l,
id_vars=["cid", "real_date"],
value_vars=[
"CPIC_SJA_P6M6ML6AR",
"DU05YXR_VT10",
"re_fit",
"pool_fit",
"fixed_entity_fit",
"fixed_time_fit",
],
var_name="xcat",
value_name="value",
)
cidx = ["AUD", "CAD", "EUR", "GBP", "JPY", "USD"]
msp.view_timelines(
fitted_l,
xcats=["DU05YXR_VT10", "re_fit", "pool_fit", "fixed_entity_fit", "fixed_time_fit"],
cids=cidx,
ncol=3,
start="2000-01-01",
same_y=False,
aspect=1.5,
height=3,
label_adj=0.2,
)
None of the models except for random effects predict much of the variation. It is to be expected as R-squared numbers were generally very low. The random effect model seems to capture much more.
Linear mixed effects model #
“The term ‘mixed’ or, more fully, ‘mixed effects’, denotes a model that incorporates both fixed- and random-effects terms in a linear predictor expression from which the conditional mean of the response can be evaluated.”
Bates et al
. Note that this paper also describes the fitting of the model in
lme4
package and part of this package is replicated in Python’s
statsmodels
under
smf.mixedlm
model.
“Linear mixed-effects models are an important class of statistical models that can be used to analyze…continuous, hierarchical data…taking into account the correlation of observations…They allow us to effectively partition overall variation of the dependent variable into components corresponding to different levels of data hierarchy.” Gałecki and Burzykowski
The practical benefits of this type of model are flexibility to introduce level-specific effects to intercepts and slopes . In particular, one can specify random intercept and random slope coefficients. As before, the model will estimate fixed coefficients For example, these models allow random intercepts with fixed means, intercept that vary across a level and a sub-level, correlated random intercepts and slopes, and uncorrelated intercept and slopes. Formally the model posits that the conditional distribution of a target vector for a realized set of level-specific random effects is normal around a linear combination of features, the random effects and, possibly, a prior offset term.
Linear mixed effects models also offer more flexibility in handling unbalanced repeated measures data. It is important to determine which effects should have random components and which should be fixed. The model is iterative: a tentative model is fitted and consequently modified to generate a better fit. In general this model does not have a closed-form solution, so convergence is required for the model to work. For more than one covariate the user can decide on parameter by parameter basis what regressors should be considered as random. For model specifications please see here
Basic documentations with methods description can be viewed here . Use of formula is described here . And the results are discussed here
In the first example we use the group (the cross-sections) intercepts and slope coefficients as random factors. Compared to the above pooled regression model the panel estimates of the intercept and slope coefficients only change slightly, reflecting that the explanatory power of cross-sections for monthly duration returns is marginal.
groups = dfx_pan.reset_index().cid
lmm1 = smf.mixedlm(
"DU05YXR_VT10 ~ CPIC_SJA_P6M6ML6AR",
dfx_pan,
groups=groups,
re_formula="1+CPIC_SJA_P6M6ML6AR",
).fit()
print(lmm1.summary())
Mixed Linear Model Regression Results
=========================================================================
Model: MixedLM Dependent Variable: DU05YXR_VT10
No. Observations: 1600 Method: REML
No. Groups: 6 Scale: 12.0676
Min. group size: 219 Log-Likelihood: -4265.0655
Max. group size: 280 Converged: Yes
Mean group size: 266.7
-------------------------------------------------------------------------
Coef. Std.Err. z P>|z| [0.025 0.975]
-------------------------------------------------------------------------
Intercept 0.619 0.166 3.733 0.000 0.294 0.945
CPIC_SJA_P6M6ML6AR -0.208 0.073 -2.844 0.004 -0.351 -0.065
Group Var 0.002 0.021
Group x CPIC_SJA_P6M6ML6AR Cov -0.001 0.012
CPIC_SJA_P6M6ML6AR Var 0.001 0.014
=========================================================================
However, unlike in the pooled model we can now derive cross-section specific intercepts and - more importantly - slope coefficients by accessing the
random_effects
attribute.
This can be very important for prediction if we have reason to believe that the cross-sectional differences of elasticities are, to some extent, persistent. In this case considering cross-section specific slope coefficient can deliver better predictions than the panel or pool coefficient. The conditional means of random effects given the data can be extracted with property
random_effects
:
print(pd.DataFrame.from_dict(lmm1.random_effects), "\n")
print(
"Fixed effects for linear mixed effects model with cid as random factor are \n",
lmm1.fe_params,
)
AUD CAD EUR GBP JPY USD
Group -0.004418 -0.000942 0.000963 0.008898 -0.010806 0.006304
CPIC_SJA_P6M6ML6AR 0.003832 0.000402 -0.000495 -0.006088 0.006915 -0.004565
Fixed effects for linear mixed effects model with cid as random factor are
Intercept 0.619364
CPIC_SJA_P6M6ML6AR -0.207805
dtype: float64