Financial markets are looking at a growing and broadening range of correlated time series for the operation of trading strategies. This increases the importance of latent factor models, i.e., methods that condense high-dimensional datasets into a low-dimensional group of factors that retain most of their underlying relevant information. There are two principal approaches to finding such factors. The first uses domain knowledge to pick factor proxies up front. The second treats all factors as latent and applies statistical methods, such as principal components, to a comprehensive set of correlated variables. A new paper proposes to combine domain knowledge and statistical methods using penalized reduced-rank regression. The approach promises improved accuracy and robustness.

The post is mainly a quick method summary based on quotes from the above paper. Some technical wording and mathematical symbols in the text have been replaced by a clear common language.

This post ties in with this site’s summary of quantitative methods for macro information efficiency.

## The basic idea

“Latent factor model estimation typically relies on __either using domain knowledge to manually pick __several observed covariates [correlated time series] as factor proxies, __or purely conducting multivariate analysis__ such as principal component analysis…We propose to bridge these two approaches [making] the latent factor model estimation robust, flexible, and statistically more accurate.”

“At the heart of our method is __a penalized reduced rank regression to combine information__. To further deal with heavy-tailed data, a computationally attractive penalized robust reduced rank regression method is proposed. We establish faster rates of convergence.”

*Reduced rank regression is a multivariate regression analysis for features with multicollinearity. It is suitable for a high-dimensional set of predictors where few linear combinations explain most of the variation of the target variable. Reduced rank regression addresses multicollinearity by identifying a lower-dimensional subspace within the overall feature space that captures most of the explanatory power. This subspace is typically defined by a smaller number of linear combinations of the original predictors, known as latent variables or components. The approach estimates regression coefficients for the linear combinations of features rather than the features themselves. This reduces the number of parameters to estimate. The reduced rank model aims to find a set of component weights that minimize prediction errors.*

“__With the emergence of high-dimensional and complex datasets, in recent years, a lot of works focus on the low-rank structure__, the penalization methods, the heavy-tailed problem, and their combinations… Our work is also related to research on supervised dimension reduction, which studies the dimension reduction for one dataset with the help of auxiliary information. Popular methods include supervised principal components, principal fitted components, and sufficient dimension reduction.”

## Finding the factors in a factor model

“Factor model is a useful tool for __modelling common dependence among multivariate outputs__ and it is gaining popularity with the emergence of high-dimensional data. “

“The factor model [relates a multivariate dataset to a small number of common factors]:”

*Here y_{t} is an extensive multivariate data set at time t, f_{t} a small number of common factors at time t, A a “loading matrix,” and e_{t} a set of idiosyncratic errors.*

“Since the true factors are typically unknown, there are two approaches being commonly adopted.

- One is to use domain knowledge to manually
__pick several observed covariates as factor proxies, treat them as common factors in the model__, and then estimate the loading matrix and the errors. We refer to this approach as the**factor proxy approach**. A well-known example is the Fama-French three-factor model in finance… - The second approach is to treat the factors as latent and estimate the whole model via statistical methods…We refer to this approach as the proxy-agnostic approach. Such an approach
__only utilizes the information in the multivariate data set and cannot incorporate information from other variables__… For the**proxy-agnostic approach**, some popular methods include the principal components-based methods, the generalized principal components-based methods, and the maximum likelihood-based methods”

“Methodologically, we propose a novel latent factor model estimation approach to __bridge the factor proxy approach and the proxy-agnostic approach__, and it allows incorporating a large number of factor proxies.”

## Combining factor proxies and statistical methods

“Our target is to utilize the information from both the multivariate observations of main interest [data set] **y**_{t} and a large set of factor proxies **x**_{t} to improve the estimation accuracy of the latent factor model. Specifically, __we consider approximating the latent factors by some linear transformation of [the proxies]__ and regard the following model as our working model:

where B is a [coefficient] matrix, and **u**_{t} is the mean-zero approximation error. “

“To combine information by integrating [the standard factor model with the above linear proxy transformations], __we propose to fit a reduced rank regression with the main data set [ y_{t}] onto the proxies [x_{t}] to estimate the loading matrix__, and then recover the latent factors [

*f**] by projecting the main data set onto the estimated loadings.”*

_{t}“We are considering the following working model:”

“Suppose the information provided by [the proxy data set] is useful, that is, the variance of the approximation errors is relatively small, __the reduced rank regression step can be regarded as a denoising procedure with the guidance of the proxies__…We can expect that, compared with the noisy [original] data, the fitted values can help us recover the latent factor structure more accurately.”

“To guarantee the approximation power, we allow the number of proxies to increase so that one can incorporate a large number of proxies, and __we replace the vanilla reduced rank regression with a penalized variant to handle the high dimensionality__. As a bonus, the number of factors is then naturally allowed to grow. The resulting method is named as **Factor Model estimation with Sufficient Proxies** (FMSP).”

## Application to the equity factor zoo

“Empirically, we conduct extensive simulations to study the finite-sample performance of the proposed method under different conditions. Two real applications on estimating the factor structure of monthly stock returns are also presented. The __numerical performance supports our theoretical findings__ and illustrates the advantages of Factor Model estimation with Sufficient Proxies.”

“[An important application of the factor model] is the ‘factor zoo’ problem arising in finance. __Over 400 covariates [correlated time series] have been proposed in the literature and claimed to have approximation power on the latent factors of stock returns__. This chaos raises a lot of concerns and discussions. The existing works mainly focus on the factor selection problem. Instead, we approach the problem from a different angle by considering directly improving the latent factor model estimation accuracy via utilizing the information contained in these covariates…We refer to the observed covariates as factor proxies to emphasize their explanatory power.”

“The dataset consists of the __monthly values of 99 proxies from July 1980 to December 2016.__ These factor proxies are all proposed in the literature and claimed to have approximation power on the latent factors of stock returns. The list of proxies contains 17 publicly available factors, including the Fama-French factors, the q-factors, the liquidity risk, and 82 long-short factors using firm characteristics. The set of proxies covers many risk sources, including Momentum, Value-versus- Growth, Investment, Profitability, Intangibles, and Trading Frictions… We use the 567 stocks in the U.S. market.”

“We evaluate the estimation accuracy based on a rolling-window scheme…__In all situations, Factor Model estimation with Sufficient Proxies shows higher accuracy than the competing methods__, and in most cases, this finding is statistically significant. The results demonstrate the efficiency and robustness of Factor Model estimation with Sufficient Proxies in loading matrix estimation.”