R is an object-oriented programming language and work environment for statistical analysis. It is not just for programmers, but for everyone conducting data analysis, including portfolio managers and traders. Even with limited coding skills R outclasses Excel spreadsheets and boosts information efficiency. First, like Excel, the R environment is built around data structures, albeit far more flexible ones. Operations on data are simple and efficient, particularly for import, wrangling, and complex transformations. Second, R is a functional programming language. This means that functions can use other functions as arguments, making code succinct and readable. Specialized “functions of functions” map elaborate coding subroutines to data structures. Third, R users have access to a repository of almost 15,000 packages of function for all sorts of operations and analyses. Finally, R supports vast arrays of visualizations, which are essential in financial research for building intuition and trust in statistical findings.
The post ties in with the SRSV summary on information efficiency.
(A second part of the post focusing on statistical inference and learning will follow).
What is R?
The R project for statistical computing provides the leading open-source programming language and environment for statistical analysis. It has been developed by academics and statisticians for over 25 years.
- As a language supports effective object-oriented programming, including all usual features such as conditionals, loops, user-defined recursive function. Unlike Python, it is not a general-purpose language but heavily geared towards statistical work.
- As a work environment R offers “an integrated suite of software facilities for data manipulation” or “an environment within which statistical techniques are implemented”.
Put simply, R can do whatever a spreadsheet can do but much faster and far more efficiently and extends to vastly more applications. The primary benefits of R are data wrangling (making untidy data usable), data transformation (building customized data sets), data analysis (applying statistical models), all forms of visualization, and machine learning. The default workspace or integrated development environment (IDE) for R is R Studio. Among the many on-line resources for learning R the data science courses of DataCamp stand out.
What makes R powerful for macro trading?
Most macro traders or portfolio managers rely on quantitative statistical analysis, typically in form of charts (often inside Bloomberg, Reuters Eikon and so forth), calculators and spreadsheets. As responsible trading is full-time demanding work, senior traders often lack time and experience for coding up their analytical tools to professional standards. Even cooperation with desk quants that offer programming support can be difficult, as many traders do not wish to reveal their personal methods and struggle to translate their needs into suitable instructions for programmers.
R makes statistical programming and data science accessible. In particular, R is not just for programmers but for all finance professionals with some interest in statistical analysis. That is because R can initially be run interactively with a limited set of basic commands. In some sense, R can be used by non-programmers much like a sophisticated calculator. Even short snippets of code can go a long way in performing operations that would be very tedious in Excel. This means that R can be deployed with minimal programming skills and typically enhances the information efficiency of the investment process quickly. Interest in and advances of programming skills then follow almost naturally.
A personal analytical framework in R is highly extensible and goes far beyond the capacity of Excel spreadsheets are. This is because it allows far greater creativity and far more data to be used (view post here).
The power of data structures
The R language and environment are built around data structures:
- The main homogeneous multidimensional data structure in R is the array, which is just a generalization of vectors and matrixes. “Homogeneous” here means that the array includes just one type of data, such as numeric. Unlike an Excel spreadsheet, it is easy and quick to do a wide range of mathematical operations on one or more of these arrays. Whether one adds two numbers or two large equally-shaped arrays makes really no difference in R.
- The main heterogeneous data structure is the dataframe. It is generally a two-dimensional structure, much like a data table. “Heterogeneous” here means that different columns can contain different types of data, such as numerics, dates, factors, character strings and so forth. Under the hood, a data frame is a list of equal-length vectors, with the latter being the columns of the frame. As with arrays, it is easy to perform logical and mathematical operations on a dataframe, albeit with more restrictions, due to the different data types involved.
The import data structures into the R mostly relies on two types of tools. The first is web APIs. (application programming interfaces) that link the local R environment with external databases, such as Bloomberg, DataStream. Macrobond and the data services of investment banks. The second is special R functions that read and import data files into the environment, including from Excel spreadsheets, csv files, and SQL databases. A particularly useful set of functions is provided by the readr package, which supports the customized import of all sorts of rectangular data.
R offers a whole host of techniques to deal with the immensely important job of data wrangling, i.e. the transformation of raw irregular data into a clean tidy data set. The tidyr package provides functions through which one can reshape imported data into a standardized format that is conducive to standard operations, estimation and analysis, particular for other packages of the tidyverse, (a collection standard R packages for data science).
A tidy data set meets the following conditions:
- Each column represents one variable, whose values represent a single attribute across observational units. This could be the daily returns of a specific asset or the values of a specific business survey.
- Each row represents one observation of these variables. In macro-finance data structures the rows typically span time periods, markets or currency areas.
- There is only one type of observational units (e.g. years, months, cross-sections) per table. If one wishes to investigate data sets with different observational units, such as higher frequencies of observations, one should use different tables.
The R language is geared towards the manipulation of tidy data structures, much more so than Python. Selecting and subsetting data structures simply requires position indices, names or logical conditions. Going beyond basic operations, the dplyr package supports a wide range of manipulations of tidy data tables, particularly
- the manipulation of cases (the rows of data table) in form of summaries, such as means and standard deviations;
- the grouping of cases, for example into specific time periods;
- the extraction of special cases, in dependence on certain conditions or through random sampling;
- the arrangement of cases, for example in form of league tables based on the values of one variable;
- the manipulation of variables (the columns of a table), i.e. transforming variables or calculating new variables based on existing ones.
Most data used for macro trading are time series. The xts (eXtensible Time Series) package has been developed for just this purpose. The package supports a special object class and functions for uniform handling of many R time series classes. An xts object is effectively an extension or special class of zoo object (class of indexed totally ordered observations). Its practical benefits include easy conversion and reconversion from and to other classes and bespoke functionality for time series data. Key advantages of xts dataframes include reliable implementation of time lags, easy and intuitive subsetting with date names, easy extraction of periodicity and time stamps and consideration of different time zones. A complementary package for specialized operations on dates and times is lubridate, which includes consideration of time zones, leap days, daylight savings times.
Finally, the popular data.table package allows efficient operations on data structures with short code, particularly subsetting, grouping, updating and univariate variable transformation. Hence, it is particularly suitable for extracting analytical summaries from large databases. The objective of the package is to reduce programming and computing time.
The power of functions
Even non-programmers eventually build their own functions to perform special operations in different contexts. Functions reduce quantity and errors of code. They also can make the intention of code much clearer. As a rule of thumb, a snippet of code should be transformed into a function if it is being copied and pasted more than two times. It is often best practice to start the creation by [1] solving a specific simple example problem with a snippet, [2] testing and cleaning up the snippet, and then [3] applying a clearly written working snippet to a function template.
Importantly, R is a functional programming (FP) language. This means that it provides many tools for the creation and manipulation of functions. In particular, R has first-class functions. This means that one can do anything with functions that one can do with data structures, including [i] assigning them to variables, [ii] storing them in lists, [iii] passing them as arguments to other functions, [iv] creating them inside functions, and [v] returning them as the result of a function.
Functional programming simply uses functions as arguments in other functions. It is typically an alternative to for loops and preferable when for loops obscure the purpose of code by displaying repetitive standard procedures. For example, if a macro trading strategy requires a special way of transforming market or macroeconomic data and if that transformation has been captured in a function, this transformation can be applied efficiently and with little code to all relevant data collections.
In particular, a functional is a function that takes another function as an argument. Functionals make code more succinct. As a rule, functionals are preferable to explicit “for loops” because they express a high-level goal clearly. Functionals reduce bugs in by better communicating intent. Most importantly, functionals implemented in base R are well tested and efficient, because they’re used by so many people.
There are two popular sets of functionals in R.
- The first is the apply family. These are really equivalent to “for loops” and no more difficult to use. The apply functions just take a collection of data, apply the input function to every element of the collection in turn, and then store the results in another collection.
- Similarly, the map family of functionals from the purr package generally iterates the application of functions over a wide range of data structures.
The power of packages
In R, the fundamental unit of shareable code is the package. A package bundles together code, data, documentation, and tests, and is easy to share with others. At present, there are almost 15,000 packages available on the “Comprehensive R Archive Network” (CRAN). The ability to find a suitable package for the job at hand can save much programming resources.
Moreover, portfolio managers can create their own packages, maybe with the help of a more experienced R programmer. In some sense, the creation of a custom package is just the natural progression of creating custom functions. An in-house package typically improves the documentation of such functions and makes them easier to use and to share.
The power of visualization
Visualization is the key link between scientific data analysis and decisions in macro portfolio management. Confidence in data-based decisions requires a good intuitive understanding of the data and trust in statistical findings. Graphics support intuition and trust better than words.
The R base package provides a range of convenient graphical functions for a “quick and dirty” visualization of data, often in the context of exploration of a data set. Many are executed through the generic plot() function. A helpful overview can be found in R Base Graphics: An Idiot’s Guide. Basic graphics are usually used for quick exploratory graphs, with some examples shown below.
For more flexible and advanced visualization the ggplot2 package provides a system for creating graphics. It is based on “The Grammar of Graphics”, a set of rules derived from the idea that one can build every graph from the same few components: a data set, a set of geoms (geometric objects that represent data points) and a coordinate system. The central activity of visualizing data with ggplot regularly involves three steps: [1] setting the links between data and plot element (“aesthetic mappings”), [2] specifying the general type of plot (“geom”) to be used,and [3] adding detail such “graphical primitives” and other added layers.
There is hardly a relevant visualization that ggplot2 cannot do (except maybe the manual drawing of trend elements in time series charts that is such a popular feature on Bloomberg and Reuters Eikon). A collection of the top 50 ggplot2 visualizations with related code can be found on r-statistics.co by Selva Prabhakaran, many of which have relevance for macro trading.
A shortlist of simple visualization for macro trading based on ggplot2 and some other specialized packages that can be accomplished with little code includes the following:
Ranges: It often is important to view the range and rough distribution of position returns or trading signals across different markets. This calls for a classical discrete-x-continuous-y geometric representation. In particular, one often needs a quick view of averages, standard deviations, and outliers across sections. In ggplot2 one can visualize cross-section ranges and distributions either in form of a box-whisker plot or in form of a jitter-points plot. The box-whisker plot is easier to read and puts more focus on outliers. The jitter-points plot provides more visual information on distribution.
Distributions: The main purpose of viewing distributions is to detect abnormalities. The most common objective of viewing distribution graphs is to detect skewness (asymmetry in the distribution with tails longer on one side) or kurtosis (high or low weight of the tails relative to a normal distribution).
Time-series graphs: These standard graphs can be plotted in various facets and grids to compare time series across sections and categories.
Categories: It is often important to visualize the relative frequency of categories of data. For example, one may need to know how often a series turns negative (for example to see if deflation has been a major issue), how often a series produces zero values (for example to see if there is a problem in updating) or how often various combinations of positive and negative values to two or more series occur (for example to see if equity and FX returns in EM are mostly pointing in the same direction).
Heatmaps: The purpose of heatmaps is to visualize a large number of numerical values across many categories in order to spot specific patterns or value regions.
Scatterplots: Various forms of scatter plots with added fittings can visualize the relation between variables across times and across markets.
Cross-correlations: A matrix of cross-correlations gives serves to give a quick graphical overview. There two convenient tools for this, both taking a dataframe and visualizing the correlations of all series. First, the ggpairs() function of the GGally package is suitable for a smaller set of series and gives scatters, correlation coefficients and series densities.
The corrplot() function of the corrplot package is a visualization function for a correlation matrix that can be calculated by cor() function of the base packages. It has various visualization options, each with less information than the ggpairs() function, but with a greater capacity to visualize correlations of a large number of series.
Model outputs: Statistical models in R have an internal structure. Estimates can be visualized by using this implied information. This may require some reshaping before passing the data on to the graphics function but otherwise works like any other visualization.