# Rsampling Fama French

This is my notes for learning Rsampling Fama French. This article introduced how to conduct k-fold cross validation in R using rsample and yardstick packages. For more details, you can read the original article.

Our goal today is to explore k-fold-cross-validation via rsample package, and a bit of model evaluation via yardstick package. In this post, we will use the rmse() function from yardstick package, but our main focus will be on the vfold_cv() function from rsample. We are going to explore these tools in the context of linear regression and Fama French, which might seem weird since these tools would typically be employed in the realms of machine learning, classification, and the like. We’ll stay in the world of explanatory models via linear regression world for a reasons.

Firs, and this is a presonal preference, when getting to know a new package or methodology, I perfer to do in a context that’s already familiar, I don’t want to learn rsample whilst also getting to know a new data set and learning the complexities of some crazy machine learning model. Since Fama French is familiar from our previous work, we can focus on the new tools in rsample and yardstick. Second, factor models are important in finance, despite relying on good old linear models, and we might even find creative new ways to deploy or visualize them.

Next, we will import data for daily prices of ETFs, convert them to returns, then import the five Fama French factor data and join it to our five ETF returns data. Here’s the code to make that happen:

After running that code, we have an object called data_joined_tidy. It hold daily returns for 5 ETFs and the Fama French Factors. Here’s a look at the first row for each FTF rows.

Let’s work with just one ETF for today and use filter(asset == "AGG) to shrink our data to just that ETF:

Okay, we’re going to regress the daily returns of AGG on one factor, then three factors, then five factors, and we want to evaluate how well each model explains AGG’s returns. That means we need a way to test the model. Last time, we looked at the adjusted $R^2$ values when the model was run on the entirely of AGG returns. Today, we’ll evaluate the model using k-fold cross validation. That’s a pretty jargon-heavy phrase (行话) that isn’t part of the typical finance lexicon （术语）. Let’s start with the second part, cross-validation. Instead of running our model on the entire data set - all daily returns of AGG - we will run it on just part of the data set, then test the results on the part that we did not used. Those different subsets of our origin data are often called the training and testing sets, though rsample calls theme analysis and assessment sets. We validate the model results by applying them to the assessment data and seeing how the model performed.

The k-fold bit refer to the fact that we’re not just dividing our data into training and testing subsets, we’re actually going to divide it into a bunch of groups, a k number of groups, or a k number of folds. One of those folds will be used as the validation set; the model will be fit on the other k-1 sets, and then tested on the validation set. We’re doing this within a linear model to see how well it explains the data; it’s typically used in machine learning to see how well a model predicts data.

If you’re like me, it will take a bit of tinkering (修补工作) to really grasp k-fold cross validation, but rsample as a great function for dividing our data into k-folds. If we wish to use five folds (the state of the art seems to be either five or ten folds), we call the vfold_cv() function, pass it our data object agg_ff_data and set v = 5.

We have an object called cved_ff, with a column called splits and a column called id. Let’s peek at the first split.

Three numbers, The first, 1207, is telling us how many observations are in the analysis. Since we have five folds, we should have 80% of our data in the analysis set. The second number, 302, is telling us how many obsevations are in the assessment, which is 20% of our original data. The third number, 1509, is the total number of observations in our original data.

Next, we want to apply a model to the analysis set of the k-folded data and test the results on the assessment set. Let’s start with one factor and run a simple linear model, lm(returns ~ MKT).

We want to run it on analysis(cved_ff$splits[[1]]) - the analysis set of our first split. Nothing too crazy so far. Now we want to test on our assessment data. The first step is to add the data to the original set. We’ll use augment() for that task, and pass it assessment(cved_ff$splits[[1]])

We just added our fitted values to the assessment data, the subset of the data on which the model was not fit. How well did our model do when compare the fitted values to the data in the held out set?

We can use the rmes() function from yardstick to measure our model. RMSE stands for root mean-squared error. It’s the sum of the squared difference between our fitted values and the actual values in the assessment data. A lower RMSE is better!

Now that we’ve down that piece by piece, let’s wrap the whole operation into one function. This function takes one argument, a split, and we’re going to use pull() so we can extract the raw number, instead of the entire tibble result.

Now we want to apply that function to each of our five folds that are stored in agg_cved_ff. We do that with a combination of mutate() and map_dbl(). We use map_dbl() instead of map because we are returning a number here and there’s not a good reason to store that number in a list column.

OK, we have five RMSE, since we ran the model on five separate analysis fold sets and tested on five separate assessment fold sets. Let’s find the average RMSE by taking the mean() of the rmse column. That can help reduce noisiness that resulted from our random creation of those five folds.

That code flow worked just fine, but we had to repeat ourselves when creating the functions for each other model. Let’s toggle (切换) to a flow where we define three models - the ones that we wish to test with via cross-validation nad RMSE - then pass those models to one function.

Bùi Nam Phong

# R