*[Editors Note: This is a guest post from Christian Hamson. Christian is the Chief Credit Officer at NSR Invest. He has 20 years experience underwriting and building predictive models specificly for unsecured consumer installment loans. He is now using his expertise to guide the NSRInvest Fund and managed accounts.]*

An important and seemingly simple question is how many loans should an investor buy to lower the variance on their expected return to a tolerable level. Part one addressed this issue using the information readily available from Lending Club and Prosper. Several blogs and other posts have also addressed this question.

A few examples of blog posts that have addressed diversification:

- Lending Robot calculates this number to be 146, using Lending Club historical data and a Monte Carlo methodology.
- Peter Renton at Lend Academy calculated a number of ~500 using Prosper data.
- Simon from LendingMemo says at least 200 is best.

What I hope to do in this post is to use traditional statistical methods to estimate the number of loans necessary for adequate diversification.

All of these posts used historical data, differing, but reasonable methodologies, and a few basic assumptions. The Lending Robot post does a great job of making the assumptions they use to calculate an explicit answer. Two key assumptions made in the Lending Robot methodology are:

- The returns on the loans follow a normal distribution. This is a good assumption. The distribution of a proportion is binomial, and for even small sample sizes the normal distribution can be proven to approximate the binomial distribution. The binomial distribution results when there are a number of events (defaults), in n independent experiments (loans), where there are only two possible outcomes (default, or non-default), and an individual default occurs with probability p.
- Any two loans are independent. If one borrower defaults that isn’t likely to change the probability of default of some other borrower.

## Assumption: Loan Defaults are not correlated

If I make the same two assumptions as Lending Robot, it is straightforward to calculate the sample size (number of loans an investor ought to buy) using any of the hundreds of sample size calculators freely available on the web. I did a search using Google; and found this tool. I used a margin of error of 5%, a confidence level of 95%, population of 100,000 and a sample proportion of 9.5% (A loss rate that would drive the returns of an investor negative.) With these inputs a sample size of 132 is suggested.

This number is right in the ball park of the Lending Robot, Lending Club, and Prosper, suggestions. Play with the sample size calculators. Put in your own assumptions. Explore. What if loan defaults are not independent? What if borrowers are subject to factors in the economic environment such as unemployment, that cause the probabilities of default to correlate? What should the number of loans for adequate diversification be in this case?

## Assumption: Loan Defaults are Correlated to Some Extent

I will repeat the analysis under the assumption that defaults are correlated. Warning! The math gets much, much harder. Suppose default rate is correlated with unemployment rate. Unemployment rates will impact multiple borrowers and thus the probability of default for loans will correlate. Let’s start with some evidence that loan defaults are correlated.

The last recession was a fantastic learning experience. We have a treasure-trove of data (acquired at tremendous expense!) that demonstrates correlation between loan defaults. The above graph is from Calculated Risk. The correlation is obvious to the naked eye. Unsecured installment loans aren’t mortgages, and there were numerous contributors to the mortgage default debacle in addition to unemployment, but at least we have an estimate of the correlation.** **

The line fit to the data above has an r-square of 0.46, and consequently the correlation between unemployment and mortgage default is 0.68 (The square root of the r-square.) That’s a lot of correlation.

My estimation process of the number of loans needed is based on work done in the medical research field for repeated measures. Suppose we count the number of red blood cells in a patient with anemia, then we give that patient our experimental drug that is supposed to stimulate blood cell production. After a week we count the red blood cells again. Those two counts may differ, but they are obviously correlated.

If we specify a generalized linear model for this case, and input our assumptions for a default rate and correlation, we can, in essence, run the model “backwards” and a number that is usually an input, the number of observations, can be produced as an output. (Note that the repeated measures model is not correct for the loan default situation. A better model would be a hierarchical model.) Since our estimate of correlation is just that, an estimate, we’ll try several values for correlation and determine the dependency of number of loans on correlation. Input values from 0.1 to 0.9 with increments of 0.1.** **

And the answer is…

For our estimated correlation of 0.68 at the upper 90% Confidence level the number of loans is 1,070.

Please note that we are using data from the great recession to estimate the correlation rate. In no way, shape or form will 1,070 loans get you through the next recession. That’s a topic for another day, and will not be resolved simply by purchasing more loans. I’d love your thoughts and feedback on this post and methodology. I’d really appreciate someone specifying the hierarchical model and running this analysis again. 🙂

Fred says

I find the chart of MortgageDelinquency vs. UnemploymentRate to be “shaky” for the following reasons:

1. the x-axis (UnemploymentRate) spans 4 quarters

2. the y-axis (MortgageDelinquency) spans only for 1 quarter

3. the “dots” are essentially one data point parameterized by US States. If we are talking about the national numbers, we only have 1 “dot”, which does not have R-squared

The chart would be much more robust if OP had used time-aligned data:

1. UnemploymentRate from 2005-2015: https://data.bls.gov/timeseries/LNS14000000

2. MortgageDelinquency from 2005-1015: https://research.stlouisfed.org/fred2/series/DRSFRMACBS

Christian says

Thanks Fred. I agree my MortgageDelinquency v Unemployment graph is a little bit “shaky.” I gave it another go using the data sources you linked to. (I love FRED for data, which means I also like your username.) I’ve posted the graph here: https://www.lendacademy.com/wp-content/uploads/2015/05/MortDelqRateByUnemployment.png

The correlation is 0.94, yikes! Just extrapolating from my other graph of # of loans vs. correlation that would be somewhere in the neighborhood of 1500 loans, if not more given the nonlinearity.

rawraw says

Assuming the correlation from one of the only deep national recessions for all credit events is a bit extreme IMO. Of course defaults are correlated in the last cycle. But will they be this correlated in typical recessions? Likely not as they tend to be regional in nature

Edward says

“In no way, shape or form will 1,070 loans get you through the next recession. That’s a topic for another day, and will not be resolved simply by purchasing more loans.”

I hope the author and others will elaborate more on this statement with your thoughts on what we can do before the next recession hits to help us mitigate losses. Thank you.

Christian says

Edward –

I think you’ve just asked the $50,000 question. (Maybe it’s more like the $2 Billion dollar question now.) I have plans to post on this topic; I haven’t nailed down the dates yet. My thoughts are that we’ll need a series of posts because the topic is both broad and important, and a couple of different viewpoints as well.

Hit me up with other topic ideas you might have or subtopics within the recession topic.

Thanks

Richard says

Great topic! The mathematics are beyond my comprehension, but I sure get the idea. More is better. A couple of questions came to mind:

1. I have about 1500 notes on Lending Club and 750 on Prosper for a total of 2250 notes. Does the data suggest that the diversification considerations are platform specific, i.e. 1500 and 750? Or would they apply to the entire holdings of 2250?

2. Would the data be applicable to only portfolios that match Lending Club and Prosper’s distribution of notes by grade? If I filter to overweight lower or higher grade notes, intuitively it seems that would effect the correlations?

Keep up the good work!

Ian Ippolito says

Great article Christian and very interesting. I’m voting my plus one for seeing that article on the analysis for the next recession!

Jacob says

Hi Richard,

I’m not a stats expert by any means, but I’ll take a shot at answering your questions as I understand them.

1) If you have notes on both Prosper and Lending Club, that would effectively increase your sample size and reduce your risk. Even though your notes are on two different platforms, you the fact that you have more notes overall should decrease the likelihood that your expected return (based on past performance) will deviate from the median return of the entire population (e.g. all notes). In other words, you are effectively reducing your risk that you will generate a wildly different return from the “average return” of all notes on each platform. For example, if you are selecting a particular grade of notes (say, A & B grade only) and the historical data says you can expect a return of around 6% for those particular notes (e.g. an average of all notes on the platform for those grades). The specific “sample size” of your own portfolio could be higher or lower than that depending on your mix of notes you select because they may not behave exactly as the average of the entire population of notes. But as you diversity, you reduce the risk that you will be dramatically higher or lower than the average.

2) If you have a portfolio that is mixed composition of notes of different grades, you would almost need to treat each grade as a separate population. This is because the population for each grade looks very different as you go further down the credit curve. For example, if you had notes that were very highly weighted to the lower grades, there is obviously going to be more volatility in the default rate due to a higher risk profile of those borrowers. Your actual return is going to be a weighted average of all of the notes in your portfolio. But to mitigate risk, you’d probably want make sure that you have a sufficient sample for each grade of notes in your portfolio. In the lower risk notes, this probably means you will need much fewer notes (since there is less variance in default), but you’ll likely need many more notes for those riskier borrowers. But in any event, you could play with the calculator and see about how big the sample size of notes you’d need to be confident that your return would match the population average.

Rob L says

Other than the possibility that availability places an upper limit on the number of loans that meet ones investment criteria, is there any downside to owning more notes rather than fewer? Granted, if you are not IRA there would be more paperwork. I’ll give up being lucky to the possibility of being unlucky.

Kamal says

I would recommend calculating a sharpe ratio and seeing how that improves as you add more loans to a portfolio. I suspect you would find very little improvement beyond 100 loans regardless of correlation. A blog post showing the relationship between correlation and sharpe might be more interesting, and of course various ways of estimating the true correlation.

The last result showing the relationship between correlation and “sample size” is nonsense. I could be persuaded to do a proper analysis if anyone is interested.

Christian says

A sharpe ratio approach is a good idea, How would you propose calculating the denominator? I don’t think the assumption that returns follow a normal distribution and that we can use the standard deviation is the direction we would want to go with this particular asset class.

Why do you expect to see very little improvement in the sharpe ratio beyond ~100 loans?

Why is the result showing the relationship between correlation and sample size nonsense and why did you put sample size in quotes?

I’m definitely interested in what you would consider a “proper” analysis. I don’t think such a thing exists outside the classroom. “Essentially, all models are wrong, but some are useful.” – Box