Preface
Continuing from the previous book, we have published two articles in the series of articles on Building a Powerful Crypto-Asset Portfolio Using Multi-Factor Models:Theoretical Basics、Data Preprocessing
This is the third article: factor validity test.
After determining the specific factor value, it is necessary to first conduct a validity test on the factors and screen factors that meet the requirements of significance, stability, monotonicity, and return rate; the factor validity test is done by analyzing the relationship between the factor value of the current period and the expected return rate. relationship to determine the validity of the factors. There are mainly 3 classic methods:
IC/IR method: IC/IR value is the correlation coefficient between factor value and expected return. The larger the factor, the better the performance.
T value (regression method): The T value reflects the significance of the coefficient after the linear regression of the next periods return on the current periods factor value. By comparing whether the regression coefficient passes the t test, we can judge the contribution of the current periods factor value to the next periods return. Typically used in multivariate (i.e., multifactor) regression models.
Hierarchical backtesting method: The hierarchical backtesting method stratifies tokens based on factor values, and then calculates the rate of return of each layer of tokens to determine the monotonicity of the factors.
1. IC/IR Law
(1) Definition of IC/IR
IC: Information Coefficient, which represents the factors ability to predict Tokens returns. The IC value of a certain period is the correlation coefficient between the factor value of the current period and the return rate of the next period.
The closer the IC is to 1, the stronger the positive correlation between the factor value and the next periods rate of return. IC= 1 means that the factors currency selection is 100% accurate. It corresponds to the token with the highest ranking score. The selected token will be used in the next position adjustment cycle. , the largest increase;
The closer the IC is to -1, the stronger the negative correlation between the factor value and the rate of return in the next period. If IC=-1, it means the token with the highest ranking will have the largest decline in the next rebalancing cycle, which is a complete inversion. index;
If the IC is closer to 0, it means that the predictive ability of the factor is extremely weak, indicating that the factor has no predictive ability for token.
IR: information ratio, which represents the ability of factors to obtain stable Alpha. IR is the mean IC of all periods divided by the IC standard deviation of all periods.
When the absolute value of IC is greater than 0.05 (0.02), the factors stock selection ability is strong. When IR is greater than 0.5, the factor has a strong ability to stably obtain excess returns.
(2) How to calculate IC
Normal IC (Pearson correlation): Calculate the Pearson correlation coefficient, the most classic correlation coefficient. However, there are many assumptions in this calculation method: the data is continuous, normally distributed, the two variables satisfy a linear relationship, etc.
Rank IC (Spearmans rank coefficient of correlation): Calculate Spearmans rank correlation coefficient, first sort the two variables, and then calculate the Pearson correlation coefficient based on the sorted results.Spearmans rank correlation coefficient evaluates the monotonic relationship between two variables, and is less affected by data outliers because it is converted into ordered values;The Pearson correlation coefficient evaluates the linear relationship between two variables, which not only has certain prerequisites for the original data, but is also greatly affected by data outliers. In real-life calculations, it is more consistent to find rank IC.
(3) IC/IR method code implementation
Create a list of unique date and time values in ascending order of date and time - record the rebalancing date def choosedate(dateList, cycle)
class TestAlpha(object):
def __init__(self, ini_data):
self.ini_data = ini_data
def chooseDate(self, cycle, start_date, end_date):
'''
cycle: day, month, quarter, year
df: original data frame df, processing of date column
'''
chooseDate = []
dateList = sorted(self.ini_data[self.ini_data['date'].between(start_date, end_date)]['date'].drop_duplicates().values)
dateList = pd.to_datetime(dateList)
for i in range(len(dateList)-1):
if getattr(dateList[i], cycle) != getattr(dateList[i + 1 ], cycle):
chooseDate.append(dateList[i])
chooseDate.append(dateList[-1 ])
chooseDate = [date.strftime('%Y-%m-%d') for date in chooseDate]
return chooseDate
def ICIR(self, chooseDate, factor):
# 1. First display the IC of each position adjustment date, that is, ICt
testIC = pd.DataFrame(index=chooseDate, columns=['normalIC','rankIC'])
dfFactor = self.ini_data[self.ini_data['date'].isin(chooseDate)][['date','name','price', factor]]
for i in range(len(chooseDate)-1):
# ( 1) normalIC
X = dfFactor[dfFactor['date'] == chooseDate[i]][['date','name','price', factor]].rename(columns={'price':'close 0'})
Y = pd.merge(X, dfFactor[dfFactor['date'] == chooseDate[i+ 1 ]][['date','name','price']], on=['name']).rename(columns={'price':'close 1'})
Y['returnM'] = (Y['close 1'] - Y['close 0']) / Y['close 0']
Yt = np.array(Y['returnM'])
Xt = np.array(Y[factor])
Y_mean = Y['returnM'].mean()
X_mean = Y[factor].mean()
num = np.sum((Xt-X_mean)*(Yt-Y_mean))
den = np.sqrt(np.sum((Xt-X_mean)** 2)*np.sum((Yt-Y_mean)** 2))
normalIC = num / den # pearson correlation
# ( 2) rankIC
Yr = Y['returnM'].rank()
Xr = Y[factor].rank()
rankIC = Yr.corr(Xr)
testIC.iloc[i] = normalIC, rankIC
testIC =testIC[:-1 ]
# 2. Based on ICt, find [IC_Mean, IC_Std,IR,IC<0 proportion--factor direction,-IC->0.05 proportion]
'''
ICmean: |IC|>0.05,Factors have strong ability to select coins, and the factor value has a high correlation with the next periods rate of return. -IC-<0.05,The currency selection ability of the factor is weak, and the correlation between the factor value and the next periods rate of return is low.
IR: |IR|>0.5,The factor currency selection ability is strong and the IC value is relatively stable. -IR-<0.5,The IR value is too small and the factor is not very effective. If it is close to 0, it is basically invalid.
IClZero (IC less than Zero): IC<0 accounts for nearly half -> factor neutral. IC>0 exceeds more than half, which is a negative factor, that is, as the factor value increases, the return rate decreases
ICALzpF(IC abs large than zero poin five): |IC|>The ratio of 0.05 is on the high side, indicating that most of the factors are effective.
'''
IR = testIC.mean()/testIC.std()
IClZero = testIC[testIC<0 ].count()/testIC.count()
ICALzpF = testIC[abs(testIC)>0.05 ].count()/testIC.count()
combined =pd.concat([testIC.mean(), testIC.std(), IR, IClZero, ICALzpF], axis= 1)
combined.columns = ['ICmean','ICstd','IR','IClZero','ICALzpF']
# 3.IC cumulative chart of IC during the rebalancing period
print("Test IC Table:")
print(testIC)
print("Result:")
print('normal Skewness:', combined['normalIC'].skew(),'rank Skewness:', combined['rankIC'].skew())
print('normal Skewness:', combined['normalIC'].kurt(),'rank Skewness:', combined['rankIC'].kurt())
return combined, testIC.cumsum().plot()
2. T-value test (regression method)
The T-value method also tests the relationship between the current periods factor value and the next periods rate of return, but it is different from the ICIR method in analyzing the correlation between the two. The t-value method uses the next periods rate of return as the dependent variable Y, and the current periods factor value as the independent variable X. For X regression, conduct a t test on the regression coefficient of the regression factor value to test whether it is significantly different from 0, that is, whether the current period factor affects the next periods return rate.
The essence of this method is to solve the bivariate regression model. The specific formula is as follows:
(1) Regression method theory
(2) Regression method code implementation
def regT(self, chooseDate, factor, return_ 24 h):
testT = pd.DataFrame(index=chooseDate, columns=['coef','T'])
for i in range(len(chooseDate)-1):
X = self.ini_data[self.ini_data['date'] == chooseDate[i]][factor].values
Y = self.ini_data[self.ini_data['date'] == chooseDate[i+ 1 ]][return_ 24 h].values
b, intc = np.polyfit(X, Y,1) # slope
ut = Y - (b * X + intc)
# Find t value t = (\hat{b} - b) / se(b)
n = len(X)
dof = n - 2 # Degrees of freedom
std_b = np.sqrt(np.sum(ut** 2) / dof)
t_stat = b / std_b
testT.iloc[i] = b, t_stat
testT = testT[:-1 ]
testT_mean = testT['T'].abs().mean()
testT L1 96 = len(testT[testT['T'].abs() > 1.96 ]) / len(testT)
print('testT_mean:', testT_mean)
print(The proportion of T values greater than 1.96:, testT L1 96)
return testT
3. Stratified backtesting method
Stratification refers to stratifying all tokens, and backtesting refers to calculating the return rate of each layer of token combinations.
(1) Stratification
First, obtain the factor value corresponding to the token pool, and sort the tokens by the factor value. Sort in ascending order, that is, those with smaller factor values are ranked first, and the tokens are equally divided according to the sorting. The factor value of layer 0 token is the smallest, and the factor value of layer 9 token is the largest.
Theoretically, equal division refers to splitting the number of tokens equally, that is, the number of tokens in each layer is the same, which is achieved with the help of quantiles. In reality, the total number of tokens is not necessarily a multiple of the number of layers, that is, the number of tokens in each layer is not necessarily equal.
(2) Backtesting
After dividing the tokens into 10 groups in ascending order of factor value, start calculating the return rate of each token combination. This step treats the tokens of each layer as an investment portfolio (the tokens contained in the token combination of each layer will change during different backtest periods), and calculates the overall value of the portfolio.Next periods rate of return. ICIR and t value analyze the current factor value andThe overall rate of return in the next period, but tiered backtesting requires calculationStratified portfolio return rate for each trading day during the backtest period. Since there are many backtesting periods with many periods, stratification and backtesting are required in each period. Finally, the token return rate of each layer is cumulatively multiplied to calculate the cumulative return rate of the token combination.
Ideally, for a good factor, group 9 has the highest curve return and group 0 has the lowest curve return.
The curves for Group 9 minus Group 0 (i.e., long-short returns) are monotonically increasing.
(3) Code implementation of hierarchical backtesting method
def layBackTest(self, chooseDate, factor):
f = {}
returnM = {}
for i in range(len(chooseDate)-1):
df 1 = self.ini_data[self.ini_data['date'] == chooseDate[i]].rename(columns={'price':'close 0'})
Y = pd.merge(df 1, self.ini_data[self.ini_data['date'] == chooseDate[i+ 1 ]][['date','name','price']], left_on=['name'], right_on=['name']).rename(columns={'price':'close 1'})
f[i] = Y[factor]
returnM[i] = Y['close 1'] / Y['close 0'] -1
labels = ['0','1','2','3','4','5','6','7','8','9']
res = pd.DataFrame(index=['0','1','2','3','4','5','6','7','8','9','LongShort'])
res[chooseDate[ 0 ]] = 1
for i in range(len(chooseDate)-1):
dfM = pd.DataFrame({'factor':f[i],'returnM':returnM[i]})
dfM['group'] = pd.qcut(dfM['factor'], 10, labels=labels)
dfGM = dfM.groupby('group').mean()[['returnM']]
dfGM.loc[LongShort] = dfGM.loc[0]- dfGM.loc[9]res[chooseDate[i+ 1 ]] = res[chooseDate[ 0 ]] * ( 1 + dfGM[returnM ]) data = pd.DataFrame({Hierarchical cumulative return:res.iloc[: 10,-1],Group:[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 ]})
df 3 = data.corr()
print("Correlation Matrix:")
print(df 3)
return res.T.plot(title='Group backtest net worth curve')
