A complete guide to linear regression using gene expression data: regularization
In this third part, I will discuss underfitting and overfitting (and how to detect them), penalized regression (ridge, lasso, and elastic net), and how to implement them in python.
You can find here the first part (introduction and math): here
You can find here the second part (fit, problem diagnosis, and solving): here
Underfitting and overfitting
Let’s discuss briefly an important point. Underfitting and overfitting.
The goal of a good model is that is able to generalize well when it encounters new data. This ability allows your model to make predictions on unseen data
Underfitting is when a model can’t accurately capture the dependencies between dependent and independent variables. In general, this because the model is too simple or it got caught in a local minimum. Generally, you notice a low R2 with known data and poor capabilities to generalize with unseen data. Overfitting is the contrary, the model learns and adapts itself to the training data and learn dependencies among data but also some random fluctuations. Generally, the model is too complex. You notice that it has perfect R2 with known data, but poor with unseen data. In concrete, your model learns too well the training data but is losing the ability to generalize. In practice more than learning from the data, the model is memorizing the training set.
As you can see in the figure below an example of underfitting and overfitting. In this case, is a polynomial regression (where some terms are exponential) but the concept is the same. The top left is an example of underfitting, the top right is maybe the best one. While the two at the bottom are more likely examples of over-fitting.
Penalized regression: ridge, lasso, and elastic net
So, what is the solution to overfitting? Regularization. Let’s discuss it! Conceptually regularization is constraining a set of parameters in order to discourage the model to be complex. Generally, this is translated to reduce the coefficients. Intuitively if there is some noise, these data points will be farther from the others and you need to increase your weights, leading to overfitting. Overfitting is a risk when you have too many features (and when you have many features you do not know which to drop. To drop the wrong features will affect the algorithm performance). In concrete, regularization is penalizing the coefficients that have large values, reducing the model complexity. In other words, we are changing the loss function to include an additional cost for large coefficients. Another problem that regularization solve is that if your coefficients are too large, the model is more sensitive to the inputs and unstable. Anyway, remember that adding regularization is reducing the variance but at the cost of introducing some bias. There are two variants of regularization:
Ridge regression (called also L2 penalization) and is equivalent to the square of the magnitude of the coefficients.
Or also:
And if we solve this equation in matrix notation:
Interesting if you remember the bias and variance equation these are becoming:
As λ increase the variance decrease but the bias increase, which lead to the fact we have to choose the optimal value for λ.
Lasso regression (called also L1 penalization) adds a penalty term that is equivalent to the absolute magnitude of the coefficients. λ is the penalty term, is a constant that decides how much is the reduction of the weight. The input dataset has a dimension of m, so the penalty is applied to all the weights.
The equation of ordinary least square can be written also in this manner:
Basically, we are trying to minimize the difference between the dependent variable and the product of x and its weight. Then the equation for lasso regression can be also rewritten like this:
Some considerations on the two methods:
- You should scale your inputs before using regularization. In ridge, regression scaling ensures that the penalty acts equally on each coefficient. Scaling with a z-score ensures that the standard deviation is equal to 1 for each input. The standard deviation is affecting the penalty.
- Ridge decreases towards zero the coefficient but without be actually zero, Lasso favors sparsity. Using Lasso many coefficients will be decreased to zero, allowing you feature selections.
- If there are two or more highly collinear variables then Lasso selects one of them randomly (which is not good for the interpretation of the data).
- In general, use Ridge when you have few predictors (few features) and you are sure they're all important.
- In general, the higher is the coefficient of penalization higher is the penalization.
- Ridge is functioning better when the weights have similar dimensions.
- The two methods solve collinearity differently: Ridge leads the coefficients of correlated predictors to be similar, while in Lasso one of the correlated predictors has a larger coefficient and the others are zeroed.
The figure below explains what is happening using different values of lambda to different coefficients (beta). As you can see, Ridge is shrinking the coefficient increasing λ but it is not actually becoming zero, while Lasso is shrinking to zero the different coefficients.
What if we combine the two methods? Elastic Net! Elastic net is working using the combination of L2 and L1 penalization and the formula is:
Interesting you can use the λ to select the strength of the penalization. Where zero is meaning no penalties, 1 is meaning full penalty and zero is basically equal to simple linear regression. A very smaller value of λ is often used (like 0,001). To understand how λ works:
Elastic net loss = loss + (λ * elastic_net_penalty)
You can select also the parameter α which controls the ratio between l1 and l2 penalization (where α = 0 is correspondent to Ridge regression and α = 1 is correspondent to Lasso regression). With an intermediate α, you will have some coefficients that are reduced and some becoming zero. To understand how α works:
Elastic net penalty = (α * l1_penalty) + ((1 — α) * l2_penalty)
One consideration:
- You can control the balance of the two types of penalties, this is generally resulting in a better performance in some cases.
Now we will see this in practice with our dataset.
from sklearn.linear_model import Ridge
model = Ridge(alpha = 0.05)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('coefficient of determination:', model.score(X_test, y_test))
and for lasso:
from sklearn.linear_model import Lasso
model = Lasso(alpha = 0.05)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('coefficient of determination:', model.score(X_test, y_test))
We say before that Lasso is reducing to zero the coefficients, you can see how many coefficients are now non zero
coefs = model.coef_.flatten()
names = X_train.columns
genes = list(zip(names, coefs))
feature =pd.DataFrame(genes, columns = ["genes", "coefs"])
feature0 = feature.loc[(feature!=0).any(axis=1)]
feature0 = feature[(feature != 0).all(1)]
feature0.shape, feature.shape
Then we can plot the coefficients.
coefs =feature0.sort_values(by=['coefs'])
plt.figure(figsize=(20, 15))
g = sns.barplot(x="genes", y="coefs", data=coefs, color= "lightblue")
g.figsize=(16,10)
plt.xticks(rotation=45)
Using a loop we can change the penalization and see how many coefficients are still non-zero, we can also print the coefficient of determination to see which is the best model.
y = [0.5, 0.1, 0.05, 0.01, 0.005, 0.001]for i in y:
model = Lasso(alpha = i, max_iter = 10000)
model.fit(X_train, y_train)
coefs = model.coef_.flatten()
names = X_train.columns
genes = list(zip(names, coefs))
x = "coef" + "_lr1_" + str(i)
feature1 =pd.DataFrame(genes, columns = ["genes", x])
feature = pd.merge(feature, feature1,
left_on='genes',right_on='genes',how='outer')
feature0 = feature1[(feature1 != 0).all(1)]
print(model.alpha), print(feature0.shape)
print('coefficient of determination:', model.score(X_test,
y_test))
Now let’s see the elastic net:
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=0.5, l1_ratio=0.5)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print('coefficient of determination:', model.score(X_test, y_test))
we can evaluate the model using cross-validation and reporting the average of the mean absolute value.
Cross-validation is a procedure that through resampling you use to evaluate a machine learning model with a limited data sample. There is only one parameter to consider k, which control the number of time a data sample is split (generally k = 10). The procedure starts shuffling the dataset randomly, then splitting into different groups (number of groups equal to k). For each unique group, the group is holding out as a test set while the remaining group is the training set. Then it fit the model on the training set, evaluate the score and discard the model. Once the groups are selected, they are not changed, meaning that each observation is used to train k-1 models. In general, you summarize with mean and standard deviation the score obtained for k models.
Take into account that due to the stochasticity of the algorithm (elastic net is using stochastic gradient descent) the results are changing a little at each run.
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import ElasticNet
from numpy import absoluteX = df.drop("MYC", 1)
y = df["MYC"]# define model
model = ElasticNet(alpha=0.5, l1_ratio=0.5)# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)# evaluate model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)# force scores to be positive
scores = absolute(scores)
np.mean(scores), np.std(scores)
Until now we just randomly select the hyperparameters (alpha, l1_ratio). We do not know if that is the best combination. We will use grid search. The different combinations will be run in combination with cross-validation. We use a subset because is quite computationally expensive.
from numpy import arange
from pandas import read_csv
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RepeatedKFold
from sklearn.linear_model import ElasticNet# define model
model = ElasticNet()
# define model evaluation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# define grid
grid = dict()
grid['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]
grid['l1_ratio'] = arange(0, 1, 0.01)# define search
search = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# perform the search
results = search.fit(X.iloc[:, 0:100], y)
# summarize
print('MAE: %.3f' % results.best_score_)
print('Config: %s' % results.best_params_)
Let’s try these parameters and make a plot.
from sklearn.linear_model import ElasticNet
model = ElasticNet(alpha=0.01, l1_ratio=0.03)
model.fit(X_train.iloc[:, 0:100], y_train)
y_pred = model.predict(X_test.iloc[:, 0:100])
x_ax = range(len(X_test))
plt.figure(figsize=(15, 5))
plt.scatter(x_ax, y_test, color = "blue", label ="real")
plt.plot(x_ax, y_pred, color = "red", label ="prediction")
plt.legend()
plt.show()
Does it look like the oscillations of an elastic?
This was the last part of the tutorial.
Other resources
On the discovery of regression: here
Stochastic gradient descent: here
on bias and variance calculation: here
on regularization: here
about regression assumptions and plots: here
about plot analysis: here
tutorial on polynomial regression: here
about multicollinearity: here
on the model selection using AIC and BIC: here
about Cook’s distance: here
on Ridge and Lasso regression: here
another source about Ridge and Lasso regression: here
about elastic net: here
on k-fold cross-validation: here
Bibliography
1. Sidey-Gibbons JAM, Sidey-Gibbons CJ. Machine learning in medicine: a practical introduction. BMC Medical Research Methodology (2019) 19:64. doi:10.1186/s12874–019–0681–4
2. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature (2017) 542:115–118. doi:10.1038/nature21056
3. Warnat-Herresthal S, Perrakis K, Taschler B, Becker M, Baßler K, Beyer M, Günther P, Schulte-Schrepping J, Seep L, Klee K, et al. Scalable Prediction of Acute Myeloid Leukemia Using High-Dimensional Machine Learning and Blood Transcriptomics. iScience (2019) 23: doi:10.1016/j.isci.2019.100780
4. Wang C, Zhang J, Yin J, Gan Y, Xu S, Gu Y, Huang W. Alternative approaches to target Myc for cancer treatment. Signal Transduct Target Ther (2021) 6:117. doi:10.1038/s41392–021–00500-y
5. Wang C, Fang H, Zhang J, Gu Y. Targeting “undruggable” c-Myc protein by synthetic lethality. Front Med (2021) doi:10.1007/s11684–020–0780-y
6. Carter JL, Hege K, Yang J, Kalpage HA, Su Y, Edwards H, Hüttemann M, Taub JW, Ge Y. Targeting multiple signaling pathways: the new approach to acute myeloid leukemia therapy. Signal Transduct Target Ther (2020) 5:288. doi:10.1038/s41392–020–00361-x
7. Benetatos L, Benetatou A, Vartholomatos G. Long non-coding RNAs and MYC association in hematological malignancies. Ann Hematol (2020) 99:2231–2242. doi:10.1007/s00277–020–04166–4
8. Burnham KP, Anderson DR. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer Science & Business Media (2003).