# A complete guide to linear regression using gene expression data: regularization

In this third part, I will discuss underfitting and overfitting (and how to detect them), penalized regression (ridge, lasso, and elastic net), and how to implement them in python.

You can find here the first part (introduction and math): here

You can find here the second part (fit, problem diagnosis, and solving): here

# Underfitting and overfitting

The goal of a good model is that is able to generalize well when it encounters new data. This ability allows your model to make predictions on unseen data

Underfitting is when a model can’t accurately capture the dependencies between dependent and independent variables. In general, this because the model is too simple or it got caught in a local minimum. Generally, you notice a low R2 with known data and poor capabilities to generalize with unseen data. Overfitting is the contrary, the model learns and adapts itself to the training data and learn dependencies among data but also some random fluctuations. Generally, the model is too complex. You notice that it has perfect R2 with known data, but poor with unseen data. In concrete, your model learns too well the training data but is losing the ability to generalize. In practice more than learning from the data, the model is memorizing the training set.

As you can see in the figure below an example of underfitting and overfitting. In this case, is a polynomial regression (where some terms are exponential) but the concept is the same. The top left is an example of underfitting, the top right is maybe the best one. While the two at the bottom are more likely examples of over-fitting.

# Penalized regression: ridge, lasso, and elastic net

Ridge regression (called also L2 penalization) and is equivalent to the square of the magnitude of the coefficients.

Or also:

And if we solve this equation in matrix notation:

Interesting if you remember the bias and variance equation these are becoming:

As λ increase the variance decrease but the bias increase, which lead to the fact we have to choose the optimal value for λ.

Lasso regression (called also L1 penalization) adds a penalty term that is equivalent to the absolute magnitude of the coefficients. λ is the penalty term, is a constant that decides how much is the reduction of the weight. The input dataset has a dimension of m, so the penalty is applied to all the weights.

The equation of ordinary least square can be written also in this manner:

Basically, we are trying to minimize the difference between the dependent variable and the product of x and its weight. Then the equation for lasso regression can be also rewritten like this:

Some considerations on the two methods:

• You should scale your inputs before using regularization. In ridge, regression scaling ensures that the penalty acts equally on each coefficient. Scaling with a z-score ensures that the standard deviation is equal to 1 for each input. The standard deviation is affecting the penalty.
• Ridge decreases towards zero the coefficient but without be actually zero, Lasso favors sparsity. Using Lasso many coefficients will be decreased to zero, allowing you feature selections.
• If there are two or more highly collinear variables then Lasso selects one of them randomly (which is not good for the interpretation of the data).
• In general, use Ridge when you have few predictors (few features) and you are sure they're all important.
• In general, the higher is the coefficient of penalization higher is the penalization.
• Ridge is functioning better when the weights have similar dimensions.
• The two methods solve collinearity differently: Ridge leads the coefficients of correlated predictors to be similar, while in Lasso one of the correlated predictors has a larger coefficient and the others are zeroed.

The figure below explains what is happening using different values of lambda to different coefficients (beta). As you can see, Ridge is shrinking the coefficient increasing λ but it is not actually becoming zero, while Lasso is shrinking to zero the different coefficients.

What if we combine the two methods? Elastic Net! Elastic net is working using the combination of L2 and L1 penalization and the formula is:

Interesting you can use the λ to select the strength of the penalization. Where zero is meaning no penalties, 1 is meaning full penalty and zero is basically equal to simple linear regression. A very smaller value of λ is often used (like 0,001). To understand how λ works:

Elastic net loss = loss + (λ * elastic_net_penalty)

You can select also the parameter α which controls the ratio between l1 and l2 penalization (where α = 0 is correspondent to Ridge regression and α = 1 is correspondent to Lasso regression). With an intermediate α, you will have some coefficients that are reduced and some becoming zero. To understand how α works:

Elastic net penalty = (α * l1_penalty) + ((1 — α) * l2_penalty)

One consideration:

• You can control the balance of the two types of penalties, this is generally resulting in a better performance in some cases.

Now we will see this in practice with our dataset.

`from sklearn.linear_model import Ridgemodel = Ridge(alpha = 0.05)model.fit(X_train, y_train)y_pred = model.predict(X_test)print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))print('coefficient of determination:', model.score(X_test, y_test))`

and for lasso:

`from sklearn.linear_model import Lassomodel = Lasso(alpha = 0.05)model.fit(X_train, y_train)y_pred = model.predict(X_test)print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))print('coefficient of determination:', model.score(X_test, y_test))`

We say before that Lasso is reducing to zero the coefficients, you can see how many coefficients are now non zero

`coefs = model.coef_.flatten()names = X_train.columnsgenes = list(zip(names, coefs))feature =pd.DataFrame(genes, columns = ["genes", "coefs"])feature0 = feature.loc[(feature!=0).any(axis=1)]feature0 = feature[(feature != 0).all(1)]feature0.shape, feature.shape`

Then we can plot the coefficients.

`coefs =feature0.sort_values(by=['coefs'])plt.figure(figsize=(20, 15))g = sns.barplot(x="genes", y="coefs", data=coefs, color= "lightblue")g.figsize=(16,10)plt.xticks(rotation=45)`

Using a loop we can change the penalization and see how many coefficients are still non-zero, we can also print the coefficient of determination to see which is the best model.

`y = [0.5, 0.1, 0.05, 0.01, 0.005, 0.001]for i in y:    model = Lasso(alpha = i, max_iter = 10000)    model.fit(X_train, y_train)    coefs = model.coef_.flatten()    names = X_train.columns    genes = list(zip(names, coefs))    x = "coef" + "_lr1_" + str(i)    feature1 =pd.DataFrame(genes, columns = ["genes", x])    feature = pd.merge(feature, feature1,               left_on='genes',right_on='genes',how='outer')    feature0 = feature1[(feature1 != 0).all(1)]    print(model.alpha), print(feature0.shape)    print('coefficient of determination:', model.score(X_test,           y_test))`

Now let’s see the elastic net:

`from sklearn.linear_model import ElasticNetmodel = ElasticNet(alpha=0.5, l1_ratio=0.5)model.fit(X_train, y_train)y_pred = model.predict(X_test)print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))print('coefficient of determination:', model.score(X_test, y_test))`

we can evaluate the model using cross-validation and reporting the average of the mean absolute value.

Cross-validation is a procedure that through resampling you use to evaluate a machine learning model with a limited data sample. There is only one parameter to consider k, which control the number of time a data sample is split (generally k = 10). The procedure starts shuffling the dataset randomly, then splitting into different groups (number of groups equal to k). For each unique group, the group is holding out as a test set while the remaining group is the training set. Then it fit the model on the training set, evaluate the score and discard the model. Once the groups are selected, they are not changed, meaning that each observation is used to train k-1 models. In general, you summarize with mean and standard deviation the score obtained for k models.

Take into account that due to the stochasticity of the algorithm (elastic net is using stochastic gradient descent) the results are changing a little at each run.

`from sklearn.model_selection import cross_val_scorefrom sklearn.model_selection import RepeatedKFoldfrom sklearn.linear_model import ElasticNetfrom numpy import absoluteX = df.drop("MYC", 1)y = df["MYC"]# define modelmodel = ElasticNet(alpha=0.5, l1_ratio=0.5)# define model evaluation methodcv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)# evaluate modelscores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)# force scores to be positivescores = absolute(scores)np.mean(scores), np.std(scores)`

Until now we just randomly select the hyperparameters (alpha, l1_ratio). We do not know if that is the best combination. We will use grid search. The different combinations will be run in combination with cross-validation. We use a subset because is quite computationally expensive.

`from numpy import arangefrom pandas import read_csvfrom sklearn.model_selection import GridSearchCVfrom sklearn.model_selection import RepeatedKFoldfrom sklearn.linear_model import ElasticNet# define modelmodel = ElasticNet()# define model evaluation methodcv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)# define gridgrid = dict()grid['alpha'] = [1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 0.0, 1.0, 10.0, 100.0]grid['l1_ratio'] = arange(0, 1, 0.01)# define searchsearch = GridSearchCV(model, grid, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)# perform the searchresults = search.fit(X.iloc[:, 0:100], y)# summarizeprint('MAE: %.3f' % results.best_score_)print('Config: %s' % results.best_params_)`

Let’s try these parameters and make a plot.

`from sklearn.linear_model import ElasticNetmodel = ElasticNet(alpha=0.01, l1_ratio=0.03)model.fit(X_train.iloc[:, 0:100], y_train)y_pred = model.predict(X_test.iloc[:, 0:100])x_ax = range(len(X_test))plt.figure(figsize=(15, 5))plt.scatter(x_ax, y_test, color = "blue", label ="real")plt.plot(x_ax, y_pred, color = "red", label ="prediction")plt.legend()plt.show()`

Does it look like the oscillations of an elastic?

This was the last part of the tutorial.

# Other resources

on bias and variance calculation: here

on regularization: here

about regression assumptions and plots: here

tutorial on polynomial regression: here

on the model selection using AIC and BIC: here

on Ridge and Lasso regression: here

another source about Ridge and Lasso regression: here

on k-fold cross-validation: here

# Bibliography

2. Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, Thrun S. Dermatologist-level classification of skin cancer with deep neural networks. Nature (2017) 542:115–118. doi:10.1038/nature21056

3. Warnat-Herresthal S, Perrakis K, Taschler B, Becker M, Baßler K, Beyer M, Günther P, Schulte-Schrepping J, Seep L, Klee K, et al. Scalable Prediction of Acute Myeloid Leukemia Using High-Dimensional Machine Learning and Blood Transcriptomics. iScience (2019) 23: doi:10.1016/j.isci.2019.100780

4. Wang C, Zhang J, Yin J, Gan Y, Xu S, Gu Y, Huang W. Alternative approaches to target Myc for cancer treatment. Signal Transduct Target Ther (2021) 6:117. doi:10.1038/s41392–021–00500-y

5. Wang C, Fang H, Zhang J, Gu Y. Targeting “undruggable” c-Myc protein by synthetic lethality. Front Med (2021) doi:10.1007/s11684–020–0780-y

6. Carter JL, Hege K, Yang J, Kalpage HA, Su Y, Edwards H, Hüttemann M, Taub JW, Ge Y. Targeting multiple signaling pathways: the new approach to acute myeloid leukemia therapy. Signal Transduct Target Ther (2020) 5:288. doi:10.1038/s41392–020–00361-x

7. Benetatos L, Benetatou A, Vartholomatos G. Long non-coding RNAs and MYC association in hematological malignancies. Ann Hematol (2020) 99:2231–2242. doi:10.1007/s00277–020–04166–4

8. Burnham KP, Anderson DR. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach. Springer Science & Business Media (2003).

Bioinformatician and data scientist, passionate and writing about bioinformatics, machine learning, and AI. https://www.linkedin.com/in/salvatore-raieli/

## More from Salvatore Raieli

Bioinformatician and data scientist, passionate and writing about bioinformatics, machine learning, and AI. https://www.linkedin.com/in/salvatore-raieli/