A complete guide on logistic regression with gene expression data: training the model

Salvatore Raieli
11 min readMay 18, 2022

In this second part of the article, I will explain how to train a logistic model and how to evaluate it.

Here you find the first part:

Here you will find the corresponding Colab Notebook with all the code shown here and used for the images.

This is the outline of the article

1. Preprocessing the dataset

2. Fit the model

3. Evaluate the model

Preprocessing of the dataset

The dataset used in this tutorial is obtained from Warnat-Herresthal et al. They collected and re-analyzed many datasets from leukemia. The dataset and microarray techniques are presented in detail in the previous tutorial.

I write this tutorial in python in Google Colab. As seen in the previous dataset

I have assumed you already have imported the dataset on google drive.

%reload_ext autoreload
%autoreload 2
%matplotlib inline
from google.colab import drive
drive.mount(‘/content/gdrive’, force_remount=True)

import necessary libraries and the dataset:

#import necessary library
import numpy as np
import
pandas as pd
import
matplotlib.pyplot as plt
import
seaborn as sns
import
umap
#dataset
data = pd.read_table("/content/gdrive/My Drive/aml/201028_GSE122505_Leukemia_clean.txt", sep = "\t")

After loading the dataset and necessary library let’s give a glance on the number of available disease conditions in the dataset.

#table of the disease
data.disease.value_counts()

There are two categories that are dominating the dataset: Acute Myeloid Leukemia (AML) and Acute Lymphoid Leukemia (ALL). For easier visualization, we are grouping some categories.

#removing some disease type
data["disease"] = np.where(data["disease"] == "Diabetes_Type_I" , "Diabetes", data["disease"])
data["disease"] = np.where(data["disease"] == "Diabetes_Type_II" , "Diabetes", data["disease"])
other = ['CML','clinically_isolated_syndrome', 'MDS', 'DS_transient_myeloproliferative_disorder']data = data[~data.disease.isin(other)]
data.disease.value_counts()

For this tutorial, we will focus only on two cancer types. Then:

selected = [‘AML’,’ALL’]
data = data[data.disease.isin(selected)]
data.disease.value_counts()
and then:
target = data[“disease”]
df = data.drop(“disease”, 1)
df = df.drop(“GSM”, 1)
df = df.drop(“FAB”, 1)
df.shape

We also filter out the features with low variance and scaling the remaining features.

df = df.drop(df.var()[(df.var() < 0.3)].index, axis=1)
from scipy.stats import zscore
df = df.apply(zscore)
df.shape

Fitting a simple logistic regression model

A small recap.

In a single-variate logistic regression, we have x as a vector (only one variable). The figure is showing the output for the input. The leftmost is zero and classified as 0 and the other is 1. Logistic regression finds the weights, in this case, f(x) = b0 + b1x and it is the dashed line. The black line is the predicted probability p(x) = 1/(1 + exp(-f(x)) and the decision boundary is 0.5.

logistic regression
Fig 1. figure source from: here

but normally datasets contain many independent variables (x is a matrix) in this case is multi-variate logistic regression. Here we have two variables, x1 and x2. Both axes represent the inputs, the white dots are 0 while the green is 1, where the dashed line separates the two classes. In this case, the formulas are slightly different:

  • Logit: 𝑓(𝑥₁, 𝑥₂) = 𝑏₀ + 𝑏₁𝑥₁ + 𝑏₂𝑥₂
  • Probabilities: 𝑝(𝑥₁, 𝑥₂) = 1 / (1 + exp(−𝑓(𝑥₁, 𝑥₂)))
logistic regression
Fig 2. figure source from: here

we divide in training and test set, we are splitting randomly.

from sklearn.model_selection import train_test_splitX_train,X_test,y_train,y_test=train_test_split(df,target,test_size=0.25,random_state=0)

We are then initiate the model and fitting the training set. We are using scikit-learn.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
model = LogisticRegression(solver=’saga’, random_state=0)
model.fit(X_train, y_train)

Scikit-learn is providing us with the parameter used. So here are the parameters we can select:

  • penalty. We will discuss this better in regularization, but you can choose among ‘l1’, ‘l2’, ‘elasticnet’, ‘none’ (default=’l2’)
  • dual. To use primal (when false, which is the default) or dual formulation (selecting True).
  • tol. A floating number that defines the stopping criteria (default=1e-4)
  • C. Floating number that defines the strength of regularization (default=1), remember that smaller values specify stronger regularization.
  • fit_intercept. Specifies if the intercept (B0) has to be calculated (true, which is the default) or has to be considered constant and with value 0 (select False).
  • intercept_scaling. Default is 1, and it decides the scaling of the intercept (B0).
  • random_state. Pseudo number generator (integer, none is the default). You select a number for reproducibility.
  • class_weight. The default is none (which assigns a weight of 1 to each class), you can also select “balanced” which will assign different weights to each class.
  • solver. You can chose‘newton-cg’, ‘lbfgs’, ‘liblinear’, ‘sag’, ‘saga’, default=’lbfgs’.
  • max_iter. A maximum number of iterations is taken for the model to converge.
  • multi_class. Decides the approach when you have multiple class
  • verbose. Decide the verbosity when you use liblinear and lbfgs solvers
  • warm_state. Default =false, if you set it to true it reuses the previous call to fit as initialization, so basically it updates the weight from the previous fit.
  • n_jobs. default=None, number of CPU cores for parallelization.
  • l1_ratio, default=None. A float number to define the ratio between l1/l2 when you use elastic-net, 1_ratio=0 is equivalent to using penalty=’l2', while setting l1_ratio=1 is equivalent to using penalty=’l1'. An intermediate value is a combination of the two.

About how to choose the solver:

  • newton-cg’, ‘sag’, and ‘lbfgs’ solvers support only L2 regularization while you can use ‘liblinear’ solver supports both L1 and L2 regularization, while you can only use ‘saga’ when you use Elastic-Net regularization.
  • For large datasets, ‘sag’ and ‘saga’ are faster while for small datasets ‘liblinear’ is a good choice
  • For fast convergence with ‘sag’ and ‘saga’, the data need to be scaled.

Now, that we have fit our model, we can see the classes.

model.classes_

As you notice we have one intercept, and we have a weight for each independent variable.

The model is returning probabilities for each observation to belong to class 0 or 1, as we can see

model.predict_proba(X_test)[0:10]
logistic regression probabilities prediction

And if the probability is over the threshold the observation is assigned to class 1 otherwise is assigned to class 0

model.predict(X_test)[0:10]

Model evaluation

We can evaluate the score:

model.score(X_train, y_train)
model.score(X_test, y_test)

we are using both the train and test set, in general, if the training set accuracy is much higher you have over-fitting. The test set accuracy is important to evaluate the performance (overall, the model has not seen the test set, so it is unbiased).

Let’s start with a refresh of some statistical concepts, in statistical hypothesis testing we have:

  • Type I error: when we reject a true null hypothesis and we have a false positive (an example a benign mass is predicted as malign by the algorithm)
  • Type II error: we do not reject a false null hypothesis and we a false negative (a malignant tumor is classified as benign by the algorithm, in this case, we are particularly interested in reducing this type of error since it is meaning we are let undiagnosed a life-threatening disease)
classification model evaluation
Fig 3. figure source from: here.

The idea is to minimize these possible errors. The perfect model would have zero false positives and zero false negatives. Generally, with the Greek letter alpha, we indicate the probability to have a false positive (type I error) and it is normally set at 0.05, meaning that is generally acceptable to have a 5 % false positive. With beta is generally indicated the probability of false negative.

From this, we can derive the confusion matrix. The confusion matrix is a table to visualize the performance of a classifier. Generally, the rows represent the predicted classes, while the columns represent the true classes.

confusion matrix logistic regression
Fig 4. figure source from: here

To measure the performance of the binary classification test, you often encounter two terms:

  • Sensitivity (or true positive rate) measures the proportion (often reported as a percentage) of true positives that are correctly classified (for example, the percentage of tumors correctly classified as tumors). It can also be called recall, hit rate, or true positive rate.
  • Specificity (or true negative rate) measures the portion of true negative that are correctly classified. It can also be called selectivity or true negative rate.
precision versus recall logistic regression
Fig 5. figure source from: here.
Fig 6. figure source from: here.

from the confusion matrix, we can calculate sensitivity and specificity

Sensitivity can be calculated as:

Sensitivity can be calculated as:

As you can see, from the confusion matrix you can calculate many other metrics.

confusion matrix
Fig 7. figure source from: here.

so which metric we should use? It depends on each specific case, for example, in a spam detector false positive is a serious problem because important emails can be directed to the spam folder. We have also to choose our metric also according to the dataset, especially if the dataset classes are unbalanced. An imbalanced dataset is when there is a huge difference between the number of observations in one class in comparison to the other.

logistic regression evaluate performance
Fig 8. figure source: here

for instance, accuracy is best suited if the dataset is balanced. This is because if the dataset is imbalanced, the classifier can predict really bad about the less represent the category and still have great accuracy.

inbalanced dataset
Fig 9. figure source: here

sensitivity is widely used when we are interested in truth-detection is the most important thing. Sensitivity is measuring which percentage of true positives are clearly identified with the aim to reduce the false negative (FN). This is important in cancer detection (where you do not want that a tumor is undiagnosed). Specificity measures how many true negatives were accurately classified by our model.

Precision instead, when you want to minimize the false positives (for example the case of spam detection). F1 score is a good choice when you are not sure if it is a more accurate metric sensitivity or precision because the F1 score is dependent on both.

Fig 10. figure source from: adapted from here.

let’s start to calculate the confusion matrix:

y_pred = model.predict(X_test)
confusion_matrix(y_test, y_pred)

And plotting it

import seaborn as sns; sns.set_theme()
conf_matrix = confusion_matrix(y_test, y_pred)
ax = sns.heatmap(conf_matrix, annot=True, fmt=”d”, linewidths=1)
plt.title(‘Confusion matrix’, y=1.1)
plt.ylabel(‘True label’)
plt.xlabel(‘Predicted label’)
confusion matrix logistic regression

Surprisingly, the model was able to classify correctly all the labels. This is probably because we have many variables and these are two different pathologies

Another example, we can decide we want to build a model to identify samples with high expression of particular genes (or based on other characteristics). Like in this case:

X = df.drop(“MYC”, 1)
y = df[“MYC”]
sns.set(style=”whitegrid”, palette=”muted”)
sns.histplot(data=y, x=y)
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)

We are arbitrarily deciding that z-score high than 0 is class 1, otherwise 0

y = np.where(y < 0, 0, 1)
unique, counts = np.unique(y, return_counts=True)
print(np.asarray((unique, counts)).T)
#fitting the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
model = LogisticRegression(solver=’saga’, random_state=1)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
confusion_matrix(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
ax = sns.heatmap(conf_matrix, annot=True, fmt=”d”, linewidths=1)
plt.title(‘Confusion matrix’, y=1.1)
plt.ylabel(‘True label’)
plt.xlabel(‘Predicted label’)

And we calculate the metrics seen before:

from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
print("Precision:",metrics.precision_score(y_test, y_pred))
print("sensitivity:",metrics.recall_score(y_test, y_pred))
print("F1 score:",metrics.f1_score(y_test, y_pred))

Scikit learn is not providing a function for specificity, so we calculate from the confusion matrix:

tn, fp, fn, tp = confusion_matrix(y_test, y_pred).ravel()
specificity = tn / (tn+fp)
specificity

Receiver operating characteristic curve

The receiver operating characteristic curve or (ROC curve) is a plot showing the diagnostic ability of a binary classifier. The ROC is generated by plotting on the x-axis the false positive rate (FPR) and true positive rate (TPR) on the y axis.

Fig 11. figure source: here

Other resources

If you want to know more about the odds ratio: here

logistic regression math: here

on the natural logarithm: here

if you want to implement logistic regression from scratch: here

on the confusion matrix: here

on the metric for logistic regression: here

if you have found it interesting:

Please share it, you can also subscribe to get notified when I publish articles, you can also connect or reach me on LinkedIn. Thanks for your support!

Here, is the link to my Github repository where I am planning to collect code, and many resources related to machine learning, artificial intelligence, and more.

Other tutorials on genomic data:

--

--

Salvatore Raieli

Senior data scientist | about science, machine learning, and AI. Top writer in Artificial Intelligence