Feature selection techniques

3 minute read

Published:

Notes on feature selection techniques

General notes

Supervised feature selection techniques consider the target variable when determining variable relevance

  • Wrapper - Evaluate performance of model trained on subsets of input features and select features with greatest performance
  • Filter - Feature selection based on relationship between feature and target variables
  • Intrinsic - Automatically perform feature selection during training (e.g., Decision trees, LASSO)

Unsupervised feature selection techniques ignore the target variable and remove redundant variables (e.g., by using correlation)

Checking correlation to determine redundant redundant variables

Strong correlation between variables may indicate dependent relationships between variables or redundancy. It is best to remove these or reduce the number of them to avoid overfitting

Correlation metrics

Some useful measures of correlation are the Pearson’s correlation coefficient, Spearman’s rank correlation coefficient, and Kendall’s $\tau$ correlation coefficient (not discussed here yet)

Pearson’s correlation coefficient (linear) - Also called “correlation coefficient”, describes the linear correlation between two variables. The correlation coefficient between variables $X$ and $Y$, $\rho$, is

$$\rho_{X,Y}=cov(X,Y)/\sigma_X\sigma_Y$$

  • \(cov\) is the covariance
  • \(\sigma_X\) is the standard deviation of \(X\)
  • \(\sigma_Y\) is the standard deviation of \(Y\)

The coefficient can have values of $-1\le\rho\le+1$. The larger the absolute value of $\rho$ (i.e., closer to $\pm1$), the more strongly correlated the two variables are. $\rho=0$ indicates no correlation, i.e., no linear dependence between the variables

Calculate the Pearson’s correlation coefficient in python with:

Spearman’s (rank) correlation coefficient (nonlinear) - A measure of monotonicity between linear or nonlinear variable. It is the Pearson correlation coefficient between the rank variables. Spearman’s correlation coefficient, $\rho_{R(X),R(Y)}$ or $r_s$, is calculated as

$$\rho_{R(X),R(Y)}=r_s=cov(R(X),R(Y))/\sigma_{R(X)}\sigma_{R(Y)}$$

  • $\rho$ is the Pearson correlation coefficient applied to the rank variables
  • $cov(R(X),R(Y))$ is the covariance of the rank variables
  • $\sigma_{R(X)} and \sigma_{R(Y)}$ are the standard deviations of the rank variables

Calculate the Spearman’s rank correlation coefficient in python with:

Correlation strength is generally understood as

STRONG $$0.7\le|\rho|\le1.0$$ MODERATE $$0.3\le|\rho|<0.7$$ WEAK $$0.0\le|\rho|<0.3$$

Correlation plots

Correlation plots are a visual way to represent the correlation matrix, a table showing correlation coefficients between pairs of variables on the X and Y axes

Calculate the Pearson’s correlation coefficient matrix of a pandas dataframe object, df, with df.corr(method='pearson') or just df.corr()

Using the seaborn module, execute seaborn.heatmat(correlation_matrix)

A full example of calculating and plotting the correlation matrix of the sklearn diabetes dataset

# Import modules for handling dataframes and plotting
import pandas as pd, seaborn as sns

# Import the diabetes dataset module
from sklearn.dataseets import load_diabetes

# Load the diabetes dataset into a dataframe
data = load_diabetes(return_X_y=True, as_frame=True)[0]

# Calculate the correlation matrix of the dataframe
corr_mat = data.corr()

# Plot the correlation matrix as a heatmap
sns.heatmap(corr_mat)

Demo correlation heatmap

Selection methods

Python modules can be used to automatically perform feature selection. Jason Brownlee for Machine Learning Mastery recommends the following two

These must be paired with a scoring function (e.g., Pearson’s correlation coefficient, chi-squared, etc.)

References

Machine Learning Mastery

Feature Engineering and Selection: A Practical Approach for Predictive Models