Data Exploration and Feature Engineering in Machine Learning

Variable Identification

Types of variable – Predictor variables (Input) and Target Variables (Output)
Data Types – Character (Gender), Numeric (Height, Weight)
Variable Category – Categorical (Gender), Continuous (Height , Weight)

Missing Value Treatment

While exploring data, if we encounter missing values, what we do? Our first step should be to identify the reason then impute missing values/ drop variables using appropriate methods. But, what if we have too many missing values? Should we impute missing values or drop the variables? I would prefer the latter, because it would not have lot more details about data set. Also, it would not help in improving the power of model. Next question, is there any threshold of missing values for dropping a variable? It varies from case to case. If the information contained in the variable is not that much, you can drop the variable if it has more than ~40-50% missing values.

Lets summarise how we can fix missing values (Missing Values Imputation)

Mean or Median values can be used to replace the missing numbers
If missing or NaN values are less than 5%, simply drop or delete all those rows
Most frequent or zero or constant value for each column. It works for both categorical and continuous values
KNN (impyute package) – More accurate than mean, median and most frequent
MICE ( Multivariate Imputation by chained equation ) creates multiple imputations, as compared to single imputations which provides variety of statistical uncertainty in the imputations. In addition to this the chained equations approach is very flexible and can handle variables of varying types (e.g. continuous or binary)
Deep Learning (datawig package) works both on categorical and continuous variables. Its a library that learns ML models using Deep Neural Network to impute values in diagram.

Imputation is simply the process of substituting the missing values with our own dataset

imputer = Imputer(missing_value = "Nan", strategy = 'mean')
 imputer = imputer.fit(df['C', 'D'])
 df[['C', 'D']] = imputer.transform(df[['C','D']])

Outliers Detection

Any value beyond the range of -1.5 IQR to + 1.5 IQR

Visualisation of data shows outliers – Box Plot, Histogram, Scatter Plot

There are tow types of Outliers :

Univariate outlier – Univariate outliers are variables in single dimension
Multivariate Outlier – Multivariate outliers are outliers in n-dimensions space

Now there are numerous ways to identify outliers. We can identify them by simply installing a simple package pyod as mentioned below,

         pip install pyod

Few algorithms supported by PyOD –

Empirical Cumulative Distribution Functions (ECOD)
Angle Based Outliers detection (ABOD)
Fast Angle-Based Outlier Detection (FastABOD)
KNN based detection (kNN)
Isolation Forest (IForest)
Histogram based Outliers Score (HBOS)
Local Correlation Integral (LOCI)
Cluster based Local Outlier Factor (ECOD)
Copula-Based Outlier Detection (COPOD)

Code for Outlier Detection –

from pyod.models.ecod import ECOD
 clf = ECOD()
 clf.fit(X_train)

y_train_scores = clf.decision_scores_  # raw outlier scores on the train data
y_test_scores = clf.decision_function(X_test)  # predict raw outlier scores on test

How to remove outliers ?

Deletion of outliers
Transforming values – Transforming values can also eliminate outliers. Natural log of value reduces the variation caused by extreme values i.e. log (value)
Binning Values – Binning is a way to group a number of more or less continuous values into small number of “bins”. For example, if you have a data about a group of people, you might want to arrange their ages into smaller intervals e.g. x<20 , 20 <x <30 , 30 < x <40 , >40 . Binning Method specifies Limits, even intervals, even distribution of Unique values base on standard deviation, sub-string, values
Imputing Outliers – Liker imputation of missing values we can also impute outliers. We can use mean, median, mode imputation methods. Before imputing values, we should analyse if its natural outlier or artificial,. If its a artificial we can go with imputing values. We can also use statistical models to predict values of outlier observation and after that we can impute it with predicted values.
Treat Separately – If there are significant number of outliers, we should treat them separately in the statistical model. One approach is to treat, group differently and than build models for both of them and later combine the output

Dimension Reduction Techniques

Dimension Reduction refers to the process of converting a set of data having vast dimensions into data with lesser dimensions ensuring that it conveys similar information concisely. These techniques are typically used while solving machine learning problems to obtain better features for a classification or regression task.

Let’s look at the image shown below. It shows 2 dimensions x1 and x2, which are let us say measurements of several object in cm (x1) and inches (x2). Now, if you were to use both these dimensions in machine learning, they will convey similar information and introduce a lot of noise in system, so you are better of just using one dimension. Here we have converted the dimension of data from 2D (from x1 and x2) to 1D (z1), which has made the data relatively easier to explain.

In similar ways, we can reduce n dimensions of data set to k dimensions (k < n) . These k dimensions can be directly identified (filtered) or can be a combination of dimensions (weighted averages of dimensions) or new dimension(s) that represent existing multiple dimensions well.

One of the most common application of this technique is Image processing. You might have come across this Facebook application – “Which Celebrity Do You Look Like?“. But, have you ever wondered the algorithm that is used for this? Well, to identify the matched celebrity image, we use pixel data and each pixel is equivalent to one dimension. In every image, there are high number of pixels i.e. high number of dimensions. And every dimension is important here. You can’t omit dimensions randomly to make better sense of your overall data set. In such cases, dimension reduction techniques help you to find the significant dimension(s) using various method(s). We will now look at these methods.

Linear Techniques (Component Based)

Linear dataset or structure arrange data in a sequential manner e.g. array. linked list, queue , stack

Factor Analysis

Factor Analysis technique is best suited for situations where we have highly correlated set of variables. It divides the variables based on their correlation into different groups and represent each group with a factor

Principal Component Analysis (PCA)

PCA is one of the most widely used techniques for dealing with linear data. It divides the data into a set of components which try to explain as much variance as possible. It is an unsupervised algorithm. The target of PCA is to maximise the variance in a dataset

The EigenVector and EigenValues of covariance or correlation matrix represent the core of a PCA

EigenVector determines the direction of the new feature space

EigenValue determines their magnitude

Factor Loading is correlation coefficient between variables (ROWS) and factors (Columns)

PCA score are the scores of each case (rows) on each factor (columns)

Independent Component Analysis (ICA)

ICA transform the data into independent components which describe the data using less number of components

Linear Discriminant Analysis (LDA)

LDA is a supervised algorithm as it takes the class label into consideration. LDA finds a centroid of each class datapoints. Now new dimensions is determined which should satisfy two criteria

Maximise the distance between the centroid of each class
Minimise the variation within each category

Non-Linear Techniques (Projection Based)

Non-Linear dataset or structure arrange data in a hierarchical manner, thus creating relationship among the data elements e.g. tree, graph

Single Value Decomposition (SVD)

SVD might be the most popular technique for dimensionality reduction when the data is sparse, like in case of :

Recommender system, where a user has very few movies ratings
Text Classification where some classes have less occurrences
Bag of Words(BOW) where count or frequency of words or most words have 0 value

Multi-Dimensional Scaling (MDS)

MDS is technique that creates a visual representation of distance or similarities between sets of objects

ISOMAP (Isometric Feature Mapping)

Its a combination of Floyd -Warshall algorithm with MDS. MDS takes a matrix of pair-wise distance between all points and computes a position for each point and than use Floyd -Warshall to compute pair wise distance between all other points. ISOMAP than uses MDS to compute the reduced dimensional position of all the points.

t-SNE

t-SNE technique works well whne the data is strongly non-linear and it works extremely well for visualization

UMAP

UMAP technique works well for high dimensional data. Its run time is shorter as compared to t-SNE

Locally -Linear Embedding (LLE)

LLE is same as ISOMAP with several advantages including faster optimisations

Laplacian Eigenmaps

Laplacian Eigenmaps uses spectral techniques to perform dimensionality reduction. This technique relies on the basic assumption that the data lies in a low dimensional manifold in a high dimensional space.

Autoencoders

Autoencoders is a feedforward neural network which is trained to approximate the identity function for dimensionality reduction on non-linear data. The network learn to encode vector into a small number of dimensions and than decode back to original space.

Benefits of Dimension Reduction

Reducing the dimensions of data to 2D or 3D may allow us to plot and visualize it precisely. You can then observe patterns more clearly. Below you can see that, how a 3D data is converted into 2D. First it has identified the 2D plane then represented the points on these two new axis z1 and z2.

It helps in data compressing and reducing the storage space required
It fastens the time required for performing same computations. Less dimensions leads to less computing, also less dimensions can allow usage of algorithms unfit for a large number of dimensions
It takes care of multi-collinearity that improves the model performance. It removes redundant features. For example: there is no point in storing a value in two different units (meters and inches).
It is helpful in noise removal also and as result of that we can improve the performance of models.

Methods to perform Feature Scaling

Rescale Data

When your data is comprised of attributes with varying scales many machine learning algorithms can benefit from re-scaling the attributes to all have the same scale.

Often this is referred to as normalisation and attributes are often re-scaled into the range of 0 and 1. It is useful for algorithm like gradient descent, weight inputs like regression & neural network and algo that uses distance measures like KNN

Binarizer (threshold = 0.0) # all values above threshold are marked 1 and all below are marked 0

Normalisation

Normalisation scales features between the range of 0 to 1, retaining their proportional range to each other

Normalizer () = X’ = (x – min(x))/ (max(x) – min(x))

MinMaxScalar(feature_range =(0,1))

Standardisation

Standardisation of data helps us to transform attributes with a Gaussian Distribution ( or Normal Distribution) of different means and of different standard deviation into a standard Gaussian Distribution with a mean of 0 and a standard deviation of 1

It works better for Gaussian Distribution in the input variables, re-scaled data such as linear, logistic regression and LDA

Standard Deviation reduces the variance around mean for 68% of data in range of -1 to 1

StandardScaler() = X’ = (x-u)/SD

StandardScaler()

Label Encoding

Label encoding converts words labels into numbers to let algorithm work on them at prediction. Label encoder encodes labels with value between 0 and n_classes -1

LabelEncoder()

One- Hot Encoding ( Dummy Encoding)

One hot encoding returns a vector for each unique value of the categorical column. It creates ‘n’ columns for ‘n’ number of unique values

One-Hot Encoding – converts ‘n’ variables

Dummy encoding – converts ‘n-1’ variables

Problem with one-hot encoding is, it leads to multi-collinarity. Two or more variables are highly correlated. means we know the value of one variable we can predict the values of other. It is basically a linear relationship between two variables.

OneHotEncoder()

Low Variance

Let’s think of a scenario where we have a constant variable (all observations have same value, 5) in our data set. Do you think, it can improve the power of model? Ofcourse NOT, because it has zero variance. In case of high number of dimensions, we should drop variables having low variance compared to others because these variables will not explain the variation in target variables.

Consider a variable in our dataset where all the observations have the same value say 1. Calculate the variance of each variable ab drop the one having low variance as compare to others

train.var()

Decision Trees

This technique can be used as a ultimate solution to tackle multiple challenges like missing values, outliers and identifying significant variables. It worked well in our Data Hackathon also. Several data scientists used decision tree and it worked well for them.

Random Forest

Similar to decision tree is Random Forest. This is one of the most commonly used technique which tells us the importance of each feature present in the dataset. We can find the importnace of each feature and keep the top most features, resulting in dimensionality reduction. I would also recommend using the in-built feature importance provided by random forests to select a smaller subset of input features. Just be careful that random forests have a tendency to bias towards variables that have more no. of distinct values i.e. favour numeric variables over binary/categorical values.

model = RandomForestRegressor(random_state=1, max_depth=10)
feature = SelectFromModel(model)

High Correlation

Dimensions exhibiting higher correlation can lower down the performance of model. Moreover, it is not good to have multiple variables of similar information or variation also known as “Multicollinearity”. You can use Pearson (continuous variables) or Polychoric (discrete variables) correlation matrix to identify the variables with high correlation and select one of them using VIF (Variance Inflation Factor). Variables having higher value ( VIF > 5 ) can be dropped.

High correlation between two variables means they have similar information. This bring down performance of some models drastically ( linear and logistic regression models for instance) . As a general guidelines, we should keep those variables which show a decent or high correlation with the target variable. Generally, if the correlation betwen a pair of variables is greater than 0.5 to 0.6, we should consider dropping one of those variables.

df.corr()

Backward Feature Elimination

In this method, we start with all n dimensions.
Compute the sum of square of error (SSR) after eliminating each variable (n times).
Then, identifying variables whose removal has produced the smallest increase in the SSR and removing it finally, leaving us with n-1 input features.
Repeat this process until no other variables can be dropped.
Reverse to this, we can use “Forward Feature Selection” method. In this method, we select one variable and analyse the performance of model by adding another variable.
Here, selection of variable is based on higher improvement in model performance.

Forward Feature Selection

This is opposite of backward feature elimination.
We start with a single feature. Essentially, we train the model ‘n’ number of times using each features separately.
The variable giving the best performance is selected as the starting variable.
Then we repeat this process and add one variable at a time. The variable that produce the highest increase in performance is retrained.
We repeat this process until no significant improvement is seen in the model’s performance

Data Exploration and Feature Engineering in Machine Learning

Technical Articles

Gemma: Google’s Open-Source Powerhouse for Responsible AI

Top 10 Generative AI Tools and Platforms Reshaping the Future

Decoding the Future: Gen AI’s Evolution in 2024 – Trends, Strategies, and Business Impact

Useful Links

Categories