Data Normalization in Python

When working on machine learning projects, you need to properly prepare the data before feeding it into a model. One method to perform on a dataset is normalization.

. <span>Photo by <a href="https://unsplash.com/@goran_ivos?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Goran Ivos</a> on <a href="https://unsplash.com/s/photos/data-normalization?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span> # **What is Normalized Data?** Normalized data within statistics in most cases involves eliminating units of measurement from a dataset. As a result, this enables you to easily compare data with different scales and are measured from different sources. # **Why is Normalized Data Important?** When training a machine learning model, we aim to bring the data to a common scale and so the various features are less sensitive to each other. In this case, we can utilize data normalization as a method of transforming our data, which may be of different units or scales. This allows our model to train using features that could lead to more accurate predictions. ## Methods of Normalizing Data ### Simple Feature Scaling --- ![](https://miro.medium.com/max/780/1*Mm1h1gI4DZymZFJWu6JHWA.png) The simple feature scaling will normalize a value between -1 and 1 by divided by the max value in the dataset. We can implement this in python: ```python #importing pandas import pandas as pd #creating dataframe our data df = pd.read.csv('example.csv') #function which return return of simple eq def norm(item): return (item)/(item.abs().max()) #apply norm function to each item in dataframe df = df.apply(norm) df ``` We can implement this using scikit-learn: ```python #importing maxabsscaler from sklearn.preprocessing import MaxAbsScaler #creating maxabsscaler object norm = MaxAbsScaler() #applying norm to dataframe df_norm = pd.Dataframe(norm.fit_transform(df), columns=df.columns) df_norm ``` ### Min-Max --- ![](https://miro.medium.com/max/948/1*A_80h3vCPcSKfGjczkxvXA.png) The min-max method will scale the feature to a fixed range between 0 and 1. We can implement this in python: ```python #function which return return of min-max eq def norm(item): return (item - item.min())/(item.max() - item.min()) #apply norm function to each item in dataframe df = df.apply(norm) df ``` We can implement this using scikit-learn: ```python #importing minmaxscaler from sklearn.preprocessing import MinMaxScaler #creating minmaxscaler object norm = MinMaxScaler() #applying norm to dataframe df_norm = pd.Dataframe(norm.fit_transform(df), columns=df.columns) df_norm ``` ### Z-Score --- ![](https://miro.medium.com/max/780/1*0jmfp2g2qeMpyUkRCV2i3g.png) The Z-Score is the measure of standard deviations between the actual value and a predicted value. In order to calculate this value we must first know the mean value and the standard deviation. We can implement this in python: ```python #function which return return of z-score eq def z_score(item): return (item - item.mean())/(item.std()) #apply norm function to each item in dataframe df = df.apply(z_score) df ``` We can implement this using scikit-learn: ```python #importing standardscaler from sklearn.preprocessing import StandardScaler #creating standardscaler object norm = StandardScaler() #applying norm to dataframe df_norm = pd.Dataframe(norm.fit_transform(df), columns=df.columns) df_norm ``` Normalizing data allows for transforming each item to a common scale. Implementing data normalization is simple as we have shown by also utilizing scikit-learn to easily normalize without using the equations.