Everything About Pipelines In Machine Learning and How Are They Used? – Analytics India Magazine

In machine learning, while building a predictive model for classification and regression tasks there are a lot of steps that are performed from exploratory data analysis to different visualization and transformation. There are a lot of transformation steps that are performed to pre-process the data and get it ready for modelling like missing value treatment, encoding the categorical data, or scaling/normalizing the data. We do all these steps and build a machine learning model but while making predictions on the testing data we often repeat the same steps that were performed while preparing the data.

So there are a lot of steps that are followed and while working on a big project in teams we can often get confused about this transformation. To resolve this we introduce pipelines that hold every step that is performed from starting to fit the data on the model.

Through this article, we will explore pipelines in machine learning and will also see how to implement these for a better understanding of all the transformations steps.

What we will learn from this article?

Pipelines are nothing but an object that holds all the processes that will take place from data transformations to model building. Suppose while building a model we have done encoding for categorical data followed by scaling/ normalizing the data and then finally fitting the training data into the model. If we will design a pipeline for this task then this object will hold all these transforming steps and we just need to call the pipeline object and rest every step that is defined will be done.

This is very useful when a team is working on the same project. Defining the pipeline will give the team members a clear understanding of different transformations taking place in the project. There is a class named Pipeline present in sklearn that allows us to do the same. All the steps in a pipeline are executed sequentially. On all the intermediate steps in the pipeline, there has to be a first fit function called and then transform whereas for the last step there will be only fit function that is usually fitting the data on the model for training.

As soon as we fit the data on the pipeline, the pipeline object is first transformed and then fitted on each of the steps. While making predictions using the pipeline, all the steps are again repeated except for the last function of prediction.

Implementation of the pipeline is very easy and involves 4 different steps mainly that are listed below:-

Let us now practically understand the pipeline and implement it on a data set. We will first import the required libraries and the data set. We will then split the data set into training and testing sets followed by defining the pipeline and then calling the fit score function. Refer to the below code for the same.

We have defined the pipeline with the object name as pipe and this can be changed according to the programmer. We have defined sc objects for StandardScaler and rfcl for Random Forest Classifier.

pipe.fit(X_train,y_train)

print(pipe.score(X_test, y_test)

If we do not want to define the objects for each step like sc and rfcl for StandardScaler and Random Forest Classifier since there can be sometimes many different transformations that would be done. For this, we can make use of make_pipeling that can be imported from the pipeline class present in sklearn. Refer to the below example for the same.

from sklearn.pipeline import make_pipeline

pipe = make_pipeline(StandardScaler(),(RandomForestClassifier()))

We have just defined the functions in this case and not the objects for these functions. Now lets see the steps present in this pipeline.

print(pipe.steps)

pipe.fit(X_train,y_train)

print(pipe.score(X_test, y_test))

Conclusion

Through this article, we discussed pipeline construction in machine learning. How these can be helpful while different people working on the same project to avoid confusion and get a clear understanding of each step that is performed one after another. We then discussed steps for building a pipeline that had two steps i.e scaling and the model and implemented the same on the Pima Indians Diabetes data set. At last, we explored one other way of defining a pipeline that is building a pipeline using make a pipeline.

I am currently enrolled in a Post Graduate Program In Artificial Intelligence and Machine learning. Data Science Enthusiast who likes to draw insights from the data. Always amazed with the intelligence of AI. It's really fascinating teaching a machine to see and understand images. Also, the interest gets doubled when the machine can tell you what it just saw. This is where I say I am highly interested in Computer Vision and Natural Language Processing. I love exploring different use cases that can be build with the power of AI. I am the person who first develops something and then explains it to the whole community with my writings.

Read more:
Everything About Pipelines In Machine Learning and How Are They Used? - Analytics India Magazine

Related Posts

Comments are closed.