Cab Cancellation Prediction using Data Science and Machine Learning!

Let’s try to unwind some questions about which day and time is guaranteeing us a cab based on the data collected from Kaggle!


Our Strategy:

CRISP-DM is a common process used to find many solutions in Data Science.

The phases of this CRIPS-DM process includes:

- Business Understanding

- Data Understanding

- Data Preparation

- Data Modelling

- Evaluation

- Deployment

Quick Glance about the data:

data_frame.head() to view the first 5 rows
data_frame.head() to view the first 5 rows

In the data we have detailed information about the Request id, Pickup Point(Either City or Airport), Driver id, time stamps for Request and drop. There are 6745 rows in total with 6 columns as mentioned above.

How many book a cab from the city and from airport?

We can see that many people book the cab from the city as their pickup location but there is not much difference though when we compare the pickups from the airport.

How many people got their cabs and how many got cancelled?

we can observe that from the univariate and bivariate -analysis, we can say that the ratio of not getting a cab is high from Airport than booking a cab from the city.

What are the total requests that come in for every hour in the day?

From the above analysis, from 5:00–10:00 and 17:00–23:00 time, we are getting many requests for the cab.

What will be my Cab cancellation rate If I want to book the cab at 6:30?

From the above analysis, I tried using many algorithms but due to reduction of data to 24 rows as we are interested for this, Decision tree algorithm performed well than other algorithms and coming to our question the cab cancellation rate that was predicted to us was 85%! For Now, If we consider the cutoff percent of 70%, we can assume that we may not get the cab due to cab cancellations or any other reason and we may not depend too much on this cab and look for other ways to travel. We need to have some additional data to consider the cutoff percent rate as the current data that was present was not sufficient for me.


There’s a lot that can be done in the future such as studying the cancellation rates and getting the cab ratio.

This is my first data science blog, hope you enjoy it, thank you for your attention and any kind of advice or any thing that should be improved from further blogs are welcomed as it will really help me to improve and learn!

To view the code:

To Connect with me via Linkedin :

I would love to connect :) See You in the next blog until then keep Hustling and stay safe!

Junior Machine Learning Engineer-Omdena | Microsoft Learn Student Ambassador-Beta | Exploring Data Science | Pythoneer | I’m curious about tech and Love chess|