Cab Cancellation Prediction using Data Science and Machine Learning!

Kaushik Tummalapalli
4 min readDec 25, 2020

Let’s try to unwind some questions about which day and time is guaranteeing us a cab based on the data collected from Kaggle!

Introduction:

For me personally, there were numerous occasions of not getting my cab or being cancelled and many reasons when I booked it from many services and I felt that there may be some kind of functionality which tells us about the cab cancellation rate so that we may not wait for the cab which may get cancelled or due to no cabs availability. Disclaimer: There is a booking option available to book in advance but for some people who want to travel urgently and not getting a cab on time, will go through a hell lot of struggle. So from many incidents that happened, I believe that this functionality in all the leading service providers will be a game changer and will help many people across the world who depend on this kind of transportation.

Our Strategy:

Let’s follow the CRISP-DM process to answer our questions!

CRISP-DM is a common process used to find many solutions in Data Science.

The phases of this CRIPS-DM process includes:

- Business Understanding

- Data Understanding

- Data Preparation

- Data Modelling

- Evaluation

- Deployment

Quick Glance about the data:

data_frame.head() to view the first 5 rows

In the data we have detailed information about the Request id, Pickup Point(Either City or Airport), Driver id, time stamps for Request and drop. There are 6745 rows in total with 6 columns as mentioned above.

How many book a cab from the city and from airport?

We can see that many people book the cab from the city as their pickup location but there is not much difference though when we compare the pickups from the airport.

How many people got their cabs and how many got cancelled?

we can observe that from the univariate and bivariate -analysis, we can say that the ratio of not getting a cab is high from Airport than booking a cab from the city.

What are the total requests that come in for every hour in the day?

From the above analysis, from 5:00–10:00 and 17:00–23:00 time, we are getting many requests for the cab.

What will be my Cab cancellation rate If I want to book the cab at 6:30?

For this question, we will be using machine learning algorithms to predict us the cab cancellation rate but before this we need to clean the data which takes most of the time and calculate the percentages for each and every hour.

From the above analysis, I tried using many algorithms but due to reduction of data to 24 rows as we are interested for this, Decision tree algorithm performed well than other algorithms and coming to our question the cab cancellation rate that was predicted to us was 85%! For Now, If we consider the cutoff percent of 70%, we can assume that we may not get the cab due to cab cancellations or any other reason and we may not depend too much on this cab and look for other ways to travel. We need to have some additional data to consider the cutoff percent rate as the current data that was present was not sufficient for me.

Conclusion:

We answered all the questions except some part of the last question which needed more data to consider the cutoff rate.

There’s a lot that can be done in the future such as studying the cancellation rates and getting the cab ratio.

This is my first data science blog, hope you enjoy it, thank you for your attention and any kind of advice or any thing that should be improved from further blogs are welcomed as it will really help me to improve and learn!

To view the code:

To Connect with me via Linkedin :

https://www.linkedin.com/in/kaushik-tummalapalli/

I would love to connect :) See You in the next blog until then keep Hustling and stay safe!

--

--

Kaushik Tummalapalli

Junior Machine Learning Engineer-Omdena | Microsoft Learn Student Ambassador-Beta | Exploring Data Science | Pythoneer | I’m curious about tech and Love chess|