Run 30 Machine Learning Models with a Few Lines of Code

When starting a new supervised Machine Learning project, one of the first steps is to analyze the data, understand what we are trying to accomplish, and which machine learning algorithms could help us achieve our goals. While the scikit-learn library makes our lives easier by making possible to run models with a few lines of code, it can also be time-consuming when you need to test multiple models. However, what if we could run multiple vanilla models at once before diving into more complex approaches and have a better idea of what models in which we should invest our precious time?

That’s what lazy predict tries (successfully) to accomplish. It runs 30 machine learning models in just a few seconds and gives us a grasp of how models will perform with our dataset. To better understand how we can use lazy predict, I created a Titanic Survivor Prediction project so that you can code along. You can find the full notebook here. You can code along with me. Basic experience with Python, Pandas, and scikit-learn will help you better understand what is going on.

Importing and cleaning data

First, let's import pyforest. PyForest imports the 40 most popular Python libraries with one line of code. I wrote an article about it, and you can find it here. I will turn some ugly warning messages off using the warning library. I will also import some metrics libraries. We will need it later on.

import pyforest
import warnings
warnings.filterwarnings("ignore")
from sklearn import metrics
from sklearn.metrics import accuracy_score

Now, let's import the dataset we will be using from Kaggle. You can find the dataset on this link. Note that I didn't import Pandas. That's because it comes included with pyforest.

# importing .csv files using Pandas
train = pd.read_csv(‘train.csv’)
test = pd.read_csv(‘test.csv’)

I will skip some Exploratory Data Analysis in this article because our primary focus is to start using lazypredict. However, in my initial EDA that you can find in my GitHub, I noticed that we need to convert the column Sex into numeric. We can easily do that with a lambda function.

train['Sex'] = train['Sex'].apply(lambda x: 1 if x == 'male' else 2)

We can also drop a few categorical columns that we will not be used for this micro project. For homework, I recommend you trying to play around with these features when you finish this article.

train.drop(columns=[‘Name’,’Ticket’,’Cabin’, ‘PassengerId’, ‘Parch’, ‘Embarked’], inplace=True)

Train Test Split

Let's now split our train set into the variables X and y. I will address all the features to X, except Survived, which is our target label.

X = train.drop([‘Survived’], axis=1)
y = train.Survived

And now, let's split the variable into train and test sets. I will go with the default 0.25 for the test size. You can easily add other values using.

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Read the full article here

Acknowledgement and thanks to:: Towards Data Science | Ismael Araujo

April 18, 2021