Python Random Forest Model vs Coin Flip

Paul
Nov 25, 2023
9 min read

Updated: Dec 10, 2023

I thought it would be interesting to code and train a random forest classifier model using Python and test how it performs in 5 predictions against using random coin flips to predict outcomes. This exercise is more about curiosity because in reality you would use a much larger test dataset to evaluate a classifier model using metrics such as accuracy, precision, recall, f1-score etc. A random forest classification model is an ensemble learning method that combines multiple decision trees to make more accurate predictions. It constructs each tree by training on a random subset of the data and outputs the mode of the predictions for classification tasks. In this post I will talk you through how I coded and trained the random forest classifier using Python, the results of the 5 predictions against the coin flips, as well as discussing the use of a larger test dataset and more formal and rigorous metrics for evaluating the classifier model.

The classifier model will predict for a person's fitness class event booking whether the person will attend or not attend the class they booked. The data used to train and test the model was downloaded from kaggle and was for a fitness chain located in Canada.

The random forest model will have input variables as follows: day_of _week - the day of the week the class is to be taken e.g Mon, Tue, Wed, Thu, Fri, Sat, Sun time - the time the class is to be taken e.g AM, PM weight - the weight in kg of the person who made the class booking months_as_member - the number of months the person has been a member of the fitness club days_before - the number of days before the class the booking was made The model will predict the target variable attended which has the value 0 for not attended and 1 for attended the fitness class. The diagram below shows the model design.

The dataset was randomly split into a training and test dataset, with 5 data points randomly selected from the test dataset. These 5 data points were used to compare the models predictive performance against a coin flip. A coin toss of heads was to taken to mean attended and tails not attended the class. The website https://flipsimu.com/ was used to simulate coin flips. As we progress through this post I will provide the results of the random forest model vs coin flip for the 5 predictions. The first model vs coin flip prediction resulted in the model predicting not attended and the coin being tails (not attended), with the actual test data point being not attended, so the scores for predictions were model = 1 and coin = 1. To implement the model in Python I had to first pip install the following libraries: pip install numpy pip install pandas pip install matplotlib pip install -U scikit-learn pip install -U imbalanced-learn

In the Python code I entered the following imports: import pandas as pd import matplotlib.pyplot as plt from collections import Counter from sklearn.model_selection import ( train_test_split, ) from imblearn.over_sampling import ( RandomOverSampler, ) from sklearn.ensemble import ( RandomForestClassifier, ) from sklearn.metrics import ( accuracy_score, confusion_matrix, classification_report, )

To load the fitness class dataset into a pandas dataframe I used the pandas method read_csv(). # Load data data_df = pd.read_csv( "fitness_class_2212.csv" ) The second model vs coin flip prediction resulted in the model predicting not attended and the coin being tails (not attended), with the actual test data point being not attended, so the scores for predictions were then model = 2 and coin = 2.

The next steps in the Python code were to analyse and prepare the data for the random forest model training and testing.

I checked the data types of the columns in the dataframe in case I needed to convert them to numeric values and one-hot encodings. I will discuss this later but my dataset had categorical variables such as the day_of_week, time etc which were loaded as type object. The numeric variable days_before was also loaded as type object instead of as a numeric type.

# See the column data types print(data_df.dtypes) booking_id int64

months_as_member int64

weight float64

days_before object

day_of_week object

time object

category object

attended int64 I wanted to see which columns had null values and how many data points had this issue. A data point being a single row in the pandas dataframe which holds the input variables and target variable for a person's class booking. Below you can see 20 data points had a weight without an entry. There are different ways you can handle null values but I decided to exclude data points with null values, as described later in the post. # Check counts of null

# values in data

print(data_df.isnull().sum()) booking_id 0

months_as_member 0

weight 20

days_before 0

day_of_week 0

time 0

category 0

attended 0

The target variable we are predicting is attended, I wanted to see if there were greater numbers of attended or not attended data points. If the target variable you are predicting has roughly the same number of classes, the dataset is described as balanced. In our case the class not attended had 1046 data points and the class attended had 454 data points, so we had an imbalanced dataset. When you have an imbalanced dataset it is sometimes possible to improve the classification model you are training by first converting your dataset into a balanced dataset before training the model. A model which is trained with an imbalanced dataset can be biased to predict the majority class and less sensitive to the minority class. There are different methods which can be used to achieve a balanced dataset, later in the post I disccuss the application of oversampling to our dataset # Check target variable -

# imbalanced dataset?

# There are more not

# attended than attended

# data points

data_count_df = data_df.groupby(

["attended"]

)["attended"].count()

print(data_count_df)

attended

0 1046

1 454 Tidying up the input variables to the random forest model, I converted the days_before column to numeric data type. # Convert days_before

# feature to numeric

data_df["days_before"] = pd.to_numeric(

data_df["days_before"],

errors="coerce",

) I also ensured the target variable attended had the numeric data type. # Convert attended

# feature to numeric

data_df["attended"] = pd.to_numeric(

data_df["attended"], errors="coerce"

) We do not want any data points with null values so I dropped rows with any null value columns. # Drop any data rows

# with null values

data_df = data_df.dropna() I extracted the target variable attended which we are predicting from the pandas dataframe to a variable y.

# Extract target to predict

y = data_df.pop("attended") The third model vs coin flip prediction resulted in the model predicting not attended and the coin being heads (attended), with the actual test data point being attended, so the scores for predictions were then model = 2 and coin = 3. The next step in the dataset preparation was to drop features not required in the randon forest model. The booking_id and category columns were not used as input variables to the model.

# Drop features not

# required in model

data_df = data_df.drop(

["category", "booking_id"], axis=1

) The input variable day_of_week is a categorical type e.g Mon, Tue etc so I checked whether it had unique values for representing each day of the week. You can see below this was not the case with values "Fri" and "Fri.", "Wed" and "Wednesday" etc. # See if day_of_week column

# has unique values for each

# day of the week

print(

data_df[

"day_of_week"

].value_counts()

) Fri 271

Thu 234

Mon 213

Sun 208

Sat 196

Tue 188

Wed 75

Wednesday 34

Fri. 26

Monday 10 I used the following code to ensure the day_of_week feature had only one value representing each day of the week e.g Mon, Tue, Wed, Thu, Fri, Sat and Sun. # Clean up day_of_week to

# have single value per day

data_df.loc[

data_df["day_of_week"].isin(

["Fri."]

),

"day_of_week",

] = "Fri"

data_df.loc[

data_df["day_of_week"].isin(

["Wednesday"]

),

"day_of_week",

] = "Wed"

data_df.loc[

data_df["day_of_week"].isin(

["Monday"]

),

"day_of_week",

] = "Mon" The input variable time is a categorical type e.g AM and PM. I checked whether it had unique values for representing each time, which it did, so no action was required. # See if time column

# has unique values

# for AM and PM

print(data_df["time"].value_counts()) AM 1110

PM 345 I needed to get the list of categorical variables in my dataset so I could convert them to one-hot encodings. One-hot encodings are variables which have a value either 0 or 1, and are one way to handle categorical variables in machine learning models. The features days_of_week and time will be converted to one-hot encodings later. This involves creating two new variables time_AM and time_PM which can have the values either 0 or 1, and they replace the single variable time to capture whether the data point is AM or PM. Likewise we will replace the single variable days_of_week with new variables, one for each of its classes. # Get list of

# categorical features

# to convert to

# dummy variables

# (one-hot encoding)

features_to_encode = list(

data_df.select_dtypes(

include=["object"]

).columns

)

print(features_to_encode) ['day_of_week', 'time'] The following code created a variable x which holds all the model input features with the cateogrical variables converted to one-hot encodings and the numeric variables left unchanged. # Format categorical

# features to one-hot

# encoding

# leave numeric

# features unchanged

x = pd.get_dummies(

data_df, columns=features_to_encode

) The next step was to split the dataset model inputs held in x and the target variable y into a training and test dataset. One third of the data was kept as test data. The training data was stored in x_train and y_train and the test data in x_test and y_test. # Create train and test data

# Seed means the result

# is reproducible

seed = 10

(

x_train,

x_test,

y_train,

y_test,

) = train_test_split(

x,

y,

test_size=0.333,

random_state=seed,

) The fourth model vs coin flip prediction resulted in the model predicting not attended and the coin being heads (attended), with the actual test data point being not attended, so the scores for predictions were then model = 3 and coin = 3. As mentioned previously the dataset is imbalanced, so I used oversampling to create a balanced training dataset. Oversampling is the process of randomly selecting data points with the minority target class and adding them to the training data until the training data has equal numbers of each of the target classes. The target classes being attended and not attended for the fitness class. The balanced training dataset had 677 of each of these classes. # Dataset is imbalanced,

# to create balanced data

# Decided to randomly

# over sample the

# minority class

ros = RandomOverSampler(random_state=12)

(

x_train_ros,

y_train_ros,

) = ros.fit_resample(x_train, y_train)

print(

sorted(Counter(y_train_ros).items())

)

[(0, 677), (1, 677)] I was then ready to create and train the random forest classifier model using the following code. A random forest model is composed of multiple decision trees, the code below set the number of decision trees as 250. # Created and train RF model

# 250 decision trees

# were trained in the model

rf_classifier = RandomForestClassifier(

min_samples_leaf=50,

n_estimators=250,

bootstrap=True,

oob_score=True,

n_jobs=-1,

random_state=seed,

)

rf_classifier.fit(

x_train_ros, y_train_ros

)

Using the trained model I made predictions using the test input data x_test and assigned the results to y_pred. # Predicted the test

# data target attended

y_pred = rf_classifier.predict(x_test) Given the test data predictions y_pred and the known test data target variable attended values y_test I calculated the model's accuracy using the method accuracy_score(). The accuracy showed 76.3% of the model's predictions were correct for the test data. # Output the accuracy

# using predictions

# and known test

# data attended values

accuracy_score(y_test, y_pred)

print(

f"Accuracy= "

**f"{round(accuracy_score(y_test,y_pred),3)*100} %"**

) Accuracy= 76.3 % As the dataset was imbalanced a naive model which always predicted the majority class could potentially show a high accuracy score. So I generated additional metrics including precision, recall and f1-score for the model's predictions against the test data, so I could more rigorously evaluate the model's performance. I used the method classification_report() to create the metrics. # Output additional metrics

print(

classification_report(

y_test, y_pred

)

The above screen print shows the model's metrics included 485 data points, with 339 fitness classes not attended and 146 attended. The target class value 0 relates to not attended and the class value 1 to attended. The 76% accuracy previously described is displayed. Precision captures the percentage of the model's predictions for a particular target class which are correct. Above 84% of the model's predictions for not attended were correct, and 60% of the predictions for attended were correct. So the model had greater precision for predicting not attended. Recall indicates the percentage of the target class correctly predicted by the model, that is the ability of the model to recall the class. The metrics show 81% of not attended events were correctly predicted and 65% of attended class events. The model was therefore more sentitive to not attended data points. The f1-score combines precision and recall using their harmonic mean, therefore maximizing the f1-score implies simultaneously maximizing both precision and recall. The f1-score for not attended is 0.83 and attended is 0.62. Ideally models will have high precision and high recall for all target classes. However, greater precision tends result in reduced recall, and greater recall in less precision. Whether a model is considered to have good predictive performance depends on the model's use case. For example for fraud detection recall may be more important to ensure crimes are not missed. However, for trading strategies precision may be considered essential if large financial losses can result from bad decisions to enter into trades. To determine which input variables were most important in the trained random forest model I created a plot using rf_classifier.feature_importances_. # Plotted the models most important features

# Larger importance values mean more important

feat_importances = pd.Series(

rf_classifier.feature_importances_,

index=x_train.columns,

)

feat_importances.nlargest(15).plot(

kind="barh"

)

plt.title("Top 15 important features")

plt.show()

The plot shows that months_as_member, weight and days_before were the most important features within the model.

The fifth and final model vs coin flip prediction resulted in the model predicting not attended and the coin being heads (attended), with the actual test data point being not attended, so the scores for predictions were then model = 4 and coin = 3. The random forest model beat the coin flip over the 5 predictions, but this should not be considered significant over such a small test sample. To rigorously evaluate and compare models you need to consider a large test dataset and use the appropriate metrics for the model's use case.