I thought it would be interesting to code and train a random forest classifier model using Python and test how it performs in 5 predictions against using random coin flips to predict outcomes. This exercise is more about curiosity because in reality you would use a much larger test dataset to evaluate a classifier model using metrics such as accuracy, precision, recall, f1-score etc.
A random forest classification model is an ensemble learning method that combines multiple decision trees to make more accurate predictions. It constructs each tree by training on a random subset of the data and outputs the mode of the predictions for classification tasks.
In this post I will talk you through how I coded and trained the random forest classifier using Python, the results of the 5 predictions against the coin flips, as well as discussing the use of a larger test dataset and more formal and rigorous metrics for evaluating the classifier model.
The classifier model will predict for a person's fitness class event booking whether the person will attend or not attend the class they booked. The data used to train and test the model was downloaded from kaggle and was for a fitness chain located in Canada.
The random forest model will have input variables as follows:
day_of _week - the day of the week the class is to be taken e.g
Mon, Tue, Wed, Thu, Fri, Sat, Sun
time - the time the class is to be taken e.g AM, PM
weight - the weight in kg of the person who made the class booking
months_as_member - the number of months the person has been a
member of the fitness club
days_before - the number of days before the class the booking was
made
The model will predict the target variable attended which has the value 0 for not attended and 1 for attended the fitness class. The diagram below shows the model design.
The dataset was randomly split into a training and test dataset, with 5 data points randomly selected from the test dataset. These 5 data points were used to compare the models predictive performance against a coin flip.
A coin toss of heads was to taken to mean attended and tails not attended the class. The website https://flipsimu.com/ was used to simulate coin flips. As we progress through this post I will provide the results of the random forest model vs coin flip for the 5 predictions.
The first model vs coin flip prediction resulted in the model predicting not attended and the coin being tails (not attended), with the actual test data point being not attended, so the scores for predictions were model = 1 and coin = 1.
To implement the model in Python I had to first pip install the following libraries:
pip install numpy
pip install pandas
pip install matplotlib
pip install -U scikit-learn
pip install -U imbalanced-learn
In the Python code I entered the following imports: import pandas as pd import matplotlib.pyplot as plt from collections import Counter from sklearn.model_selection import ( train_test_split, ) from imblearn.over_sampling import ( RandomOverSampler, ) from sklearn.ensemble import ( RandomForestClassifier, ) from sklearn.metrics import ( accuracy_score, confusion_matrix, classification_report, )
To load the fitness class dataset into a pandas dataframe I used the pandas method read_csv().
# Load data
data_df = pd.read_csv(
"fitness_class_2212.csv"
)
The second model vs coin flip prediction resulted in the model predicting not attended and the coin being tails (not attended), with the actual test data point being not attended, so the scores for predictions were then model = 2 and coin = 2.
The next steps in the Python code were to analyse and prepare the data for the random forest model training and testing.
I checked the data types of the columns in the dataframe in case I needed to convert them to numeric values and one-hot encodings. I will discuss this later but my dataset had categorical variables such as the day_of_week, time etc which were loaded as type object. The numeric variable days_before was also loaded as type object instead of as a numeric type.
# See the column data types print(data_df.dtypes) booking_id int64
months_as_member int64
weight float64
days_before object
day_of_week object
time object
category object
attended int64 I wanted to see which columns had null values and how many data points had this issue. A data point being a single row in the pandas dataframe which holds the input variables and target variable for a person's class booking. Below you can see 20 data points had a weight without an entry. There are different ways you can handle null values but I decided to exclude data points with null values, as described later in the post. # Check counts of null
# values in data
print(data_df.isnull().sum()) booking_id 0
months_as_member 0
weight 20
days_before 0
day_of_week 0
time 0
category 0
attended 0
The target variable we are predicting is attended, I wanted to see if there were greater numbers of attended or not attended data points. If the target variable you are predicting has roughly the same number of classes, the dataset is described as balanced. In our case the class not attended had 1046 data points and the class attended had 454 data points, so we had an imbalanced dataset. When you have an imbalanced dataset it is sometimes possible to improve the classification model you are training by first converting your dataset into a balanced dataset before training the model. A model which is trained with an imbalanced dataset can be biased to predict the majority class and less sensitive to the minority class. There are different methods which can be used to achieve a balanced dataset, later in the post I disccuss the application of oversampling to our dataset # Check target variable -
# imbalanced dataset?
# There are more not
# attended than attended
# data points
data_count_df = data_df.groupby(
["attended"]
)["attended"].count()
print(data_count_df)
attended
0 1046
1 454 Tidying up the input variables to the random forest model, I converted the days_before column to numeric data type. # Convert days_before
# feature to numeric
data_df["days_before"] = pd.to_numeric(
data_df["days_before"],
errors="coerce",
) I also ensured the target variable attended had the numeric data type. # Convert attended
# feature to numeric
data_df["attended"] = pd.to_numeric(
data_df["attended"], errors="coerce"
) We do not want any data points with null values so I dropped rows with any null value columns. # Drop any data rows
# with null values
data_df = data_df.dropna() I extracted the target variable attended which we are predicting from the pandas dataframe to a variable y.
# Extract target to predict
y = data_df.pop("attended") The third model vs coin flip prediction resulted in the model predicting not attended and the coin being heads (attended), with the actual test data point being attended, so the scores for predictions were then model = 2 and coin = 3. The next step in the dataset preparation was to drop features not required in the randon forest model. The booking_id and category columns were not used as input variables to the model.
# Drop features not
# required in model
data_df = data_df.drop(
["category", "booking_id"], axis=1
) The input variable day_of_week is a categorical type e.g Mon, Tue etc so I checked whether it had unique values for representing each day of the week. You can see below this was not the case with values "Fri" and "Fri.", "Wed" and "Wednesday" etc. # See if day_of_week column
# has unique values for each
# day of the week
print(
data_df[
"day_of_week"
].value_counts()
) Fri 271
Thu 234
Mon 213
Sun 208
Sat 196
Tue 188
Wed 75
Wednesday 34
Fri. 26
Comments