Heart attack prediction model

Friday, June 18, 2021

Introduction

In this post we will cover a simple heart attack prediction model. There are many factors that lead to a heart attack. We will attempt to build a model on data collected containing factors that may lead to a heart attack. You can check out my kaggle notebook here. We will work on the Heart Attack Analysis & Prediction Dataset.

Exploring the data

Let's have a look at the data

You may be familiar with some of the headings. All the abbreviations are clearly explained when you visit the dataset. Since we are going to build a simple model and our data is relatively clean we will not be doing anything fancy like feature engineering or parameter tuning.

Set features and labels

In machine learning, labels are the "targets" or "results". In this case our label is the "output" column which contains 1s and 0s. A "1" represents a person that suffered a heart attack and a "0" represents a person that did not. The features are all the inputs you want your model to see and learn from. For a deep dive on the subject check out What Are Features And Labels In Machine Learning? | Codeing School.

So our labels will be the "output" column only and our features will be the rest of the columns. Features are usually represented as X and labels as y displayed below.

Splitting the data

Now we need to split the data into training data and testing data. The training data is the data our machine learning model will train on and learn from. Our test data is the data we use to predict on from the things learned from our training data. In this instance we will have a 67%/33% split of training and testing data respectively. We will also need to shuffle the data to perform the split. Scikit-learn has a function called train_test_split() that does all the shuffling and splitting so you don't have to do it manually. The train_test_split() splits data into 4 datasets. X_train, X_test, y_train and y_test.

Assigning a classifier

Next, we need to choose our machine leaning model. My best choices were between the CatBoostClassifier and the DecisionTreeClassifier. The CatBoostClassifier gave me the best results overall so I went with that one. Now we can fit our model. Fitting a model is like finding the best pattern to match the features to the labels. It will find the best pattern using the X_train and the y_train data.

Predictions

Once we have fitted our model we can make some predictions using the X_test data and compare it with the y_test data. The predictions for the first 10 rows were as follows.

Prediction: 0 Actual: 1
Prediction: 0 Actual: 0
Prediction: 1 Actual: 1
Prediction: 1 Actual: 1
Prediction: 1 Actual: 1
Prediction: 1 Actual: 1
Prediction: 0 Actual: 0
Prediction: 0 Actual: 0
Prediction: 1 Actual: 1
Prediction: 1 Actual: 1

Looking good so far. Our model predicted correctly 9 out of 10 times.

Confusion matrix

"A confusion matrix, also known as an error matrix, is a specific table layout that allows visualization of the performance of an algorithm." - https://en.wikipedia.org/wiki/Confusion_matrix

Below are the confusion matrix results.

True positives = 44 people that the model correctly predicted would suffer a heart attack
True negatives = 63 people that the model correctly predicted would NOT suffer a heart attack
False positives = 8 people that the model incorrectly predicted would suffer a heart attack
False negatives = 7 people that the model incorrectly predicted would NOT suffer a heart attack

Measuring the performance

We will measure the performance of our model 5 different ways.

Sensitivity or the True Positive Rate

Specificity or the True Negative Rate

Precision or the Positive Prediction Value

Accuracy Rate

F1 Score (the harmonic mean of precision and sensitivity)

All five performance measurement show scores in the high 80's. I'm happy with these results. You can copy the kaggle notebook and play with the code to see if you can't optimize it further. That's all for now.

Lindley Coetzee