Titanic Data Analysis

Statistical Machine Learning: Data Analysis

Author

Aayush Shrestha

Published

March 21, 2025

Introduction

The Titanic dataset from Kaggle (https://www.kaggle.com/datasets/yasserh/titanic-dataset) is a fascinating window into one of history’s most famous maritime disasters. This is a widely used dataset, often for classification problems. It contains detailed information about passengers on the RMS Titanic. The goal of this project is to analyze models with survival status as the target variable. I will apply techniques such as logistic regression and k-nearest neighbors (KNN) while incorporating essential machine learning practices like cross-validation and model evaluation. I will also conduct exploratory data analysis (EDA) to identify trends and patterns within the dataset, to make sure that our models are built on meaningful insights.

Beyond predictive modeling, this project reflects key responsibilities in a professional data science role. The workflow includes collecting, cleaning, and analyzing data, ensuring data accuracy and integrity, and developing visual reports and dashboards to track model performance. The ability to identify trends, anomalies, and key relationships within the data is crucial for making informed decisions. This analysis relies on industry-standard tools in R, using tidyverse for data manipulation and visualization and tidymodels for machine learning workflows. A strong grasp of statistical inference, regression and classification models, preprocessing, feature engineering, and model evaluation is necessary for building interpretable models. By following a structured and methodical approach, this project not only enhances my technical expertise but also reinforces best practices in data science and machine learning.

Dataset Variables

  • PassengerId: A unique identifier assigned to each passenger. This is a sequential numerical value without predictive significance.
  • Survived: The target variable we aim to predict. It’s a binary indicator where 0 represents a passenger who perished and 1 represents a survivor.
  • Pclass: Passenger class, serving as a socio-economic status proxy. It has three possible values:

    • 1 = First class (Upper)

    • 2 = Second class (Middle)

    • 3 = Third class (Lower)

  • Name: The passenger’s full name, which might contain titles that indicate social status, age, or marital status (e.g., Mr., Mrs., Master, Miss).
  • Sex: Gender of the passenger, recorded as “male” or “female”.
  • Age: The passenger’s age in years. This variable contains approximately 177 missing values in our dataset.
  • SibSp: Number of siblings or spouses aboard the Titanic. This family relation variable counts:

    • Siblings: brothers, sisters

    • Spouses: husbands, wives (mistresses and fiancés were not counted)

  • Parch: Number of parents or children aboard the Titanic. This family relation variable counts:

    • Parents: mother, father

    • Children: daughter, son, including stepchildren

  • Ticket: Ticket number, which might contain alphabetic prefixes or be purely numerical.
  • Fare: The amount paid for the passenger’s ticket, measured in British pounds. Values range from free passage to over £500.
  • Cabin: The cabin number where the passenger stayed. This variable has a high proportion of missing values, as cabin information was not recorded for many passengers.
  • Embarked: Port of embarkation, indicating where the passenger boarded the Titanic:

    • C = Cherbourg, France

    • Q = Queenstown, Ireland (now Cobh)

    • S = Southampton, England

1. Loading Libraries and Data

Code
# Load Required Libraries
library(tidyverse)
library(tidymodels)
library(kknn)
library(ggrepel)
library(yardstick)
library(ggplot2)
library(ggthemes)

# Set Seed for Reproducibility
set.seed(401)  

# "The Titanic is associated with the number 401 because that was the yard number #assigned to the ship by its builders, Harland and Wolff". 
#Just thought this was a fun seed number!

# Load Titanic Dataset
titanic <- read_csv("titanic.csv")

# View First Few Rows
glimpse(titanic)
Rows: 891
Columns: 12
$ PassengerId <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,…
$ Survived    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1…
$ Pclass      <dbl> 3, 1, 3, 1, 3, 3, 1, 3, 3, 2, 3, 1, 3, 3, 3, 2, 3, 2, 3, 3…
$ Name        <chr> "Braund, Mr. Owen Harris", "Cumings, Mrs. John Bradley (Fl…
$ Sex         <chr> "male", "female", "female", "female", "male", "male", "mal…
$ Age         <dbl> 22, 38, 26, 35, 35, NA, 54, 2, 27, 14, 4, 58, 20, 39, 14, …
$ SibSp       <dbl> 1, 1, 0, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 1, 0, 0, 4, 0, 1, 0…
$ Parch       <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 2, 0, 1, 0, 0, 5, 0, 0, 1, 0, 0, 0…
$ Ticket      <chr> "A/5 21171", "PC 17599", "STON/O2. 3101282", "113803", "37…
$ Fare        <dbl> 7.2500, 71.2833, 7.9250, 53.1000, 8.0500, 8.4583, 51.8625,…
$ Cabin       <chr> NA, "C85", NA, "C123", NA, NA, "E46", NA, NA, NA, "G6", "C…
$ Embarked    <chr> "S", "C", "S", "S", "S", "Q", "S", "S", "S", "C", "S", "S"…

We’ve loaded essential packages like tidyverse for data manipulation, tidymodels for building predictive workflows, and visualization tools like ggplot2. I’ve set our random seed to 401 which was the actual yard number assigned to the Titanic by its builders, adding just a touch of historical significance while ensuring reproducible results.

Our initial glimpse at the dataset reveals 891 passenger records with various features. The target variable “Survived” (0 = perished, 1 = survived) will be what we’re trying to predict. We have both numerical variables like Age (ranging from 0.42 to 80 years) and Fare (from free passage to a whopping 512.33), as well as categorical variables like passenger class and embarkation point. There are some missing values, particularly in Age (177 missing entries), which we’ll need to address during preprocessing.

The next step is to clean the data and continue using that data for the rest of the project.

2. Cleaning Data

Code
titanic <- titanic |>
  mutate(
    Survived = factor(Survived),
    Pclass = factor(Pclass),
    Sex = factor(Sex),
    Embarked = factor(Embarked)
  ) |>
  select(-PassengerId, -Name, -Ticket, -Cabin)  # Drop unnecessary columns

In this step, I’ve converted categorical variables (Survived, Pclass, Sex, Embarked) into factors, making them suitable for our classification models. I’ve also removed less useful columns like PassengerId, Name, Ticket, and Cabin to focus on predictive features.

I’ve left missing values intact at this stage rather than imputing them immediately. This is to prevent data leakage. We will handle missing values within our modeling workflow using various imputation strategies to see which performs best.

3. Exploratory Data Analysis (EDA)

Code
# Define colors for consistency
numerical_color <- "#00876c"  # Greenish-blue
categorical_color <- "#e63946"  # Red
survival_colors <- c("#1f78b4", "#33a02c")  # Blue for non-survivors, Green for survivors

# Plot Distribution of Numerical Features
numerical_features <- c("Age", "SibSp", "Parch", "Fare")
for (feature in numerical_features) {
  print(
    ggplot(titanic, aes(x = .data[[feature]])) +
      geom_histogram(binwidth = 10, fill = numerical_color, color = "white", alpha = 0.85) +
      ggtitle(paste("Distribution of", feature)) +
      theme_economist() +
      theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
      labs(x = feature, y = "Count")
  )
}

Code
# Plot Distribution of Categorical Features
categorical_features <- c("Survived", "Pclass", "Sex", "Embarked")
for (feature in categorical_features) {
  print(
    ggplot(titanic, aes(x = .data[[feature]])) +
      geom_bar(fill = categorical_color, color = "black", alpha = 0.85) +
      ggtitle(paste("Distribution of", feature)) +
      theme_economist() +
      theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
      labs(x = feature, y = "Count")
  )
}

Code
# Plot Survival Rate by Different Features
features_to_plot <- c("Pclass", "Sex", "Embarked")
for (feature in features_to_plot) {
  print(
    ggplot(titanic, aes(x = .data[[feature]], fill = as.factor(Survived))) +
      geom_bar(position = "fill", alpha = 0.85) +
      scale_fill_manual(values = survival_colors) +
      ggtitle(paste("Survival Rate by", feature)) +
      theme_economist() +
      theme(plot.title = element_text(hjust = 0.5, face = "bold")) +
      labs(x = feature, y = "Proportion", fill = "Survived")
  )
}

Our EDA reveals some interesting but unsurprising realities about the Titanic disaster. The survival rate shows that significantly more passengers perished than survived.

Diving deeper (no pun intended) into demographics, we observe that passengers between 20-40 years had the highest survival rates relative to other age groups. The most striking pattern emerges when examining passenger class: third-class passengers had dramatically lower survival rates compared to first-class passengers, highlighting the socioeconomic difference which eventually influenced survival outcomes.

These visualizations provide crucial insights into potential feature importance for our predictive models and highlight data anomalies we’ll need to address during preprocessing.

4. Train-Test Split & Cross-Validation

Code
titanic_split <- initial_split(titanic, prop = 0.8, strata = Survived)
train_data <- training(titanic_split)
test_data <- testing(titanic_split)

# 10-Fold Cross-Validation with 10 repeats
titanic_folds <- vfold_cv(train_data, v = 10, strata = Survived, repeats = 10)

Here, we’ve partitioned our data into training (80%) and testing (20%) sets. I’ve used stratification to maintain the same proportion of survivors in both sets, preventing any artificial bias in our evaluation.

I’ve also implemented 10-fold cross-validation with 10 repeats. This rigorous approach means each model will be tested across 100 different training/validation combinations, giving us extremely reliable performance estimates and reducing the variance in our assessments.

5. Defining our models

Code
# Logistic Regression
log_reg_model <- logistic_reg() |> 
  set_engine("glm") |> 
  set_mode("classification")

# KNN Models with Different k Values
knn5_model <- nearest_neighbor(neighbors = 5) |> 
  set_engine("kknn") |> 
  set_mode("classification")

knn10_model <- nearest_neighbor(neighbors = 10) |> 
  set_engine("kknn") |> 
  set_mode("classification")

For our prediction, I’ve selected two different types of models. First is logistic regression, which is reliable, interpretable, and effective for understanding feature importance. Then we have k-nearest neighbors (KNN) with two different configurations (k=5 and k=10), which is a more flexible approach and adapts to local patterns in the data.

Since we’re predicting a binary outcome (survived or didn’t survive), all models are set to “classification” mode rather than “regression.” Each model has its strengths: logistic regression excels at providing interpretable odds ratios, while KNN can capture more complex, non-linear relationships in the data.

6. Preprocessing Data with Recipes

Code
# For Logistic Regression recipes
logreg_meanimpute <- recipe(Survived ~ ., data = train_data) |> 
  step_nzv(all_predictors()) |>  
  step_zv(all_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |>
  step_unknown(Embarked) |>  # Handle missing values in Embarked
  step_dummy(all_nominal_predictors(), one_hot = FALSE) |>
  step_lincomb(all_numeric_predictors())  

logreg_medianimpute <- recipe(Survived ~ ., data = train_data) |> 
  step_nzv(all_predictors()) |>  
  step_zv(all_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>  
  step_unknown(Embarked) |>  # Handle missing values in Embarked
  step_dummy(all_nominal_predictors(), one_hot = FALSE) |>
  step_lincomb(all_numeric_predictors())  

logreg_knnimpute <- recipe(Survived ~ ., data = train_data) |> 
  step_nzv(all_predictors()) |> 
  step_zv(all_predictors()) |>
  step_impute_knn(all_numeric_predictors()) |>  
  step_unknown(Embarked) |>  # Handle missing values in Embarked
  step_dummy(all_nominal_predictors(), one_hot = FALSE) |>
  step_lincomb(all_numeric_predictors())  

# For KNN recipes
knn_meanimpute <- recipe(Survived ~ ., data = train_data) |> 
  step_nzv(all_predictors()) |>  
  step_zv(all_predictors()) |>
  step_impute_mean(all_numeric_predictors()) |>  
  step_unknown(Embarked) |>  # Handle missing values in Embarked
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  step_lincomb(all_numeric_predictors())  

knn_medianimpute <- recipe(Survived ~ ., data = train_data) |> 
  step_nzv(all_predictors()) |>  
  step_zv(all_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>  
  step_unknown(Embarked) |>  # Handle missing values in Embarked
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  step_lincomb(all_numeric_predictors())  

knn_knnimpute <- recipe(Survived ~ ., data = train_data) |> 
  step_nzv(all_predictors()) |>  
  step_zv(all_predictors()) |>
  step_impute_knn(all_numeric_predictors()) |>  
  step_unknown(Embarked) |>  # Handle missing values in Embarked
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  step_lincomb(all_numeric_predictors())

Here, I’ve created six different preprocessing recipes to transform our raw data into model-ready format. Each recipe follows a similar pattern but with key differences in how missing values are handled:

  1. First, we remove near-zero and zero-variance predictors that don’t contribute meaningful information

  2. Then we impute missing values using one of three strategies:

-   Mean imputation (replacing missing values with the average)

-   Median imputation (using the middle value, more robust to outliers)

-   KNN imputation (using similar instances to estimate missing values)
  1. Next, we convert categorical variables to dummy variables

  2. Finally, we remove linearly dependent variables to avoid multicollinearity

The distinction between logistic regression and KNN recipes lies in the one-hot encoding: logistic regression uses regular dummy variables, while KNN benefits from one-hot encoding (setting one_hot = TRUE) due to its distance-based nature.

This comprehensive approach lets us compare different imputation strategies and determine which works best for our Titanic data.

7. Creating Model Lists and Workflow Sets

Code
# Logistic Regression Preprocessors
logreg_preprocessors <- list(
  logreg_knn_impute = logreg_knnimpute,
  logreg_mean_impute = logreg_meanimpute,
  logreg_median_impute = logreg_medianimpute
)

# KNN Classification Preprocessors
knn_preprocessors <- list(
  knn_knn_impute = knn_knnimpute,
  knn_mean_impute = knn_meanimpute,
  knn_median_impute = knn_medianimpute
)

# Logistic Regression Model
logreg_models <- list(
  logreg_model = logistic_reg() |> 
    set_engine("glm") |> 
    set_mode("classification")
)

# KNN Models with different k values
knn_models <- list(
  knn5 = nearest_neighbor(neighbors = 5) |> 
    set_engine("kknn") |> 
    set_mode("classification"),
  
  knn10 = nearest_neighbor(neighbors = 10) |> 
    set_engine("kknn") |> 
    set_mode("classification")
)

# Create workflow sets for each model type
logreg_workflows <- workflow_set(logreg_preprocessors, logreg_models, cross = TRUE)
knn_workflows <- workflow_set(knn_preprocessors, knn_models, cross = TRUE)

# Combine all workflow sets
all_models <- bind_rows(logreg_workflows, knn_workflows)

# Print workflow set
all_models
# A workflow set/tibble: 9 × 4
  wflow_id                          info             option    result    
  <chr>                             <list>           <list>    <list>    
1 logreg_knn_impute_logreg_model    <tibble [1 × 4]> <opts[0]> <list [0]>
2 logreg_mean_impute_logreg_model   <tibble [1 × 4]> <opts[0]> <list [0]>
3 logreg_median_impute_logreg_model <tibble [1 × 4]> <opts[0]> <list [0]>
4 knn_knn_impute_knn5               <tibble [1 × 4]> <opts[0]> <list [0]>
5 knn_knn_impute_knn10              <tibble [1 × 4]> <opts[0]> <list [0]>
6 knn_mean_impute_knn5              <tibble [1 × 4]> <opts[0]> <list [0]>
7 knn_mean_impute_knn10             <tibble [1 × 4]> <opts[0]> <list [0]>
8 knn_median_impute_knn5            <tibble [1 × 4]> <opts[0]> <list [0]>
9 knn_median_impute_knn10           <tibble [1 × 4]> <opts[0]> <list [0]>

With our preprocessing recipes and models defined, it’s time to organize everything into a systematic framework, like the Titanic in its early hours. I’ve structured our approach using workflow sets which is a powerful tidymodels feature that helps manage complex modeling processes.

I’ve grouped our preprocessing recipes by model type and created lists of our models with different configurations. Then, using workflow_set(), I’ve created all possible combinations of preprocessors and models. This gives us 9 different workflows:

  • 3 logistic regression workflows (one for each imputation strategy)

  • 6 KNN workflows (two different k values, each with three imputation strategies)

This approach keeps everything organized while automating the process of trying multiple model configurations. Rather than writing repetitive code for each combination, we now have a clean, methodical way to evaluate all our options.

8. Model Training and Evaluation

Code
# Define classification metrics
titanic_class_metrics <- metric_set(accuracy, roc_auc)

# Fit all models using cross-validation
all_fits <- all_models |> 
  workflow_map("fit_resamples",
               resamples = titanic_folds,
               metrics = titanic_class_metrics)

# Collect metrics
collect_metrics(all_fits)
# A tibble: 18 × 9
   wflow_id         .config preproc model .metric .estimator  mean     n std_err
   <chr>            <chr>   <chr>   <chr> <chr>   <chr>      <dbl> <int>   <dbl>
 1 logreg_knn_impu… Prepro… recipe  logi… accura… binary     0.801   100 0.00439
 2 logreg_knn_impu… Prepro… recipe  logi… roc_auc binary     0.847   100 0.00444
 3 logreg_mean_imp… Prepro… recipe  logi… accura… binary     0.798   100 0.00418
 4 logreg_mean_imp… Prepro… recipe  logi… roc_auc binary     0.847   100 0.00458
 5 logreg_median_i… Prepro… recipe  logi… accura… binary     0.797   100 0.00430
 6 logreg_median_i… Prepro… recipe  logi… roc_auc binary     0.846   100 0.00458
 7 knn_knn_impute_… Prepro… recipe  near… accura… binary     0.784   100 0.00431
 8 knn_knn_impute_… Prepro… recipe  near… roc_auc binary     0.825   100 0.00448
 9 knn_knn_impute_… Prepro… recipe  near… accura… binary     0.801   100 0.00397
10 knn_knn_impute_… Prepro… recipe  near… roc_auc binary     0.840   100 0.00453
11 knn_mean_impute… Prepro… recipe  near… accura… binary     0.792   100 0.00424
12 knn_mean_impute… Prepro… recipe  near… roc_auc binary     0.828   100 0.00473
13 knn_mean_impute… Prepro… recipe  near… accura… binary     0.803   100 0.00401
14 knn_mean_impute… Prepro… recipe  near… roc_auc binary     0.839   100 0.00474
15 knn_median_impu… Prepro… recipe  near… accura… binary     0.794   100 0.00417
16 knn_median_impu… Prepro… recipe  near… roc_auc binary     0.825   100 0.00473
17 knn_median_impu… Prepro… recipe  near… accura… binary     0.803   100 0.00381
18 knn_median_impu… Prepro… recipe  near… roc_auc binary     0.836   100 0.00474
Code
# Plot Accuracy for Classification Models
autoplot(all_fits, metric = "accuracy") +
  geom_label_repel(aes(label = wflow_id), 
                   size = 3, 
                   label.size = 0.15, 
                   alpha = 0.7, 
                   max.overlaps = 10, 
                   force = 5, 
                   nudge_x = 0.2, 
                   nudge_y = 0.005, 
                   seed = 42) + 
  theme_bw(base_size = 14) +
  theme(legend.position = "none") +
  labs(title = "Accuracy of Classification Models",
       x = "Workflow Rank",
       y = "Accuracy")

Code
# Plot AUC for Classification Models
autoplot(all_fits, metric = "roc_auc") +
  geom_label_repel(aes(label = wflow_id), 
                   size = 3, 
                   label.size = 0.15, 
                   alpha = 0.7, 
                   max.overlaps = 10, 
                   force = 5, 
                   nudge_x = 0.2, 
                   nudge_y = 0.005, 
                   seed = 42) + 
  theme_bw(base_size = 14) +
  theme(legend.position = "none") +
  labs(title = "AUC-ROC of Classification Models",
       x = "Workflow Rank",
       y = "AUC-ROC")

(If you’re rendering this yourself, please be patient for this step, the Titanic wasn’t built in a day.)

Now comes the exciting part - training and evaluating our models across multiple configurations. Using workflow_map(), I’ve fit all 9 model configurations across our 10-fold cross-validation with 10 repeats, effectively evaluating each model 100 times. This comprehensive approach provides extremely reliable performance estimates.

Looking at the accuracy plot, we can see clear performance differences. The KNN models with k=10 using mean and median imputation (specifically knn_mean_impute_knn10 and knn_median_impute_knn10) achieve the highest accuracy scores around 80-81%. This suggests these configurations are classifying passengers with the greatest overall correctness.

The ROC AUC visualization tells a more different story. Our logistic regression models (particularly logreg_knn_impute_logreg_model) reach AUC scores approaching 0.85-0.86, outperforming the KNN models in this metric. This indicates that while KNN models might have slightly better raw accuracy, the logistic regression models excel at ranking predictions by probability which is a crucial capability when we need to understand confidence levels in survival predictions.

From the metric table, we can see that the best logistic regression model achieves an accuracy of approximately 79.6% with an AUC of 0.85, while the best KNN model reaches about 80.4% accuracy but with a slightly lower AUC of 0.83. This trade-off between metrics highlights an important consideration in model selection, whether we prioritize overall correctness (accuracy) or probabilistic ranking ability (AUC).

9. Selecting the Best Model and Making Predictions

Code
# Select the best model
best_model <- all_fits |>
  rank_results(rank_metric = "roc_auc") |>
  filter(.metric == "roc_auc") |>
  slice(1) |>
  pull(wflow_id)

# Fit the best model on the full training data
final_fit <- all_models |>
  extract_workflow(best_model) |>
  fit(data = train_data)

# Make predictions on test data
test_results <- test_data |>
  mutate(
    pred_class = predict(final_fit, new_data = test_data)$.pred_class,
    pred_prob = predict(final_fit, new_data = test_data, type = "prob")$.pred_1
  )

# Create confusion matrix
conf_mat <- test_results |>
  conf_mat(truth = Survived, estimate = pred_class)

# Plot confusion matrix
conf_mat |> 
  autoplot("heatmap") +
  scale_fill_gradient(low = "white", high = "#003f5c") + 
  theme_minimal() +
  labs(title = "Confusion Matrix Heatmap",
       fill = "Count")

Code
# Calculate comprehensive metrics
binary_metrics <- metric_set(accuracy, recall, precision, 
                            specificity, npv)

test_metrics <- test_results |>
  binary_metrics(truth = Survived, estimate = pred_class, event_level = "second")

# Print metrics
test_metrics
# A tibble: 5 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.810
2 recall      binary         0.696
3 precision   binary         0.787
4 specificity binary         0.882
5 npv         binary         0.822

After thorough evaluation, I’ve selected the best-performing model based on ROC AUC, which is logreg_knn_impute_logreg_model. This logistic regression model with KNN imputation balances multiple performance aspects rather than optimizing for a single metric like accuracy, which can be misleading with the imbalanced classes in the Titanic dataset.

The confusion matrix heatmap reveals the model’s prediction patterns in more detail. The darker squares along the diagonal (true negatives in the top-left and true positives in the bottom-right) show where our model correctly identified both non-survivors and survivors. From the intensity of these squares, we can see the model is particularly strong at identifying non-survivors (darker top-left square).

Our comprehensive metrics on the test data reveal:

  • Accuracy: 81.0% – the model correctly classifies about 4 out of 5 passengers overall

  • Recall: 69.6% – among actual survivors, the model identifies roughly 7 out of 10

  • Precision: 78.7% – when predicting someone survived, the model is right almost 80% of the time

  • Specificity: 88.2% – the model excels at identifying non-survivors, correctly classifying nearly 9 out of 10

  • NPV: 82.2% – when predicting someone didn’t survive, the model is correct about 82% of the time

These numbers tell us that our model is particularly good at identifying who didn’t survive (high specificity) but slightly less adept at catching all survivors (lower recall). This asymmetry is like the historical reality of the Titanic disaster, where the majority did not survive.

10. ROC Curve Analysis

Code
# Generate points for ROC curve
roc_curve_data <- test_results |>
  roc_curve(truth = Survived, pred_prob, event_level = "second")

# Plot ROC curve
roc_curve_data |>
  autoplot()

Code
# Calculate AUC
auc_value <- test_results |>
  roc_auc(truth = Survived, pred_prob, event_level = "second")

# Print AUC
auc_value
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.865

Another crucial aspect of binary classification models is understanding the tradeoff between sensitivity (catching true survivors) and specificity (correctly identifying non-survivors). The ROC curve visualizes this tradeoff across different probability thresholds.

Our ROC curve shows a pronounced bow toward the upper-left corner of the plot, with the line rising steeply at first before gradually flattening. This shape indicates that our model achieves a good balance between true positive rate and false positive rate across various classification thresholds.

The calculated AUC value of 0.865 quantifies this performance. Interpretively, this means our model has an 86.5% chance of ranking a randomly chosen survivor higher than a randomly chosen non-survivor – significantly better than the 50% we would expect from random guessing. In practical terms, if we had to decide who to allocate limited lifeboat space to based on survival probability, our model would make the right call about 87% of the time.

This strong AUC score validates our model selection approach and suggests that logistic regression with KNN imputation effectively captures the complex patterns within the Titanic dataset. The curve’s shape also offers insights into potential threshold adjustments we could make to favor either sensitivity or specificity depending on the specific analytical goals.

Conclusion

This analysis of the Titanic dataset demonstrates a complete data science workflow from preprocessing through model evaluation. Through careful data preprocessing with multiple imputation strategies and appropriate handling of categorical variables, I established a solid foundation for understanding the factors that influenced passenger survival. The systematic approach to model evaluation, comparing logistic regression and KNN models across different configurations using tidymodels, revealed that while KNN models achieved slightly higher accuracy, logistic regression offered better probabilistic ranking capabilities.

The final model achieved 81% accuracy and an AUC of 0.865 on unseen test data. The model’s higher specificity (88.2%) compared to recall (69.6%) reflects the historical reality that the majority of passengers did not survive, with clear patterns emerging around passenger class, gender, and age. These insights not only have statistical significance but also provide a numerical perspective on the historic tragedy.

Throughout this project, I’ve applied a methodical, rigorous approach to data science using industry-standard tools and techniques. From exploratory data analysis that uncovered meaningful patterns to the implementation of cross-validation techniques ensuring reliable model evaluation, each step followed best practices in statistical modeling. This systematic workflow demonstrates how modern machine learning techniques can extract valuable insights even from historical datasets, bringing statistical clarity to complex human events while maintaining technical rigor and analytical integrity.

Note

Some visual elements of this report and some debugging were done with the help of AI. Everything that was done using AI was fully understood and comprehended.