Data Project by Viraj

In this final project, we explore Concrete Jungle’s volunteer data and analyze the journeys volunteers have taken. A volunteer journey refers to the path of interactions of a volunteer with Concrete Jungle. Firstly, we clean and merge data frames and find some irregularities in the data. Then, we model volunteer journeys as a Markov chain and try to predict the next step of a volunteer given her current activity. Finally, we use machine learning algorithms to to group similar volunteer paths and explore their characteristics.

Importing Libraries

Code
# Import packages
#| output: false
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
#pio.renderers.default = "notebook_connected"

from keras.models import Model
from keras.regularizers import l1
from keras.optimizers import Adam
from keras.layers import Input, Dense
from keras import activations

import tensorflow as tf

from sklearn.metrics import silhouette_score
from sklearn.model_selection import train_test_split
from sklearn.neighbors import NearestNeighbors 
from sklearn.manifold import TSNE
from sklearn.decomposition import TruncatedSVD
from sklearn.cluster import DBSCAN
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

Importing Data

Code
# Import data sets

data_folder = "data"
file_name = "full_journey.csv"

def load_csv_as_dataframe(file_name):
    return pd.read_csv(os.path.join(os.path.curdir, data_folder, file_name))

full_journey_df = load_csv_as_dataframe(file_name)

# Unique event_type
unique_events = full_journey_df['Event_Type'].unique()

Markov Chain Modelling of Volunteer Journey

A discrete time markov model is a sequence of random variable, known as stochastic process, where the value of the next variable only depends on the value of the current variable.

Let \(\{X_t\}_t\) be a stochastic process that captures the state of the system at time \(t\), and let \(\mathcal{S} = \{x_1, x_2, x_3, x_4, x_5 \}\) be the discrete state space that consists of the five event types: Contact Created, Sign up for events, Cancellations, Donations, and Sign up for adventure picks.

We can model each volunteer’s path \(\{X_t\}_t\) as a markov chain, wherein we assume that a volunteer’s participication in an event at time \(t + 1\) purely depends on their activity at time \(t\).

The markov chain has a one-step memory.

Let \(P\) be the \(5 \times 5\) transition matrix. The probability of going from event \(i\) to event \(j\) can be defined as follows: \[ p_{ij}(t) = \mathbb{P}(X_{t + 1} = x_i | X_t = x_j)\]

\[\mathbb{P}(X_{t + 1} = x_i | X_{t} = x_j, \dots, X_0 = x_0) = \mathbb{P}(X_{t + 1} = x_i | X_{t} = x_j)\]

$$ \[\begin{equation*} P = \begin{bmatrix} p_{11} & p_{12} & \dots & p_{15}\\ \vdots & \vdots & \vdots & \vdots\\ p_{51} & p_{52} & \dots & p_{55} \end{bmatrix} \end{equation*}\] $$

Data Pre-Processing for Markov Model

Code
# Convert Date into datetime object
full_journey_df['Date'] = pd.to_datetime(full_journey_df['Date'])
Code
journey_subset = full_journey_df[['hashed_name', 'Date', 'Event_Type']]

def event_classification(event):
    '''
    A function to broadly classify event types into four categories:
    1. Cancellations
    2. Sign-Up: Event
    3. Sign-Up: Adventure Pick
    4. Donation
    Note: Event_Type == Contact_Created already exists
    '''
    if str(event).startswith('Cancellation'):
        return 'Cancellation'
    elif str(event) == 'Sign-Up: Adventure Pick':
        return 'Sign-Up: Adventure Pick'
    elif str(event).startswith('Sign-Up:'):
        return 'Sign-Up: Event'
    else:
        return event
    
journey_subset['Event_Type'] = journey_subset['Event_Type']\
    .apply(lambda x: event_classification(x))
Code
# create pivot table with counts of each event by hashed_name and date
pivot_journey = journey_subset.pivot_table(index=['hashed_name', 'Date'], 
                                           columns='Event_Type', aggfunc='size', 
                                           fill_value=0).reset_index()

# group by hashed_name and sum up events for each day
grouped = pivot_journey.groupby('hashed_name').sum().reset_index()
Code
# Set-up data for Markov Chain model

def conversion_list(x):
    path_list = x.to_list()

    conversion = 'Sign-Up: Adventure Pick'

    # If volunteer did not sign-up for adventure pick, end journey with 'Null'
    if conversion not in path_list:
        path_list.append('Null')
    else:
        pass       
        
    return path_list

df_paths = journey_subset.groupby('hashed_name')['Event_Type'].aggregate(
    lambda x: conversion_list(x)).reset_index()

df_paths = df_paths.rename(columns={'Event_Type': 'path'})
list_of_paths = df_paths['path'].to_list()
Code
# Markov Chain Model
class MarkovChain:
    
    '''
    Motivation: https://towardsdatascience.com/marketing-channel-attribution-with-markov-chains-in-python-part-2-the-complete-walkthrough-733c65b23323
    '''

    
    def __init__(self, list_of_paths):
        # stores all paths taken by volunteers/donors
        self.paths = list_of_paths

        # stores all unique events in the list of paths
        self.states = ['Contact_Created', 'Sign-Up: Event', 'Cancellation', 
                       'Donation', 'Sign-Up: Adventure Pick', 'Null']
    
    def conversion_rate(self):

        total_conversion = sum(path.count('Sign-Up: Adventure Pick') for path in self.paths)
        base_conversion_rate = total_conversion / len(self.paths)

        return print(f"The total conversion rate is:{total_conversion}, and the base conversion rate is: {base_conversion_rate}")
    
    def transition_states(self):
        '''
        A function that takes the list of paths and returns a dictionary that 
        represents the transition counts between every event pair in the list.

        '''

        # Initialize a dictionary with all possible transition states
        transition_states = {x + '>' + y: 0 for x in self.states for y in self.states}

        # Loop through each possible state and count the transitions
        for state in self.states:
            # Skip the 'Null' event
            if state == 'Null':
                continue

            # Loop through each user path
            for path in list_of_paths:

                # If the event is in the path, count its transitions
                if state in path:
                    # Find all occurrences of the event in the path
                    indices = [i for i, s in enumerate(path) if s == state]
                    # Increment the transition count for each occurrence of the event
                    for index in indices:
                        if index < len(path) - 1:
                            transition_states[path[index] + '>' + path[index + 1]] += 1

        return transition_states
    
    def transition_prob(self, transition_dict):

        # Calculate the transition probabilities for each state
        trans_prob = {}
        for state in self.states:
            # Skip 'Null' state
            if state == 'Null':
                continue

            # Calculate the total number of transitions involving the current channel
            total_transitions = sum([count for key, count in transition_dict.items() if key.startswith(state + '>')])
            # Calculate the transition probabilities for each transition state involving the current channel
            for key, count in transition_dict.items():
                if key.startswith(state + '>'):
                    trans_prob[key] = float(count) / total_transitions

        return trans_prob
    
    def transition_matrix(self, transition_probabilities):

        # Set up the transition matrix
        transition_matrix = pd.DataFrame(index = self.states, columns = self.states)
        for state in self.states:
            transition_matrix[state] = 0.00
            transition_matrix.loc[state] = 0.00
            transition_matrix.loc[state, state] = 1.0 if state == 'Null' else 0.00
            
        for key, value in transition_probabilities.items():
            current_state, next_state = key.split('>')
            transition_matrix.at[current_state, next_state] = value

        return transition_matrix

    def fit(self):

        trans_states = self.transition_states()
        trans_prob = self.transition_prob(trans_states)
        trans_matrix = self.transition_matrix(trans_prob)
        
        return trans_matrix

Markov Model Results

The results from the estimated transition matrix are as follows:

\[ \hat{P} = \begin{bmatrix} & \text{Contact Created} & \text{Event} & \text{Cancellation} & \text{Donation} & \text{Adventure Pick} & \text{Null} \\ \text{Contact Created} & 0.000 & 0.400 & 0.005 & 0.061 & 0.075 & 0.458 \\ \text{Event} & 0.222 & 0.483 & 0.009 & 0.040 & 0.038 & 0.206 \\ \text{Cancellation} & 0.000 & 0.556 & 0.037 & 0.000 & 0.111 & 0.296 \\ \text{Donation} & 0.014 & 0.014 & 0.000 & 0.778 & 0.002 & 0.193 \\ \text{Adventure Pick} & 0.366 & 0.361 & 0.014 & 0.089 & 0.171 & 0.000 \\ \text{Null} & 0.000 & 0.000 & 0.000 & 0.000& 0.000 & 1.000 \\ \end{bmatrix} \]

Code
import seaborn as sns 
model1 = MarkovChain(list_of_paths).fit()
fig, ax = plt.subplots(figsize = (10, 4))
  
# Displaying dataframe as an heatmap 
sns.heatmap(model1, cmap ="Blues" , linewidths = 0.30, annot = True)
<AxesSubplot: >

Assuming that volunteer paths follow a Markov process, we can claim that a there is a 40% chance that volunteers who create contact will sign up for an event at the next time period. Moreover, there is only a 7.5% chance that they will sign up for an adventure pick after creating contact, and a 45.6% chance that they will be inactive after contact creation.

Similarly, if a volunteer sign up for an adventure pick, there is a 17% probability that they will sign up for an adventure pick in the next time period. Interestingly, there is a 77% probability that an individual who has donated will donate again at the next time step.

Deep Clustering

We take two approaches to study clusters of volunteer journeys:

  1. Autoencoder + K-Means: We use an Autoencoder neutral network to conduct dimesionality reduction and fit a K-Means clustering algorithm to the encoded data. Next, we feed the low dimesional cluster centers into the decoder to get a full dimensional journey image. The journey images is representative of the standard journeys in the cluster and the autoencoder helps stitch the low dimensional data into the original shape and visualize it.

  2. Truncated SVD + DBSCAN: Density-based spatial clustering of applications with noise (DBSCAN) is a density based clustering algorithm that groups points that are closely packed together. We use a dimesionality reduction technique called Truncated SVD to reduce the journey image sparse matrix. Next, we run the DBSCAN clustering algorithm on the low dimesional data. DBSCAN model can be treated as a predictive model to predict the cluster label of each volunteer journey.

Autoencoder + K-Means Clustering (Without Features)

We use a deep autoencoder clustering algrorithm to identify groups of volunteers who have taken a similar path to interact with the organization. In order to perform the analysis, we take the following steps: 1. Represent each volunteer’s journey as a 2D image. - First, we convert each activity into a \((1 \times 5)\) vector like \([0, 0, 1, \dots, 0]\). The vector values represent whether a given activity has been done \((1)\) or not \((0)\). - Next, we expand the vector into a matrix of ordered activties. A journey path for a volunteer \(i\) might look like:

\[\begin{equation*} V_i = \begin{bmatrix} 0 & 1 & \dots & 0\\ \vdots & \vdots & \vdots & \vdots\\ 1 & 0 & \dots & 0 \end{bmatrix} \end{equation*}\]

  1. We create a \((24 \times 5)\) matrix for each journey as 24 is the optimal upper limit on path lengths (excluding some outliers). A zero row vector represents the remaining row for volunteers with less than \(24\) steps. We have a total of \(1283\) volunteers with path length less than 24. So, full journey of each volunteer can be stored in a \((1283 \times 24 \times 5)\) tensor.

  2. We train an autoencoder neutral network that can learn to compress each image through its encoder network and recreate the original sized image through its decoder network. The reduced form images from the encoder are fed into a K-Means clustering algorithm to estimate the groups of similar volunteer paths. Lastly, we use the decoded network to transform the cluster centers back into their original shape and use sanky plots to visualize the standard paths taken by the volunteers.

Code
full_journey_df['Event_Type'] = full_journey_df['Event_Type'].apply(lambda x: event_classification(x))

# % of NA values in each column for each event type
def na_percent(x):
    return (x.isnull().sum() * 100) / len(x)
percent_missing = full_journey_df.groupby('Event_Type').agg(na_percent)

# Choose all columns where percentage of NA values is less than 30%

notna_columns = list(percent_missing[percent_missing < 30].dropna(how = 'all', axis = 1).columns) # convert to list

# Remove redundant or unwanted columns

remove_columns = ['Unnamed: 0.1', 'Unnamed: 0','Earliest sign up', 'Most recent sign up', 'Event Roles', 'Event Type',
                     'Role Type', 'Event Role Name (for confirmation automation)', 'Event Time', 'Event Lead First Name',
                     'Role', 'Email Confirmation Sent',
                     'Total Amount', 'Total donation', 'Total cancellation', 
                     'Total number of sign up', 'Location Name', 'Event Date (For Instruction Emails)']

notna_columns = [col for col in notna_columns if col not in remove_columns]
notna_columns.append('Event_Type')


journey_df = full_journey_df[notna_columns]
hashed_names = journey_df['hashed_name'].unique()

event_types = journey_df['Event_Type'].unique()
event_type_indices = {event_type: i for i, event_type in enumerate(event_types)}

#Re-arrange Columns
pivot_journey = pivot_journey[['hashed_name', 'Date', 'Sign-Up: Event', 
                                'Contact_Created', 'Donation', 'Cancellation',
                                'Sign-Up: Adventure Pick']]

# Maximum steps taken on a path
max_path_length = max(len(x) for x in list_of_paths)
Code
# Inspecting list_of_paths

list_of_paths_df = pd.DataFrame(columns=['path', 'count'])

list_of_paths_df['path'] = list_of_paths
list_of_paths_df['count'] = list_of_paths_df['path'].apply(lambda x: len(x))
#list_of_paths_df.describe()

df_paths['count'] = df_paths['path'].apply(lambda x: len(x))
df_paths.describe()
count
count 2526.000000
mean 5.058195
std 8.716581
min 2.000000
25% 2.000000
50% 3.000000
75% 4.000000
max 109.000000

The range of path length is 107 with a mean of 5, and a median of 3. We can see the the path count is skewed to the right. Some of the highest path lengths can be attributed to recurring donation payments. Next, we exclude the only donors to get a sense of the average path length.

Code
Donation = 'Donation'
Null = 'Null'

# New column `Only_Donor` stores 1 if volunteer has only donated, else 0
df_paths['Only_Donor'] = df_paths['path']\
    .apply(lambda x: 1 if set(list(x)) == {'Donation', 'Null'} else 0)

donor_list_of_path_df = df_paths[df_paths['Only_Donor'] == 1]
events_list_of_path_df = df_paths[df_paths['Only_Donor'] == 0]

# Visualize the distribution of path lengths
events_list_of_path_df['count'].plot.hist(bins=100, alpha=0.5)
<AxesSubplot: ylabel='Frequency'>

We drop any volunteer journeys with path size greater than 25. We remove the outliers as they are increasing the dimesions of the data (\(\text{Max Path Length} \times \text{\# Events}\)), and distorting the clustering analysis.

Code
events_list_of_path_df = events_list_of_path_df[events_list_of_path_df['count'] <=25]

Creating Image Tensor

Code
# Create a zero tensor with dimensions 
# (unique hashed_name) x (max path length) x (number of events)

tensor = np.zeros((len(events_list_of_path_df), max(events_list_of_path_df['count']), len(event_types)))
for i, name in enumerate(events_list_of_path_df['hashed_name']):
    name_event = pivot_journey[pivot_journey['hashed_name'] == name][event_types].values
    num_events = name_event.shape[0]
    tensor[i, :num_events, :] = name_event
Code
# Flatten 3D tensor into 2D

X = tensor.reshape((len(tensor), np.prod(tensor.shape[1:])))
X_train, X_test = train_test_split(X, test_size=0.2)

tensor_train, tensor_test = train_test_split(tensor, test_size=0.2)
X.shape, X_train.shape, X_test.shape
((1283, 120), (1026, 120), (257, 120))

Autoencoder Architecture

Representative Autoencoder Design | Source: tikz.net/autoencoder/

Representative Autoencoder Architecture | Source: www.saberhq.com/blog/autoencoders

Code
input_size = 24 * 5 # (Max Path Length) * (Number of Events)
hidden_size_1 = 64
hidden_size_2 = 32
latent_size= 32

# Autoencoder Architecture

input_layer = Input(shape= input_size, name = 'input_layer')
hidden_1 = Dense(hidden_size_1, activation='relu', name = 'hidden_1')(input_layer)
hidden_2 = Dense(hidden_size_2, activation='relu', name = 'hidden_2')(hidden_1) 
latent_layer = Dense(latent_size, activation='relu', name = 'latent_layer')(hidden_2)
hidden_3 = Dense(hidden_size_2, activation='relu', name = 'hidden_3')(latent_layer)
hidden_4 = Dense(hidden_size_1, activation='relu', name = 'hidden_4')(hidden_3)
output_layer = Dense(input_size, activation='relu', name = 'output_layer')(hidden_4)

# Model Training

autoencoder = Model(input_layer, output_layer)
autoencoder.compile(optimizer = Adam(learning_rate = 1e-5), loss = tf.keras.losses.MeanSquaredError())
autoencoder.fit(X_train, X_train, epochs = 100, validation_data = (X_test, X_test))
autoencoder.save_weights(os.path.join(os.path.curdir, 'ae_weights.h5'))
Code
autoencoder.summary()
Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 input_layer (InputLayer)    [(None, 120)]             0         
                                                                 
 hidden_1 (Dense)            (None, 64)                7744      
                                                                 
 hidden_2 (Dense)            (None, 32)                2080      
                                                                 
 latent_layer (Dense)        (None, 32)                1056      
                                                                 
 hidden_3 (Dense)            (None, 32)                1056      
                                                                 
 hidden_4 (Dense)            (None, 64)                2112      
                                                                 
 output_layer (Dense)        (None, 120)               7800      
                                                                 
=================================================================
Total params: 21,848
Trainable params: 21,848
Non-trainable params: 0
_________________________________________________________________
Code
# Extracting the encoder network
encoder = Model(input_layer, latent_layer)
encoder.compile(optimizer = Adam(learning_rate = 1e-5), loss=tf.keras.losses.MeanSquaredError(), metrics = ['accuracy'])

# load the weights created by the autoencoder 
encoder.load_weights(os.path.join(os.path.curdir, 'ae_weights.h5'), by_name = True)

# low dimesion encoded data for KMeans()
encoded_data = encoder.predict(X)
 1/41 [..............................] - ETA: 2s
41/41 [==============================] - 0s 638us/step
Code
# Extracting the decoder network
decoder = Model(latent_layer, output_layer)
encoder.compile(optimizer = Adam(learning_rate = 1e-5), loss=tf.keras.losses.MeanSquaredError(), metrics = ['accuracy'])

# load the weights created by the autoencoder 
decoder.load_weights(os.path.join(os.path.curdir, 'ae_weights.h5'), by_name = True)
Code
# Choose optimal number of clusters by visual inspection (Elbow Method)
#| warning: false
inertia = []

for i in range(1, 10):
  km = KMeans(n_clusters = i, random_state=0).fit(encoded_data)
  inertia.append([i, km.inertia_])

# Plot Inertia
inertia = np.array(inertia).reshape(-1, 2)
plt.plot(inertia[:, 0], inertia[:, 1])
plt.axvline(x = 5, c = 'red')
plt.xlabel('Number of Clusters')
plt.ylabel('Inertia (SSD)')
plt.title('K-Means Cluster from Encoded Data (Elbow Method)')
Text(0.5, 1.0, 'K-Means Cluster from Encoded Data (Elbow Method)')

Code
N_CLUSTER = 5
encoded_clusters = KMeans(n_clusters = N_CLUSTER)
encoded_clusters.fit(encoded_data)
center = encoded_clusters.cluster_centers_

tsne_center = TSNE(n_components = 2, perplexity= 2, random_state=0).fit_transform(center)

# plot t-SNE clusters

plt.scatter(tsne_center[:, 0], tsne_center[:, 1])
plt.title('Cluster Centers')
Text(0.5, 1.0, 'Cluster Centers')

Code
def sanky_plot(path_array, iter):

    # create labels for nodes
    labels = list(event_types)
    nodes = pd.Index(labels)

    # create dictionary of nodes and their indices
    node_dict = {node: i for i, node in enumerate(nodes)}
    node_dict['end'] = len(nodes)
    df = pd.DataFrame(path_array, columns = list(event_types))

    # create source and target columns
    def source_finder(x):
        if x['Sign-Up: Event'] == 1.0:
            x['source'] = 'Sign-Up: Event'
            
        elif x['Contact_Created'] == 1.0:
            x['source'] = 'Contact_Created'
            
        elif x['Donation'] == 1.0:
            x['source'] = 'Donation'
            
        elif x['Cancellation'] == 1.0:
            x['source'] = 'Cancellation'
            
        else:
            x['source'] = 'Sign-Up: Adventure Pick'  
        return x 


    df = df.apply(lambda x: source_finder(x), axis = 1)
    df['target'] = df['source'].shift(-1).fillna('end')

    # group data by source and target events and calculate count of links
    link_df = df.groupby(['source', 'target']).size().reset_index(name='count')

    # create dict of links
    links = []
    for i, row in link_df.iterrows():
        source = node_dict[row['source']]
        target = node_dict[row['target']]
        value = row['count']
        links.append(dict(source=source, target=target, value=value))

    # create plotly figure
    fig = go.Figure(data=[go.Sankey(
        node=dict(pad=15,
            thickness=20,
            line=dict(color='black', width=0.5),
            label=nodes
            ),
        link=dict(source=[link['source'] for link in links],
                target=[link['target'] for link in links],
                value=[link['value'] for link in links]))])

    # update layout
    fig.update_layout(
        title_text= f"Volunteer Journey {iter + 1}",
        font=dict(size=12, color='black'),
        plot_bgcolor='white',
        height=400)

    # show figure
    fig.show()
Code
cluster_labels = encoded_clusters.labels_
cluster_center_paths = decoder.predict(encoded_clusters.cluster_centers_)
recreated_centers = np.where(cluster_center_paths > 0.5, 1.0, 0.0)
for i in range(N_CLUSTER):
    center_path = recreated_centers[i].reshape(-1, 5)
    sanky_plot(center_path, i)
1/1 [==============================] - ETA: 0s
1/1 [==============================] - 0s 46ms/step

DBSCAN Clustering with Truncated SVD

Next, we consider all DBSCAN fits with parameter eps from 0.5 to 1.5 and select the one with the best silhouette score. The silhouette score compares the mean intra-cluster distance to the mean nearest cluster distance and ranges from -1 to 1. While -1 represent the worst score, 1 represents the best possible score. If the data has more than 2 dimensions, choose MinPts = 2 * dimensions (Sander et al., 1998). The DBSCAN algorithm can be used as a predictive technique to cluster future journeys.

Code
# Truncated SVD Object
svd = TruncatedSVD(n_components = 10, random_state=0)

# Dimesionality Reduction using Truncated SVD
X_svd = svd.fit_transform(X)
Code
# Using Nearest Neighbors to find the average distance between data points
# Finding the Nearest Neighbors

neighbors = NearestNeighbors(n_neighbors = 10 * 2) 
nbrs = neighbors.fit(X_svd)
distances, indices = nbrs.kneighbors(X_svd) 

# Sort and Plot the Distances Results
distances = np.sort(distances, axis = 0) 
distances = distances[:, 1]

plt.plot(distances) 
plt.show()

Code
eps_range = np.arange(0.5, 1.5, 0.01)
output = []
for eps in eps_range:
    labels_loop = DBSCAN(eps = eps, min_samples = 10 * 2).fit(X_svd).labels_
    score_loop = silhouette_score(X_svd, labels_loop, random_state = 0)
    output.append((eps, score_loop))
Code
optimal_eps, score = sorted(output, key = lambda x : x[-1])[-1]
print(f"Best silhouette_score: {score}")
print(f"eps: {optimal_eps}")
Best silhouette_score: 0.6985920850838758
eps: 0.7300000000000002
Code
# DBSCAN Implementation

# fitting the model
dbscan = DBSCAN(eps = optimal_eps, min_samples = 10 * 2).fit(X_svd) 
labels = dbscan.labels_

# Plotting DBSCAN clusters

X_svd_df = pd.DataFrame(X_svd)
X_svd_df['labels'] = pd.Series(labels)

pca = PCA(n_components = 3)
PCs_3d = pd.DataFrame(pca.fit_transform(X_svd_df.drop(["labels"], axis=1)))
PCs_3d.columns = ["PC1_3d", "PC2_3d", "PC3_3d"]

plot_DBSCAN = pd.concat([X_svd_df, PCs_3d], axis=1, join='inner')
fig = px.scatter_3d(plot_DBSCAN, x = "PC1_3d", y = "PC2_3d", z = "PC3_3d", color = "labels")
fig.update_layout(title = 'DBSCAN Cluster Representation in 3D',
                  xaxis_title = 'PC1',
                  yaxis_title = 'PC2',
                  height=600)

fig.show()

The DBSCAN algorithm clusters the reduced data into ten clusters with labels \(0, 1, \dots 10\), whereas the outliers are labelled as \(-1\). Out of all the journeys, 218 are labelled as outliers and the rest are clustered into three groups. The DBSCAN clustering algorithm is applied on a \(10\)-dimesional reduced data. In order to visually represent the cluster points, the data is further reduced to \(2\)-dimesions (using PCA). This might be a reason that the clusters appear to be overlapping. Lastly, the silhouette score of the DBSCAN algorithm is \(0.698\).

Code
# DBSCAN Evaluation
print(f"DBSCAN Silhouette Score: {silhouette_score(X_svd_df.drop('labels', axis = 1), X_svd_df['labels']):.3f}")
DBSCAN Silhouette Score: 0.699
Code
pd.DataFrame(plot_DBSCAN.groupby('labels').size(), columns= ['Count'])
Count
labels
-1 218
0 72
1 332
2 285
3 44
4 97
5 63
6 72
7 29
8 24
9 27
10 20

Reference

Sander, J., Ester, M., Kriegel, HP. et al. Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery 2, 169–194 (1998). https://doi.org/10.1023/A:1009745219419