Cohort Retention Analysis with Python

6 min readMay 16, 2023

Retention has been a popular term in the recent development of the analytical world. Cohort analysis is a well-known technique to easily grasp the fundamental of how retention works and benefits as a metric. This article specifically shows a step-by-step how to create a Python-based cohort retention heat map which is often used as a visual to help analysts better understand the retention rate of a particular subject in the business and simplify it to make the business team easily understand. You can access all my codes here: https://github.com/tiofaizin/retention_cohort_analysis/

(Additional: I also add a SQL code to produce a cohort retention map).

Data Context
The data is daily customer transaction data from a marketplace company. The data only consists of user_id (unique id for each user), order_id (unique id for each purchase transaction made), created_date (user registration date), and trx_date (user transaction date). You can download the data here: https://drive.google.com/file/d/1Iy_K3y5iEr5PblI0CM6dVdwVYUF2Woc9/view?usp=sharing

Goal
To create a monthly user retention heat map. The business team needs to know the monthly retention rate of users since their registration.

Definition
Based on the context and goal above, we need to define several things, such as:

Cohort: a group of users who share common characteristics around a time period. In this, the cohort is: users who register their account every month of 2022
Retention: this is a measure of how well your platform retains users. In this case, if 100 people sign ups for the marketplace in the first month, but only 20 of them have transactions in the next month, your “retention rate” is 20%.

Import required modules

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")

Import data

I normally use CSV format. The column arrangement is important to note. You might have various formats of data with different column names and arrangements. But I strongly suggest completely following my data column naming and arrangement (if you wish to use this notebook entirely). Since a different column naming will result in an error when running.

The column should consist of:

Retention subject column: this column represents an entity in the data in which you wish to calculate its retention. In this case, I want to know the user retention rate, so the column contains user_id information. (Other possible data for this column: merchant id, streamer id, user phone number, etc.)
Retention object column: this column is an entity in the data that represent the activity of which the retention subject (user in this case) stated as retained. In this data the column is orderid. (Other possible data for this column: app session id, login session id, streaming id, etc.)
Cohort head column: this column consists of a period of time (mostly in a date format) that your retention subject will be grouped into. As stated before, my cohort is users who register their accounts every month in 2022. So my cohort head column is created_date (user account registration date). (Other possible data for this column: user first time of transaction, user app installation date, etc.)
Retain time column: this column simply represents the time when activity in the retention object column happens. In this case, the activity is a user transaction so the column is trx_date. (Other possible data for this column: user purchase time, post uploaded time, etc.).

If you have other columns in your data (usually product_id, purchase_amount, customer_address, etc.), you must not include those columns for this cohort retention analysis.

# Import data
user_trans = pd.read_csv('user_transaction_data.csv')
user_trans.head()

Convert time data to datetime format

We need to convert created_date and trx_date columns from object datatype to datetime datatype.

user_trans['created_date'] = pd.to_datetime(user_trans.created_date)
user_trans['trx_date'] = pd.to_datetime(user_trans.trx_date)
user_trans['created_month'] = pd.to_datetime(user_trans['created_date'].dt.strftime('%Y-%m-01'))
user_trans['trx_month'] = pd.to_datetime(user_trans['trx_date'].dt.strftime('%Y-%m-01'))
user_trans.info()

Calculate the cohort size

Cohort size is the number of users that belong to each cohort. In this case, we need to calculate the number of users who registered their account in each month of 2022.

cohort_size = user_trans.groupby('created_month').user_id.nunique().reset_index()
cohort_size.rename(columns = {'user_id':'cohort_size'}, inplace = True)
cohort_size

Truncate the date columns

Since we want to get the retention rate on a monthly basis, we need to truncate the date column so it’s rounded to the beginning of the month (e.g. created_date = “2022–04–24” >> “2022–04–01”). After that you need to apply pandas drop_duplicates as the data must have lots of duplicates after the truncation. Note: you only need 1 user transaction of each month to state that the user is a retained user.

user_trans_trunc = user_trans[['user_id','created_month','trx_month']].drop_duplicates().reset_index().drop(['index'], axis=1)

Calculate the time difference between trx_month and created_month

At this step, we add a new column called month_diff which consists of a month difference between the trx_month and created_month. If the created_month is 2022–04–01 and trx_month is 2022–07–01, the month_diff will be 3 (3 months between July and April). It means the user has a transaction in the third month after their registration, and is considered as a retained customer in ‘2022–07–01’.

user_trans_trunc['month_diff'] = user_trans_trunc['trx_month'].dt.to_period('M').astype(int) - user_trans_trunc['created_month'].dt.to_period('M').astype(int)
user_trans_trunc.tail()

Pivot

Let’s pivot the data with month_diff as columns and created_month as rows. The value of each cell will be the count unique of retained users.

monthly_pivot_count = pd.DataFrame(pd.pivot_table(user_trans_trunc, values='user_id', index='created_month', columns='month_diff', aggfunc=lambda x: len(x.unique())).to_records())
monthly_pivot_count

Join with the cohort size data frame

To get the retention rate (in %) we need to compare the number of retained users in each month with the cohort size. At this step, we join the pivot data frame with the cohort size data frame. As can be seen from the result, there are 23,713 users who registered their accounts in January 2022, and only 14,500 of them carried out at least one transaction in the same month.

monthly_pivot_join = pd.merge(
                        monthly_pivot_count,
                        cohort_size,
                        how="inner",
                        on="created_month")

monthly_pivot_join = monthly_pivot_join[['created_month','cohort_size','0','1','2','3','4','5','6','7','8','9','10','11']]
monthly_pivot_join

Get the retention rate

This step simply transforms the number of monthly retained users into a percentage format, dividing them by cohort size. As can be seen from the result, only 61% of the users who registered their accounts in January 2022 have transactions (retained) in the same month.

monthly_pivot_pct = monthly_pivot_join[['created_month','cohort_size']]
for i in range (0,12):
    monthly_pivot_pct[str(i)] = np.round((monthly_pivot_join[str(i)]/monthly_pivot_join['cohort_size'])*100,0).fillna(0).astype(int)

# Transform to percentage (number of retained users in each month/cohort size)
monthly_pivot_pct = monthly_pivot_pct.fillna(0)
monthly_pivot_pct['created_month'] = monthly_pivot_pct['created_month'].astype(str)
monthly_pivot_pct['cohort_size'] = monthly_pivot_pct['cohort_size'].astype(str)
monthly_pivot_pct

Coloring

Let’s color the retention rate and don’t forget to set the min-max value of the rate.

monthly_pivot_pct.style.background_gradient(cmap='Blues',vmin=0,vmax=100)

Highlight

From the result, we can see that the user retention rate goes lower as the month passed. Even users tend to not purchase anything right after their registration, there are months with the lowest retention rate in the same month of users’ registration (February and August, only 27%). This isn’t a good sign for the business. The business team should find a way how to attract users to do more frequent purchases and to acquire more potential users. Or they can reflect on what they are doing in October to December 2022, since the retention in that period is improving.

It has been a long time for me not to work with python. I’m fully aware that my code isn’t perfect. Even I actually found multiple sources out there that perform more advanced code and delivered a more sophisticated result. Nevertheless, I hope it helps!

Cohort Retention Analysis with Python

THANKS!

Written by Tio Faizin

Responses (1)