Indonesian Rainfall Pattern Classification using Time Series K-means

Intermezzo

Flanked by two continents and two important factors of global climate driving force, the Pacific and Indian ocean, rendering Indonesia to constantly experience a variety of climate and weather events. Located in the low latitude, makes Indonesian only experiences fewer forms of precipitation. Once hail pouring our neighborhood, the whole local news would come up and make it as their headline. It’s very uncommon in here, moreover hail usually came with a devastating thunderstorm. I couldn’t imagine how astonished Indonesian would be if the snow came down to them (which probably won’t happen). Because the only precipitation form we familiar with is the rain.

Indonesian Rainfall Pattern

Rain is hard to predict in Indonesia. Plenty of climate variabilities play a role to trigger the occurrence of rain. Java is highly affected by monsoon activity, which usually experiences a high amount of precipitation from December to February. In most parts of Borneo and Sumatra, the rain usually peaks two times a year, in March and October, follows the movement of the sun around the equator. A profusion of pouring rain happens mostly in June-July for the most part of the middle Sulawesi and Maluku. These three types of rain patterns in Indonesia introduced in 1999 by Bayong in his book “Klimatologi Umum”. Lots of the subsequent research approved and support these rainfall patterns, such as Aldrian in his dissertation: A regionalization of Indonesian climate based on its annual patterns using double correlation method (2001).

Bayong, 1999

In this article, I try to classify Indonesian rainfall patterns using a well-known unsupervised learning method, K-means. Since I use time series data for the analysis, one little adjustment is done, which I use Dynamic Time Warping (DTW) instead of Euclidean for the distance computation.

Ok well, let the work commence!

Time Series K-means and DTW

K-means algorithm is one of the classification methods which works by allocating every data points to each pre-defined nearest cluster centroid. There are four major steps in K-means:

  1. Set the K: The number of clusters/K/number of the centroid is arbitrarily picked at the first which makes the result is super depending on how many K set
  2. Assign each observation to the nearest cluster’s centroid
  3. Recalculate the centroids
  4. Repeat steps 2 and 3 until the centroids don’t change.

The difference between k-means and time series k-means is only on the use of distance computation method. In time-series k-means, the proper distance calculation is Dynamic Time Warping (DTW).

DTW is good for measuring similarity between two temporal sequences which may vary in speed, length, or time. Euclidean distance isn’t the suitable choice for such a sequence or pattern matching. The distance is exactly calculated between the same index in two sequences, while DTW calculates the distance by mapping one index in a sequence to the nearest index in other sequences. Euclidean distance would measure two time series as totally different series, even if it’s only one shifted at a certain time of two highly correlated series.

Conceptualization of the difference between (a) the Euclidean distance and (b) a DTW-based distance (Swanepoel, 2015)

Read this: Dynamic Time Warping for more about DTW.

Data Preprocessing

I use ERA5 single-level monthly total precipitation data (2001–2010). I would say that it isn’t the best practice of using reanalysis data instead of land observation data (if any). But for the shake of the first trial of classifying meteorological data, I think it’s ok. To represent the whole Indonesian rainfall pattern, I take samples from each region, which are:

  • Sumatra
  • Java-Bali-Nusa Tenggara
  • Borneo
  • Sulawesi
  • Maluku-Papua
Sample size formula, source: surveymonkey

I use the formula above to get the sample size for each region and turn it to python function code below:

import scipy.stats as st
import math
def sample_size(n,prop,conf_level,moe):
a = ((st.norm.ppf(1-(1-conf_level)/2)**2)*(prop*(1-prop)))/(moe**2)
b = 1 + (((st.norm.ppf(1-(1-conf_level)/2)**2)*(prop*(1-
prop)))/((moe**2)*n))

return math.ceil(a/b)

Time series K-means python library expects a 3-dimensional data input, which consists of a number of samples and a maximum sample length. Given the sample size is 1,423 and since the data is a monthly composite which consists of 12-month data for each sample, the shape of data input is :

(1423, 12, 1) 

Next, rescale the data using MinMaxScaler, so the value range would be between 0 and 1.

Clustering using tslearn

The complete code of this work is available on my github. Tslearn module provides k-means methods with a variety of distance computation options. The first step of time series clustering is the same like on the regular k-means that the number of K has to be decided first. It’s nice to know the optimum number of K first despite the three different rainfall clusters we already aware of. To do that, we fit the time series k-means with soft-DTW (a variant of DTW that replaces the non-differentiable min operation with a differentiable soft-min operation), and the number of K ranges from 1 to 15.

The parameter used is the sum of squared distances between each data point to the nearest cluster’s centroid.

from tslearn.clustering import TimeSeriesKMeansSum_of_squared_distances = []
K = range(1,15)
for k in K:
km = TimeSeriesKMeans(n_clusters=k,
metric=”softdtw”,
metric_params={“gamma”: .01},
verbose=True,
random_state=seed)
km = km.fit(data_scaled) #data_scaled = your scaled 3D input data
Sum_of_squared_distances.append(km.inertia_)
Sum of squared distances per K to decide the optimum number of K

Using the elbow method, the optimum number of K based on the graph above is K = 3. At K = 3, the green line, which represents the sum of squared distances, is start decreasing in a linear fashion.

Let classify the rainfall data with setting number of clusters = 3.

seed = 0
sdtw_km = TimeSeriesKMeans(n_clusters=3,
metric=”softdtw”,
metric_params={“gamma”: .01},
verbose=True,
random_state=seed)
y_pred = sdtw_km.fit_predict(data_scaled) #data_scaled = your scaled
3D input data

The algorithm will assign each time series sample data to the nearest centroid. Each centroid mimics the sequential pattern of its member. In this case, three clusters are displayed and quite represent the previous research about three Indonesian rainfall patterns. Cluster 1 clearly shows the monsoonal pattern. The amount of rain peaked at both the beginning and end of the year, whilst the lowest amount occurs in June-July-August.

Cluster 2 represents the equatorial pattern, which has two peaks of the rain season (March and October). The third cluster, Cluster 3, represents the region that has unimodal patterns and peaks in the middle of the year (Local pattern, usually peaks in June-July).

Indonesian Rainfall Pattern Clustering

Cluster 1 (monsoonal pattern) has the highest amount of members, 58% of the data belongs to this cluster. Follow by Cluster 3 (local pattern), 24% and Cluster 2 (equatorial pattern), 17%.

In atmospheric science, time series clustering could be useful for various research topics. Discovering patterns, aggregating the parameters, and distinguish every possible clusters are some of the analysis effort in the series of solving the great meteorological phenomena conundrum.

References

Atmospheric Data Science Enthusiast | Data Analyst in Fintech.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store