Usage data of a group of users distributed across a number of categories,
such as songs, movies, webpages, links, regular household products, mobile
apps, games, etc. can be ultra-high dimensional and massive in size. More often
this kind of data is categorical and sparse in nature making it even more
difficult to interpret any underlying hidden patterns such as clusters of
users. However, if this information can be estimated accurately, it will have
huge impacts in different business areas such as user recommendations for apps,
songs, movies, and other similar products, health analytics using electronic
health record (EHR) data, and driver profiling for insurance premium estimation
or fleet management.
In this work, we propose a clustering strategy of such categorical big data,
utilizing the hidden sparsity of the dataset. Most traditional clustering
methods fail to give proper clusters for such data and end up giving one big
cluster with small clusters around it irrespective of the true structure of the
data clusters. We propose a feature transformation, which maps the
binary-valued usage vector to a lower dimensional continuous feature space in
terms of groups of usage categories, termed as covariate classes. The lower
dimensional feature representations in terms of covariate classes can be used
for clustering. We implemented the proposed strategy and applied it to a large
sized very high-dimensional song playlist dataset for the performance
validation. The results are impressive as we achieved similar-sized user
clusters with minimal between-cluster overlap in the feature space (8%) on
average). As the proposed strategy has a very generic framework, it can be
utilized as the analytic engine of many of the above-mentioned business use
cases allowing an intelligent and dynamic personal recommendation system or a
support system for smart business decision-making.