K-Means : From Start to State..!!

Shrishti Kapoor
6 min readJul 19, 2021

In Today’s world with the increased usage of Internet, the amount of data generated is incomprehensively massive. We need big data analysis tools to handle such processes. Data mining algorithms and techniques along with machine learning provide us ways of interpreting big data in an understandable way. K-means is one of a data clustering algorithm which can be used for unsupervised machine learning and is capable to work on large number of clusters.

Before jumping to look at the importance of K-means Algorithm and its use-cases lets discuss about the origin of this Algorithm.

CLUSTRING

Clustering is one of the most curious data analysis technique used to get an intuition about the structure of the data. Clustering is a Machine Learning technique that involves the grouping of data points. In Data Science, we can use clustering analysis to gain some valuable insights from our data by seeing what groups the data points fall into when we apply a clustering algorithm. It is basically a type of unsupervised learning method. Clustering (sometimes called cluster analysis) is usually used to classify data into structures that are more easily understood and manipulated.

Now moving towards to K-means which is considered as one of the most used clustering algorithms due to its simplicity.

K-Means

K-Means is probably the most well-known clustering algorithm. It’s taught in a lot of introductory data science and machine learning classes. It’s easy to understand and implement in code. A clustering algorithm like K-Means Clustering can help you group the data into distinct groups, guaranteeing that the data points in each group are similar to each other.

K-Means Algorithm is an Unsupervised Learning algorithm, which groups the unlabeled dataset into different clusters. The K-Means clustering algorithm is an iterative process where you are trying to minimize the distance of the data point from the average data point in the cluster.

K Means algorithm calls for:

  1. Choosing the number of clusters “k”.
  2. Randomly assigning each point to a cluster.
  3. Until clusters stop changing, repeat the following aspects:
  • For each cluster, compute the cluster centroid by taking the mean vector of points in the cluster.
  • Assign each data point to the cluster for which the centroid is the closest.

Steps are shown in the below Chart:

K-Means algorithm is widely used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc.

K-Means Advantages :

1) If variables are huge, then K-Means most of the times computationally faster than hierarchical clustering, if we keep k smalls.

2) K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular.

Use cases for K-means.

Classification of Network Traffic

Problem: As more and more services begin to use APIs on your application, or as your website grows, it is important you know where the traffic is coming from. For example, you want to be able to block harmful traffic and double down on areas driving growth. However, it is hard to know which is which when it comes to classifying the traffic.

Solution: K-means clustering is used to group together characteristics of the traffic sources. When the clusters are created, you can then classify the traffic types. The process is faster and more accurate than the previous Autoclass method. By having precise information on traffic sources, you are able to grow your site and plan capacity effectively.

Document Classification

Problem: Cluster documents in multiple categories based on tags, topics, and the content of the document, this is a very standard classification problem.

Solution: K-means is a highly suitable algorithm for this purpose. the initial processing of the documents is needed to represent each document as a vector and uses term frequency to identify commonly used terms that help classify the document. The document vectors are then clustered to help identify similarity in document groups.

Customer Segmentation

Problem: Marketers improve their customer base, work on target areas, and segment customers based on purchase history, interests, or activity monitoring.

Solution: To improve their customer base, work on target areas, and segment customers based on purchase history, interests, or activity monitoring. Telecom providers can cluster pre-paid customers to identify patterns in terms of money spent in recharging, sending sms, and browsing the internet. The classification would help the company target specific clusters of customers for specific campaigns.

Fantasy league stat analysis

Problem: To create a fantasy draft team and like to identify similar players based on player stats.

Solution: Analysing player stats has always been a critical element of the sporting world, and with increasing competition, machine learning has a critical role to play here. as an interesting exercise, So to solve the given problem, k-means can be a useful option.

Call record detail analysis

Problem: A call detail record (cdr) is the information captured by telecom companies during the call, sms, and internet activity of a customer. This is very important, this information provides greater insights about the customer’s needs when used with customer demographics.

Solution: Cluster customer activities for 24 hours by using the unsupervised k-means clustering algorithm. it is used to understand segments of customers with respect to their usage by hours.

Cyber-profiling criminals - SECURITY DOMAIN

Problem: In today’s society, security has become the chief problem of information society. Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. the idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Difficulties in implementing cyber profiling is on the diversity of user data and behaviour when online is sometimes different from actual behaviour.

Solution: In this study, the K-Means algorithm is used as an algorithm for the cyber profiling process. K-Means algorithm being used is in line with expectations from this study, because it has a simple algorithmic process with a good degree of accuracy.

It is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns. Insurance fraud can potentially have a multi-million dollar impact on a company, which finally has the ability to detect frauds is cruciality.

Some more areas are Delivery store optimization, Identifying crime localities, Insurance fraud detection, Rideshare data analysis etc.

So Through This it’s pretty clear about the use-cases and importance of K-Means Algorithm in Today’s World specially in case of Security Domain.

--

--

Shrishti Kapoor

Spread Knowledge because it’s all about Right Education..!!