Cyber Profiling:: K-Means Clustering

  • Clustering is used to create a group (cluster) of the data so that it can easily find the necessary data.
  • Clustering is a classification of similar objects into several different groups, it is usually applied in the analysis of statistical data which can be utilized in various fields, for example, machine learning, data mining, pattern recognition, image analysis, and bioinformatics.
  • Clustering including supervised learning types.
  • There are four types of clustering algorithms that have been compared based on performance, such as K-Means, hierarchical clustering, self-organization map (SOM), and expectation maximization (EM Clustering).
  • Based on these test results can be concluded that the k-means algorithm performance and EM are better than a hierarchical clustering algorithm.

In general, partitioning algorithms such as K-Means and EM are highly recommended for use in large-size data. This is different from a hierarchical clustering algorithm that has good performance when they are used in small-size data.

The method of the K-means algorithm is as follows:

  1. Determine the number of clusters k as in shape. To determine the number of clusters K was done with some consideration as theoretical and conceptual considerations that may be proposed to determine how many clusters.
  2. Generate K centroid (the center point of the cluster) beginning at random. Determination of initial centroid done at random from objects provided as K cluster, then to calculate the i cluster centroid next, use the following formula:
xi: the object to-i; N: the number of objects to be members of the cluster

3. Calculate the distance of each object to each centroid of each cluster. To calculate the distance between the object with the centroid author using Euclidian Distance.

n= the number of object ; ai= object a to-i ; bi= object b to-i

4. Allocate each object into the nearest centroid. To perform the allocation of objects into each cluster during the iteration can generally be done in two ways, with a hard Kmeans, where it is explicitly every object is declared as a member of the cluster by measuring the distance of the proximity of nature towards the center point of the cluster, another way to do with fuzzy C-Means.

5. Do iteration, then specify a new centroid position using equation (1).

6. Repeat step 3 if the new centroid position is not the same.

  • The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is more specifically based on what is known and not known about the criminal.
  • Profiling is information about an individual or group of individuals that is accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity.
  • Difficulties in implementing cyber profiling are on the diversity of user data and behavior when online is sometimes different from actual behavior. Given the privilege in personal behavior, inductive generalizations can be very reliable but can also lead to a misunderstanding of behavior analysis. Therefore the cyber-profiling process is via a combination of deductive and inductive methods.
  • For investigation, the cyber-profiling process gives a good, contributing to the field of forensic computer science.
  • Cyber Profiling is one of the efforts made by the investigator, to know the alleged offenders through the analysis of data patterns that include aspects of technology, investigation, psychology, and sociology.
  • The process of profiling against criminals is often also known as cyber-criminal profiling criminal investigation or analysis.
  • Criminal profiles are generated in the form of data on personal traits, tendencies, habits, and geographic-demographic characteristics of the offender (for example age, gender, socioeconomic status, education, the origin place of residence).
  • Preparation of criminal profiling will relate to the analysis of physical evidence found at the crime scene, the process of extracting the understanding of the victim (victimology), looking for a modus operandi (whether the crime scene planned or unplanned), and the process of tracing the perpetrators was deliberately left out (signature).
  • The new approach to cyber profiling is to use clustering techniques to classify the Web-based content through data user preferences. This preference can be interpreted as an initial grouping of the data so that the resulting cluster will show user profiles.

@ User profiling can be seen as the conclusion of the interests of users, intentions, characteristics, behavior, and preferences.

@ User profiles a recreated for a description of the background knowledge of the user. User profile represents a concept model which is owned by the user when searching for information web.

  • The K-Means algorithm is used as an algorithm for the cyber profiling process.
  • K-Means algorithm being used is in line with expectations because it has a simple algorithmic process with a good degree of accuracy. But the K-Means algorithm has disadvantages, namely the process of making an initial value initial random center. This can lead to differences in the results of the cluster.
  • In the early stages of primary data obtained containing information about the websites accessed by users via the internet. In addition to the data contained informative website also contains data that updates to the operating system, the update of the web browser, and website advertising that usually appears as a pop-up.
  • Implementation of the K-Means algorithm, the result obtained is a level of visits to the website. The visit is divided into three groups: low, medium, and high.
  • Clustering by Rapid Miner and SPSS application indicates that the output produced has the same cluster of data. Based on the results of the cluster, it appears there are three clusters whose value is different.
  • Those values represent the number of websites that have been divided in each cluster.

K-means is a very effective as well as an easy way of Clustering in machine learning. There are many use cases of K-means in the Security Domain similar to Cyber Profiling. Some of the interesting use cases are: Identifying crime localities, Insurance fraud detection, Call record detail analysis, Automatic clustering of IT alerts, crime document classification, Rideshare data analysis, and so on.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store