Applied Hierarchical Cluster Analysis with Average Linkage Algoritm

This research was conducted in Sidoarjo District where source of data used from secondary data contained in the book "Kabupaten Sidoarjo Dalam Angka 2016" .In this research the authors chose 12 variables that can represent sub-district characteristics in Sidoarjo. The variable that represents the characteristics of the subdistrict consists of four sectors namely geography, education, agriculture and industry. To determine the equitable geographical conditions, education, agriculture and industry each district, it would require an analysis to classify sub-districts based on the sub-district characteristics. Hierarchical cluster analysis is the analytical techniques used to classify or categorize the object of each case into a relatively homogeneous group expressed as a cluster. The results are expected to provide information about dominant sub-district characteristics and non-dominant sub-district characteristics in four sectors based on the results of the cluster is formed.


INTRODUCTION
Sub-district characteristics is general overview of sub-district that need to be developed optimally, thus providing a positive impact on the sub-district progress.Sub-district characteristics divided into several sectors, geography, education, government, social, agriculture, industry, commerce, communications, finance and prices and regional income.Sidoarjo District is divided into 18 sub-districts that are Buduran Sub-district, Candi Sub-district, Porong Sub-district, Krembung Sub-district, Tulangan Sub-district, Tanggulangin Sub-district, Jabon Sub-district, Krian Sub-district, Balongbendo Sub-district, Wonoayu Sub-district, Tarik Sub-district, Prambon Sub-district, Taman Sub-district, Waru Sub-district, Gedangan Sub-district, Sedati Sub-district and Sukodono Subdistrict.Potential Sidoarjo District evenly spread over the 18 Sub-districts and is reflected on the sub-district characteristics.In order equitable development to improve people's welfare, goverment of Sidoarjo Ditrict collaboration with BPS Sidoarjo District published "Sidoarjo Dalam Angka 2016" which contains sub-district characteristics in Sidoarjo.This book is expected to provide benefits for the implementation of development as well as helping to evaluate and supervise development outcomes of Sidoarjo District.Data of sub-district characteristics in "Sidoarjo Dalam Angka 2016" has been analyzed only use descriptive analysis, therefore the authors consider that these data have a lot of information if further analysis.The focus of research using hierarchical cluster analysis are the four sectors that represent sub-district characteristics, namely geography, education, agriculture and industry.The results are expected to provide information about dominant sub-district characteristics and non-dominant sub-district characteristics in four sectors based on the results of the cluster is formed.

METHODS
Cluster analysis is a technique used to classify objects into relatively homogeneous groups, called clusters.Objects in each group tend to resemble each other and differ greatly with objects from other clusters.Cluster analysis using the principal components analysis can use interval and ratioscaled data.Cluster analysis is also called classification analysis or taxonomy numerical analysis because it deals with clustering procedure where each object is only fit into one cluster only, to avoid overlapping [1].
There are several terms used dapam cluster analysis.The terms include the following [2]: • Aglomeration Schedule, is to schedule that provides information about the object or the case will be merged or entered in clusters on each stage, in a process of hierarchical cluster analysis.
• Cluster Centroid, is the average value of all variable objects or cases in a particular cluster.
• Cluster centers, is the starting point of the start of the grouping in non-hierarchical cluster analysis.
• Cluster Membership, Membership is showing the clusters, where each object or a case of being members.
• Dendogram, is a graphical tool to present the results of cluster analysis, or upright vertical lines represent the merged cluster together.Line position on the scale indicates the distance which were merged cluster.Dendogram should be read from left to right.Terms of normality, linearity, and homoscedasticity highly considered in the multivariate analysis, but not in the cluster analysis.In cluster analysis, researchers should be more concerned with how large a sample representative in population and the presence or absence of multicollinearity.The first step in formulating cluster analysis of the problem of defining the variables used for basic grouping.Then measure the exact distance should be selected.The distance measure determine similarity or dissimilarity of the object to be grouped.To determine the number of clusters requires subjective judgment of the researchers, in addition based on the calculation results objectively.Cluster obtained should be interpreted and expressed in the variables used for the basic formation of clusters.The equation commonly used for calculating the distance between the item X to item Y is a Euclidean distance.The equation used to calculate the Euclidean distance is as follows [3]: There are two types of cluster analysis is hierarchical cluster analysis and non-hierarchical cluster analysis.In the method of hierarchical cluster there are two basic types namely agglomerative (concentration) and divisive (the spread).In agglomerative method, any object or observation is considered as a separate cluster.In the next stage, the two clusters which has some similarities are combined into a new cluster and so on.Instead, the divisive methods, from a large cluster consisting of all objects or observation.Furthermore, the object or observation that the highest value does not resemble separated and so on [4].
There are five kinds of algorithms to form a group with a hierarchical method, namely [5]: • Single-Linkage Single linkage method defines the similarity between clusters based on the shortest distance from any object in one cluster to any other object.If there is a third object which has the closest distance to one of the objects in the group that has been formed, then the object can be merged into the group.This process continues to form a single group.This method is the most flexible method aglomeratif.

• Complete Linkage
This method is basically the same as the single linkage method.It's just the distance used is the maximum distance.Reasons have the maximum distance is that objects that have little in common can be connected.

• Average-Linkage
Average linkage method also has similarities with two single linkage method.Only the distance used is the average distance of all objects in a group with other objects outside the group.
Grouping objects with one another based on the average minimum.Because using the average, then this method is considered more stable, and no bias.

• Centroid method
The distance used in this method is the distance between the center point of the two groups.
Where is the center point of the group is the middle value of each variable object in one group.In this method each time a new group is formed, then the center point changes.The advantage of this method is the small effect of outliers in the formation of the group.

• Ward's Method
In the Ward method, distance calculations based on the sum of the squares between the two groups for all variables.This method can be used if the number of observations is not too large.
In general, the distance used is a Euclidean distance squared.The opposite of hierarchy cluster analysis method is non-hierarchy cluster analysis.In this method does not include the "treelike construction" but through the process by placing objects into the cluster at once, forming a particular cluster.The first step in the method is to choose a cluster nonhirarki as initial cluster centers, and all objects within a certain distance placed on cluster formation.Then select the next cluster and the placement of objects continued until all are placed.The objects can be placed again if the distance was closer to the other cluster than the cluster of origin.Non-hierarchy cluster analysis methods associated with the K-means cluster, and there are three approaches used to place each observation on a single cluster.Such approaches include the following [6]: • Sequential Threshold, Threshold Sequential Method start by selecting one cluster and placing all the objects that are at a certain distance into it.If all objects that are at a certain distance has been entered, then the second cluster selected and put all objects within a certain thereto.Then the third cluster is selected and the process continues as before.• Parallel Threshold, Threshold Parallel method is the opposite of the first approach by selecting a number of clusters simultaneously and placing objects into clusters that have the distance between the nearest face.In the process, the distance between the face can be specified to include some objects into clusters -cluster.Also some variation on this method, the rest of the objects are not grouped if it is outside a certain distance of a cluster.• Optimization, the third method is similar to the previous method except that this method makes it possible to put objects back into the cluster closer.
In this research used secondary data sourced from "Kabupaten Sidoarjo Dalam Angka 2016".Unit of observation in this research was 18 sub-district in Sidoarjo District.In this research the authors chose 12 variables that can represent sub-district characteristics.Operational definition of each variable will be described as  1 : Surface area (km 2 )  2 : Total population  3 : The number of national and private elementary school  4 : The number of national and private elementary school students  5 : The number of national and private junior high school  6 : The number of national and private junior high school student  7 : The number of national and private high school  8 : The number of national and private high school students  9 : Harvest land area (Ha)  10 : Paddy Production (Kw)  11 : The number of large and small industries  12 : The number of workers in large and small industries The method of analysis in this research is. 1. Standardization of data that have variability research unit 2. Correlation Analysis and Principal Component Analysis on Research Variables 2. Classify sub-district in Sidoarjo with Hierarchical cluster analysis with average linkage algorithm Processing of data by hierarchical cluster analysis with average linkage algorithm performed with SPSS 20, Before the cluster analysis, principal component analysis is done to overcome the correlation between variables, as well as the standardization of variables in order to obtain a variable with the same unit, making it eligible cluster analysis.The results of the cluster analysis then be concluded and interpreted.

RESULTS AND DISCUSSION
Sidoarjo District consists of 18 Sub-districts formed four clusters, details of four clusters with members that include within each cluster can be seen in Table 1.Grouping four clusters based on the data that provide a general overview of the sub-district characteristics in Sidoarjo, represented by the four sectors, geography, education, agriculture and industry.Cluster 1 consists of one sub-district, cluster 2 consists of 14 sub-district, cluster consists of two sub-districts and cluster 4 consists of one sub-district.To identify the characteristics of each cluster conducted a descriptive analysis.Members of the group can be described as follows: 1. Members of Cluster 1: Sidoarjo Sub-district 2. Members of the Cluster 2: Buduran Sub-district, Candi Sub-district, Porong Sub-district, Krembung Sub-district, Tulangan Sub-district, Tanggulangin Sub-district, Krian Sub-district, Balongbendo Subdistrict, Wonoayu Sub-district, Tarik Sub-district, Prambon Sub-district, Taman Sub-district, Gedangan Sub-district and Sukodono Sub-district.3. Members of the Cluster 3: Jabon Sub-district and Sedati Sub-district 4. Members of Cluster 4: Waru Sub-district Once known the number of clusters formed and the members of each cluster then performed a descriptive analysis of each cluster.To search for characteristics which are most dominant in each cluster, then look for the highest average of the variables for each cluster.Summary of average value in each cluster can be seen in Table 2.

Table 2 .
Summary of Average Value in Each CusterBased on Table3can known dominant characteristics and non-dominant characteristics of the four clusters are formed with the following description : 1. Cluster 1 consist of Sidoarjo Sub-district has dominant variables are  2 ,  3 ,  4 ,  5 ,  6 ,  7 and  8 , this indicates that the average number of national and private elementary school sector, the average number of national and private elementary school students, the average number of national and private junior high school, the average number of national and private junior high school students, the average number of national and private high school and the average number of national and private high school students most are in Sidoarjo Sub-district.Sixth dominant variables are  2 ,  3 ,  4 ,  5 ,  6 ,  7 and  8 represents of education condition, shows that education condition in the Sidoarjo Sub-district has the most good progress compared to 17 other sub-districts.