Cluster Analysis On NYC Taxi Trip Using K-Means
This report introduces how the program was implemented and show the result and further analysis.
Introduction
Clustering problems arise in many different applications, such as data mining and knowledge discovery, data compre-ssion and vector quantization, and pattern recognition and pattern classification. Due to such large size of data it becomes very difficult to perform effective analysis using the existing traditional techniques. Apache Hadoop has become the platform of choice for developing large-scale data-intensive applications.
Implementation
Mapper
Parallelization of this algorithm is mainly carried out by parallel scanning and reading of data. We assumed that the k points are small enough to be placed in the cache for sharing. At the beginning of map, the K central points are loaded by reading the cache file. And the nearest central point is determined by comparing the distance of each central point. Then the output is < central point ID, the current number of all points, the current center point of all points >. In mapper stage, the current number of all points is 1. The central point of all the current points is the information of the current record point.
Combiner
The Hadoop combiner class is an optional class in the MapReduce framework that is added between the Map class and the Reduce class, to reduce the amount of data received by the Reduce class by combining the data output in the Map. In this class, we receive the k-v data from mapper which has the same key (same central id), then add the number of data together to form a new output < central point ID, the current number of all points, the current center point of all points >, then pass to reducer.
Reducer
The data with the same key will be processed by the same node throng the partitioner. In the reduce phase, all the points of the same center ID are recalculated to get the final center point. And then the output of this iteration will be the input for the next iteration. DriverFor better controlling the whole system, we developed a driver class.
Experiment
To evaluate the performance and the result of our MapReduce program, we set up 2 different kinds of experiments. The data set we use coming from NYC. The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP).
Experiment I: Performance of Different Data Scale and Different Size of Cluster Nodes. We use 3 different scale of data, 4000, 8000 and 12000, and 2 different numbers of nodes to evaluate the performance of the program. Experiment II: Result of different KWe use the data set with scale of 12000, set different k to figure out the distribution of the clusters.
Conclusion
From Experiment I, we can infer that adding more nodes to the cluster can boost the program about 40%. Especially, when the scale of data becomes quite larger than before, the benefit that more nodes bring can be remarkable.
From Experiment II, the figures show the different ways to divide data points using different k. While keeping enlarging k makes no further meanings.
There are many studies using NYC data. They focus more on the social behavior that reveals in these data. After we implemented the program, we found if we focus more on the information the data have, other than the geographical location of the pickup and drop-off points, our research may has more actual meanings. That is the target we study is, for example, bills, we can evaluate it from different dimension, like distance of the trip, location, payment type. It can inspire us more idea than what we have done now.