Detection of Outlier: Data Mining
Abstract
In data mining reprocessing means preparing data. It is the one of the important and compulsory task. Before applying the data mining technique like association, classification, or clustering noisy and outlier should be removed. We have proposed replicator neural network (runs) as on outlier detecting algorithm. Here in detection of outlier - essay about data mining we compare run for outlier detection with three other methods using both publicity available data sets (generally small).
Introduction
Data mining is a process of extracting valid, previously unknown, and ultimately comprehensible information from large data sets and using it for organizational decision making. An outlier is defined as data point which is very different from the rest of the data based on some measure. Such a point often contains useful information on abnormal behavior of the system described by data. On the other hand, many data mining algorithms in the literature find outliers as a side- product of clustering algorithms. From the viewpoint of a clustering algorithm, outliers are objects not located in clusters of data set, usually called noise. Outlier detection problem is one of the very interesting problems arising recently in the data mining research. Recently, a few studies have been conducted on outlier detection for large data sets. Outliers are different from noisy data. In general, noise is not interesting in data analysis, including outlier detection. Outlier Detection (OD) becomes a significant research problem which aims to find objects which are dissimilar, omitted and contradictory in the behavior of existing database. The main aim of detecting outlier is to improve the quality of data.
Many machine learning and data mining algorithm will not work well in the presence of outlier. Outlier are interesting because they are suspected of not begin generated by the same mechanisms as the rest of the data. Outlier detection is also related to novelty detection in evolving data set.
- Wens report the monitored data from the real world using imperfect sensing devices
- Such devices are battery powered and thus their performance tends to deplete as power is exhausted
- If WEN is deployed for military and security uses, sensors are exposed to manipulation by adversaries
- Since these networks may include a large number of sensors, this number may reach an extremely high value that can reach to million nodes depending on the application, hence the chance of error is more than that in traditional networks.
Outlier Detection
Outlier detection aims to find patterns in data that do not conform to expected behavior. It has extensive use in a wide variety of applications such as military surveillance for enemy activities, intrusion detection in cyber security, fraud detection for credit cards, insurance or health care and fault detection in safety critical systems. Their importance in data is due to the fact that they can translate into actionable information in a wide variety of applications.
Defining Outliers
Outliers are patterns in data that do not conform to a well-defined notion of normal behavior. Most observations lie in these two regions. Points that are sufficiently far away from the regions, Outliers might be induced in the data for a variety of reasons, such as malicious activity, e.g., credit card fraud, cyber-intrusion, terrorist activity or breakdown of a system. The object in R is outlier in the data set.
Issues
- Resource constraints.
- High communication cost.
- Distributed streaming data.
- Dynamic network topology.
- Large scale deployment.
- Identifying outlier source.
Applications
- Fraud detection- detecting fraudulent applications for credit cards, state benefits or detecting fraudulent usage of credit cards or mobile phones. Loan application processing - to detect fraudulent applications or potentially problematical customers.
- Intrusion detection- detecting unauthorized access in computer networks.
- Activity monitoring - detecting mobile phone fraud by monitoring phone activity or suspicious trades in the equity markets.
- Network performance- monitoring the performance of computer networks, for example to detect network bottlenecks.
- Fault diagnosis- monitoring processes to detect faults in motors, generators, pipelines or space instruments on space shuttles for Structural defect detection- monitoring manufacturing lines to detect faulty production runs for example cracked beams.
Approaches to Outlier Management
Outlier Accommodation
Which is characterized by the development of a variety of statistical estimation or testing procedures which are robust against, or relatively unaffected by, outliers? In these procedures, the analysis of the main body of data is the key objective and outliers themselves are not of prime concern. This approach is difficult to be applied to those applications where explicit identification of anomalous observations is an important consideration, e.g. suspicious credit card transactions.
Retained or Rejected
This approach is characterized by identifying outliers and deciding whether they should be retained or rejected. Many statistical techniques have been proposed to detect outliers and comprehensive texts on this topic are those by Hawkins and Barnett and Lewis. These approaches range from informal methods such as the ordering of multivariate data, the use of graphical and pictorial methods, and the application of simple test statistics, to some more formal approach in which a model for the data is provided, and tests of hypotheses that certain observations are outliers are set up against the alternative that they are part of the main body of data. The identification of outliers has also received much attention from the computing community .However; there appear to be much less work on how to decide whether outliers should be retained or rejected. In statistical community, a commonly-adopted strategy when analyzing data is to carry out the analysis both including and excluding the suspicious values. If there is little difference in the results obtained then the outliers had minimal effect, but if excluding them does have an effect it may be better to find an alternative. This is where knowledge-based outlier analysis steps in. In order to successfully distinguish between noisy outlying data and noise free outliers, different kinds of information are normally needed. An RAN is a variation on the usual regression model where, instead of the input vectors being mapped to the desired output vectors, the input vectors are also used as the output vectors.
Conclusion
Most of the users of data mining can think that noisy data and outlier data are same both should be removed, actually here we are try to find the dissimilarities between noisy and outlier ,noisy is removed in reprocessing whereas outliers may or may not removed depending the data mining algorithm. Noisy data does not have any applications where as outliers may be observed in clustering technique as a byproduct .Outliers having different views depending on application and method, identified outliers are not leaved they should be analyzed, but noisy data is simply it is removed it does not have any applications. Such objective measures need to be developed and assessed for their usefulness in comparing outlier detectors.
References
- A. C. Atkinson. Fast very robust methods for the detection of multiple outliers. Journal of the American Statistical Association, 89:1329–1339, 1994.
- S. D. Bay. The repository, 1999.
- N. Brillo, A. S. Heidi, and P. F. Vela men. BACON: Blocked adaptive computationally.-Computational Statistic & Data Analysis, 34:279–298, 2000.
- C. L. Blake and C. J. Mere. repository of machine learning databases, 1998.
- G. E. Hinton D. H. Lackey and T. J. Malinowski. A learning algorithm for Boltzmann machines. Cog nit. Sci., 9:147–169, 1985