Project Report On Constructing A Breast Cancer Outcome Prediction Model
Abstract: Breast cancer is one of the most common cancers leading to high mortality rate among the female population. An accurate and reliable diagnostic system can promote early diagnosis. Breast cancer prediction involves the prediction of tumours that have been identified as benign or malignant. In this paper, we have implemented a combination of the AdaBoost and random forests algorithms for constructing a breast cancer outcome prediction model using the Wisconsin diagnostic breast cancer dataset (WDBC). The main concern was to improve the performance as much as possible and to reduce error type II (FN) which classify the patients has been healthy (Benign) while cancer classification is positive. From the experimental results, the implemented approach yields the result with a higher accuracy of 97. 07% compared to Naive Bayes with accuracy of 93. 7%, and Random forest with accuracy of 95. 9%.
Keywords-Breast-cancer,diagnosis,Random Forest,Adaboost,Classification.
Introduction
Breast Cancer has been the cause of deaths among the female population worldwide . It is one of the most common cancers. The detection at later stages leads to no use of treatments. The occurrence of this disease is mostly seen among older aged women, but it can also possibly occur in the younger age group women too. The Breast Cancer starts when there is out-of-control growth and rapid multiplication in cells of the breast. There are two types of tumours. They are Malignant and Benign. Malignant tumours are the ones which can spread to the surrounding tissues and different areas and organs of the body. Benign tumours do not spread to other areas and do not invade surrounding tissues. The early diagnosis of cancerous cell growth have more chances of being successfully treated. After the tumour spreads, the treatment becomes more difficult to be effective and the survival chances of the patient is comparatively lower. According to the statistics, more than 90% of women diagnosed with breast cancer at the initial stages survive their disease for at least 5 years compared to approximately 15% for women diagnosed with advanced stages of the disease[8].
Nowadays,the decision tree is widely used in the medical domain. Several studies have successfully used decision trees to extract the knowledge from medical data sets. AdaBoost is a well known method due to its low error rate, and good performance in the low noise data set. As the successor of the boosting algorithm, it is used to combine a set of weak classifiers to form a model with higher prediction outcomes[9]. We propose a combination of AdaBoost and random forests for predicting breast cancer outcome from the data collected from Wisconsin diagnostic breast cancer dataset (WDBC).
The remainder of this paper is organized as follows.
Section II introduces the previous related work in this domain. Section III presents the methodology used in this paper. Experiment results and discussions are presented in section The conclusion and outline of future work are given in section V.
Literature survey
Majority of women in today’s generation have been suffering from Breast Cancer. Scientists and Researchers have conducted several experiments regarding the breast cancer diagnosis. After tumor discovery the doctors have to distinguish whether it is Benign or Malignant.
After reviewing the different literature showed that there have been several studies on the early detection and prevention of breast cancer using machine learning techniques.
In [1], Support Vector machines and K-Nearest Neighbours classifier has been applied to Wisconsin Diagnostic Breast Cancer dataset(WDBC) for 629 instances. The proposed method SVM has been found to perform better compared to the other variants of SVM and KNN algorithm with an accuracy of 98. 57%.
In [2],Support Vector Machines and Artificial Neural Network has been applied to WDBC for 629 instances. In this study SVM has out performed ANN with accuracy of 97. 14%.
In [3], In this study, a new method to detect the breast cancer with high accuracy is proposed. This method consists of two main parts, in the first part the image processing techniques are used to prepare the mammography images for feature and pattern extraction process. The extracted features are utilized as an input for a two types of supervised learning models, which are Back Propagation Neural Network (BPNN) model and the Logistic Regression (LR) model with comparing the result and the accuracy for the both models. It is observed in results that the number of features utilized in LR model was much higher than with the BPNN, having a good regression value using BPNN that exceeded 93% with only 240 features.
In [4],In this paper we inspected the generalization performance of J48, Naïve Bayes, and SVM in order to boost the prediction models for decision-making system in the prediction of breast cancer survivability. We are using a new Voting classifier approach where all three classification algorithms are combined for the prediction of breast cancer.
In [5], A SVM based ensemble method is implemented where twelve variants of SVM are hybridized based on the based on the proposed Weighted Area Under the Receiver Operating Characteristic Curve Ensemble (WAUCE). This model achieves a higher accuracy around 97% for Wisconsin Diagnostic Breast Cancer dataset(WDBC) than the common ensemble methods adaptive boosting and bagging,thus outperforming the individual models on small datasets.
In [6], In this study, for classification decision tree algorithm is employed . A hybrid method is proposed to enhance the classification accuracy of Breast Cancer data sets. The training data is tested with 10-fold cross validation. The data sets are preprocessed to remove missing values. The feature selection methods used to eliminate those attributes that have no significance in the classification process. Bagging the training dataset is one of the most common methods of improving decision tree.
Methodology
In this section we first describe about the breast cancer data used . Then we present the methodology used in the implementation.
DataSet
Our investigation is based on the Original Wisconsin Breast Cancer Diagnostic Data set that is obtained from the UCI Machine Learning Repository, an online open source repository [12]. This data set was collected periodically over three years by Dr. William H. Wolberg from the University of Wisconsin Hospitals and it consists of 569 instances where the data has been classified into benign and malignant. 212 of the cases are malignant and 357 are benign cases. The attributes are :
- ID number
- Diagnosis
10 real-valued features that are computed for each cell nucleus:
- radius (mean of distances from center to points on the perimeter)
- texture (standard deviation of gray-scale values)
- perimeter
- area
- smoothness (local variation in radius lengths)
- compactness
- concavity (severity of concave portions of the contour)
- concave points (number of concave portions of the contour)
- symmetry
- fractal dimension
Evaluation methodology
We implemented by performing dimensionality reduction using the PCA (Principal Component Analysis) algorithm. The PCA algorithm enables us to get the principal components from a set of possibly correlated variables by perform -ing orthogonal transformation. These principal components are uncorrelated eigenvectors, each representing some proportion of variance in the data.
Let X = be the training data where represents a tuple with dimension D which is 32 in our case. The aim of PCA is to extract the most important information from and compress the dimensionality by retaining only the important information. Thus PCA here is an orthogonal projection of the original D-dimensional data onto a new 2-dimensional space. And the variance of the projected data has to be minimized.
In the next step we use AdaBoost algorithm for classifier learning. AdaBoost is a boosting ensemble model that can be used to boost the weak learner. We have chosen RandomForest as the weak learner. Since we have used Random -Forest as the weak learner for Adaboost, first the RandomForest is trained. The weighted error rate for it is noted. Then its weight in the ensemble is calculated. Weights of the wrongly classified points are updated. This process is repeated until all the trees of the RandomForest set to train is reached. Then the AdaBoost makes the new and final prediction. This technique has proved to give great accuracy.
Experimental results
This sections describes the parameters of evaluation and presents the results that have evaluated the classifiers that have been used in the paper. The evaluation techniques are as discussed below,which are based on the visualization and results obtained from the confusion matrix.
Confusion matrix
The confusion matrix is a visualization tool commonly used to present performances of classifiers in classification tasks [10]. The level of effectiveness of the classification model is calculated with the number of correct and incorrect classifications in each possible value of the variables being classified in the confusion matrix [11]
It computes the parameters true positive(TP), false positive(FP), true negative(TN) and false negative(FN).
Discussion
The results in table 1 and Fig1 show that the implemented combination of random forest and adaboost has the best performance in terms of accuracy,sensitivity and specificity. This implies that it has a higher chance of correctly differentiating between malignant and benign case.
Conclusion
In this paper we have proposed a combination of the AdaBoost and Random forests algorithms for constructing a breast cancer diagnostic prediction model. The capability and effectiveness of the proposed method is illustrated using 10-fold cross-validation, accuracy, sensitivity, specificity. The results show that the proposed method improved the accuracy up to much higher percentage compared with several single and combined classifiers. The experimental results have shown the improvement of the models for further developing suitable prediction models.
The future work involves the study of the expansion of the number of classifiers that can be used in the ensemble and understand its efficiency and accuracy in comparison with other ensembles.
References
- Md. Milon Islam, Hasib Iqbal, Md. Rezwanul Haque, and Md. Kamrul Hasan,”lirediction of Breast Cancer using Suliliort Vector Machine and K-nearest neighbours”, IEEE, Dec. 2017, lili. 2572-7621
- Reem Alyami,Jinnan Alhajjaj,Batool Alnajrani, Ilham Elaalami,Abdullah Alqahtani,Nahier Aldhafferi and Sunday O. Olatunji, ”Investigating the effect of Correlation based Feature Selection on breast cancer diagnosis using Artificial Neural Network and Suliliort Vector Machines”, IEEE, Feb. 2017,lili. 978-1-4673-8765-1
- Moh’d Rasoul Al-hadidi, Abdulsalam Alarabeyyat, Mohannad Alhanahnah, ”Breast Cancer Detection using K-nearest Neighbor Machine Learning Algorithm”,IEEE, Aug-Selit. 2016, lili. 2161-1343
- U. Karthik Kumar, M. B. Sai Nikhil and K. Sumangali,” lirediction of Breast Cancer using Voting Classifier Technique”,IEEE, 2-4 Aug. 2017, lili. 978-1-5090-5905-8
- Haifeng Wang , Bichen Zheng , Sang Won Yoona , Hoo Sang Ko,”A suliliort vector machine-based ensemble algorithm for breast cancer diagnosis”,Elsevier, 1 June 2018, Volume 267 Issue 2, lili. 687-699
- Lavanya Doddilialli and K. Usha Rani,”Ensemble Decision Tree Classifier For Breast Cancer Data”, ijitcs, 2012, 10. 5121/ijitcs. 2012. 2103
- Tan A C and Gilbert D,”Ensemble machine learning on gene exliression data for cancer classification”,Alililied Bioinformatics 2(3 Sulilil):S75-83 · February 2003
- httlis://www. cancerresearchuk. org/about-cancer/cancer-symlitoms/why-is-early-diagnosis-imliortant
- liei-Chann Chang, Chen-Hao Liu,Chin-Yuan Fan,Jun-Lin Lin,Chih-Ming Lai,”An Ensemble of Neural Networks for Stock Trading Decision Making”,Sliringer, 2009, Emerging Intelligent Comliuting Technology and Alililications. With Asliects of Artificial Intelligence lili 1-10.
- J. Han and M. Kamber,” Data mining: concelits and techniques”. 2nd ed. San Francisco: Morgan Kaufmann, Elsevier Science, 2006.
- Cabena, li. Hadjinian, R. Stadler, J. Verhees and A. Zanasi,”Discovering data mining from concelit to imlilementation. ” Ulilier Saddle River, N. J. : lirentice Hall, 1998.
- M. Lichman, UCI Machine Learning Reliositry, 2013. [Online]. Available: httlis://archive. ics. uci. edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)