Machine Learning Approach For Disease
Abstract
Insights from real-time disease surveillance systemsare very useful for the public to take preventive measuresagainst these diseases and it also benefits the pharmaceuticalmanufacturers in improving the sales of medicines for theparticular disease and ensuring adequate availability of medicineswhen they are needed. A disease outbreak is an event wherein there is a rise in thenumber of positive cases for a disease in a short span of time. An outbreak can be limited to a particular region or time ofthe year.
Diseases can be detected by several approaches, socialmedia being preferred method due to availabiity of real-timedata. Hence, data from social media, especially Twitter can beused to detect live events and monitor them efficiently. In orderto detect diseases precisely, this paper proposes an approachwherein tweets, which are collected and pre-processed, can beeffectively vectorized and clustered into the appropriate diseaseswith the usage of the Agglomerative Clustering Technique. Thetweets can also be visualised using their geo information in orderto generate zones which have high density of diseases. Such asurveillance system can be very useful for early prediction ofdieases outbreaks, which in turn can facilitate faster and betterhandling of the situation.
Keywords— Twitter, Disease, Embedding, Agglomerative Clus-tering, Visualisation
INTRODUCTION
A disease outbreak is an event where there is an increasein the occurrences of a disease in a region, it is termedas a disease outbreak. Outbreaks usually last for few daysor few weeks. The event wherein there are few cases of adisease previously unheard of or diseases caused by a newbacterium/virus can also be termed as a disease outbreak. Using the tweets posted by people, outbreaks can beeffectively detected and alerts can be issued, so that measurescan be taken by the people to protect themselves and alsohealth organisations can undertake necessary measures tocontrol the outbreak. We believe that our study could benefitboth aspiring researchers who are looking to learn valuablelessons of data mining, analysis and visualization, and thehealth-care sector who would like to see the potential ofTwitter data in providing valuable insights during surveillanceof diseases. Twitter is an online microblogging/social media platformwhere a lot of information is exchanged, mostly as messagescalled ’tweets’ that offer insights into many aspects of life. Here, users share links, pictures, comments on live events andtheir experiences and opinions on a variety of topics. Twitterproduces massive amounts of data at an unprecedented scaleof nearly 500 million tweets per day from close to 336 millionactive users. It is the availability of large amounts of twitterdata, the large outreach of tweets and the ease with which itcan be fetched that has made Twitter a very popular sourcefor researchers looking for data to draw insights from. Twitteris also preferred by researchers and data miners due to its APIwhich allows data to be mined easily.
This makes the publiclyavailable data a valuable resource for mining and to discoverinteresting and actionable healthcare insights. Hence, Twitterhas also drawn great interest from public health community toanswer many health related questions regarding the detectionand spread of certain diseases. We feel that, compared toconventional methods for detection of diseases, analysis ofdata obtained from twitter is faster, economical and precise. The collected twitter data is cleaned, vectorized and passedthrough supervised or unsupervised learning algorithms togain insights. Few such algorithms which are suited to twitterdata include K-means clustering and Agglomerative clustering. Agglomerative clustering is a bottom-up hierarchicalclustering method where clusters have sub-clusters, which inturn have sub-clusters. Agglomerative clustering starts withevery single object (gene or sample) in a single cluster.
Then, in each successive iteration, it agglomerates (merges) theclosest pair of clusters by satisfying some similarity criteria,until all the tweets are clustered into ’n’ clusters. The organization of rest of this paper is as follows: Section II discusses previous related work on the analysis ofTwitter data in the field of disease surveillance. Section IIIdescribes how the data was collected. Section IV illustratesthe methodology used which comprises Pre-processing andEmbeddings. Section V presents the Clustering of the diseasetweets. Section VI presents the Tweet Analysis. Section VII describes the Visualization of the geo-tagged tweets. Finally ,Section VIII includes the results and the conclusion to thepaper/work.
RELATED WORK
Twitter data has been widely used by researchers in thepast in order to gain insights and develop models on varioustopics. Healthcare has been one of the leading domains dueto the ease of availability of large amounts of data and thepresence of a variety of algorithms, hence resulting in a largenumber of ways of solving a problem. In this section, wepresent an overview of the related works carried out in thisdomain in the past. Kathy Lee et al(2013) described a real-time flu andcancer surveillance that uses spatial, temporal and textmining on Twitter data. The real-time analysis resultsreported visually in terms of US disease surveillance maps,distribution and timelines of disease types, symptoms, andtreatments.
Their surveillance is useful not only for earlyprediction of seasonal disease outbreaks such as flu, but alsofor monitoring distribution of cancer patients with differentcancer types and symptoms in each state and the popularityof treatments used. They have also performed text analysison different types of cancer and flu. An approach for reliable classification of tweets basedon influenza based keywords was presented by Kenny Byrdet al(2016), using which, the spread of influenza canbe predicted with high accuracy and that there is a wayto monitor the spread of influenza in selected cities inreal-time. Their approach consists of efficient extraction ofdata from Twitter streams, classifying the extracted tweetsbased on their sentiment and visualizing data via a real-timeinteractive map.
The tweets have been classified into positive,negative or neutral sentiments using supervised learningtechniques, namely, Naive Bayes, Maximum Entropy and Dynamic Language Model classifiers. Neha Garg et al(2017) extracted Twitter data, preprocessedand geographical clustered using K-means clustering,(clustered the tweets based on geographical coordinates)using the Elbow method in order to generate the number ofclusters. The previous works focus on using Twitter data forsurveillance of a particular disease - either flu, influenza orcancer, applying supervised learning techniques to classifytweets based on the sentiment and visualizing them on themap and geographical clustering of tweets using the K-meanstechnique.
DATA COLLECTION
Twitter data was collected for the months of January toJune using the REST API which is used to interact withTwitter services. Tweepy, the library interface for the TwitterAPI provided access to the entire Twitter RESTful API meth-ods. Each Method accepted various parameters and returnedresponses. The Streaming API was used to retrieve tweetsin real-time or to create a live feed using a user stream. Tweepy classifies most common twitter messages and routesthem to appropriately named methods, but these methods areonly substitutes to large methods. Another method Stream,establishes a streaming session and routes messages to theStreamListener instance. StdOutListener is a class which usesStreamListener that prints received tweets to an output stream(file or terminal). Filter, the Stream method filters out TwitterStreams to capture data by the keywords.