5/10/2018
The aim was to make clusters/categorize the articles that have been published in the Indian English newspapers from 2001 to 2018 on Human Immunodeficiency Virus (HIV). The findings generate an insight into the topic and helps frame better strategy to curb HIV.
- Web spider was deployed to the websites of these Newspapers with the help of Scrapy library in python. This provided and output of the articles with time stamps and titles as separate columns.
- The data obtained was fed into a model made on the model of the Agile Model
- The missing values were imputed and the date was converted to date format of pandas library.
- The Natural Language Text Corpora (NLTK) was used to tokenize(break up into words) and for the stemming of the word (linguistic normalization)
- The stop words and punctuations were accounted for and removed using NLTK stop words.
- The word to vector approach was used to convert all the preprocessed words into vectors with the help of a shallow two layer neural network.
- For Clustering an ensemble of Density – Based Spatial Clustering of Applications with Noise(DBSCAN) (since it does not require input of number of clusters),
Agglomerative Hierarchical Clustering(provides option to choose number of clusters) and K Nearest Neighbors (KNN)(using distance measured with cosine similarity) was used for performing the grouping of the articles.
- Multi Dimensional Scaling(MDS) was used for dimensionality reduction( gives more precision and faster runtime.
- The model fit and used for prediction on the data.
- The output was analyzed and clustering was labelled
- Important parameters like Geo-tagging of the articles was done using libraries like GeoText.
- The seaborn library was used to perform temporal and geographical analysis of the data.
Inferences
Inferences
Number of articles published year wise
Number of articles published year wise
NUmber of MentiOns OF MAjor COuntries in The articles
NUmber of MentiOns OF MAjor COuntries in The articles
|
IN | LK | US | TR | PK | ZA | CN | PH | CZ | GB | TH | PL |
---|---|---|---|---|---|---|---|---|---|---|---|
2628 | 3 | 338 | 112 | 73 | 44 | 43 | 190 | 107 | 112 | 31 | 61 |
--- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| LK | US | TR | PK | ZA | CN | PH | CZ | GB | TH | PL | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 2628 | 3 | 338 | 112 | 73 | 44 | 43 | 190 | 107 | 112 | 31 | 61 |
Mention Of Various Major Cities
XA | 311 |
---|---|
B | 104 |
C | 953 |
D | 87 |
E | 185 |
F | 223 |
G | 284 |
Cluster Name | Type of Article |
---|---|
A | Statistics relating to HIV |
B | Aricles Relating to Government/Individual Decisions on HIV related issues |
C | Spreading Stigma On HIV |
D | News Informing of Positive Developments against HIV(non scientific) |
E | News informing about HIV test methods And previous diagnosis results |
F | Articles spreading awareness |
G | News Informing of Positive Developments against HIV(scientific) |
The analysis can be used to identify the current problem with the HIV eradication programme. It can also be used to identify the principal affected areas. The report can also be used as a indicator of the awareness drive of the social bodies and to measure as well as enhance their efficiencies
By Chinmay Singh