Skip to content

using data scraped from The Hindu and Times of Indai from 2001 to 2018 for temporal and geographical analysis as well as labelling of the data into several clusters in python using Geotext and NLTK

Notifications You must be signed in to change notification settings

mengeziml/HIV-Newspaprer-Articles-Clustering

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

5/10/2018

Hiv articles categorization

using Articles from The hindu and the Times Of India (2001 to 2018)

Indtroduction

The aim was to make clusters/categorize the articles that have been published in the Indian English newspapers from 2001 to 2018 on Human Immunodeficiency Virus (HIV). The findings generate an insight into the topic and helps frame better strategy to curb HIV.

Approach Used

  1. Web spider was deployed to the websites of these Newspapers with the help of Scrapy library in python. This provided and output of the articles with time stamps and titles as separate columns.
  2. The data obtained was fed into a model made on the model of the Agile Model
  3. The missing values were imputed and the date was converted to date format of pandas library.
  4. The Natural Language Text Corpora (NLTK) was used to tokenize(break up into words) and for the stemming of the word (linguistic normalization)
  5. The stop words and punctuations were accounted for and removed using NLTK stop words.
  6. The word to vector approach was used to convert all the preprocessed words into vectors with the help of a shallow two layer neural network.
  7. For Clustering an ensemble of Density – Based Spatial Clustering of Applications with Noise(DBSCAN) (since it does not require input of number of clusters),

Agglomerative Hierarchical Clustering(provides option to choose number of clusters) and K Nearest Neighbors (KNN)(using distance measured with cosine similarity) was used for performing the grouping of the articles.

  1. Multi Dimensional Scaling(MDS) was used for dimensionality reduction( gives more precision and faster runtime.
  2. The model fit and used for prediction on the data.
  3. The output was analyzed and clustering was labelled
  4. Important parameters like Geo-tagging of the articles was done using libraries like GeoText.
  5. The seaborn library was used to perform temporal and geographical analysis of the data.

Inferences

Inferences

Number Of Articles On HIV quarter- WIse

MULtiDimensionalLY Scaled Model

( clusters overlapping because of large number of dimensions )

Number of articles published year wise

Number of articles published year wise

NUmber of MentiOns OF MAjor COuntries in The articles

NUmber of MentiOns OF MAjor COuntries in The articles

|

IN LK US TR PK ZA CN PH CZ GB TH PL
2628 3 338 112 73 44 43 190 107 112 31 61
--- --- --- --- --- --- --- --- --- --- --- ---

| LK | US | TR | PK | ZA | CN | PH | CZ | GB | TH | PL | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 2628 | 3 | 338 | 112 | 73 | 44 | 43 | 190 | 107 | 112 | 31 | 61 |

Mention Of Various Major Cities

Frequency OF Occurrence OF VARIOUS TYPES OF ARTICLES

XA 311
B 104
C 953
D 87
E 185
F 223
G 284
Cluster Name Type of Article
A Statistics relating to HIV
B Aricles Relating to Government/Individual Decisions on HIV related issues
C Spreading Stigma On HIV
D News Informing of Positive Developments against HIV(non scientific)
E News informing about HIV test methods And previous diagnosis results
F Articles spreading awareness
G News Informing of Positive Developments against HIV(scientific)

Way Forward

The analysis can be used to identify the current problem with the HIV eradication programme. It can also be used to identify the principal affected areas. The report can also be used as a indicator of the awareness drive of the social bodies and to measure as well as enhance their efficiencies

Thanking you

A report

By Chinmay Singh

About

using data scraped from The Hindu and Times of Indai from 2001 to 2018 for temporal and geographical analysis as well as labelling of the data into several clusters in python using Geotext and NLTK

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 100.0%