Indtroduction

5/10/2018

Hiv articles categorization

using Articles from The hindu and the Times Of India (2001 to 2018)

Indtroduction

The aim was to make clusters/categorize the articles that have been published in the Indian English newspapers from 2001 to 2018 on Human Immunodeficiency Virus (HIV). The findings generate an insight into the topic and helps frame better strategy to curb HIV.

Approach Used

Web spider was deployed to the websites of these Newspapers with the help of Scrapy library in python. This provided and output of the articles with time stamps and titles as separate columns.
The data obtained was fed into a model made on the model of the Agile Model
The missing values were imputed and the date was converted to date format of pandas library.
The Natural Language Text Corpora (NLTK) was used to tokenize(break up into words) and for the stemming of the word (linguistic normalization)
The stop words and punctuations were accounted for and removed using NLTK stop words.
The word to vector approach was used to convert all the preprocessed words into vectors with the help of a shallow two layer neural network.
For Clustering an ensemble of Density – Based Spatial Clustering of Applications with Noise(DBSCAN) (since it does not require input of number of clusters),

Agglomerative Hierarchical Clustering(provides option to choose number of clusters) and K Nearest Neighbors (KNN)(using distance measured with cosine similarity) was used for performing the grouping of the articles.

Multi Dimensional Scaling(MDS) was used for dimensionality reduction( gives more precision and faster runtime.
The model fit and used for prediction on the data.
The output was analyzed and clustering was labelled
Important parameters like Geo-tagging of the articles was done using libraries like GeoText.
The seaborn library was used to perform temporal and geographical analysis of the data.

Inferences

Number Of Articles On HIV quarter- WIse

MULtiDimensionalLY Scaled Model

( clusters overlapping because of large number of dimensions )

Number of articles published year wise

NUmber of MentiOns OF MAjor COuntries in The articles

|

IN	LK	US	TR	PK	ZA	CN	PH	CZ	GB	TH	PL
2628	3	338	112	73	44	43	190	107	112	31	61
---	---	---	---	---	---	---	---	---	---	---	---

| LK | US | TR | PK | ZA | CN | PH | CZ | GB | TH | PL | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 2628 | 3 | 338 | 112 | 73 | 44 | 43 | 190 | 107 | 112 | 31 | 61 |

Mention Of Various Major Cities

Frequency OF Occurrence OF VARIOUS TYPES OF ARTICLES

XA	311
B	104
C	953
D	87
E	185
F	223
G	284

Cluster Name	Type of Article
A	Statistics relating to HIV
B	Aricles Relating to Government/Individual Decisions on HIV related issues
C	Spreading Stigma On HIV
D	News Informing of Positive Developments against HIV(non scientific)
E	News informing about HIV test methods And previous diagnosis results
F	Articles spreading awareness
G	News Informing of Positive Developments against HIV(scientific)

Way Forward

The analysis can be used to identify the current problem with the HIV eradication programme. It can also be used to identify the principal affected areas. The report can also be used as a indicator of the awareness drive of the social bodies and to measure as well as enhance their efficiencies

Thanking you

A report

By Chinmay Singh

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
HIV.ipynb		HIV.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIV.ipynb

HIV.ipynb

README.md

README.md

Repository files navigation

Hiv articles categorization

using Articles from The hindu and the Times Of India (2001 to 2018)

Indtroduction

Approach Used

Number Of Articles On HIV quarter- WIse

MULtiDimensionalLY Scaled Model

( clusters overlapping because of large number of dimensions )

Frequency OF Occurrence OF VARIOUS TYPES OF ARTICLES

Way Forward

The analysis can be used to identify the current problem with the HIV eradication programme. It can also be used to identify the principal affected areas. The report can also be used as a indicator of the awareness drive of the social bodies and to measure as well as enhance their efficiencies

Thanking you

A report

About

Releases

Packages

Languages

mengeziml/HIV-Newspaprer-Articles-Clustering

Folders and files

Latest commit

History

HIV.ipynb

HIV.ipynb

README.md

README.md

Repository files navigation

Hiv articles categorization

using Articles from The hindu and the Times Of India (2001 to 2018)

Indtroduction

Approach Used

Number Of Articles On HIV quarter- WIse

MULtiDimensionalLY Scaled Model

( clusters overlapping because of large number of dimensions )

Frequency OF Occurrence OF VARIOUS TYPES OF ARTICLES

Way Forward

The analysis can be used to identify the current problem with the HIV eradication programme. It can also be used to identify the principal affected areas. The report can also be used as a indicator of the awareness drive of the social bodies and to measure as well as enhance their efficiencies

Thanking you

A report

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages