Skip to content

rajitbanerjee/data-science-ucd

Repository files navigation

Data Science @ UCD

A collection of projects from selected Data Science modules.


COMP30760 Data Science in Python

Autumn Trimester, 2020

Getting Started

  • Create and activate the ds-env environment.
    conda env create -f environment.yml
    conda activate ds-env
  • Change to the python-comp30760 directory, then run juypter notebook.
  • The project notebooks can now be run.

A1: Spotify Analysis

The objective of this assignment is to collect a dataset from one or more open web APIs, and use Python to prepare, analyse, and derive insights from the collected data.

Tasks:

  • Data Identification and Collection:
    • Choose one or more public web APIs.
    • Collect data from your API(s) using Python.
    • Save the collected dataset in JSON format for subsequent analysis.
  • Data Preparation and Analysis:
    • Load the stored JSON dataset, and represent it using an appropriate structure.
    • Apply any pre-processing steps that might be required to clean, filter or engineer the dataset before analysis.
    • Analyse, characterise, and summarise the cleaned dataset, using tables and visualisations where appropriate.
    • Summarise any insights which you gained from your analysis of the dataset, and suggest ideas for further analysis.

A2: COVID-19 Mobility Data Analysis

Increasingly, large-scale mobility datasets are being made publicly available for research purposes. This type of data describes the aggregated movement of people across a region or an entire country over time. Mobility data can naturally be represented using a time series, where each day is a different observation. Recently, Google made mobility data available to help researchers to understand the effects of COVID-19 and associated government policies on public behaviour. This data charts movement patterns across different location categories (e.g. work, retail etc). The objective of this assignment is construct different time series representations for a number of countries based on the supplied mobility data, and analyse and compare the resulting series.

Tasks:

  • Within-country analysis (for each of the three selected countries separately)
    • Construct a set of time series that represent the mobility patterns for the different location categories for the country (e.g. workplaces, residential, transit stations etc).
    • Characterise and visualise each of these time series. You may choose to apply re-sampling and/or smoothing in order to provide a clearer picture of the trends in the series.
    • Compare and contrast how the series for the different location categories have changed over time for the country. To what extent are these series correlated with one another?
    • Suggest explanations for any differences that you have observed between the time series for the location categories.
  • Between-country analysis (taking the three selected countries together)
    • Construct a set of time series that represent the overall mobility patterns for the three countries.
    • Characterise and visualise each of these time series. You may choose to apply re-sampling and/or smoothing in order to provide a clearer picture of the trends in the series.
    • Compare and contrast how the overall time series for the three countries have changed over time. To what extent are these series correlated with one another?
    • Suggest explanations for any differences that you have observed between the time series for the countries.

COMP30770 Programming for Big Data

Spring Trimester, 2021

Project 1: CLI (Bash) and Data Management for Big Data

For a detailed report, please see proj1-report.pdf.

1. Cleaning a Dataset with Bash

The scripts below can be run on any Linux shell, without any command line arguments.
Bash on WSL2 for Windows 10 has been used for development.

  • Raw dataset: data/reddit_2021.csv
  • Execute permissions for scripts:
    $ chmod +x ./*.sh
    
  • Performing data cleaning operations:
    $ ./00-replace-protected-commas.sh
    $ ./01-drop-index-and-nsfw.sh
    $ ./02-drop-empty-cols.sh
    $ ./03-drop-single-val-cols.sh
    $ ./04-sec-to-month.sh
    $ ./05-count-posts-per-month.sh
    $ ./06ab-title-lower-no-punc.sh
    $ ./06c-remove-stop-words.sh
    $ ./06d-reduce-to-stem.sh
    $ ./06e-place-clean-titles.sh
    
  • All other files in the data/ directory are regenerated by the scripts above.
  • Clean dataset obtained: data/reddit_2021_clean.csv

2. Data Management

A Docker container with support for MySQL and MongoDB is recommended.
Image used: registry.gitlab.com/roddhjav/ucd-bigdata/db.

  • Stop and remove preexisting containers named comp30770-db, then create a new container using the above image.
    $ ./docker-create.sh
    
  • Copy scripts for MySQL and MongoDB, and the cleaned dataset data/reddit_2021_clean.csv to the container.
    $ ./docker-cp-files.sh
    
  • Start a Bash prompt in the container's /root directory.
    $ ./docker-start.sh
    
  • Create and populate the 'reddit' database in MySQL and MongoDB (in Docker)
    # ./07-mysql-create-db.sh
    # ./08-mysql-populate-db.sh
    # ./09-mongo-populate-db.sh
    
  • Queries are run in the mysql and mongo prompts as described in the report.

3. Reflection

  • CLI for Big Data
  • Relational (SQL) vs. non-relational (NoSQL) database systems
  • Review on 'Dynamo: Amazon's highly available key-value store'

Project 2: Spark

For a detailed report, please see proj2-report.pdf.

  • Execute permissions for scripts:

    $ chmod +x ./*.sh
    
  • Set up a Docker cluster for Spark:

    $ ./docker-setup-spark.sh
    
  • Clean and download data/ files, copy files to Docker container, and start a Bash prompt.
    Both files in the data/ directory are regenerated by this script.

    $ ./docker-start.sh
    
  • In the container, run Spark SQL queries on the GitHub starred projects dataset.

    bash-5.0# spark-shell -i 01-github.scala
    
  • Graph processing on the DBLP co-authorship dataset.

    bash-5.0# spark-shell -i 02-dblp.scala
    
  • Reflection: Review on 'Spark: Cluster Computing with Working Sets'.

COMP30850 Network Analysis

Spring Trimester, 2021

A1: Co-stardom Network

The goal of this assignment is to construct and characterise network representations of two movie-related datasets. The networks should model the co-starring relations between actors in these two dataset - i.e. the collaboration network of actors who appear together in the same movies.

Tasks:

For each dataset:

  • Network Construction
    • Parse the JSON data and create an appropriate co-starring network using NetworkX, where nodes represent individual actors.
    • Identify and remove any isolated nodes from the network.
  • Network Characterisation
    • Apply a range of different methods to characterise the structure and connectivity of the network.
    • Apply different centrality measures to identify important nodes in the network.
  • Ego-centric Analysis
    • Select one of the important nodes in the network and generate an ego network for this node.
  • Network Visualisation
    • Export the network as a GEXF file and use Gephi to produce a useful visualisation.

Sample visualisations (see notebook for details):

A2: Twitter Networks

The goal of this assignment is to construct and characterise a range of network representations, created from pre-collected Twitter data for a specific Twitter List of user accounts which relate to a particular topic (e.g. technology, sports news etc).

Tasks

For the selected data, construct and characterise five different Twitter network representations.

  • Follower network
  • Reply network
  • Mention network
  • User-hashtag network
  • Hashtag co-occurrence network

Sample visualisations (see notebook for details):

COMP30390 Optimisation

Autumn Trimester, 2021

Getting Started

cd optimisation-comp30390
conda env create -f env-comp30390.yml
conda activate comp30390
jupyter notebook

A number of classic linear programming problems solved in Julia.

A1: Linear Programming

  • Notebook: a1

A2: Integer Linear Programming

  • Notebook: a2

COMP47490 Machine Learning

Autumn Trimester, 2021

Getting Started

cd machine-learning-comp47490
conda env create -f env-com47490-m1.yml
conda activate comp47490-m1
jupyter notebook

A1: Austin Animal Shelter Outcomes

  • Notebook: a1
  • Task:
    • Given a sample of the Austin Animal Shelter Outcomes dataset, the objective is to build a data analytics solution for death risk prediction to help the shelter in their planning toward improving animal welfare.
    • The goal is to work with the sample to build and evaluate prediction models that capture the relationship between attributes and the target feature: outcome.

A2: Adult Census Income

  • Notebook: a2
  • Task:
    • Given a sample of the Adult Census Income dataset, the objective is to use the ensemble learning functionality to identify the extent to which classification performance can be improved through the combination of multiple models.
    • The data contains 14 attributes including age, race, sex, marital status etc, and the goal is to predict whether the individual earns over $50K per year.

Acknowledgements