Skip to content

dsgt-birdclef/birdclef-2022

Repository files navigation

birdclef-2022

This repository contains code for the birdclef-2022 Kaggle competition for the Data Science at Georgia Tech team.

quickstart

Development has primarily been done on Windows 10, but the code is generally platform agnostic and runs on the default Kaggle kernel.

repo and data preparation

Checkout the repository to your local machine, and download the data from the competition website. Ensure the data is extracted to the data/raw/birdclef-2022 directory.

git clone https://github.com/acmiyaguchi/birdclef-2022
cd birdclef-2022

# download the data to the data/raw directory and extract
mkdir -p data/raw
# ...

# ensure that you can run the following command from the project root
cat data/raw/birdclef-2022/scored_birds.json | wc -l
# 23

Install the Google Cloud SDK and ask for permission to the birdclef-2022 bucket. Run the following command to ensure you have the correct permissions.

gsutil cat gs://birdclef-2022/processed/model/2022-04-12-v4/metadata.json

{
  "embedding_source": "data/intermediate/embedding/tile2vec-v2/version_2/checkpoints/epoch=2-step=10849.ckpt",
  "embedding_dim": 64,
  "created": "2022-04-12T23:09:51.920185",
  "cens_sr": 10,
  "mp_window": 20,
  "use_ref_motif": false
}

Run the sync.py script to pull data down from the remote bucket.

python scripts/sync.py down

In particular, this will synchronize shared files from the data/processed directory.

python

Install Python 3.7 or above. Install pipx to manage a few utilities like pip-tools and pre-commit.

pip install pipx
pipx install pip-tools
pipx install pre-commit

Install the pre-commit hooks. This will ensure that all the code is formatted correctly.

pre-commit install

Create a new virtual environment and activate it.

# create a virtual environment in the venv/ directory
python -m venv venv

# activate on Windows
./venv/Scripts/Activate.ps1

# activate on Linux/MacOS
source venv/bin/activate

Then install all of the dependencies.

pip install -r requirements.txt

running tests

Unit-testing helps with debugging smaller modules in a larger project. For example, we use tests to assert that models accept data in one shape and output predictions in another shape. We use pytest in this project. Running the tests can help ensure that your environment is configured correctly.

pytest -vv tests/

You can select a subset of tests using the -k flag.

pytest -vv tests/ -k embed_tilenet

You can also exit tests early using the -x flag and enter a debugger on failing tests using the --pdb flag.

repository structure

The repository is structured in the following way.

Directory Description
birdclef The primary Python module that encapsulates all the competition code.
data Associated data files, not checked into the source code.
notebooks Notebooks, often for exploration and analysis. The naming convention is to use YYYY-MM-DD-{initials}-{notebook name}.ipynb
figures Figures that are checked into the repository.
notes Notes about the project. Filenames should be prefixed by github handle.
scripts Scripts for maintaining the development environment and other miscellaneous tasks.
tests Unit tests written in pytest.
terraform Terraform configuration files, for associated cloud resources.
label-studio Label Studio configuration files (may be deprecated)

The python module has a few notable submodules.

Directory Description
datasets This contains code related the the soundscape task.
models This contains code related to different models used throughout the project.
workflows This contains code related to the workflows, such as the command line interaface.

The data directory has three notable subdirectories.

Directory Description
data/raw Raw data files, which are provided by the competition.
data/intermediate Intermediate data files, generated by tasks in the repository and generally not shared.
data/processed Processed data files, which are shared across the team and into the kaggle notebooks.

development

The majority of development notes can be found under the notes directory.

adding dependencies

This repository uses pip-compile to maintain dependencies. Please add direct dependencies to requirements.in, rather than modifying requirements.txt. After adding a dependency, run pip-compile to generate a new requirements.txt file. The sequence looks something like:

pipx install pip-tools  # if you haven't installed it already via the quickstart guide

# add any new direct dependencies to requirements.in
pip-compile
# observe that requirements.txt has changed locally
# commit the result

kaggle submission

See the following two notebooks:

The first notebook downloads any models in the shared GCP bucket (gs://birdclef-2022). It also downloads the main package in this repository, using a private github token.

The second notebook contains the actual code. It simply mounts the output of the model-sync notebook and calls the birdclef.workflows.classify command.

papers

The approach for this year's competition is focused on unsupervised methods. In particular, the fast similarity matrix profile and tile2vec papers provide the technical foundation for methods found in the repository.