Skip to content

Latest commit

History

History
91 lines (60 loc) 路 6.69 KB

CONTRIBUTING.md

File metadata and controls

91 lines (60 loc) 路 6.69 KB

How to contribute?

The basic

A lot of discussions about ideas take place in the Issues section. There you can catch up with what's going on and also suggest new ideas.

  1. Fork this repository
  2. Create your branch: $ git checkout -b new-stuff
  3. Commit your changes: $ git commit -am 'My cool contribution'
  4. Push to the branch to your fork: $ git push origin new-stuff
  5. Create a new Pull Request

Environment

The recommended way of setting your environment up is with Anaconda, a Python distribution with useful packages for Data Science. Download it and create an environment for the project.

$ conda update conda
$ conda create --name serenata_de_amor python=3
$ source activate serenata_de_amor
$ ./setup

The activate serenata_de_amor command must be run every time you enter in the project folder to start working.

Best practices

In order to avoid tons of conflicts when trying to merge Jupyter Notebooks, there are some guidelines we follow.

Basically we have four big directories with different purposes:

Directory Purpose File naming
develop/ This is where we explore data, feel free to create your own notebook for your exploration. [ISO 8601 date]-[author-initials]-[2-4 word description].ipynb (e.g. 2016-05-13-ec-air-tickets.ipynb)
report/ This is where we write up the findings and results, here is where we put together different data, analysis and strategies to make a point, feel free to jump in. Meaninful title for the report (e.g. Transport-allowances.ipybn
src/ This is where our auxiliary scripts lies, code to scrap data, to convert stuff etc. Small caps, no special character, - instead of spaces.
data/ This is not supposed to be committed, but it is where saved databases will be stored locally (scripts from src/ should be able to get this data for you); a copy of this data will be available elsewhere (just in case). Small caps, no special character, - instead of spaces.

Source files (src/)

Here we explain what each script from src/ does for you:

One script to rule them all
  1. src/fetch_datasets.py dowloads all the available datasets to data/ is .xz compressed CSV format with headers translated to English.
Quota for Exercising Parliamentary Activity (CEAP)
  1. src/fetch_datasets.py --from-source dowloads all CEAP datasets to data/ from the official source (in XML format in Portuguese) .
  2. src/fetch_datasets.py dowloads the CEAP datasets into data/; it can download them from the official source (in XML format in Portuguese) or from our backup server (.xz compressed CSV format, with headers translated to English).
  3. src/xml2csv.py converts the original XML datasets to .xz compressed CSV format.
  4. src/translate_datasets.py translates the datasets file names and the labels of the variables within these files.
  5. src/translation_table.py creates a data/YYYY-MM-DD-ceap-datasets.md file with details of the meaning and of the translation of each variable from the Quota for Exercising Parliamentary Activity datasets.
Suppliers information (CNPJ)
  1. src/fetch_cnpj_info.py iterate over the CEAP datasets looking for supplier unique documents (CNPJ) and create a local dataset with each supplier info.
  2. src/clean_cnpj_info_dataset.py clean up and translate the supplier info dataset.
  3. src/geocode_addresses.py iterate over the supplier info dataset and add geolocation data to it (it uses the Google Maps API set in config.ini).
Miscellaneous
  1. src/backup_data.py uploads files from data/ to an Amazon S3 bucket set on config.ini .

Datasets (data/)

Here we explain what are the datasets inside data/. They are not part of this repository, but downloaded with the scripts from src/. Most files are .xz compressed CSV. All files are named with a ISO 8601 date suffix.

  1. data/YYYY-MM-DD-current-year.xz, data/YYYY-MM-DD-last-year.xz and data/YYYY-MM-DD-previous-years.xz: Datasets from the Quota for Exercising Parliamentary Activity; for details on its variables and meaning, check data/YYYY-MM-DD-ceap-datasets.md.
  2. data/datasets-format.html: Original HTML in Portuguese from the Chamber of Deputies explaining CEAP dataset variables.
  3. data/YYYY-MM-DD-ceap-datasets.md: Table comparing contents from data/YYYY-MM-DD-datasets_format.html and our translation of variable names and descriptions.
  4. data/YYYY-MM-DD-companies.xz: Dataset with suppliers info containing all the fields offered in the Federal Revenue alternative API and complemented with geolocation (latitude and longitude) gathered from Google Maps.

Four moments

The project basically happens in four moments, and contributions are welcomed in all of them:

Moment Description Focus Target
Possibilities To structure hypothesis and strategies taking into account (a) the source of the data, (b) how feasible it is to get this data, and (c) what is the purpose of bringing this data into the project. Contributions here require more sagacity than technical skills. GitHub Issues
Data collection Once one agrees that a certain possibility is worth it, one might want to start writing code to get the data (this script's go into src/). Technical skills in scrapping data and using APIs. src/ and data/
Exploring Once data is ready to be used, one might want to start exploring and analyzing it. Here what matters is mostly data science skills. develop/
Reporting Once a relevant finding emerge from the previous stages, this finding might be gathered with similar other findings (e.g. put together explorations on air tickets, car rentals and geolocation under a report on transportation) on a report. Contributions here requires good communication skills and very basic understanding of quantitative methods. report/

Jarbas

As soon as we started Serenata de Amor we felt the need for a simple webservice to browse our data and refer to documents we analize. This is how Jarbas was created.

If you fancy web development, feel free to check Jarba's source code, to check Jarba's own Issues and to contribute there too.