Skip to content

Google Summer of Code 2022

Martin Fleischmann edited this page Mar 2, 2022 · 6 revisions

GeoPandas Introduction

GeoPandas is an open-source project that makes it easier to work with geospatial data in Python. GeoPandas combines the capabilities of pandas and shapely (python interface to the GEOS library), providing geospatial operations in pandas and a high-level and performant interface to multiple geometries to shapely. It combines the power of the whole ecosystem of geo tools by building upon the capabilities of many other libraries including pygeos (vectorized GEOS API), fiona (reading/writing vector data with GDAL), pyproj (projections), rtree (spatial index), and others. GeoPandas enables you to easily do operations in Python that would otherwise require desktop applications like ArcGIS or QGIS or a spatial database such as PostGIS.

Contributing Guide for Students

Please see the NumFOCUS Contributing Guide for Students for helpful suggestions on putting together your project ideas and preparing your proposal.

Proposal template

Ideally, you should use the template to write your proposal.

Project ideas

We have four project ideas for GSoC 2022:

Pure Python GeoPackage IO

GeoPandas currently depends on Fiona or Pyogrio to read and write geospatial file formats. Those, in turn, depends on GDAL, a mighty C library which allows reading almost any GIS file. However, installing fiona and GDAL can be cumbersome due to various dependency conflicts, which are not always easy to resolve.

To make it easier to get started with GeoPandas, we would like to make GDAL an optional dependency. In order to do so, GeoPandas needs to support alternative file I/O that does not depend on GDAL or any additional C library; it should support this functionality purely in Python. We do not want to replace the full capability of fiona; fiona can still be optionally installed by those that need the full suite of capabilities.

We would like to enable native Python support for reading and writing the three most widely used geospatial file formats:

  • GeoJSON: nearly complete
  • ESRI Shapefile: proof of concept
  • GeoPackage: this project

GeoPandas can already generate GeoJSON-like output and create GeoDataFrames from JSON, and the methods need only minor amendments to fully support GeoJSON I/O.

Preliminary support for ESRI Shapefiles has been developed (see PR #1580) using a relatively light-weight implementation based on pyshp, which deals with the file format itself.

We have started prototyping GeoPackage support in a package called pgpkg and we believe this could become the basis for support within GeoPandas.

The GSoC project should expand and refine pgpkg and turn it into a production-ready library linked directly to GeoPandas to deliver pure Python interface to GeoPackage, limited to the reading and writing of vector data. Major tasks include updating the code to correctly support the GeoPackage specification, adding tests, and integrating within GeoPandas.

This project will contribute toward enabling GeoPandas to natively read and write vector files without fiona or GDAL.

Skills

  • Experience with vector GIS file formats
  • Familiarity with GeoPandas data structures (GeoSeries, GeoDataFrame)

Difficulty level

  • intermediate

Project size

~175 hours

Mentors

Resources

GeoPackage implementation

Relevant Python projects

ESRI Shapefile implementation


S2 - bringing spherical geometry to Python

GeoPandas currently assumes that all geometries it operates on are planar and uses the GEOS C library to manage them. Unfortunately, that means that any point on earth needs to be represented in a projected coordinate system (in meters, feet and alike) to work correctly. While that is a standard approach in GIS, it may be suboptimal for some applications. Another option is to replace GEOS with S2, a geometry engine representing data on a three-dimensional sphere, similarly to the implementation in the sf package in R. However, the interface between S2, written in C++, and Python does not currently exist (at least in a reliable form).

The GSoC project should explore potential development paths and result in a functional prototype of S2 Python bindings that could be exposed to a user similarly to how PyGEOS exposes GEOS capabilities and eventually embedded in GeoPandas as a second geometry engine.

Skills

  • Experience with Python and C/C++
  • Experience in wrapping C++ libraries in Python is welcome

Difficulty level

  • advanced

Project size

~350 hours

Mentors

Resources


GeoPandas - Dask bridge to scale geospatial analysis

Dask (https://dask.org/) is a library that brings parallel and distributed computing to the PyData ecosystem. For example, it provides a Dask DataFrame that consists of partitioned pandas DataFrames. Each partition can be processed by a different process enabling the computation to be done in parallel or even out-of-core.

GeoPandas operations relying on GEOS are currently all single-threaded, which severely limits the scalability of its usage and leaves most of the CPU cores just laying around, doing nothing.

However, Dask could provide ways to scale geospatial operations in GeoPandas in a similar way it does it with pandas. There has been some effort to build a bridge between Dask and GeoPandas, currently taking the shape of the dask-geopandas library. While that already supports basic parallelisation, some of the advanced components are not ready yet.

This project should further extend the package enabling distributed spatial indexing, overlapping computation, and a range of other methods (e.g. overlay, plot), aiming to get closer to a production-ready stage. The project may involve contributions to Dask directly to add potential required support in the upstream library.

Skills

  • Experience with Dask
  • Familiarity with GeoPandas

Difficulty level

  • advanced

Project size

~175/350 hours (flexible, depending on the specification of the project proposal)

Mentors

Resources

Current development

Initial efforts and proof of a concept

Dask documentation


Beautiful maps made simple: a static plotting project

GeoPandas currently covers a broad range of geospatial tasks, from data exploration to advanced analysis. However, one moment may tempt the user to use different software - plotting. GeoPandas can create static maps based on matplotlib, but they are a bit basic at the moment. It isn't straightforward to generate a complex map in a production-quality which can go straight to an academic journal or an infographic. We want to change this and remove barriers which we currently have and make it simple to create beautiful maps.

The project is composed of multiple tasks. We need to link matplotlib functionality better, to remove existing limitations of plot customisation. We should rework how the legend works and allow its easy adaptation to users' needs. Adding scale bar, north arrow, and other cartographic features (e.g. graticules) should be straightforward. And more, depending on the student's analysis.

We need to diagnose what is required, fix bugs we are aware of and bring new features closely linked to the existing plotting ecosystem built around matplotlib and GeoPandas.

Skills

  • Experience with plotting
  • Familiarity with GeoPandas and matplotlib

Difficulty level

  • intermediate

Project size

~175 hours

Mentors

Resources

GeoPandas plotting issues

Relevant Python packages