Skip to content

TM029 Pyret and Google Sheets (Part 2)

Sam Dooman edited this page Nov 10, 2016 · 5 revisions

Pyret and Google Sheets (Part 2)

If you're reading this before completing Part 1, here is a link to Part 1.

This page serves as a tutorial for creating tables in Pyret from data contained in Sheets documents in your Google Drive. If you are unfamiliar with Sheets, Sheets is Google's free spreadsheet software. After this tutorial, you will be able to import tabular data from Sheets documents and manipulate it with Pyret. Warning: This tutorial is still under development, so the content is subject to change.

Table of Contents

0. Terminology

Terminology Meaning
Sheets Google's spreadsheet software
Sheets document A file to be opened in the Sheets software
sheet A single page containing tabular data within a Sheets document
Sanitizer A function the table loader uses to load inconsistently formatted data in a standard way

1. Sanitizing Data

NOTE: If you are not familiar with Pyret's primitive values and data types, please review them here before continuing the tutorial.

We've successfully loaded a small, organized example dataset. Unfortunately, when using real world data sets, it won't always be this easy. For this next exercise, add the following row to your music Sheets document:

Now, when you run your program, you should see this error message:

Pyret is smart. It tries to infer what kind of data is in each column. When loading the table, Pyret will infer from the first few entries in the plays column that it contains Numbers. When loading the entry in our new row, Pyret has to throw an error because "--" isn't a Number, it's a String!

This may seem like a contrived example, but in real world data sets, inconsistencies like this happen all the time. Entries are left blank, entries that are supposed to be numerical use "n/a", "null", or "--" to represent 0, etc. What's worse is data sets often have hundreds of thousands of rows, with too many inconsistencies to manually fix.

Luckily, Pyret has a way to clean up these messy data sets: Sanitizers! Sanitizers are functions applied to each entry of a column during the table loading. These functions allow the programmer to clean up messy data entries by specifying the how to interpret data that doesn't satisfy the inferred type.

First, add the following code to the top of your program file:

import data-source as DS

This library contains all of Pyret's built in sanitizing functions. Here is the source code, which lists all of Pyret's built in sanitizing functions.

Next, modify your previous table loading code to include a sanitizing function:

music-sheet = load-table: title, artist, album, plays
  source: music-sheets-doc.sheet-by-name("Sheet1", true)
  sanitize plays using DS.string-sanitizer
end

Now, when running your program, you'll see that all of the entries in the plays column are treated as Strings (they have quotation marks now!)

Using a string sanitizer allows our code to compile, but there's a problem: plays should contain numbers, so that we can extract this column as a list and perform operations on it. In an ideal situation, we could tell the table loader to turn all instances of "--" into 0s, since it represents a play count of 0. Here's where we use one of Pyret's handiest features: we can define our own custom sanitizer as a function!

First, add this code to the top of your program. This will allow us to use Pyret's implementation of an Option data type (for more information on Options, read this article).

import option as O

Next, we're going to define this custom sanitizer:

fun plays-sanitizer(x, row, col):
  cases(DS.CellContent) x:
    | c-str(s) =>
      cases(O.Option) string-to-number(s):
        | none => 
          if s == "--":
            0
          else:
            raise("cannot convert this string to number of plays")
          end
        | some(n) => n
      end
    | c-num(n) => n
    | c-bool(b) => raise("cannot convert boolean to number of plays")
    | c-custom(datum) => raise("cannot convert this data to number of plays")
    | c-empty => 0
  end
end

There's a lot going on here, so we'll break it down line by line. First, every custom sanitizer must take 3 arguments: the value x in a particular table entry, and row, col which are the row and column of this entry, respectively.

Next, let's examine the function body. Sanitizers are functions that allow the table loader to handle cases where the entry's data type is different than expected. Therefore, a sanitizer must give different output for different data types. The cases statement does just that; for example, if x can be cast as a Number n (this is the question asked by the expression c-num(n)), then the value to be loaded into the table is n. Sanitizers must account for each of the potential data types a Sheets document entry could contain: Booleans, Strings, Numbers, Custom data types, and empty entries.

So what will happen in this case when the table attempts to load "--"? First, Pyret will determine that "--" can be cast as a String s. Then, if the string can be converted into some number n (for instance string-to-number("13") would return 13), this number n is returned. Otherwise, if the string is "--", then the entry represents 0 plays. If the string is not "--", then an exception is thrown.

All that's left is to modify the table loading code to use our custom sanitizer:

music-sheet = load-table: title, artist, album, plays
  source: music-sheets-doc.sheet-by-name("Sheet1", true)
  sanitize plays using plays-sanitizer
end

Now, when you run your code and type music-sheet on the side, you should see a new table. Notice that instead of --, we have a 0 for the last table entry.

You've now successfully loaded and sanitized a table from Google Drive! Now it's time to put your knowledge to the test.

##2. Practice

Suppose you're a teacher running an experiment with your class: each week for 4 weeks, you have the students record (in an anonymized spreadsheet) the number of hours they spent that week playing video games. Unfortunately, the way students record their entries isn't consistent!

Here's a link to the spreadsheet: https://docs.google.com/spreadsheets/d/1R6KxV70c3JOs_ymoCGRXS76dA5iin8Xh7jB1n5l8gAU/edit#gid=0. You can load this spreadsheet by its ID (the text that comes after the "/d/" in the URL) with the following code:

video-games-sheets-doc = GDS.load-spreadsheet("1R6KxV70c3JOs_ymoCGRXS76dA5iin8Xh7jB1n5l8gAU")

NOTE: For documents that aren't in your code.pyret.org folder, the permissions must be specially set to "Anyone with the link Can View". To accomplish this in your own spreadsheet, Go to File -> Share -> Get Sharable Link

  1. Write a custom sanitizer function that will convert string entries to 0.
  2. Load this table with columns student, week1, week2, week3, week4 and apply the sanitizer to each column except student

SCROLL DOWN FOR SOLUTIONS

video-games-sheets-doc = GDS.load-spreadsheet("1R6KxV70c3JOs_ymoCGRXS76dA5iin8Xh7jB1n5l8gAU")

fun video-sanitizer(x, row, col):
  cases(DS.CellContent) x:
    | c-str(s) => 0
    | c-num(n) => n
    | c-bool(b) => raise("cannot convert boolean to hours")
    | c-custom(datum) => raise("cannot convert custom data to hours")
    | c-empty => empty
  end
end

video-sheet = load-table: student, week1, week2, week3, week4
  source: video-games-sheets-doc.sheet-by-name("Sheet1", true)
  sanitize week1 using video-sanitizer
  sanitize week2 using video-sanitizer
  sanitize week3 using video-sanitizer
  sanitize week4 using video-sanitizer
end