Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Canonicalization for cell arrays #552

Draft
wants to merge 35 commits into
base: master
Choose a base branch
from
Draft

Conversation

ajfriend
Copy link
Contributor

No description provided.

@coveralls
Copy link

coveralls commented Dec 22, 2021

Coverage Status

Coverage decreased (-0.6%) to 97.571% when pulling 428854c on ajfriend:low52 into be83872 on uber:master.

if (a < b) return -1;
if (a > b) return +1;
return 0;
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: closing }

Copy link
Collaborator

@nrabinowitz nrabinowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really exciting! Made some suggestions, mostly naming

2. canonical
3. compacted and canonical

#### Low52 ordered
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Not keen on this name. Maybe "cell digit ordered" or similar?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid just using "ordered", since there are potentially multiple different orderings someone might do, like the standard uint64_t ordering, for example.

What don't you like about lower 52? :) I liked that it was a very distinct name, so it was immediately clear to a reader that you're talking about a specific concept. It's also easier to code search. I'm worried that "cell digit ordered" is a bit too generic of a name.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer the actual ordering to be opaque to the end user - the idea is:

  • We have a special, somewhat opaque format for sets of H3 indexes, and if you use it, you get access to these set functions.
  • We have 3 levels of canonicalization, L1, L2, L3. Each one is more expensive to apply than the last, but the subsequent runtime of the set functions is faster.

Beyond that, the user shouldn't care. Calling this "Low52" (very concrete), "Canonical" (completely opaque), and "Compacted Canonical" (partly concrete, partly opaque) just seems to invite confusion for the user about what they should use - the names here are about the implementation, not the end use. Treating all of the formats as opaque helps to ensure that they are used only with appropriate functions.

Copy link
Contributor Author

@ajfriend ajfriend Jan 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'm convinced to keep things opaque. And now I'm considering getting rid of "L1", as I don't see any use cases. L1 was more of a by-product of me figuring out how to get things working.

As far as the algorithms are concerned, they won't care if a set is compacted or not as long as it is canonical. Because of that, I might also avoid a separate type for "L3" and just leave it to the user to keep track of whether a set is compacted. (I also don't think there's an obvious/easy test for if a set is compacted; you basically just have to run through the compact logic again and check that there are no changes.)

With that in mind, what would you think of something like this (modulo names):

typedef struct {
    H3Index *cells;
    int64_t numCells;
} CellArray;

bool isCanonicalSet(CellArray A);
H3Error toCanonicalSet(CellArray *A);

// in-place compact algo; no dynamic memory allocation needed.
// result comes out canonical as a nice by-product.
H3Error canonicalCompact(CellArray *A);

// functions below work on canonical; are faster on canonical compacted.
bool setContains(CellArray A, H3Index h);
bool doSetsIntersect(CellArray A, CellArray B);
bool isSubset(CellArray A, CellArray B);
H3Error setIntersection(CellArray A, CellArray B, CellArray *C);
H3Error setUnion(CellArray A, CellArray B, CellArray *C);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those look good to me! For clarity, the set functions only work if the set is canonical, right? I'd love some way to enforce this at a type level, e.g. instead of CellArray call it CanonicalSet, then take the args H3Index *cells, int64_t numCells for the functions that don't have this requirement (isCanonicalSet, toCanonicalSet, canonicalCompact -- which BTW I'd call toCompactCanonicalSet). That way the user has to either pass their array through toCanonicalSet or toCompactCanonicalSet in order to use the set functions, or at least they need to explicitly create a CanonicalSet themselves, affirming that their input is canonical.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you'd have:

typedef struct {
    H3Index *cells;
    int64_t numCells;
} CanonicalSet;

bool isCanonicalSet(H3Index *cells, int64_t numCells);
H3Error toCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);

// in-place compact algo; no dynamic memory allocation needed.
// result comes out canonical as a nice by-product.
H3Error toCompactCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);

// functions below work on canonical; are faster on canonical compacted.
bool setContains(CanonicalSet A, H3Index h);
bool doSetsIntersect(CanonicalSet A, CanonicalSet B);
bool isSubset(CanonicalSet A, CanonicalSet B);
H3Error setIntersection(CanonicalSet A, CanonicalSet B, CanonicalSet *out);
H3Error setUnion(CanonicalSet A, CanonicalSet B, CanonicalSet *out);

A couple of questions here:

  • If we're already making this tradeoff between pre-processing and fast operations, do we need the non-compact version? I guess the benefit is that you can get the original cells out, as long as they were unique.
  • In the intersection and union, how do we manage the memory for the output? I'm thinking we might want to offer helpers here like maxSetIntersectionSize (size of the larger set) and maxSetUnionSize (sum of the set sizes) to help callers allocate memory for the out set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those look good to me! For clarity, the set functions only work if the set is canonical, right?

Yes. We could write up versions that work on sorted but not canonical sets, but I don't think it is worth it. They're mostly the same; the non-canonical but sorted sets just introduce a few extra annoying edge cases you have to consider.

H3Error toCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);

I was thinking about this too. My hesitation (and why I originally wrote it as H3Error toCanonicalSet(CellArray *A)) was that out would point to the same memory as cells (since the operation is in-place). Maybe not a big deal, but I worry about issues that come up with multiple references to the same memory, like double calls to free.

  • If we're already making this tradeoff between pre-processing and fast operations, do we need the non-compact version? I guess the benefit is that you can get the original cells out, as long as they were unique.

A set of cells is different from its compact representation (that compact representation could be uncompacted to multiple different resolutions, for example). And if users want to uniquely identify a set of cells with a hash, I think we still want to provide them with a way to get a canonical representation of any set of cells.

I'm imagining situations where uncompacted sets of cells are efficiently stored as a tuple (compacted set id, uncompact resolution), and we'd want a hash that distinguishes the compacted and uncompacted sets.

  • In the intersection and union, how do we manage the memory for the output? I'm thinking we might want to offer helpers here like maxSetIntersectionSize (size of the larger set) and maxSetUnionSize (sum of the set sizes) to help callers allocate memory for the out set

Agreed. But it is actually the sum of the set sizes in both cases, and I was thinking that was easy enough to remember for now. But you're probably right in that we should provide functions so users don't need to know that.

And it's the sum, even for intersection, because things get weird when you start working with compact canonical sets.
For example, the intersection of the first two sets (in the sense we're talking about) here is the third:

Screen Shot 2022-01-05 at 9 50 12 PM

Screen Shot 2022-01-05 at 9 56 52 PM

Screen Shot 2022-01-05 at 9 50 21 PM

(Maybe actually, the worst-case bound is the sum of the set sizes minus 2?)

And it might be possible to have a slightly expensive function that computes the exact intersection size so you could allocate the exact amount of space needed, but that would result in basically running the intersection algorithm twice (I think). But maybe that's worth it?

We can check this property by ensuring that

```c
cmpCanon(a[i-1], a[i]) == -2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, naming - maybe just "cmpOrdered" and "cmpOrderedSet"?

Either way, I'd prefer Canonical to Canon (I think of a "canon" in this context as being a set of things, e.g "the Shakespeare canon" is the set of recognized works, each of which is canonical)


typedef struct {
H3Index *cells;
int64_t N;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numCells?

- `cmpCanon(a, b) == -1` if `a` is a child (or further descendant) of `b`
- `cmpCanon(a, b) == +1` if `b` ... `a`
- `cmpCanon(a, b) == -2` if `a` < `b` in the low52 ordering, but they are not related
- `cmpCanon(a, b) == +2` if `b` < `a` ...
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for this?

int isLow52Sorted(const H3Index *cells, const int64_t N);
H3Error low52Sort(H3Index *cells, const int64_t N);

int isCanonicalCells(const H3Index *cells, const int64_t N);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh. Maybe isCanonicalSet or isOrderedSet?

Basically I want singular, type-style names (and maybe type aliases) for the different kinds of arrays, e.g. OrderedList, OrderedSet, CompactOrderedSet. Maybe these should even be structs with a "type" enum specifying which one they are, so that set operations can work on all of them equivalently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea! I think things like OrderedList, OrderedSet, CompactOrderedSet would definitely help to make things clearer.

But that starts to suggest changes to the API to start using these structs instead of our usual pointer and number of elements. Would we want to do that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I was thinking of using "canonical" instead of "set", because they're slightly different. You can have a distinct set of cells that includes parent/child cells, but canonical forbids that. So wanted a concept that was very obvious to a reader as distinct from the usual notion of "set".

What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about something like Low52List, CanonicalSet? I'm not yet sure if we need a separate type for compact and canonical tho...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above. I think having these new functions (and, for the moment, only these new functions) use the structs as semi-opaque objects would be valuable, and I would choose names that indicate the use cases and tradeoffs the user incurs, rather than the implementation that leads to these tradeoffs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants