Canonicalization for cell arrays #552

ajfriend · 2021-12-22T21:06:52Z

No description provided.

coveralls · 2021-12-22T21:09:34Z

Coverage decreased (-0.6%) to 97.571% when pulling 428854c on ajfriend:low52 into be83872 on uber:master.

isaacbrodsky · 2021-12-28T18:57:09Z

dev-docs/RFCs/canonicalization.md

+    if (a < b) return -1;
+    if (a > b) return +1;
+    return 0;
+```


nit: closing }

nrabinowitz

This is really exciting! Made some suggestions, mostly naming

nrabinowitz · 2022-01-03T22:16:26Z

dev-docs/RFCs/canonicalization.md

+2. canonical
+3. compacted and canonical
+
+#### Low52 ordered


Nit: Not keen on this name. Maybe "cell digit ordered" or similar?

I wanted to avoid just using "ordered", since there are potentially multiple different orderings someone might do, like the standard uint64_t ordering, for example.

What don't you like about lower 52? :) I liked that it was a very distinct name, so it was immediately clear to a reader that you're talking about a specific concept. It's also easier to code search. I'm worried that "cell digit ordered" is a bit too generic of a name.

I think I'd prefer the actual ordering to be opaque to the end user - the idea is:

We have a special, somewhat opaque format for sets of H3 indexes, and if you use it, you get access to these set functions.

We have 3 levels of canonicalization, L1, L2, L3. Each one is more expensive to apply than the last, but the subsequent runtime of the set functions is faster.

Beyond that, the user shouldn't care. Calling this "Low52" (very concrete), "Canonical" (completely opaque), and "Compacted Canonical" (partly concrete, partly opaque) just seems to invite confusion for the user about what they should use - the names here are about the implementation, not the end use. Treating all of the formats as opaque helps to ensure that they are used only with appropriate functions.

I think I'm convinced to keep things opaque. And now I'm considering getting rid of "L1", as I don't see any use cases. L1 was more of a by-product of me figuring out how to get things working.

As far as the algorithms are concerned, they won't care if a set is compacted or not as long as it is canonical. Because of that, I might also avoid a separate type for "L3" and just leave it to the user to keep track of whether a set is compacted. (I also don't think there's an obvious/easy test for if a set is compacted; you basically just have to run through the compact logic again and check that there are no changes.)

With that in mind, what would you think of something like this (modulo names):

typedef struct { H3Index *cells; int64_t numCells; } CellArray; bool isCanonicalSet(CellArray A); H3Error toCanonicalSet(CellArray *A); // in-place compact algo; no dynamic memory allocation needed. // result comes out canonical as a nice by-product. H3Error canonicalCompact(CellArray *A); // functions below work on canonical; are faster on canonical compacted. bool setContains(CellArray A, H3Index h); bool doSetsIntersect(CellArray A, CellArray B); bool isSubset(CellArray A, CellArray B); H3Error setIntersection(CellArray A, CellArray B, CellArray *C); H3Error setUnion(CellArray A, CellArray B, CellArray *C);

Those look good to me! For clarity, the set functions only work if the set is canonical, right? I'd love some way to enforce this at a type level, e.g. instead of CellArray call it CanonicalSet, then take the args H3Index *cells, int64_t numCells for the functions that don't have this requirement (isCanonicalSet, toCanonicalSet, canonicalCompact -- which BTW I'd call toCompactCanonicalSet). That way the user has to either pass their array through toCanonicalSet or toCompactCanonicalSet in order to use the set functions, or at least they need to explicitly create a CanonicalSet themselves, affirming that their input is canonical.

So you'd have:

typedef struct { H3Index *cells; int64_t numCells; } CanonicalSet; bool isCanonicalSet(H3Index *cells, int64_t numCells); H3Error toCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out); // in-place compact algo; no dynamic memory allocation needed. // result comes out canonical as a nice by-product. H3Error toCompactCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out); // functions below work on canonical; are faster on canonical compacted. bool setContains(CanonicalSet A, H3Index h); bool doSetsIntersect(CanonicalSet A, CanonicalSet B); bool isSubset(CanonicalSet A, CanonicalSet B); H3Error setIntersection(CanonicalSet A, CanonicalSet B, CanonicalSet *out); H3Error setUnion(CanonicalSet A, CanonicalSet B, CanonicalSet *out);

A couple of questions here:

If we're already making this tradeoff between pre-processing and fast operations, do we need the non-compact version? I guess the benefit is that you can get the original cells out, as long as they were unique.

In the intersection and union, how do we manage the memory for the output? I'm thinking we might want to offer helpers here like maxSetIntersectionSize (size of the larger set) and maxSetUnionSize (sum of the set sizes) to help callers allocate memory for the out set

Those look good to me! For clarity, the set functions only work if the set is canonical, right?

Yes. We could write up versions that work on sorted but not canonical sets, but I don't think it is worth it. They're mostly the same; the non-canonical but sorted sets just introduce a few extra annoying edge cases you have to consider.

H3Error toCanonicalSet(H3Index *cells, int64_t numCells, CanonicalSet *out);

I was thinking about this too. My hesitation (and why I originally wrote it as H3Error toCanonicalSet(CellArray *A)) was that out would point to the same memory as cells (since the operation is in-place). Maybe not a big deal, but I worry about issues that come up with multiple references to the same memory, like double calls to free.

If we're already making this tradeoff between pre-processing and fast operations, do we need the non-compact version? I guess the benefit is that you can get the original cells out, as long as they were unique.

A set of cells is different from its compact representation (that compact representation could be uncompacted to multiple different resolutions, for example). And if users want to uniquely identify a set of cells with a hash, I think we still want to provide them with a way to get a canonical representation of any set of cells.

I'm imagining situations where uncompacted sets of cells are efficiently stored as a tuple (compacted set id, uncompact resolution), and we'd want a hash that distinguishes the compacted and uncompacted sets.

In the intersection and union, how do we manage the memory for the output? I'm thinking we might want to offer helpers here like maxSetIntersectionSize (size of the larger set) and maxSetUnionSize (sum of the set sizes) to help callers allocate memory for the out set

Agreed. But it is actually the sum of the set sizes in both cases, and I was thinking that was easy enough to remember for now. But you're probably right in that we should provide functions so users don't need to know that.

And it's the sum, even for intersection, because things get weird when you start working with compact canonical sets.
For example, the intersection of the first two sets (in the sense we're talking about) here is the third:

(Maybe actually, the worst-case bound is the sum of the set sizes minus 2?)

And it might be possible to have a slightly expensive function that computes the exact intersection size so you could allocate the exact amount of space needed, but that would result in basically running the intersection algorithm twice (I think). But maybe that's worth it?

nrabinowitz · 2022-01-03T22:21:40Z

dev-docs/RFCs/canonicalization.md

+We can check this property by ensuring that
+
+```c
+cmpCanon(a[i-1], a[i]) == -2


Again, naming - maybe just "cmpOrdered" and "cmpOrderedSet"?

Either way, I'd prefer Canonical to Canon (I think of a "canon" in this context as being a set of things, e.g "the Shakespeare canon" is the set of recognized works, each of which is canonical)

nrabinowitz · 2022-01-03T22:26:05Z

src/apps/testapps/testLow52.c

+
+typedef struct {
+    H3Index *cells;
+    int64_t N;


numCells?

nrabinowitz · 2022-01-03T22:27:07Z

dev-docs/RFCs/canonicalization.md

+- `cmpCanon(a, b) == -1` if `a` is a child (or further descendant) of `b`
+- `cmpCanon(a, b) == +1` if `b` ... `a`
+- `cmpCanon(a, b) == -2` if `a` < `b` in the low52 ordering, but they are not related
+- `cmpCanon(a, b) == +2` if `b` < `a` ...


What's the use case for this?

nrabinowitz · 2022-01-03T22:30:51Z

src/h3lib/include/h3api.h.in

+int isLow52Sorted(const H3Index *cells, const int64_t N);
+H3Error low52Sort(H3Index *cells, const int64_t N);
+
+int isCanonicalCells(const H3Index *cells, const int64_t N);


Ugh. Maybe isCanonicalSet or isOrderedSet?

Basically I want singular, type-style names (and maybe type aliases) for the different kinds of arrays, e.g. OrderedList, OrderedSet, CompactOrderedSet. Maybe these should even be structs with a "type" enum specifying which one they are, so that set operations can work on all of them equivalently.

I like this idea! I think things like OrderedList, OrderedSet, CompactOrderedSet would definitely help to make things clearer.

But that starts to suggest changes to the API to start using these structs instead of our usual pointer and number of elements. Would we want to do that?

Also, I was thinking of using "canonical" instead of "set", because they're slightly different. You can have a distinct set of cells that includes parent/child cells, but canonical forbids that. So wanted a concept that was very obvious to a reader as distinct from the usual notion of "set".

What do you think?

What about something like Low52List, CanonicalSet? I'm not yet sure if we need a separate type for compact and canonical tho...

See my comment above. I think having these new functions (and, for the moment, only these new functions) use the structs as semi-opaque objects would be valuable, and I would choose names that indicate the use cases and tradeoffs the user incurs, rather than the implementation that leads to these tradeoffs.

function declarations

a2a61f8

ajfriend added 28 commits December 22, 2021 13:44

rfc

7ce1842

types of arrays

ddeaf42

merp

1c741bd

implementations

f1a6e9a

intersection test

d372101

better with bool

b9d5af2

wayLessThan

f87297b

ensureASmaller

8c90200

ternary

d7bf43e

merp

cbf69b3

notes

8b07372

intersectTheyDo_slow

81d7e2f

formatting

b55c886

add some tests

6368edf

trying some stuff

34e6bf2

clean up tests

04c853d

ring_intersect

0355ce7

overlapping disks

39cb4e0

h3api.h

fc2da34

oops, wrong one

bec6d6b

try again

8076474

H3_EXPORT might be the trick

0e810f3

H3_EXPORT all the things

5a6c581

one last straggler

948d2e8

some clean up

201c6c1

cleaner

24f6e2b

trying out t_isLow52 and t_isCanon

26d02dc

t_intersect

a9e1777

ajfriend added 6 commits December 25, 2021 18:21

clean up helper functions

3eed379

sets use capital letters

d624f97

t_intersects

5475831

do some simpler flipping between left and right side of A

ea81e27

simplify input for disjointInsertionPoint

6e3fcaf

tricky ring tests

428854c

isaacbrodsky reviewed Dec 28, 2021

View reviewed changes

dev-docs/RFCs/canonicalization.md

if (a < b) return -1;

if (a > b) return +1;

return 0;

```

Copy link

Collaborator

isaacbrodsky Dec 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: closing }

nrabinowitz reviewed Jan 3, 2022

View reviewed changes

pocketken mentioned this pull request Jan 4, 2022

Add cell canonicalization (upstream RFC) pocketken/H3.net#61

Open

ajfriend mentioned this pull request Sep 22, 2022

order-dependent duplicate error in compact_cells #697

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Canonicalization for cell arrays #552

Canonicalization for cell arrays #552

ajfriend commented Dec 22, 2021

coveralls commented Dec 22, 2021 •

edited

isaacbrodsky Dec 28, 2021

nrabinowitz left a comment

nrabinowitz Jan 3, 2022

ajfriend Jan 4, 2022

nrabinowitz Jan 5, 2022

ajfriend Jan 5, 2022 •

edited

nrabinowitz Jan 5, 2022

nrabinowitz Jan 5, 2022

ajfriend Jan 6, 2022

nrabinowitz Jan 3, 2022

nrabinowitz Jan 3, 2022

nrabinowitz Jan 3, 2022

nrabinowitz Jan 3, 2022

ajfriend Jan 4, 2022

ajfriend Jan 4, 2022

ajfriend Jan 4, 2022

nrabinowitz Jan 5, 2022

Canonicalization for cell arrays #552

Are you sure you want to change the base?

Canonicalization for cell arrays #552

Conversation

ajfriend commented Dec 22, 2021

coveralls commented Dec 22, 2021 • edited

Choose a reason for hiding this comment

nrabinowitz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ajfriend Jan 5, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coveralls commented Dec 22, 2021 •

edited

ajfriend Jan 5, 2022 •

edited