New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP feat(patterns): pattern-based compression take2 #1584
Draft
erights
wants to merge
1
commit into
markm-prepare-for-extended-matchers
Choose a base branch
from
markm-pattern-based-compression-2
base: markm-prepare-for-extended-matchers
Could not load branches
Branch not found: {{ refName }}
Could not load tags
Nothing to show
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
WIP feat(patterns): pattern-based compression take2 #1584
erights
wants to merge
1
commit into
markm-prepare-for-extended-matchers
from
markm-pattern-based-compression-2
+1,053
−15
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 10, 2023 06:44
241b2d3
to
f57ac4b
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 20, 2023 21:45
f57ac4b
to
533d62a
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
June 6, 2023 03:22
533d62a
to
7ce2d16
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
August 8, 2023 02:23
7ce2d16
to
1025466
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
2 times, most recently
from
August 8, 2023 02:36
18db466
to
accc77c
Compare
erights
force-pushed
the
markm-tag-guards-2
branch
3 times, most recently
from
August 9, 2023 02:27
b05871a
to
2a13b3d
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
August 9, 2023 02:34
accc77c
to
2e6810f
Compare
erights
force-pushed
the
markm-type-guards
branch
from
August 15, 2023 22:53
a0170df
to
505f81f
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
August 15, 2023 23:02
2e6810f
to
99b58d6
Compare
erights
force-pushed
the
markm-type-guards
branch
from
August 21, 2023 22:48
505f81f
to
c2cd034
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
2 times, most recently
from
August 28, 2023 05:22
282fd46
to
b77b6f7
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
2 times, most recently
from
August 30, 2023 01:23
be5d3aa
to
3a169ed
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
2 times, most recently
from
September 16, 2023 02:45
7125ac7
to
061c7e6
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
2 times, most recently
from
September 26, 2023 03:13
5497b03
to
ce825a7
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
6 times, most recently
from
April 29, 2024 03:01
833067b
to
65f26cc
Compare
erights
changed the base branch from
master
to
markm-prepare-for-extended-matchers
April 29, 2024 03:02
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
April 29, 2024 19:19
1943903
to
896cae3
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
April 29, 2024 19:25
65f26cc
to
46946b7
Compare
erights
added a commit
that referenced
this pull request
Apr 29, 2024
closes: #XXXX refs: #2248 #1584 Agoric/agoric-sdk#6432 ## Description Pure refactor. Changes only static info. Mostly more consistent and more readable use of `@import`. One case made less readable: Remove newlines within a large `@import` directive. The reason is that `yarn lerna run build:types` chokes on those newlines. TODO minimal repro + report issue. Extracted from other PRs #1584 #2248 which are now staged on this one. But this should be a reviewable and mergeable improvement regardless of whether we move forward on the others. ### Security Considerations none ### Scaling Considerations none ### Documentation Considerations none ### Testing Considerations none ### Compatibility Considerations none ### Upgrade Considerations none - ~[ ] Includes `*BREAKING*:` in the commit message with migration instructions for any breaking change.~ - ~[ ] Updates `NEWS.md` for user-facing changes.~
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
April 29, 2024 20:20
896cae3
to
2eafdc2
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
April 29, 2024 20:22
46946b7
to
8e7925b
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
April 30, 2024 19:40
2eafdc2
to
e3cfbad
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
April 30, 2024 19:41
8e7925b
to
1f6703c
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
May 2, 2024 23:18
e3cfbad
to
d25cfad
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 2, 2024 23:19
1f6703c
to
c50817b
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
May 6, 2024 22:17
d25cfad
to
5a51499
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 6, 2024 22:18
c50817b
to
5e470dc
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
May 6, 2024 22:22
5a51499
to
f5b2d72
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 6, 2024 22:23
5e470dc
to
bc39e81
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
May 7, 2024 18:50
f5b2d72
to
dd3b3ad
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 7, 2024 18:51
bc39e81
to
e104e22
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
May 7, 2024 21:09
dd3b3ad
to
aa85135
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 7, 2024 21:09
e104e22
to
35ea462
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
May 9, 2024 00:09
aa85135
to
b619239
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 9, 2024 00:10
35ea462
to
737d43c
Compare
erights
force-pushed
the
markm-prepare-for-extended-matchers
branch
from
May 24, 2024 03:41
964d1ac
to
4c7ac33
Compare
erights
force-pushed
the
markm-pattern-based-compression-2
branch
from
May 24, 2024 03:43
6fae12d
to
61d0621
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Staged on #2248
closes: #2112
refs: #1564 Agoric/agoric-sdk#6432
Description
Adds two new exports to @endo/patterns
and its "inverse"
(From Agoric/agoric-sdk#6432 (comment) ):
For example without compression, the Zoe proposal
is stored with a smallcaps body of
'#{"exit":{"afterDeadline":{"deadline":"+11","timer":"$0.Alleged: timer"}},"give":{"Bid":{"brand":"$1.Alleged: simoleans","value":"+37"}},"want":{"Winnings":{"brand":"$2.Alleged: moola","value":{"#tag":"copyBag","payload":[[{"foo":"c"},"+1"],[{"foo":"b"},"+1"],[{"foo":"a"},"+1"]]}}}}'
But it compresses with the proposalShape
to
whose smallcaps body is
'#[[["c"],["b"],["a"]],"+37","+11"]'
which is 12% as long.
It would take much more work, but if we were able to use matching interface guards on the sending and receiving sides, we'd get similar savings for messages. Agoric/agoric-sdk#6355 may help get there. But note the difficulties explained in "Upgrade Considerations" below.
mustCompress
is analogous tomustMatch
, which as a reminder isThe following equivalences must hold
mustMatch(s,p,l1?)
must succeed iffmuchCompress(s,p,l2?)
succeeds. When they succeed, the label does not matter.label
to be more informative. Thus, one throws iff the other throws. The diagnostics are not necessarily the same.mustMatch(s,p,l1?)
and thereforemustCompress(s,p,l2?)
succeeds iffcompress(s,p) === true
.mustMatch(s,p,l?) === c
iffmustDecompress(c,p,l) === s2
wheres
ands2
have the same distributed object semantics.compareRank(s, s2) === 0
,isKey(s) === isKey(s2)
,isKey(s) =>
keyEQ(s,s2)`.The point is that typically
c
is smaller thans
, though in some cases it may be larger. The space savings should typically be similar to the space savings from schema-based encodings like protobuf or capn-proto. The pattern is analogous to the schema. Anything that must be in all specimens that match a given pattern can be omitted from the compressed form, since those parts can be recovered from the pattern on decompression. Unlike schema-based compression, this can include dynamic elements like brand identity, potentially resulting in greater savings and tighter error checking.Unlike schema-based compression schemes like protobuf or cap'n proto, the layering here makes compression mostly independent of encoding/serialization, as shown by the above example: The compression is independent of whether the result will be encoded with smallcaps, and the smallcaps encoding is independent of whether its input was a compressed or uncompressed specimen. Or rather, mostly independent. We chose a nested-array compression because of its compact JSON representation, preserved by smallcaps.
Security Considerations
If sender and receiver can be led into compressing and decompressing with different patterns, or with different compression/decompression algorithms associated with that pattern's matchers, then compressed data might be decompressed into something arbitrarily different that the sender meant to send. See "Upgrade Considerations" below.
Aside from that, none.
Scaling Considerations
The whole point. Compression could result in tremendously less data stored, send, and received. Unfortunately, so far, the informal measurements of the time taken to compress is not encouraging. This needs to be measured carefully, and probably needs to be improved tremendously, before this PR is ready for production use. Ideally:
encode(mustCompress(data, pattern))
typically takes both less time and less space thanmustMatch(data, pattern) && encode(data)
.mustDecompress(decode(encodedCompressedData))
typically takes less time thandecode(encodedUncompressedData)
.This will depend of course on what
encode
scheme is used.Documentation Considerations
Testing Considerations
Already includes good manual tests.
Compatibility Considerations
A big advantage of smallcaps encoded of an uncompressed specimen is that the result is still mostly human readable, and processable using JSON-oriented tooling like jq. The compressed form loses both of these benefits, also calling into question whether there's any point in smallcaps encoding the compressed form rather than using an unreadable binary encoding like
compactOrdered
,syrup
orcbor
.compactOrdered
is both rank equality preserving and rank order preserving. Holding the pattern constant,compactOrdered
of the compressed form would still be rank equality preserving, but not rank order preserving. Thus, stores will probably continue to encode their keys usingcompactOrdered
on the uncompressed form, forfeiting the opportunity to usekeyShape
for compression.Upgrade Considerations
When the compressed form is communicated instead of the uncompressed form, the sender and receiver must agree precisely on the pattern. If a different pattern is used to uncompress than was used to compress, the compressed data might silently uncompress into data arbitrarily different than the original specimen. The best way to do this is to send the pattern as well somehow from the sender to receiver. For small data, this may cost more space than it saves.
SwingSet already stores optional patterns with some large data stores, with an error check to ensure that the data matches the pattern:
keyShape
,valueShape
, andstateShape
. Agoric/agoric-sdk#6432 modifies SwingSet to also use thevalueShape
andstateShape
for compression.A pattern is a tree of copy-data to be matched literally (the key-like parts), and Matchers, typically expressed in code like
M.bagOf(keyShape, countShape)
in the example above. The overall compression/decompression algorithms are composed from compression/decompression algorithms for each matcher kind. Not only must the sender and receiver agree exactly on the pattern, they must agree exactly on the algorithms associated with each matcher in the pattern. But we'd also like to improve these over time. Thus, this PR includes in each matcher kind definition an optional version number of the compression algorithm it uses. If omitted, that matcher does not compress. Version numbers are assigned in increasing sequence starting with1
. The algorithm associated with a given sequence number must never change. If a given version of the endo supports matcher M sequence number N, then it should also support all sequence numbers prior to N, unless there is a compelling reason to retire an old one.The
M.something(...)
matcher makers should generally produce a matcher with the latest locally supported sequence number. Thus, this system supports older senders sending to newer receivers. This works fine for intra-vat storage, as in Agoric/agoric-sdk#6432 , since intra-vat storage communicates data only forward in time/versions. However, inter-vat communications must tolerate some version slippage in both direction, which will require design of some kind of pattern negotiation.[ ] Includes*BREAKING*:
in the commit message with migration instructions for any breaking change.This PR itself does not introduce any breaking changes. But PRs based on it will have more hazards of breaking changes as explained above.
NEWS.md
for user-facing changes.Many of the points made in this PR note should be summarized in a NEWS.md entry.