New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compress the compiled bytecode. #18959
base: master
Are you sure you want to change the base?
Conversation
Can we (a) check the size reduction in .vo size across the entire CI somehow, and (b) get monogram statistics for the bytecode words across the entire CI so that we can order them optimally? |
@coqbot bench |
This ensures that the two rightmost bytes are nul in the general case, hence improving the compression ratio.
The compression scheme is quite naive. It is based on the observation that, in most cases, a bytecode word is a small byte followed by three nul bytes. In that case, it is directly stored as a single byte. In the other cases, more bytes are used, up to 5 bytes to store a full word. This brings a 4% reduction of the overall .vo size of the standard library, that is, a 2MB reduction.
🏁 Bench results:
🐢 Top 25 slow downs┌──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ TOP 25 SLOW DOWNS │ │ │ │ OLD NEW DIFF %DIFF Ln FILE │ ├──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 62.3430 63.3860 1.0430 1.67% 609 coq-fiat-crypto-with-bedrock/rupicola/bedrock2/bedrock2/src/bedrock2Examples/lightbulb.v.html │ │ 26.2760 27.2180 0.9420 3.59% 35 coq-fiat-crypto-with-bedrock/src/Bedrock/End2End/X25519/MontgomeryLadderRISCV.v.html │ │ 28.0630 28.9560 0.8930 3.18% 32 coq-fiat-crypto-with-bedrock/src/Bedrock/End2End/X25519/MontgomeryLadderRISCV.v.html │ │ 14.2890 14.8380 0.5490 3.84% 1505 coq-vst/floyd/VSU.v.html │ │ 1.6240 2.1560 0.5320 32.76% 20 coq-fiat-crypto-with-bedrock/src/Rewriter/Passes/NBE.v.html │ │ 2.0480 2.5570 0.5090 24.85% 32 coq-fiat-crypto-with-bedrock/src/Rewriter/Passes/NBE.v.html │ │ 17.4980 17.9430 0.4450 2.54% 32 coq-performance-tests-lite/src/pattern.v.html │ │ 7.1910 7.5880 0.3970 5.52% 1503 coq-vst/floyd/VSU.v.html │ │ 1.0400 1.4110 0.3710 35.67% 854 coq-stdlib/FSets/FMapAVL.v.html │ │ 155.3850 155.7300 0.3450 0.22% 1190 coq-unimath/UniMath/CategoryTheory/GrothendieckConstruction/IsPullback.v.html │ │ 3.1450 3.4420 0.2970 9.44% 122 coq-stdlib/setoid_ring/Ncring_initial.v.html │ │ 7.4060 7.6760 0.2700 3.65% 1501 coq-vst/floyd/VSU.v.html │ │ 72.7830 73.0510 0.2680 0.37% 905 coq-unimath/UniMath/ModelCategories/Generated/LNWFSCocomplete.v.html │ │ 80.7710 81.0350 0.2640 0.33% 48 coq-fiat-crypto-with-bedrock/src/Curves/Weierstrass/AffineProofs.v.html │ │ 0.8270 1.0780 0.2510 30.35% 384 coq-stdlib/MSets/MSetAVL.v.html │ │ 3.4070 3.6570 0.2500 7.34% 1060 coq-unimath/UniMath/CategoryTheory/EnrichedCats/Examples/FunctorCategory.v.html │ │ 23.9500 24.1950 0.2450 1.02% 85 coq-fiat-crypto-with-bedrock/src/Curves/Montgomery/AffineProofs.v.html │ │ 24.5230 24.7540 0.2310 0.94% 12 coq-fourcolor/theories/job517to530.v.html │ │ 5.5890 5.8150 0.2260 4.04% 308 coq-iris-examples/theories/logrel/F_mu_ref_conc/binary/examples/stack/refinement.v.html │ │ 17.5130 17.7240 0.2110 1.20% 481 coq-verdi-raft/theories/RaftProofs/EndToEndLinearizability.v.html │ │ 0.8290 1.0330 0.2040 24.61% 816 coq-stdlib/MSets/MSetRBT.v.html │ │ 25.5670 25.7680 0.2010 0.79% 12 coq-fourcolor/theories/job190to206.v.html │ │ 0.4820 0.6770 0.1950 40.46% 82 coq-stdlib/Numbers/Cyclic/Int63/Sint63.v.html │ │ 23.3020 23.4960 0.1940 0.83% 12 coq-fourcolor/theories/job486to489.v.html │ │ 4.4290 4.6200 0.1910 4.31% 5 coq-fiat-crypto-with-bedrock/src/Assembly/Parse/Examples/fiat_p256_square_optimised_seed103.v.html │ └──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ 🐇 Top 25 speed ups┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐ │ TOP 25 SPEED UPS │ │ │ │ OLD NEW DIFF %DIFF Ln FILE │ ├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤ │ 254.5780 248.8550 -5.7230 -2.25% 8 coq-neural-net-interp-computed-lite/theories/MaxOfTwoNumbersSimpler/Computed/AllLogits.v.html │ │ 48.8640 46.2710 -2.5930 -5.31% 110 coq-fiat-crypto-with-bedrock/rupicola/bedrock2/bedrock2/src/bedrock2Examples/full_mul.v.html │ │ 218.9550 216.6910 -2.2640 -1.03% 103 coq-fiat-crypto-with-bedrock/src/Arithmetic/BarrettReduction.v.html │ │ 96.3470 94.5660 -1.7810 -1.85% 999 coq-performance-tests-lite/src/fiat_crypto_via_setoid_rewrite_standalone.v.html │ │ 96.2340 94.5380 -1.6960 -1.76% 968 coq-performance-tests-lite/src/fiat_crypto_via_setoid_rewrite_standalone.v.html │ │ 66.1640 65.0260 -1.1380 -1.72% 27 coq-fiat-crypto-with-bedrock/src/Rewriter/Passes/ToFancyWithCasts.v.html │ │ 132.9220 131.8880 -1.0340 -0.78% 22 coq-fiat-crypto-with-bedrock/src/Rewriter/Passes/ArithWithCasts.v.html │ │ 44.7080 43.9180 -0.7900 -1.77% 827 coq-vst/veric/binop_lemmas4.v.html │ │ 2.3200 1.5980 -0.7220 -31.12% 196 coq-stdlib/setoid_ring/Ncring_tac.v.html │ │ 256.2570 255.5380 -0.7190 -0.28% 1629 coq-metacoq-pcuic/pcuic/theories/PCUICSR.v.html │ │ 4.0550 3.4060 -0.6490 -16.00% 490 coq-stdlib/Reals/Cauchy/ConstructiveCauchyRealsMult.v.html │ │ 10.7310 10.1450 -0.5860 -5.46% 307 coq-fiat-crypto-with-bedrock/src/Bedrock/End2End/X25519/EdwardsXYZT.v.html │ │ 8.4700 8.0120 -0.4580 -5.41% 359 coq-fiat-crypto-with-bedrock/src/Bedrock/End2End/X25519/EdwardsXYZT.v.html │ │ 17.5190 17.0770 -0.4420 -2.52% 607 coq-mathcomp-odd-order/theories/PFsection9.v.html │ │ 28.8410 28.4100 -0.4310 -1.49% 144 coq-fiat-crypto-with-bedrock/src/Bedrock/End2End/X25519/GarageDoorTop.v.html │ │ 13.1890 12.7710 -0.4180 -3.17% 1028 coq-unimath/UniMath/CategoryTheory/LocalizingClass.v.html │ │ 29.3120 28.9020 -0.4100 -1.40% 12 coq-fourcolor/theories/job001to106.v.html │ │ 21.8020 21.4400 -0.3620 -1.66% 40 coq-fiat-crypto-with-bedrock/src/PushButtonSynthesis/SolinasReductionReificationCache.v.html │ │ 5.0080 4.6470 -0.3610 -7.21% 111 coq-bedrock2/bedrock2/src/bedrock2Examples/full_mul.v.html │ │ 18.9430 18.5860 -0.3570 -1.88% 957 coq-unimath/UniMath/CategoryTheory/Monoidal/Examples/DisplayedCartesianMonoidal.v.html │ │ 87.4760 87.1350 -0.3410 -0.39% 365 coq-mathcomp-odd-order/theories/PFsection4.v.html │ │ 99.7400 99.4130 -0.3270 -0.33% 20 coq-fiat-crypto-with-bedrock/src/Rewriter/Passes/NBE.v.html │ │ 23.3530 23.0350 -0.3180 -1.36% 296 coq-fiat-crypto-with-bedrock/src/Bedrock/End2End/X25519/EdwardsXYZT.v.html │ │ 39.8580 39.5490 -0.3090 -0.78% 236 coq-rewriter/src/Rewriter/Rewriter/Examples/PerfTesting/LiftLetsMap.v.html │ │ 47.6170 47.3140 -0.3030 -0.64% 558 coq-bedrock2/bedrock2/src/bedrock2Examples/insertionsort.v.html │ └───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘ |
@@ -513,8 +580,7 @@ let to_memory fv code = | |||
reloc_info = RelocTable.create 91; | |||
} in | |||
emit env code []; | |||
(** Later uses of this string are all purely functional *) | |||
let code = Bytes.sub_string env.out_buffer 0 env.out_position in | |||
let code = compress_code env.out_buffer env.out_position in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this is the wrong place to perform the compression, because when evaluating non-constants in the VM it will be immediately decompressed right after. This was one of the reasons why I introduced a VM library data structure on disk, the compression should occur around when building it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. That said, I would not be surprised if non-constants were to be rather short, bytecode-wise. (They might be huge, term-wise, especially when doing computational reflection, but most of that size will come from block values whose associated bytecode is trivial.) So, the cost of compressing/decompressing might be in the noise and not worth complicating the code. For example, heavy vm_compute
users like Four-Color did not suffer from any slowdown.
So, on the whole bench, the overall reduction of the .vo size is about 20%, with files from HoTT being reduced by more than 70% (?!). |
30d1ed4
to
3d66700
Compare
I think HoTT has a higher than usual proportion of Defined vs Qed. I would expect similar results on unimath. |
Sure. But think about it. By design, this patch cannot perform a better compression than 75% of the bytecode segment of a .vo file (and in practice, it is more like 65-70%). So, if there is a 73% reduction on the whole .vo file (e.g., |
As an illustration, here is what the factorial function of the standard library looks like before and after this patch. Before, 240 bytes:
After, 63 bytes:
|
This is in line with some profiling I had done before, so I'm not really surprised. |
@coqbot run full ci |
The thing is, the size of the bytecode is directly proportional to the size of the term, since there is no optimization such as inlining. So, to get 2 million opcodes from a minuscule term, the only explanation I can imagine is the existence of an exponential amount of sharing in that term, which I did not think could happen in a non-artificial use case. |
This is not quite true, the current compilation scheme of closures is quadratic in the number of binders. I think this is a very bad design choice for symbolic code such as the kind encountered in UniMath proofs, and more generally all the Coq definitions never meant to be sent to the VM. |
Right, this had eluded me. Just to be sure, I suppose you are talking about a term of the form That said, from a computational point of view, I don't think there is a way around it. Indeed, it is important that a closure does not spuriously keep terms alive. So, it should not store more values in its closure than strictly needed for its evaluation. For example, the closure I suppose we could add an opcode that copies a segment of the closure environment on the stack (i.e., a |
While I agree with this claim in your run-of-the-mill call-by-value boolean evaluator, it's not clear to me at all that for symbolic code with many bound variables that occur with probability 1 on average within subclosures (e.g. in type annotations) the same trade-off is worth it. I think we should experiment. |
How much trouble would it be to support both sorts of closure compilation schemes? It would be nice to be able to manually annotate some definitions as "symbolic-like closures". |
Since #18964 did not reduce the size of the generated bytecode of fun (i j : G) (g : G i j) (x : D i) (y : E (i; x)) => moveR_transport_V E' (colimp i j g x) y (E_f g x y) (transport_E' g x y)^ The bytecode is about 500 opcodes long (before peephole optimization), 75% of which are there just for creating universe instances. More precisely, 14 instances are created for this small term, of about 6 universes each, with an average cost of 4 opcodes per universe. |
Actually, my mistake, these are the numbers before the end of the section. This example gets even worse once the section is closed, since there are now 15 instances, with up to 11 universes, for a total of 94 universes, and the average cost per universe grows to 4.5 opcodes. Most of these instances are identical, hence shared, in the Coq term, but they get duplicated once compiled to bytecode. |
instead of implementing the substitution in bytecode unrolled over the instance to substitute. Since the instances are now handled through structured constants even when variable they get deduplicated, reducing vo size significantly in universe polymorphic code. For instance HoTT total theories/ vo size 141MB -> 90MB HoTT Colimit_Flattening.vo size 8.7MB -> 3.4MB cf discussion around coq#18959 (comment)
instead of implementing the substitution in bytecode unrolled over the instance to substitute. Since the instances are now handled through structured constants even when variable they get deduplicated, reducing vo size significantly in universe polymorphic code. For instance HoTT total theories/ vo size 141MB -> 90MB HoTT Colimit_Flattening.vo size 8.7MB -> 3.4MB list with all files: https://gist.github.com/SkySkimmer/5d9eda76b016404b82239bbe352a9c6d cf discussion around coq#18959 (comment)
cf #18968 to try to reduce univ instance vm code size |
Good catch, I didn't think about the universe instances... |
instead of implementing the substitution in bytecode unrolled over the instance to substitute. Since the instances are now handled through structured constants even when variable they get deduplicated, reducing vo size significantly in universe polymorphic code. For instance HoTT total theories/ vo size 141MB -> 90MB HoTT Colimit_Flattening.vo size 8.7MB -> 3.4MB list with all files: https://gist.github.com/SkySkimmer/5d9eda76b016404b82239bbe352a9c6d cf discussion around coq#18959 (comment)
I am unable to reproduce the marshal failure on |
launched |
So, cosmic ray it was. |
The compression scheme is quite naive. It is based on the observation that, in most cases, a bytecode word is a small byte followed by three nul bytes. In that case, it is directly stored as a single byte. In the other cases, more bytes are used, up to 5 bytes to store a full word.
This brings a 4% reduction of the overall .vo size of the standard library, that is, a 2MB reduction.