-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify CAMLalign
and use C11 max_align_t
#13139
base: trunk
Are you sure you want to change the base?
Conversation
0d98742
to
a49b31b
Compare
a49b31b
to
d226d67
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. +1 for using _Alignas
and alignas
in preference to attributes. One suggestion below concerning the MSVC fallback.
|
||
struct pool_block { | ||
#ifdef DEBUG | ||
intnat magic; | ||
#endif | ||
struct pool_block *next; | ||
struct pool_block *prev; | ||
union max_align data[]; /* not allocated, used for alignment purposes */ | ||
max_align_t data[]; /* not allocated, used for alignment purposes */ | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One downside with this use of double
as fallback max_align_t
type is that double
has rather low alignment: 8 on x86_64, 4 on x86_32, while most SSE vector instructions require 16-alignment. (In Linux x86_64 and macOS x86_64, max_align_t
has 16 alignment.) What about using an explicit 16 alignment as the fallback case?
struct pool_block {
#ifdef DEBUG
intnat magic;
#endif
struct pool_block *next;
struct pool_block *prev;
#ifdef HAVE_MAX_ALIGN_T
max_align_t data[]; /* not allocated, used for alignment purposes */
#else
CAMLalign(16) char data[]; /* 16 is a reasonable alignment default */
#endif
};
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd tend to align (ha, ha) with your suggestion, but: MSVC C++ cstddef.h
header uses double alignment for max_align_t
, and clang-cl uses double too, which I think makes it a reasonable default for Windows.
As for a general fallback, I hope that other compilers+libc aren't as buggy, and I wonder if we do need to provide a definition, or rather catch a missing definition as a compilation failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MSVC C++ cstddef.h header uses double alignment for max_align_t
It is in error, then. The program below, compiled with MSVC, prints 16, showing that there are types with alignment greater than 8.
#include <iostream>
#include <xmmintrin.h>
int main() {
std::cout << alignof(__m128) << '\n';
return 0;
}
In this PR, you're not trying to emulate whatever strange choices MSVC does, but to make sure OCaml's pool allocator works as intended. If the intent is to align for the maximal alignment constraint of the target platform, the alignment must be >= 16 on x86, because SSE instructions. If the intent is to align for the biggest datatype OCaml stores in heap blocks, word-alignment is enough and you don't need to add anything to struct pool_block
to guarantee it, as it already contains two word-sized pointers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about defaulting to long double
rather than double
then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is in error, then. The program below, compiled with MSVC, prints 16, showing that there are types with alignment greater than 8.
At least for C++, I don't think there's an error.
The type
max_align_t
is a POD type whose alignment requirement is at least as great as that of every scalar type, and whose alignment requirement is supported in every context. [C++11]
#include <iostream>
#include <xmmintrin.h>
#include <cstddef>
#include <type_traits>
int main() {
std::cout << alignof(__m128) << std::endl
<< alignof(std::max_align_t) << std::endl
<< std::is_scalar<__m128>() << std::endl;
return 0;
}
shows 16
, 8
, 0
, indicating that __m128
isn't considered a scalar type, and thus the definition of std::max_align_t
is consistent (with respect to the __m128
type).
The definition for C is more vague and I guess it could be argued that 16
would be a correct value.
max_align_t
which is an object type whose alignment is as great as is supported by the implementation in all contexts; [C11]
What about defaulting to long double rather than double then?
long double
is identical to double
under MSVC 1.
Thanks for the thorough review. I think the real question is indeed whether the intent is to align for the maximal alignment constraint of the target platform, or to align for the biggest datatype OCaml stores in heap blocks. None of the fields in the former union max_align
had a 16 bytes alignment, and all worked well, hasn't it? Now it's not clear to me that this field is actually needed.
Footnotes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could use the same fallback as GCC and clang do. They don't seem to account for vector extensions (also, why stop at __m128
when there's also 256-bits vectors?). My macOS M1 which supports NEON still defines alignof(max_align_t) == 8
.
#if !defined(HAVE_MAX_ALIGN_T)
typedef struct {
CAMLalign(long long) long long ll;
CAMLalign(long double) long double ld;
} max_align_t;
#endif
MinGW-w64 defines a 16-bytes alignment for max_align_t
, but contrary to MSVC, it supports long double
, aligned to 16 bytes.
The original justification comes from this comment:
That trick with
max_align
is not needed at all for correctness (void*
would do just fine as far as correctness goes); however, it is supposed to increase the chances of getting a more favourable alignment ofdata
in terms of performance (it is important, sincedata
is going to be accessed much more often than the header of the block).
This patch would move data
from an 8-bytes boundary to a 16-bytes boundary. I don't know if it would affect performance, but it would likely waste a bit of space.
Thinking about it, the data
field type probably shouldn't be max_align_t
but rather char
as in your first suggestion.
CAMLalign(max_align_t) char data[];
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"this comment" refers to the OCaml 4 memory allocator, which has been extensively rewritten for OCaml 5. Please see #12212 and the corresponding OCaml 5 code to check how alignment is actually handled in pool blocks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that you are talking about two different things, which very conveniently have the exact same name in the codebase. In memory.c (which the PR is hacking on), "pool" refers to a single global pool of blocs (struct pool_block
) that have been caml_stat_alloc
ed by the runtime, and is used to make sure that they are all freed on program termination -- I think that it is only used in cleanup-at-exit mode? There is a single memory pool, which is basically a doubly-linked list, and each caml_stat_alloc
adds one element to it.
In shared_heap.c, struct pool
refers to one block or "slab" of memory in the memory, of size 4096 words, and owned by a domain-local caml_heap_state
structure; each pool has its own free list. I think that the alignment constraints that @xavierleroy has in mind apply to values stored in the shared_heap.c pools, but that those are in fact currently not stored in the global memory.c pool, as they are allocated by caml_mem_map
in shared_heap.c:pool_acquire. (Large objects are allocated differently in large_allocate
, and oddly enough they seem to use malloc
and not caml_stat_alloc
.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification, it makes more sense now.
So, we're talking about caml_stat_alloc
, which should have the same good properties as malloc
, in particular: align sufficiently so that all data types supported by the target architecture (not just those used by the OCaml runtime system, because third-party OCaml-C stubs) can be stored safely. For x86, it means 16-alignment, because of SSE 128-bit vector operations. (256- and 512-bit AVX operations support less aligned accesses, albeit slowly.) For the other OCaml target architectures, 8-alignment seems enough, e.g. for ARM64's load/store register pair instructions, and for ARM NEON, but I'm not sure.
(Note that I wrote "stored safely", i.e. the program doesn't crash, not "stored efficiently" with the alignment that gives best hardware performance for vector instructions. The latter may require specific allocators besides malloc
, for the reason below.)
Giving struct pool_block
an alignment greater than the one guaranteed by malloc
is useless. E.g. if struct pool_block
is 16-aligned and malloc
returns an address = 8 mod 16 (as can happen in 32-bit Windows, I heard), you'll add 16 to this address and get something that is not 16 aligned, but 8 bytes are wasted.
For this reason, I don't believe in the "align on cache lines" argument of "this comment": cache lines are typically 64-byte wide, sometimes 128 or even 256. malloc
will not return 64-byte aligned pointers, at least not for small- to medium-sized blocks, because it would waste too much space, and trying to realign to 64 afterward would waste much space too.
What does this mean for this PR?
- The target systems we care about are 64-bit architecture with alignment constraints <= 16 and malloc returning 16-aligned blocks. In this case, the
data
part ofstruct pool_block
is naturally 16-aligned (because the two pointers before use 16 bytes), and nothing needs to be done. Aligningdata
usingmax_align_t
should have no effect. (*) - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 16-aligned blocks (e.g. Linux x86-32), aligning
data
to 16 seems preferable to me and can be achieved by usingmax_align_t
. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 8-aligned blocks (perhaps Windows 32 bits, not sure): no amount of alignment constraints in
struct pool_block
will give 16-aligneddata
fields, so you could just as well put no alignment constraints.
(*) There's still an issue with the pesky magic number added in debug mode, which throws alignment off. I wish we would just remove it, as I don't think it adds anything to the debug mechanisms built into most malloc implementations.
That's all (and everything) I had to say about this PR. Now I'm done with this discussion and this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the review and the explanations, there's a lot to be learned here.
3b09e44
to
bc1c885
Compare
I would be happy to approve and merge this PR if the recommendations of Xavier are implemented:
(The footnote I am not sure whether the current state of the PR corresponds to this explicit design choice. (I think it does, but in a non-obvious way.) @MisterDA, can you confirm? Could you maybe include a comment that explains the current design/intent, possibly by just quoting the description of Xavier above? (If you just quote, you could mark him as author of the corresponding git commit.) |
I'm not sure either, I'll need some time to convince myself.
I'm thinking of turning the bullets points into some sort of static assertions, to be added to the code or the configure script. |
In our public headers, we're using either: - C23 where `alignas` is a keyword; - C++11 or later where `alignas` is also available; - C11/C17 where `_Alignas` is available.
Explanations from Xavier Leroy at ocaml#13139 (comment) - The target systems we care about are 64-bit architecture with alignment constraints <= 16 and malloc returning 16-aligned blocks. In this case, the `data` part of `struct pool_block` is naturally 16-aligned (because the two pointers before use 16 bytes), and nothing needs to be done. Aligning data using `max_align_t` should have no effect. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 16-aligned blocks (e.g. Linux x86-32), aligning `data` to 16 seems preferable to me and can be achieved by using `max_align_t`. - For a 32-bit architecture with 16-byte alignment constraints and malloc returning 8-aligned blocks (perhaps Windows 32 bits, not sure): no amount of alignment constraints in `struct pool_block` will give 16-aligned `data` fields, so you could just as well put no alignment constraints. Unfortunately, MSVC C11 suppport is incomplete and doesn't define `max_align_t`. - https://developercommunity.visualstudio.com/t/max_align_t-is-not-provided-by-stddefh/10299797 - https://developercommunity.visualstudio.com/t/stdc11-should-add-max-align-t-to-stddefh/1386891
For C++, MSVC defines `using max_align_t = double`. LLVM's clang-cl copies this. It's unlikely that we need to carry a fallback implementation for other compilers. If so, the following could be used: typedef struct { alignas(long long) long long ll; alignas(long double) long double ld; } max_align_t; https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/ginclude/stddef.h https://github.com/llvm/llvm-project/blob/main/clang/lib/Headers/__stddef_max_align_t.h https://en.cppreference.com/w/c/types/max_align_t
bc1c885
to
fd228f1
Compare
Some cleanups removing checks and workarounds for older compilers, assuming that the compiler supports C11 or C++11 out of the box. We may use
_Alignas
(since C11) oralignas
(since C23) directly, and use themax_align_t
type. Unfortunately, support formax_align_t
is missing from the Windows C standard library.