Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve object heap allocation for Compressed References JVM for zLinux #19490

Closed
dmitripivkine opened this issue May 13, 2024 · 8 comments
Closed

Comments

@dmitripivkine
Copy link
Contributor

Current behaviour for object heap allocation on zLinux is direction bottom-up. The bottom-up allocation direction is selected to get performance benefit from smaller shift on Z platform. By this reason Compressed Refs shift on Z is supported as (0,1,2,3,4) while on other platforms it is (0,3,4). However it also means object heap consumes part of memory below 4GB bar always, which prevents Suballocator expansion if necessary (default 200MB or customer specified value is not large enough).

The preferred solution for me to change zLinux behaviour to match non-Z platforms. However it is not clear how much performance regression we might have deal with.

If we decide to continue support (0,1,2,3,4) shifts for zLinux there is an alternative suggestion:

  • change allocation direction from bottom-up to top-down
  • add top-down allocation attempts for 8GB down (1-shift) and 16GB down (2-shift)
    In the worst case f(rom consuming memory below 4GB bar point of view) the allocation scenario still be the same. However in general case there is chance that pressure for memory below 4GB bar is going to be reduced while shift-wise nothing is changed.

Please note the allocation scheme for Concurrent Scavenger with HW Support (Guarded Storage) is still be bottom-up due HW implementation support complication.

Also there is a way to reduce such pressure even more (up to nothing) but for price to have higher shift value.
There is "estimated heap start address" variable. It can be set to 0 for most conservative case. The minimum possible shift for requested size is calculated for top heap address as estimated top heap address = estimated heap start address + requested size.
If allocation is not possible (for instance, there is no realistic way to allocate 7.5GB heap with 1-shift below 8GB) an allocation is going to fail and higher shift is attempted.

There are examples for illustration:

  1. Most conservative - use "estimated heap start address" set to 0GB:
  • for requested heap size 5GB
  • minimum shift for estimated top address = 0GB + 5GB (total 5GB) is 1
  • max address for 1-shift is 8GB
  • so, primary attempted heap range allocated top-down is [3GB,8GB]
    In this case we still guarantee minimum shift but for price of consuming 1GB below 4GB bar
  1. Most relaxed - use "estimated heap start address" set to 4GB:
  • for requested heap size 5GB
  • minimum shift for estimated top address = 4GB + 5GB (total 9GB) is 2
  • max address for 2-shift is 16GB
  • so, primary attempted heap range allocated top-down is [11GB,16GB]
    In this case we guarantee 0-pressure to memory below 4GB bar but for price of using higher shift (2 instead of 1)
    We are using this logic currently for non-Z platforms for protection memory below bar during switching from 3 to 4 shift.

An implementation for this enhancement is pretty simple, there is low risk change. I prefer to isolate this change for zLinux only, so there is no risk to break other platforms.

Just to summarize:
Case 1 is an enhancement for current behaviour but with partial improvement.
Case 2 should resolve issue with low memory below 4GB bar but with changing current behaviour, so it might be negative performance impact related to larger CR shift value.

@vijaysun-omr @joransiu @TobiAjila @r30shah @amicic What do you think?

@r30shah
Copy link
Contributor

r30shah commented May 13, 2024

On code-gen we would see the performance impact going from shift 1 to higher shifts (Not for 0 shift) where we lose the capability of embedding the shift into the load/store instruction itself forcing us to generate extra instruction for each load/store.

Perf wise, I am pasting the old numbers I have collected comparing the shift of 1 vs 3(Few weeks ago, I did refresh those numbers but are on the machine which is offline now, so would be able to extract those later this week). For now pasting old numbers to get the conversation continue, will update comment with latest results (Though I think, performance delta was similar)

  1. ILOG-ODM :
    Options : -Xcompressedrefs -Xms1024m -Xmx1024m

    Configuration : 4 application threads, using 4 Logical CPs

Benchmark -XXgc:forcedShiftingCompressionAmount=1 -XXgc:forcedShiftingCompressionAmount=3
300 RuleSet 100 98.50
5 FastPath 100 98.78

@vijaysun-omr
Copy link
Contributor

@dmitripivkine just so I am clear on the proposal, it is composed of two steps that are different from current default scheme, both of which only apply to the case when heap size could have been allocated below 8gb following a "bottom-up" approach in effect today by default on zLinux.

  1. Change to "top-down" approach and see if it succeeds first in allocating below 8gb
  2. If "top-down" approach did not succeed, then instead of trying "bottom-up" approach, we will try a scheme where we set "estimated start address" to 4gb and thereby avoid allocating in the bottom 4gb completely, thereby incurring a higher risk of shift greater than 1.

Is this understanding correct ?

@dmitripivkine
Copy link
Contributor Author

@dmitripivkine just so I am clear on the proposal, it is composed of two steps that are different from current default scheme, both of which only apply to the case when heap size could have been allocated below 8gb following a "bottom-up" approach in effect today by default on zLinux.

  1. Change to "top-down" approach and see if it succeeds first in allocating below 8gb
  2. If "top-down" approach did not succeed, then instead of trying "bottom-up" approach, we will try a scheme where we set "estimated start address" to 4gb and thereby avoid allocating in the bottom 4gb completely, thereby incurring a higher risk of shift greater than 1.

Is this understanding correct ?

No, not exactly. Sorry, I was not clear.

My base suggestion is change bottom-up to top-down allocation direction for all zLinux cases except Concurrent Scavenger with HW support case (which might be addressed later if we need, just requires more work).

Changing of allocation direction will reduce usage of the memory below 4G bar, it is going to be better or the same in the worst case scenario. If currently with bottom-up all free memory below 4G bar is consumed for sure, with top-down approach it might be consumed if there is not enough memory between 4G bar and maximum address supported with selected minimum shift.
For example, for 5G heap minimum shift is 1. Maximum address for shift 1 is 8G. So, if heap larger than [4G,8G] interval (which it is in this example) the deficit (1G) will be taken from below 4G bar. If such allocation attempt has failed the higher shift (2 in this example) is going to be selected and attempt repeated with new max address (16G). This is same behaviour as we have now. The changing of allocation direction to top-down improves average case without compromising performance by switching to higher shift. This scenario I described in case 1.

Now, when we state this, our allocation policy can be improved optionally if we like. We can reduce (or eliminate) memory usage below 4G bar by playing with parameters (for the price to go to higher shift sometimes of course). I tried to explain this in example in case 2.

I am open for ideas how heap allocation can be improved (and we do have tools to do it on zLinux). However would be good to have allocation logic to be aligned with other platforms.

@vijaysun-omr
Copy link
Contributor

Thanks @dmitripivkine

My preference would be to go with the "top-down" scheme described under case 1. But, does this scheme not come with its own throughput risk ? Specifically, in a case where the heap size was such that the entire heap could have been contained in the lower 4gb, we may have been able to run without any shifting with the "bottom-up" approach, whereas with the "top-down" approach, we may do shift=1 (not as bad as shift=3 but also not as good as no shift). If so, maybe we need to compare shift=1 vs no shift (i.e. not what Rahil had collected before).

The optional enhancement described in case 2 maybe can come later, if we find that employing the "top-down" scheme did not help in enough of the cases that you are attacking with this proposal. Is it support cases that are driving this proposal and if so, do you feel that trying with just "top-down" approach change would be worth trying as an initial step to address what you are seeing with the support cases ? We can discuss going further if needed later in my opinion (but happy to hear more reasons to reconsider that position).

@dmitripivkine
Copy link
Contributor Author

But, does this scheme not come with its own throughput risk ? Specifically, in a case where the heap size was such that the entire heap could have been contained in the lower 4gb, we may have been able to run without any shifting with the "bottom-up" approach, whereas with the "top-down" approach, we may do shift=1 (not as bad as shift=3 but also not as good as no shift). If so, maybe we need to compare shift=1 vs no shift (i.e. not what Rahil had collected before).

@vijaysun-omr No, there is no risk. There are details have not been described, I have focused on zLinux specific:
Full proposed allocation process steps (we are going to the next step if current one is skip or not succeeded, allocation direction top-down for all attempts):

  1. If requested heap size <= 4G attempt to allocate below 4G to get shift 0. This step is common for all platforms (except zOS). The only difference for zLinux is direction for allocation - instead bottom-up is going to be top-down.
  2. If requested heap size <= 8G attempt to allocate below 8G to get shift 1. This is proposed step specific for zLinux only.
  3. If requested heap size <= 16G attempt to allocate below 16G to get shift 2. This is proposed step specific for zLinux only.
  4. If requested heap size <= 28G attempt to allocate below 32G to get shift 3. 28G value is used to protect [0,4G] range explicitly. If heap size is larger than 28G force shift 4. This step is common for all platforms. However there is (only) new behaviour for zLinux. We can follow common path or make an exception for zLinux. I prefer common path.
  5. Allocate heap below 64G top-down and get smallest shift possible (expected to be 4). This step is common for all platforms.

So, steps 1, 4 and 5 exists today for all platforms except Z. I am suggesting to apply them for zLinux too with addition of steps 2 and 3 specific for zLinux only.

@vijaysun-omr
Copy link
Contributor

Thanks for those details. I am fine with the proposed "top-down" scheme since it carries no throughput risk.

@dmitripivkine
Copy link
Contributor Author

Implementation eclipse/omr#7344

@dmitripivkine
Copy link
Contributor Author

There are a few examples of heap location for new implementation:

-- 512m, 0-shift

1STHEAPREGION  0x000003FFA4081E10 0x00000000E0000000 0x00000000E0600000 0x0000000000600000 Generational/Tenured Region
1STHEAPREGION  0x000003FFA4081870 0x00000000FFE00000 0x00000000FFF00000 0x0000000000100000 Generational/Nursery Region
1STHEAPREGION  0x000003FFA40812D0 0x00000000FFF00000 0x0000000100000000 0x0000000000100000 Generational/Nursery Region

-- 3G located [1G,4G], 0-shift

1STHEAPREGION  0x000003FF98084800 0x0000000040000000 0x0000000040600000 0x0000000000600000 Generational/Tenured Region
1STHEAPREGION  0x000003FF98084260 0x00000000FFE00000 0x00000000FFF00000 0x0000000000100000 Generational/Nursery Region
1STHEAPREGION  0x000003FF98083CC0 0x00000000FFF00000 0x0000000100000000 0x0000000000100000 Generational/Nursery Region

— 4G, [4G, 8G] 1-shift

1STHEAPREGION  0x000003FFA4084940 0x0000000100000000 0x0000000100600000 0x0000000000600000 Generational/Tenured Region
1STHEAPREGION  0x000003FFA40843A0 0x00000001FFE00000 0x00000001FFF00000 0x0000000000100000 Generational/Nursery Region
1STHEAPREGION  0x000003FFA4083E00 0x00000001FFF00000 0x0000000200000000 0x0000000000100000 Generational/Nursery Region

— 5G, [3G,8G] 1-shift

1STHEAPREGION  0x000003FF7C084940 0x00000000C0000000 0x00000000C0600000 0x0000000000600000 Generational/Tenured Region
1STHEAPREGION  0x000003FF7C0843A0 0x00000001FFE00000 0x00000001FFF00000 0x0000000000100000 Generational/Nursery Region
1STHEAPREGION  0x000003FF7C083E00 0x00000001FFF00000 0x0000000200000000 0x0000000000100000 Generational/Nursery Region

— 11G, [5G,16G] 2-shift

1STHEAPREGION  0x000003FFA8084D40 0x0000000140000000 0x0000000140600000 0x0000000000600000 Generational/Tenured Region
1STHEAPREGION  0x000003FFA80847A0 0x00000003FFE00000 0x00000003FFF00000 0x0000000000100000 Generational/Nursery Region
1STHEAPREGION  0x000003FFA8084200 0x00000003FFF00000 0x0000000400000000 0x0000000000100000 Generational/Nursery Region

-- 23G, [9G,32G] 3-shift

1STHEAPREGION  0x000003FFB4084FC0 0x0000000240000000 0x0000000240600000 0x0000000000600000 Generational/Tenured Region
1STHEAPREGION  0x000003FFB4084A20 0x00000007FFE00000 0x00000007FFF00000 0x0000000000100000 Generational/Nursery Region
1STHEAPREGION  0x000003FFB4084480 0x00000007FFF00000 0x0000000800000000 0x0000000000100000 Generational/Nursery Region

— 27G, [5G,32G] 3-shift

1STHEAPREGION  0x000003FF84085350 0x0000000140000000 0x0000000140600000 0x0000000000600000 Generational/Tenured Region
1STHEAPREGION  0x000003FF84084DB0 0x00000007FFE00000 0x00000007FFF00000 0x0000000000100000 Generational/Nursery Region
1STHEAPREGION  0x000003FF84084810 0x00000007FFF00000 0x0000000800000000 0x0000000000100000 Generational/Nursery Region

— 29G, [35G,64G] 4-shift <--- this is only difference from current behaviour, pushed to 4-shift

1STHEAPREGION  0x000003FF88085350 0x00000008C0000000 0x00000008C0600000 0x0000000000600000 Generational/Tenured Region
1STHEAPREGION  0x000003FF88084DB0 0x0000000FFFE00000 0x0000000FFFF00000 0x0000000000100000 Generational/Nursery Region
1STHEAPREGION  0x000003FF88084810 0x0000000FFFF00000 0x0000001000000000 0x0000000000100000 Generational/Nursery Region

— 35G, [29G,64G] 4-shift

1STHEAPREGION  0x000003FFB0085490 0x0000000740000000 0x0000000740600000 0x0000000000600000 Generational/Tenured Region
1STHEAPREGION  0x000003FFB0084EF0 0x0000000FFFE00000 0x0000000FFFF00000 0x0000000000100000 Generational/Nursery Region
1STHEAPREGION  0x000003FFB0084950 0x0000000FFFF00000 0x0000001000000000 0x0000000000100000 Generational/Nursery Region

— 60G, [4G,64G] 4-shift

1STHEAPREGION  0x000003FFB80855D0 0x0000000100000000 0x0000000100600000 0x0000000000600000 Generational/Tenured Region
1STHEAPREGION  0x000003FFB8085030 0x0000000FFFE00000 0x0000000FFFF00000 0x0000000000100000 Generational/Nursery Region
1STHEAPREGION  0x000003FFB8084A90 0x0000000FFFF00000 0x0000001000000000 0x0000000000100000 Generational/Nursery Region

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants