Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proper string type for rolldown #427

Open
hyf0 opened this issue Feb 27, 2024 · 16 comments
Open

Proper string type for rolldown #427

hyf0 opened this issue Feb 27, 2024 · 16 comments

Comments

@hyf0
Copy link
Member

hyf0 commented Feb 27, 2024

Rolldown used oxc Atom as the default string type.

After oxc-project/oxc#2295 and oxc-project/oxc#2516, oxc Atom can't be used as a general purposed string type anymore. Not to mentioned that it lacks characteristics that bundler required.

To make updating oxc easier, we need to replace the current oxc Atom and use a new string type as the default. And to gain potential performance, the new string type should be

  • friendly in parallel
  • O(1) clone
  • fast comparing
  • fast hashing/pre-computed hash

Contexts:

  • Due to scope hoisting, rolldown needs to pull out values from the every oxc ast identifier and store them in a different places. Since oxc Atom is not a general purposed string type anymore, we need to converting the Atom to a new string type. This costs some tax just for converting.
@hyf0
Copy link
Member Author

hyf0 commented Feb 27, 2024

I would give a try on https://github.com/dudykr/ddbase/tree/main/crates/hstr. It would be wonderful if @overlookmotel could give some suggestions.

@hyf0
Copy link
Member Author

hyf0 commented Feb 27, 2024

The downside:

With the new string type rolldown used(let's call it Rstr for now), there are three kinds of string type now.

ArenaStr<'_>, CompatStr and Rstr.

ArenaStr would be used as the default string type for the oxc ast. Since ArenaStr is tagged with lifetime, it have to be attached with the ast, which means it could not exist as a standalone item.

For oxc itself, situations like scope analysis or transform, they all required to pull out string value from oxc ast, so ArenaStr<'_> provide to_compat_str method to convert itself to CompatStr. This is an O(n) clone operation. Due ArenaStr can't stand alone, this means that every situation requires string value from the oxc ast would leads to a clone operation in the end.

So far, we could see that only ArenaStr<'_> and CompatStr are introducing some converting fee.

With Rstr, to gain characteristics motioned above, rolldown would need to convert ArenaStr and CompatStr to Rstr. But there still are problems. Rolldown relies on the oxc ast and scope analysis information, which are all built on top of ArenaStr and CompatStr. Though Rstr may gain some performance for some internal usages, rolldown have to interact with ArenaStr and CompatStr. So Rstr would needs to convert back to ArenaStr. Oh man, there are a lot converting.

@Boshen
Copy link
Member

Boshen commented Feb 27, 2024

I can provide a unsafe ArenaStr<'alloc> -> ArenaStr<'static> API which preserves what we had before the change.

We already have the unsafe Program<'static> AST anyway 😅

@hyf0
Copy link
Member Author

hyf0 commented Feb 27, 2024

I can provide a unsafe ArenaStr<'alloc> -> ArenaStr<'static> API which preserves what we had before the change.

We already have the unsafe Program<'static> AST anyway 😅

😅 I don't see it this way. Program<'static> is compiler think it's unsafe but it's safe. ArenaStr<'alloc> -> ArenaStr<'static> is compiler think it's unsafe and it is.

@overlookmotel
Copy link

@hyf0 Good to meet you!

I'm not at all familiar with Rolldown codebase yet, so I may be missing the point, but here are some thoughts:

The intent was for Atom<'_> to evolve into:

  • 16-byte struct.
  • Can contain either up to 16-byte string inline, or for longer strings, a reference to string in source text (or for escaped strings, arena string store).
  • For > 16 byte strings, layout much the same as Box<str>
    • pointer (usize) + length (u32)
    • Immutable - no capacity field, unlike std's String.
    • This leaves 4 bytes free for some other use (undecided what).
  • Use same trick as compact_str to smuggle the inline/ref discriminant and inline length in the last byte (this is what makes it possible to store 16 bytes inline, while keeping Atom<'a>'s size = 16).
  • Has a niche, so Option<Atom<'_>> also 16 bytes.

CompactStr to be replaced with a new type which has identical size and layout to Atom<'_>. Only difference between Atom<'_> and new CompactStr would be that CompactStr stores string content on heap, and owns that allocation (so no lifetime). Like, Atom<'_>, it'd be immutable.

The implications are:

  • For inline strings (i.e. 16 or less bytes), converting Atom<'_> -> CompactStr (or back) is free - just transmute.
  • Equality comparison of 2 x inline strings is cheap - just bitwise comparison of the inline bytes.
  • Equality comparison of 2 x heap/ref strings can compare their lengths first using only inline data, and only has to read from the heap if they're identical length (so cheap in most cases).

Would that fulfill most of the requirements for RStr?

The parts I'm not sure about are:

  1. Friendly in parallel. Presumably this would require RStr to contain an atomic ref count (like Arc).
  2. Hashing.

For inline strings, maybe neither of these is required:

  • Ref count not required, because no heap data to track.
  • Hashing likely not required for comparisons as they're so cheap to compare anyway.
  • Hashing for Hashmap keys could maybe use the inline data as 128-bit pseudo hash?

For non-inline strings, perhaps one or other of these could use the spare 4 bytes in Atom<'_> / CompactStr? Even if Atom<'a> / CompactStr don't use those 4 bytes in the same way, if RStr has same layout, it would make converting Atom<'_> / CompactStr to RStr less expensive.

I'm don't know how common hashing strings is in Rolldown, so how important it is for performance to pre-calculate hashes.

Like I said, I'm not at all familiar with Rolldown yet, so sorry if I'm totally missing the point. If I am, please correct me!

@overlookmotel
Copy link

The other solution, I guess, is to ensure OXC's allocator and ASTs stay alive while Rolldown is processing those files. Then maybe Atom<'_> can be used without conversion in Rolldown. But I don't know what your appetite for lifetimes everywhere is!

@hyf0
Copy link
Member Author

hyf0 commented Feb 27, 2024

@hyf0 Good to meet you!

I'm not at all familiar with Rolldown codebase yet, so I may be missing the point, but here are some thoughts:

The intent was for Atom<'_> to evolve into:

....

Good to know all this, but the current implementation seems not that delicate.

For inline strings (i.e. 16 or less bytes), converting Atom<'_> -> CompactStr (or back) is free

This seems promising. Why not take a further step and make Atom a general purpose string type for most of use cases? So users don't need to convert them back and forth. This new Atom seems only has advantages for parsing faster but it hurts everyone interact with it. Not just rolldown as a bundler, but also oxc semantic and oxc transfomer would have high frequent operations for storing, hashing and comparing these strings.

My point so far is that ArenaStr or Atom<'_> are only a local optimum for parsing. A global optimum would be having only one string type, it might slow down parsing for this goal but would gain performance advantage on scope analysis, ast transform and bundling.

@overlookmotel
Copy link

overlookmotel commented Feb 27, 2024

Why not take a further step and make Atom a general purpose string type for most of use cases?

I totally see your point about local optimums. I guess what you're asking for here is a general-purpose owned string type.

However, as far as I understand, it's unfortunately not possible in OXC's AST, without fundamental changes to OXC.

To retrace how we got here:

Original problem was a memory leak (oxc#1803). Cause of the leak was that Atom did own its string data. But this was incompatible with OXC's use of an arena allocator, which requires that all AST data is stored in the arena. Atom owned memory on general heap outside the arena, and it didn't get freed when the arena is dropped.

oxc#2294 fixed the leak but used an unsound transmute to do so. oxc#2497 removed the unsoundness, by introducing a lifetime - making it explicit that Atom<'_> doesn't own the string data.

There's only 3 ways around this that I can see:

  1. Remove the arena allocator from OXC.
  2. Consumers who want the convenience of not dealing with lifetimes can convert Atom<'_> to CompactStr (at cost of O(n) memory copy, as you point out).
  3. Consumers deal with the lifetime, and use Atom<'_> directly, to avoid any performance hit.

Personally, I don't think (1) is a good option. It would significantly impact OXC's performance, and the arena also enables some significant features which are otherwise impossible (oxc#2409).

The choice between (2) and (3) is a trade-off between performance and ergonomics.

Again, this is just my personal view, but I tend to think that OXC semantic and transfomer could move to (3) without too much trouble. Lifetimes are a pain, but in this case personally I don't find them so bad because all the lifetimes are the same - everything lives as long as the allocator.

For Rolldown, I don't know enough to have any opinion on which is the best trade-off.

But from my understanding, I don't think it's possible to square the circle of combining an owned string type with the arena allocator. The design outlined above is intended to minimize the cost of conversion to an owned type where that's needed, but it can't be removed entirely.

Incidentally, the same dilemma also exists for the other non-owning types in OXC's AST - Box<'a> and Vec<'a>.

Does this make any sense?

@hyf0
Copy link
Member Author

hyf0 commented Feb 29, 2024

I don't think it's possible to square the circle of combining an owned string type with the arena allocator.

I don't quite get this one. oxc could use an owned string type with keeping arena. For example, we could simply replace the Atom<'_> with String, it will slow down the parsing, but we could try different type to reducing the slowing. Or even get the same performance as Atom<'_> with some optimized string type.

I also agree that to enable oxc-project/oxc#2409, everything in the oxc ast should be allocated in the arena. I just consider this is a trade-off issue. If this feature is important enough, I totally agree that it's understandable that to have a unique string type just optimized for the case.

@Boshen
Copy link
Member

Boshen commented Feb 29, 2024

we could simply replace the Atom<'_> with String

The whole AST has no drop method on it, it's a memory leak if it's a String, which was fixed in oxc-project/oxc#2294

The allocator Box and Vec has no drop either :-) https://github.com/oxc-project/oxc/blob/fe777f330f2aed619279236f78e7464ae78c2781/crates/oxc_allocator/src/arena.rs#L17-L20

@hyf0
Copy link
Member Author

hyf0 commented Feb 29, 2024

we could simply replace the Atom<'_> with String

The whole AST has no drop method on it, it's a memory leak if it's a String, which was fixed in oxc-project/oxc#2294

The allocator Box and Vec has no drop either :-) https://github.com/oxc-project/oxc/blob/fe777f330f2aed619279236f78e7464ae78c2781/crates/oxc_allocator/src/arena.rs#L17-L20

Oh, now I understand the drop problem. If we want to have owned string type in the ast, we have to implement a custom drop method that only drop the filed that holds owned string. Is this feasible?

I could think of the downsides, recursive dropping hurts the advantage using arena. Tough.

@hyf0
Copy link
Member Author

hyf0 commented Feb 29, 2024

Updates:

I'm going to using a new string type to replace Atom in rolldown codebase, so rolldown could update oxc to the latest version without being affected by the changes of Atom.

This no doubtfully causes some extra converting, slows down the general bundling performance. But we can live with it so far.

Let's see if we could find some Zero-Cost Abstractions solutions in the futures. For example make the string type a generic type or some other customized ways.


It's pleasure to discuss with @overlookmotel, who always make things easy to understand to everyone.

@Boshen
Copy link
Member

Boshen commented Feb 29, 2024

Can you type alias CompactString and use CompactString as the underlying type? We'll eventually export our own version of CompactString.

@hyf0
Copy link
Member Author

hyf0 commented Feb 29, 2024

Can you type alias CompactString and use CompactString as the underlying type? We'll eventually export our own version of CompactString.

Yeah. I'm gonna to using new type instead of type alias and wrap CompactString as the underlying type.

@overlookmotel
Copy link

overlookmotel commented Feb 29, 2024

It's pleasure to discuss with @overlookmotel, who always make things easy to understand to everyone.

Thanks @hyf0. Some people think I just talk too much! Glad you appreciate my long winding writing.


Is it infeasible to work with Atom<'_> directly in Rolldown without conversion to an owned string type?


I guess the other big outstanding question is hashing. Would we be better off (both in OXC and Rolldown) if Atom<'_> was a "real" atom type which de-duplicates strings? (like string-cache or hstr)

This has pluses and minuses:

  1. Higher cost to create an Atom<'_>, because you have to hash it.
  2. Lower cost to compare two Atom<'_>s.
  3. Good performance single-threaded, but not so good multi-threaded due to locking.

Whether it'd be a gain probably depends on frequency of comparisons vs creations, and whether the Atoms being compared are short or not (short Atoms are cheap to compare anyway without hashing). Parser's performance would suffer, but gains in semantic/transformer/bundler might outweigh that.

Boshen knows this area better than me. servo/string-cache#268

@hyf0
Copy link
Member Author

hyf0 commented Mar 16, 2024

@hyf0 hyf0 changed the title New string type for rolldown Proper string type for rolldown Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants