Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Implement Encoding in rust #2579

Draft
wants to merge 22 commits into
base: trunk
Choose a base branch
from
Draft

[WIP] Implement Encoding in rust #2579

wants to merge 22 commits into from

Conversation

b-n
Copy link
Member

@b-n b-n commented May 31, 2023

By extracting out Encoding into it's own module, we can start to rely on some of the implementations in spinoso-string to provide the functionality.

Some rough todos:

  • Implement instance methods
  • Implement constants and their binding to the interpreter

Relates: PR #2578
Relates: #183

@b-n b-n added C-enhancement Category: New feature or request. A-ruby-core Area: Ruby Core types. A-spec Area: ruby/spec infrastructure and completeness. S-wip Status: This pull request is a work in progress. B-mruby Backend: Implementation of artichoke-core using mruby. labels May 31, 2023
@b-n b-n changed the base branch from trunk to b-n/extract-encoding May 31, 2023 09:48
Base automatically changed from b-n/extract-encoding to trunk June 4, 2023 18:11
@b-n b-n force-pushed the b-n/encoding-in-rust branch 3 times, most recently from ad68b76 to 37f58a9 Compare June 14, 2023 09:14
Copy link
Member

@lopopolo lopopolo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some early thoughts

@@ -13,7 +13,7 @@ use crate::Artichoke;

mod array;
mod boolean;
mod boxing;
pub(crate) mod boxing;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't need to be made pub. all of the types are pub use'd below

static ESCAPABLE_CONSTANT_CHARS: OnceLock<Regex> = OnceLock::new();
let escapable_const_char_regex = ESCAPABLE_CONSTANT_CHARS.get_or_init(|| {
// Match any `-` and `.` chars. These will be replaced with `_`
Regex::new(r"[-\.]+").unwrap()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regex is really heavy here. We should be able to use ByteSlice::replace from bstr: https://docs.rs/bstr/latest/bstr/trait.ByteSlice.html#method.replace

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah nice, that is indeed much better.

I was thinking of whether there is another way to create bare symbols. e.g.

irb(main):001:0> constants = Encoding.constants
=> 
[:ISO8859_14,                                                         
...                                                                   
irb(main):002:0> Encoding::UTF_8.replicate("Super Special Encoding")
=> #<Encoding:Super Special Encoding>
irb(main):003:0> Encoding.constants - constants
=> [:Super_Special_Encoding, :SUPER_SPECIAL_ENCODING]
irb(main):004:0> Encoding::Super_Special_Encoding
=> #<Encoding:Super Special Encoding>
irb(main):005:0> Encoding::Super_Special_Encoding.name
=> "Super Special Encoding"

And check this out:

irb(main):007:0> Encoding::UTF_8.replicate("\x7F")
=> #<Encoding:>
irb(main):008:0> Encoding::UTF_8.replicate("\x80")
(Object doesn't support #inspect)
=>                                                                                
irb(main):009:0> _.name
=> "\x80"
irb(main):010:0> _.encoding
=> #<Encoding:US-ASCII>
irb(main):011:0> Encoding.find("\x80")
(Object doesn't support #inspect)
=>                                                               
irb(main):012:0> _.name
=> "\x80"

Valid names apparently need to be asciicompat as per https://github.com/artichoke/artichoke/blob/trunk/artichoke-backend/vendor/ruby/include/ruby/internal/encoding/encoding.h#L766-L793 which is used here: https://github.com/artichoke/artichoke/blob/trunk/artichoke-backend/vendor/ruby/encoding.c#L288-L301

Eventually some more to look into maybe, but:

irb(main):001:0> constants = Encoding.constants
=> 
[:ISO8859_14,                                                          
...                                                                    
irb(main):002:0> (1..127).map(&:chr).each_slice(32).map { |c| Encoding::UTF_8.replicate(c.join) }
=> 
[#<Encoding:                                                           
                                                                       
                                                                       
 >,                                                                    
 #<Encoding:!"#$%&'()*+,-./0123456789:;<=>?@>,                         
 #<Encoding:ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`>,                         
 #<Encoding:abcdefghijklmnopqrstuvwxyz{|}~>]                           
irb(main):003:0> Encoding.constants - constants
=> [:ABCDEFGHIJKLMNOPQRSTUVWXYZ______, :ABCDEFGHIJKLMNOPQRSTUVWXYZ_____]

(Had to chunk it because otherwise some issue with Constants and lengths that I think it was getting mad at).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: in the between time, going to take the easy Bstr replace approach.

artichoke-backend/src/extn/core/encoding/trampoline.rs Outdated Show resolved Hide resolved
@@ -32,6 +34,7 @@ pub struct State {
pub parser: Option<parser::State>,
pub classes: Registry<class::Spec>,
pub modules: Registry<module::Spec>,
pub encodings: HashSet<encoding::Spec>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a better design is to pass encodings around as symbols (i.e. a u32) both in the interpreter storage, mrb_value repr for Encodings, and the flags of a String mrb_value.

If we end up doing that, we should use an FxHashSet here from rustc_hash which will perform much better

@b-n b-n force-pushed the b-n/encoding-in-rust branch 2 times, most recently from efe7e30 to 9efe532 Compare June 18, 2023 18:50
b-n added 13 commits June 18, 2023 20:52
Easier to maintain there, rather than implementing some sort of state engine in
rust.
At first I thought of putting some sort of state machine into the
`mruby.rs`, but realised soon after that this makes more sense in the
trampoline since we're going to be utilizing values from `spinoso-string`
Learning:
- This half works
- It pushes the allocationg of the encoding into the registry, but it's
  still backed by the extn types which feels wrong
- The encoding list doesn't actually contain all the constant registered
  encodings (we need to go the class registry spec route, I now
  understand)
- I'm slowly learning what mrb_data_type and how these things are
  hanging together.

Next steps:
- Implement a Spec type. This is the thing we want to allocate, and
  pretend is a real object. When we list, we list the spec's. Although
  Encoding objects look like they are separate because of their names,
  they are just backed by the same object on the other side.
Most of this is copy and pasting from boxing.rs in an attempt to
directly access the encoding value from an RValue. I'm not 100% sure of
what I'm doing and if this is correct, thus this commit existing for
advice
There is no MRB_TT_ENCODING, so implementing as little as possible.

This code currently segfaults when trying to retrieve the value from
memory. It looks as though it does retrieve the pointer, but is
currently dieing when trying to inspect the innards.
b-n added 2 commits June 18, 2023 20:52
This is needed since we are expecting immediates back. MRuby still needs
some sort of reference to encoding, so this init produces the smallest
possible class for it.

We could get around this by just adding the class type in instead, but
:shrug:
b-n added 4 commits June 24, 2023 14:02
This will make it consistent with what string expects.
I think it's possible to remove the EncodingRegistry entirely and maybe
relace with mezzaluna-type-registry, but will see.
b-n added 3 commits June 25, 2023 11:38
TODO: There is no way to assign these encodings to a String since the
encoding is stored as bitflags at the start of the string.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ruby-core Area: Ruby Core types. A-spec Area: ruby/spec infrastructure and completeness. B-mruby Backend: Implementation of artichoke-core using mruby. C-enhancement Category: New feature or request. S-wip Status: This pull request is a work in progress.
Development

Successfully merging this pull request may close these issues.

None yet

2 participants