New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Implement Encoding
in rust
#2579
base: trunk
Are you sure you want to change the base?
Conversation
ad68b76
to
37f58a9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some early thoughts
artichoke-backend/src/convert.rs
Outdated
@@ -13,7 +13,7 @@ use crate::Artichoke; | |||
|
|||
mod array; | |||
mod boolean; | |||
mod boxing; | |||
pub(crate) mod boxing; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't need to be made pub. all of the types are pub use
'd below
static ESCAPABLE_CONSTANT_CHARS: OnceLock<Regex> = OnceLock::new(); | ||
let escapable_const_char_regex = ESCAPABLE_CONSTANT_CHARS.get_or_init(|| { | ||
// Match any `-` and `.` chars. These will be replaced with `_` | ||
Regex::new(r"[-\.]+").unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
regex is really heavy here. We should be able to use ByteSlice::replace
from bstr
: https://docs.rs/bstr/latest/bstr/trait.ByteSlice.html#method.replace
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nice, that is indeed much better.
I was thinking of whether there is another way to create bare symbols. e.g.
irb(main):001:0> constants = Encoding.constants
=>
[:ISO8859_14,
...
irb(main):002:0> Encoding::UTF_8.replicate("Super Special Encoding")
=> #<Encoding:Super Special Encoding>
irb(main):003:0> Encoding.constants - constants
=> [:Super_Special_Encoding, :SUPER_SPECIAL_ENCODING]
irb(main):004:0> Encoding::Super_Special_Encoding
=> #<Encoding:Super Special Encoding>
irb(main):005:0> Encoding::Super_Special_Encoding.name
=> "Super Special Encoding"
And check this out:
irb(main):007:0> Encoding::UTF_8.replicate("\x7F")
=> #<Encoding:>
irb(main):008:0> Encoding::UTF_8.replicate("\x80")
(Object doesn't support #inspect)
=>
irb(main):009:0> _.name
=> "\x80"
irb(main):010:0> _.encoding
=> #<Encoding:US-ASCII>
irb(main):011:0> Encoding.find("\x80")
(Object doesn't support #inspect)
=>
irb(main):012:0> _.name
=> "\x80"
Valid names apparently need to be asciicompat as per https://github.com/artichoke/artichoke/blob/trunk/artichoke-backend/vendor/ruby/include/ruby/internal/encoding/encoding.h#L766-L793 which is used here: https://github.com/artichoke/artichoke/blob/trunk/artichoke-backend/vendor/ruby/encoding.c#L288-L301
Eventually some more to look into maybe, but:
irb(main):001:0> constants = Encoding.constants
=>
[:ISO8859_14,
...
irb(main):002:0> (1..127).map(&:chr).each_slice(32).map { |c| Encoding::UTF_8.replicate(c.join) }
=>
[#<Encoding:
>,
#<Encoding:!"#$%&'()*+,-./0123456789:;<=>?@>,
#<Encoding:ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`>,
#<Encoding:abcdefghijklmnopqrstuvwxyz{|}~>]
irb(main):003:0> Encoding.constants - constants
=> [:ABCDEFGHIJKLMNOPQRSTUVWXYZ______, :ABCDEFGHIJKLMNOPQRSTUVWXYZ_____]
(Had to chunk it because otherwise some issue with Constants and lengths that I think it was getting mad at).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: in the between time, going to take the easy Bstr replace approach.
artichoke-backend/src/state/mod.rs
Outdated
@@ -32,6 +34,7 @@ pub struct State { | |||
pub parser: Option<parser::State>, | |||
pub classes: Registry<class::Spec>, | |||
pub modules: Registry<module::Spec>, | |||
pub encodings: HashSet<encoding::Spec>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a better design is to pass encodings around as symbols (i.e. a u32
) both in the interpreter storage, mrb_value
repr for Encoding
s, and the flags of a String
mrb_value
.
If we end up doing that, we should use an FxHashSet
here from rustc_hash
which will perform much better
efe7e30
to
9efe532
Compare
Easier to maintain there, rather than implementing some sort of state engine in rust.
At first I thought of putting some sort of state machine into the `mruby.rs`, but realised soon after that this makes more sense in the trampoline since we're going to be utilizing values from `spinoso-string`
Learning: - This half works - It pushes the allocationg of the encoding into the registry, but it's still backed by the extn types which feels wrong - The encoding list doesn't actually contain all the constant registered encodings (we need to go the class registry spec route, I now understand) - I'm slowly learning what mrb_data_type and how these things are hanging together. Next steps: - Implement a Spec type. This is the thing we want to allocate, and pretend is a real object. When we list, we list the spec's. Although Encoding objects look like they are separate because of their names, they are just backed by the same object on the other side.
Most of this is copy and pasting from boxing.rs in an attempt to directly access the encoding value from an RValue. I'm not 100% sure of what I'm doing and if this is correct, thus this commit existing for advice
There is no MRB_TT_ENCODING, so implementing as little as possible. This code currently segfaults when trying to retrieve the value from memory. It looks as though it does retrieve the pointer, but is currently dieing when trying to inspect the innards.
This is needed since we are expecting immediates back. MRuby still needs some sort of reference to encoding, so this init produces the smallest possible class for it. We could get around this by just adding the class type in instead, but :shrug:
9efe532
to
6a7db50
Compare
This will make it consistent with what string expects.
I think it's possible to remove the EncodingRegistry entirely and maybe relace with mezzaluna-type-registry, but will see.
63fee63
to
d54e115
Compare
TODO: There is no way to assign these encodings to a String since the encoding is stored as bitflags at the start of the string.
By extracting out
Encoding
into it's own module, we can start to rely on some of the implementations inspinoso-string
to provide the functionality.Some rough todos:
Relates: PR #2578
Relates: #183