Skip to content

Workweek encoding

jdm edited this page Nov 11, 2014 · 1 revision

Servo workweek encoding

  • SimonSapin: JS strings are UTF16, except that they're not guaranteed to be correct, as they can have unpaired surrogate codepoints. Those cannot be represented in UTF8, which is the native Rust string representation. That means that if we convert the JS string to UTF8 and back, it's lossy. Some of the values could be replaced by the replacement character.
  • kmc: I suspect there are very few sites that depend on the behaviour of unpaired surrogates in the DOM. Could we start with replacing and move to something else if we need to support that?
  • SimonSapin: That was hsivonon's argument. (something about representing binary data)
  • kmc: I think we should wait for that to break.
  • SimonSapin: Hard to tell until we have actual users.
  • pcwalton: Could we create a Firefox addon to look for this?
  • SimonSapin: hsivonen suggested telemetry for Firefox.
  • kmc: If utf8 is workable, a utf8 DOM shouldn't paint us in a corner.
  • pcwalton: Performance problems?
  • kmc: Not concerned about perf of unpaired surrogates.
  • jack: charAt?
  • SimonSapin: We're talking about different things. Options: 1) use WTF8 (utf8 with surrogates) everywhere, including SpiderMonkey. Requires doing some work in SM, not sure how much. Problem with charAt and other JS apis based on UCS2 indices. Could optimize case of sequential uses of charAt by keeping around prior information. Is random access popular on web on strings? Hard to detect if access is sequential or random.
  • zwarich: Problem when using SM of benchmarks that use indexing.
  • kmc: Can't regress on that.
  • SimonSapin: Random access?
  • zwarich: Code taken from actual websites at some point.
  • kmc: We should look at how they use charAt.
  • zwarich: Unless we get SM to agree to have support for variant of UTF8 in SM itself, wouldn't like to copy every string access from UTF8 to...
  • SimonSapin: Other option: keep SM using UCS2, but at binding level convert at boundary between UTF8->UCS2
  • kmc: That's what we have today. Could do lazy conversion; as soon as script touches string, do conversion.
  • zwarich: When converting in the DOM, lose benefits of (??). Will have different asymptotic behaviour (???)
  • kmc: Need to know what real sites do, but should get good idea from self-contained experiements. How slow is UCS2 charAt on UTF8? How big must string be for us to care? Can do self-contained programs to show this. Could have clevel tricks besides caching like wavelet trees.
  • zwarich: Feel only thing in practice will be charAt followed by converting to UTF16. Don't see wavelet trees landing in SM.
  • kmc: Does SM use ropes?
  • zwarich: Everyone does.
  • kmc: If you do, could record number of UCS2 code units in each chunk and should be like skip list on characters.
  • mbrubeck: We should be willing to regress benchmarks if they're not good representations of reality. Not good PR, but can advance state of the art. SM has been handcuffed by SunSpider for a long time. JSC and v8 have seemed to give up on SS in order to win on others, based on AWFY.
  • zwarich: Don't care about SS since no longer used as marketing benchmark. Everybody's neck and neck. Someone on SM claimed that SS is only benchmark that does charAt in perf-sensitive way, but don't know if that's actually the case. Regressing perf on benchmark also requires convincing SM team too.
  • kmc: I think it's unlikely our team will have the resources to push forward SM with existing resources and planning horizon.
  • zwarich: If deciding to go certain path, at time when we want to optimize memory usage we need to ensure that SM team is at least on board with potential plans. Don't want to be trapped into copying every string and can't get around it without changing DOM design.
  • kmc: Should get numbers, but think space usage could be helped by (???). Wouldn't require any changes to SM.
  • zwarich: Would negate benefits of UTf8.
  • kmc: More compact, takes less time to stream through memory. html5ever benefited with fewer bytes going through the parser. Better for SIMD, etc. If we end up with an enum that's either str or UTF16, that's probably acceptable if it gets us better perf, instead of storing UTF16 everywhere for all-english pages.
  • zwarich: Having latin-1 codepath means don't need to branch.
  • kmc: Might be good to investigate. Right now parser is written around converting everything to UTF8, good for byte scans. Could also work with latin1.
  • zwarich: Shoudl see how WK parser does it. Might detect latin1 on the fly.
  • kmc: There's a character sniffing algorithm.
  • jack: Have we abandoned WTF8?
  • zwarich: Difference between UTF/WTF8 is pretty small. Only JS engine would need to use WTF8 and we could use UTF8 at DOM level and above.
  • kmc: Kind of arbitrary. No reason not to allow invalid surrogates in DOM if SM supports WTF8.
  • zwarich: Can't pass arbitrary strings; not memory safe. Could make new &str that only points to all chars except last unpaired surrogate.
  • kmc: Wouldn't necessarily need to copy; could scan through WTF8 slice, confirm valid UTF8 slice and transmute. Not clear on near-term decisions that we're trying to make. If big changes to SM are off the table, do we stick with UTF8 or move to UTF16 in the DOM? Would like to put off the latter as long as possible, hope that other approaches work out.
  • zwarich: Would like to know opinion of SM team on whether direction of WTF8 would be allowed? Maybe talk to Jan de Mooij?
  • jack: So UTF8 DOM and talk to the SM team?
  • kmc: We already have that, since we use String. Should see how does lazy conversion stack up, how often are strings converted, and what's the distribution by length?
  • zwarich: Relatedly, if you have utf8 resource and utf8 strings, instead of making a copy you can just have a pointer that atomically refcounts the resource and contains a span. When parsing, not actually copying any strings. WK is making this change right now.
  • kmc: We're working on switching html5ever to that model using IOBufs (backing buffer + slice). Should be possible to run through parsing pipeline without copies, then lazily convert to UCS2 for the DOM. Does need to be atomic refcounting, and will require a bunch of refcount bumps, so could be perf bottleneck.
  • zwarich: Still probably faster than malloc and free.
  • kmc: I think UTf8 being smaller is a pretty big performance win even with higher asymptotic compexity.
  • jack: How will we measure whether performance win or not?
  • kmc: Already did some benchmarking for parsing demonstrating better to use UTf8. Should have microbenchmarks to show how big a string can be before we start to care about charAt. Then look at distribution of strings on the web.
  • jack: Talking about comparing charAt performance of scan of string vs. bare index?
  • kmc: Calling into SM might dwarf cost of scanning, not sure. Should be easy question to answer.
Clone this wiki locally