Workweek string interning

Jump to bottom

Josh Matthews edited this page Jul 8, 2014 · 2 revisions

String interning

kmc: I've been looking into interning in the html parser. a string is either statically interned (many strings we know at compile time), dynamically interned during parsing (css classes, can't predict but need fast matching), not interned. datatype has those three variants. static is good, have macros to allow quick matches on strings that optimizes really well - llvm uses bitmask on immediate 64bit values, so most common tags in first 64 are very quick. dynamic should be orthogonal to static interning; one idea is if string is <sizeof(pointer), store those characters if it's not a static interned string. need invariant that if string can be static variant, always is. so string is either statically interned to small integer, dynamically interned to hashtable, dynamically "interned" immediate, not interned.
jack: how do you determined inline vs. in hashtable?
gw: set a bit?
kmc: yes, tagging. could never intern on non-ascii, and ensure high-bit is set if pointer. current impl is rust enum, so two words, but want to turn that into one.
jack: could use invalid utf8?
kmc that's another option. there's a email thread, I'll put it in the etherpad. [ https://groups.google.com/d/msg/mozilla.dev.servo/1K2-Qy27e3A/LqiZ0EIT0EMJ ]
jack: wasn't there something about sha-something? [ https://groups.google.com/d/msg/mozilla.dev.servo/4NJsID0kwoc/3BWneVlCegwJ https://github.com/mozilla/servo/issues/2217 ]
kmc: not that good; need at least sha256, so everything gets bigger. optimizes parsing at expense of layout because you don't synchronize on hashtable, but matching is comparing four words and takes more memory. probably using strong crypto hash function is not a win even if fn is fast; interning by storing small strings immediately should be a win, lots of small strings. less synchronizing on hashtable. don't think we should worry on which datastructure, just touch it less. inline interning is one way to do that. start with a simple mutex-guarded hashtable; measure and replace if it's a problem.
jack: what about a task instead of concurrent hashtable?
kmc: yes, sync on channel instead of mutex.
brson: performance will be worse.
gw: it's a green thread context, though.
brson: will be faster than a kernel context switch on a mutex
kmc: yeah but [on Linux] mutex is futex, so you only hit the kernel in the contended case. I like the idea of interface of "this hands out interned strings" so we can measure different implementations easily.

DOM string representation

kmc: another related topic is how to represent DOM strings.
simonsapin: utf8 vs. ucs2?
kmc: yes. I want to be aggressive in servo about pushing utf8 as far as possible in web platform, assuming people aren't using lone surrogates much.
simonsapin: problem of ucs2 is indexing - slow path is not lone surrogates, but anything outside of Basic Multilingual Plane. if we want utf8 in JS, need to wait on spidermonkey.
kmc: they like the idea. we can do a lazy conversion - guessing that most strings don't get touched by JS, so we could only convert when necessary. win for memory usage, since most text won't be converted at all. parser should be faster since most content comes off the wire as utf8.
jack: what do we do right now?
kmc: we have conversion from utf8 -> utf16 in bindings.
jack: so we do this already?
kmc: but we don't save the results
jack: what would we do for indexing for the lazy conversion?
kmc: if SM is entirely ucs2 and we do lazy conversion, no issue. interesting question is if there is anything outside of JS that requires indexing into ucs2 strings. if in world where some JS strings are not ucs2 - could flag ones that contain non-BMP chars, or even non-ascii. could do it chunk-wise (chunks of similar encodings).
zwarich: could be tricky preserving current optimizations that benchmarks rely upon.
jack: could be fine, since they probably create their own strings and DOM doesn't touch them.
zwarich: all strings in webkit are ucs2 or 8 bit Latin-1. there are more cases in engine where utf8 would suffice, but if ascii then don't worry about JS interface. have specialized parsers for 8 bit strings, which is big performance win.
kmc: my parser exploits that all html chars with special meaning are ascii, so can do byte-oriented parsing. already a big performance win and moreso with SSE string instructions. would be interesting research to try fancy adaptive string representations (e.g. succinct rank-selection dicts for UTF-8 indexing), but our short-term approach that doesn't rely on SM hacking is lazy conversion to utf16.
simonsapin: concern that lone surrogate is not just slow path, but cannot be represented in utf8.
kmc: only way to get lone surrogate is from JS, in which case could be ucs2 forever.
simonsapin: if script creates dom node with lone surrogate, in rust dom node we wouldn't store utf8?
kmc: the lazy-conversion proposal is that DOMString would be an enum of either ucs2 or utf8 string. utf8 only gets created by parsing, ucs2 is created out of script or converted before returning to script. could have small string optimization (attr names could be stored inline). haven't considered text rendering; text node could be either utf8 or utf16, do we convert or have text engine that accepts both?
zwarich: want to avoid doing conversions everywhere.
kmc: pretty sure harfbuzz only takes utf8 right now. ucs2 in Servo only exists in SM at this time.
zwarich: will we convert everything in html parser to utf8? there are other encodings...
kmc: looked at numbers - utf8 is at least 80-90% of content [ed: 80.8% according to http://w3techs.com/technologies/overview/character_encoding/all], next most common are latin1/1252 [9.9%], [then Windows-1251 for Cyrillic, 2.5%], 16 bit chinese encodings [GB2312, 1.6%], couple others...
simonsapin: converting to utf8 is not worse than converting to ucs2.
kmc: better, since halves most content size. almost no content is utf16 so you always have to convert once. if you have invalid utf8, it's handled before parsing stage. previous stage turns invalid utf8 into replacement char.
kmc: parser needs to handle surrogate pair split across document.write calls, so hold buffer of surrogate until second call. when see leading surrogate, can store one code unit and next code unit is either trailing surrogate (now valid utf8) or is anything else and turn surrogate into replacement character. depends on tiny change to web spec which hsivonen is fine with (fine to turn lone surrogates into replacement character). spec assumes all browsers/parsers are ucs2, but probably fine to change spec/break compat on lone surrogates, as long as proper surrogate pairs are handled properly under this scheme.

More on string interning

ryan: q about string intern: if strings in different formats, how to compare for equality? static string and dynamic string.
kmc: for different encoding, natural thing for lazy conversion would be converting to utf16 if necessary.
kmc: for static vs. dynamic interned strings, initially wrote to compare characters always. but for fast matches, we want the invariant if string can be statically interned it always is. however it may be case that strings are sometimes dynamically or not interned, and can be handled by detecting mismatch.
ryan: when I tried last time, the conversion on the fly was quite slow. need something to make it fast, but not sure.
kmc: in my scheme, a dynamically interned string is a pointer to string owned by hashtable. non-interned is an owning pointer to the string. in both cases, you have pointer to characters and can compare. I think you intern dynstring by storing pointer to string itself, so comparison is fast.
ryan: do they point to the same string?
kmc: yes, if both strings are dyn, can compare by pointer. if one isn't, need to compare characters.
gw: we need to refcount hash table entries so we can delete unused interned strings
kmc: can be expensive, I'm not sure how often they're copied during style recalc/selector matching
gw: only applies to strings interned through hashtable...
kmc: if most are statically interned or immediately interned (inline), shouldn't be issue.
gw: think you need to be able to remove from the hashtable.
kmc: probably sites that create classes on fly and fill up the table. no way to determine all owners.
ryan: how do we do case-insensitive comparison on interned strings?
kmc: tricky. for the parser things mostly get lowercased before interning. in general at interning time if you don't know will be compared caselessly you have no benefit from interning.
gw: some parts of the DOM today store both original and lower-case versions; we could just intern both [That's not for case-insensitive matching, though.]
kmc: if you refcounted them together it could save memory. I think many details will wait until we have implementation and see performance numbers.
zwarich: important to think about 8 bit character case, since webkit and blink see big benefit from optimizing. also when there's SM support, there is no need to be fancy to expose those to JS.
kmc: ascii-only strings?
zwarich: Webkit and Blink do it for latin1. when you create string object, can create from ucs2 buffer or ascii buffer. benefit is in JS engine can just pass around smaller, less-fancy indexing data structures.
kmc: fixed-size for certain number of characters is tried and true.
zwarich: CSS parser should probably do similar things to HTML parser.
simonsapin: css parser is utf8, could change. may need to change to handle lone surrogates anyways.
jack: what would that be used for?
jdm: maybe multiple appends of surrogates to inline css block?
simonsapin: not quite like document.write; separate appends are separate rules, so not parsed the same way. still might be found on the web.
jack: at some point it's ok to break these. who's working on this? gw and kmc? who will be the person to commit it to servo?
kmc: do we see value in landing interning before parser is ready?
zwarich: could be good for interface for interning to land before html parser so can update dom bindings, css parser, etc.
kmc: I could pull out the interning and put it into servo before the parser is done.
jack: maybe we can finish benchmarking first.
simonsapin: is it only internal to the parser?
kmc: clients of the treebuilder see Atom rather than String. can compare atoms, turn atom into string slice. basically looks like an immutable String but sometimes comparison is much faster