Skip to content

Transcription parallelism

Lars Bergstrom edited this page Nov 14, 2013 · 1 revision

String interning

  • We have a string interning version with a linked list of arrays with a has function. Atom has raw ptr to string and lwoercases. In Servo, when it compares strings, it converts to lower, which is really slow. Instead, pointer comparisons, which is a huge speedup.
  • Comment from bz is that it's not thread-safe. I just use the global hash table. Does it need to be thread-safe?
  • pcwalton: We have also been discussing. I think we have a good idea about how to do this.
  • I am trying to implement a lock-free interning. I am using AtomicPtr. Is that the right way?
  • brson: I would prefer AtomicOption
  • I just need to read the value. Only has take, no read. So I can't read the value unless I remove it from the hash table.
  • jack: We should fix that in the std library
  • brson: May be intentional. If you want multiple threads, you may want an UnsafeAtomicPtr.
  • Currently, just using AtomicPtr and compare-and-swap. This work is in progress. Servo does not use string interning, but we will get good perf improvements with it. In October, we had our internal evaluation with showing numbers. That first (sequential) version had a huge contribution.

Parallel CSS selector matching

  • I have not implemented this one; another guy did. Similar to the one in the paper.
  • We start with the DOM tree, do a preorder traversal. Split the vector into 8 pieces (for 4 cores). For each piece, we send it to a task, do CSS selector matching independently. After we get the results, we do one more traversal to calculate the CSS values for each node. Gave us about 1.2x perf improvement on our test file (5k nodes, with ~1000 CSS selectors). Then one more pass to calculate the CSS values, which we tried to parallelize with a task. In 10ms by default, but with two tasks, it went to about 200ms.
  • Originally, matching was 80ms. The original attempt was 400ms. After the rust fixes, 50-60ms. Works because CSS selector matching currently reads the DOm tree nodes and traverses upwards, so it can be done independently. Sibling selector matching would require us to do it differently.
  • jack: So with 8 tasks, 1.2x speedup?
  • Yes, better than 4 tasks.
  • pcwalton: Probably lost a lot of performance by creating a vector hurt. If we directly followed the links using unsafe code, it would probably be a lot faster. That vector creation probably costs a lot of code.
  • How do we do that?
  • pcwalton: Traverse the nodes. Use a taskpool with 8 tasks. We probably want to do something like work-stealing. If you want to do a pre-order traversal, Once you process a node, enqueue its children.
  • They're independent.
  • pcwalton: Then whenever a task picks off a node, it should push the children into the queue before running anything. The other tasks should be looking for work to steal. Sounds like what Leo did in his paper. The moment the parent pushes their children.
  • brson: Can we just round-robin the noes?
  • jack: Just traverse the tree, create a task for selector matching...
  • pcwalton: Too slow to create those tasks.
  • brson: Use a pool, maybe, that's persistent. Traverse by sending a message for each node to the tasks in the task pool. But that may be too slow. So maybe tile and send 4-8 at a time.
  • pcwalton: Won't the tiling be a sequential bottleneck? I would have an atomic counter so that the next child goes into the work queue for the task.
  • kmc: Need a concurrent queue.
  • pcwalton: Need for work-stealing anyway.
  • jack: Can we use what Rust already uses?
  • kmc: Do we need it for layout anyway?
  • brson: Traversing n cores is going to be very fast. Can send all the messages to the cores quickly.
  • pcwalton: Why not just let the tasks discover their children? Seems like it saves one tree traversal sequentially.
  • brson: I'm just thinking it's easier and faster.
  • kmc: Maybe need work-stealing now.
  • The reason we did this is because we needed to split/chunk the work so we could pass it to other tasks. That gave us good improvement. Maybe another way.
  • pcwalton: Maybe batch up your kids and send them to another task?
  • Maybe batch up 4 or 8 at a time?
  • pcwalton: Ultimately, sad. Work-stealing is going to be much better.
  • kmc: When you built the vector, did you make sure the vectors are cache line-aligned? So that multiple processors are not touching the same thing.
  • Just have 8 vectors separately allocated. Need to send it unsafely.
  • pcwalton: There's a par function in the stdlib that used to be able to chunk and perform work on vectors in parallel. That's probably what you would want.
  • kmc: Takes a slice and returns two subslices that can be borrowed independently? parMap?
  • pcwalton: Consumes the vector & returns slices that can be sent?
  • brson: Traversing the entire DOM seems like a huge bottleneck.
  • kmc: ad-hoc traversal
  • pcwalton: just work-stealing with a spinlock?
  • brson: pthread_mutex in rust is better than that... Maybe we should start using it everywhere and talk about it.
  • brson: we need a deque. Let's take one and write it in Rust.
  • kmc: jason was going to do it. I should just follow up.
  • lbergstrom: I'll take the CSS selector matching and port work stealing to it.
  • Task spawn takes a lot of time.
  • yichoi: Was early stages, maybe we could test it again.
  • brson: task spawning has not been tuned yet.
  • kmc: But green threads are our cheap abstraction.
  • jack: What does a task that does nothing take? 4kb?
  • brson: a little more than that.
  • jack: in Erlang, they preallocate the processes.
  • brson: We cache nothing associated with tasks. It is a fairly big structure. They're not inlined, etc.
  • kmc: Tricks used in kernel.
  • lbergstrom: Can you provide the test case?
  • Yes.

Succint(compact) representation

  • Worked on this in graduate school. Idea is to flatten the DOM in an array using balanced paren scheme. Nodes have ancestor/decendant relationship. Imagine parenthesizing the tree. Allows fast traversal and perform DOM operations by reading the open/close parens and counting. This optimizes the memory space by using bit-per-paren and create an array of corresponding DOM nodes.
  • So preorder traversal is just an array walk. Can read the bits from the index to traverse the tree. Might be simple for implementing traversals.
  • Counting the number of bits in the byte is required (or we can hash the values into the table) to try and make traversals fast. getParent may be slower (if you start from the back and have to read the numbers) -- no back ptr. But we can build a reverse index.
  • kmc: Like succint dictionary representations?
  • lbergstrom: how do handle add/removeChild
  • We leave empty spaces in the array. Update might be slower because we might have to shift the bits. Instead, we leave spaces.
  • jack: So bitmask with empty vs. full, too, then?
  • Index on top of them counts the number of bits stored. Forward/backward access allows the traversals. Size of the index is small.
  • jack: Do the nodes still contain pointers to their parents? Or only contain non-pointer. first/last/next_sibling/prev_sibling are only in the index now? So, if we wanted to do the COW DOM, do we need two indexes?
  • Yes.
  • pcwalton: Maybe better for the flow tree. That one does not have to interact with the GC (that's difficult for this one). The flow tree we have complete control over. The flow tree also uses a lot of memory. The flow tree does a lot of traversals and its allocation needs to be extremely fast. It's also what we do all the reflow on. So, we might be able to get some interesting optimizations on it. And its shape does not change once it's built.
  • jack: Might work very well for COW DOM, since those pointers are annoying. We were going to have 5 extra pointers. In this representation, we get extra savings.
  • pcwalton: Don't we just need one? For the dirty node?
  • jack: One plus 5 for each that changed. Because the dirty nodes are interior...
  • pcwalton: let's talk later.
  • yichoi: Can save memory usage. Not sure, but maybe we can exploit SIMD with this?
  • kmc: Many database companies use complicated data structures of this kind.
  • Can do; will send it to you.
  • dherman: Are you thinking about all nodes everywhere or just some certain kinds of nodes?
  • Just wanted to see if this technique seems interesting.
  • dherman: Might be 2-3 types of nodes where there are huge wins. For simple nodes with perfect children. Probably try just doing it where we expect it to wind. Like where it's not deep and not a lot of bi-directional/sibling traversals on them. Should see if there's a place where it works.
  • I have a paper on this work.
  • kmc: If modifications are more expensive, maybe store it on the side.
  • Yeah, maybe use more compact representation.
  • brson: Send us papers.

Parallelization ideas

  • Prescanner to scan the stream and read the external images and download them.
  • May be able to reduce the number of traversals if we only have inherited & synthesized attributes. Ones that don't depend on descendants.
  • jack: prescanning seems like an obvious win.
  • dherman: Don't we do that?
  • pcwalton: Nobody as agrressively as the zoom project did. That was a huge win. It's also because their success measure was initial page load (ignoring caches). That's primarily network anyway, so it was a huge victory.
  • dherman: Matters for modules. JS will have modules with the name of external modules. That has to implemented in the JS engine side. We could have a prescan_for_requested_modules API. Not immediately necessary, but should fit into the same infrastructure.

Rule hash for CSS selector matching

  • Is in other browsers and enables other optimizations from Leo's paper. We take every node and go through all the rules matching each one to see which match the node. Obviously slow.
  • Idea: reduce the number of rules we attempt to check against. We partition the rules into sets. If the last part is a selector with an ID (with a #), then we know the node should have id="sidebar". We can put it in a hash table that matches id->that rule. Can do for IDs, classes, and elements. That allows us to reduce the number of rules we need to check.
  • (See example on the slide for how it reduces them)

With rule hashing

  • 3x speedup on standard web pages with ~1k CSS rules. There is a PR for that. Using this we can implement more optimizations.
  • Redundant selector elimination is possible. We can group rules that imply other rules so that we don't have to worry about multiple matching.
  • ryan: We combined string interning with the rule hash and got a lot of perf improvement.
  • Since not all the hash tables fit in cache, start with ID rules for all nodes, then all class, then element, then the mandatory ones to check and you'll avoid cache misses.
  • dherman: No tuning?
  • Yes, just do it one at a time. one more optimization is to preallocate a vector for the rules that match a node. Then avoid allocating on the heap. Can have a vector of size 15-ish and only allocate on the heap if you need it.
  • I will be working on this over the coming weeks.
  • pcwalton: I'll r+ it now.
  • jack: Have we started timing how fast we are compared to gecko and webit?
  • pcwalton: 10x slower, missing string interning. And the 50x allocations really hurt us. Once we get those close to zero, we will be very competitive, except for where we need the Bloom filter. But that is not all pages. A lot of pages just use ID and CLASS, and then we're fine because it won't help.
  • kmc: why that?
  • pcwalton: remember your parents. Descendant selector matching.
  • jack: Match the right hand side. Then match the parents. The Bloom filter will tell you if the parent is in your tree.
  • pcwalton: people mistakenly write it as "a div div" instead of "a > div > div", which casues an explosion. Web developers don't understand that the first one is the wrong and expensive to do the match.
  • dherman: I just want a list of tips for how to parallelize for servo.
  • pcwalton: We have that for firefox, particularly on selectors.
  • kmc: developer UI could show you hints, and people will go nuts for it.
  • jdm: This is 2x faster b/c of servo's parallel speedups
  • pcwalton: a clear helper would help. e.g., if you can detect when the floats do not impact later, we can tell people to add a clear.
  • dherman: at Northeastern, they wrote something that would tell you if you failed
  • jack: Like google search, we should show the parallel speedups we got in the UI. Right in the title bar.
  • lbergstrom: Can't do speedup without a baseline...
  • jack: Just CPU utilization
Clone this wiki locally