Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worksheet Reading 3-4x Slower Than Baseline Of Underlying quick_xml Library #372

Open
RoloEdits opened this issue Nov 5, 2023 · 4 comments
Labels

Comments

@RoloEdits
Copy link
Contributor

RoloEdits commented Nov 5, 2023

When looking into the worksheet read time, based on early conversations here, and seeing that it took up 22 seconds out of a ~25 second operation, I began some testing of the underlying library, quick_xml, and saw that the raw worksheet xml file, a 1.3GB on-disk file, took ~6 seconds to read through when doing "nothing".

With this time as the baseline time that we build off of with the work of doing things with the parsed events, the jump to 22 seconds seems like there could be some big win/s to be had.

I'm starting this issue as a thread specifically for optimizing the worksheet reading time. As seen in the issue I created for quick_xml, we could benefit from that getting an #[inline] coating there, but I want to look more into what is done in this crate for improvements in design and implementation. Given the percentage of time this takes currently, 85%+, any improvement will have a noticeable effect on the overall operation.

@tafia
Copy link
Owner

tafia commented Nov 7, 2023

Thanks! This is indeed much more than what I had in mind.

Out of curiosity, could you measure

  • the time it takes to read the sheet only (as opposed to the workbook with shared string etc ...)
  • the time it takes to unzip that 1.3GB file?
  • all text is utf8, could you update the quick-xml bench to convert Events (Start and Text) to &str (not String as we want to not count allocation)

And thanks for opening this issue!

@RoloEdits
Copy link
Contributor Author

RoloEdits commented Jan 30, 2024

Ok, sorry for the long delay, life has been hectic.

I started to work on a near-from-scratch branch to get a better feel for the full code base. So far I have been able to get a base working implimentation when the program saving the spreadsheet actually makes an effort to follow the specification (looking at you OnlyOffice).

Starting from scratch allowed me to more easily test performance changes. Using the dataset we have been working against, a nothing, that is, no events, just continue till Eof, took 9.5 seconds on my computer. My base implimentation(which also gets font color information), is at ~13 seconds. Down from the 22 seconds from this last test (Did not test against the DataRef changes).

The work is all bundled up in a wip module. The api also changed a lot. And there is now an underlying Spreadsheet data structure as an abstraction over a Vec<Cell<'a>>. Its one continuous buffer that then a 2d table is abstracted over using offsetting math, row * columns + column.

I am not testing memory usage right now as the style information for cells and text takes up a lot more than just the value in master.

This is the profile of the nothing run:
image

This is the most recent profile I took of wip:
image

The ..sheet one is the Spreadsheet data structure. Its able to take advantage of the dimension element if the specification is actually followed (again looking at you OnlyOffice 😠 ). It resizes the underlying Vec<Cell<'a>> to match the dimensions. The middle part between the iterator going over the attributes and the bufread is all the actual work that is added/needed to turn bytes into structure.

The current implimentation fills the Spreadsheet with empty Cells, as would be represented in the actual spreadsheet. This takes time but brings other benefits, and given that it dropped to 13 seconds from 22, I think it can be worked with for now.

The values of the cells are all stores in a small sized optimized string. This cleans up the api for the cell value to be just .value() -> &str. Parsing can be done by the user of the library is the idea. There is an enum also kept in the cell, potentially also to mimic the current api but its decoupled from containing the value. This change is for a potential write functionality.

The rewrite also restructures the project a bit, or at least starts it. With wip acting as a lib it begins it keeps top level abstraction stuff in it, the workbook trait, worksheet, and cell. The workbooks, in this case Xlsx, implement the workbook trait. The worksheet and cell are abstracted enough to hopefully fit a generic spreadsheet implimentation. The xlsx module then holds the specific workbook specific implimentation details. With the parse sub module used to break out the parsing aspects and parsing related data structures that dont need to clutter other modules.

Other api changes include cleaning up naming and iterators

use calamine::wip::Workbook;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let path = r#"NYC_311_SR_2010-2020-sample-1M.xlsx"#;
    // an enum that filters which workbook implimentation to use based off the extension.
    // This mimics current `open_workbook` functionality.
    let mut workbook = Workbook::open(path)?;

    // This lets you iterate over the worksheet names without parsing it.
    // This is used in nushell to get the names fast in the current api.
     for worksheet in workbook.worksheets() {
         println!("{}", worksheet.name());
     }
   // Option<Result<Worksheet<'_>>>
   // parses full worksheet up front
    let worksheet = workbook.worksheet("Sheet1")?.unwrap();

   //     Row<'_>            Rows<'_>
    for row in worksheet.rows() {
        // If the column is out of range it returns `None`
         let a = row.column(0).unwrap();
         let b = row.column(1);

       // value can be `None` if its an empty cell
         let Some(name) = a.value() else {
            break;
         };

         let location = b.unwrap();


        println!(
            "| {}:`{}` | {} |",
             name,
             // font has `name`, `size`,  `argb` and `rgb` getters. Returns as `&str`
             a.font().rgb(),
             location.value().unwrap(),
         );
    }

    Ok(())
}

This is still very incomplete and NOT meant to be merged, only here for a reference an experimentation. But would love to see what progress can be made going forward.

@tafia
Copy link
Owner

tafia commented Feb 6, 2024

Thanks for the feedback!
What do you think is the main factor for the speed up you're seeing so far?

  • dimension resizing (can easily be ported I think? Or do we not parse some dimensions currently?
  • dense/sparse ranges. This is very worksheet dependent. I'd say defaulting to dense makes more sense but ideally it could be some input from the api
  • anything else?

@RoloEdits
Copy link
Contributor Author

What do you think is the main factor for the speed up you're seeing so far?

Hard to say for sure to be honest. I think it could be a combination of a lot of different things as the current profile of the master branch doesn't have any one thing that sticks out. The benefit of a rewrite is being able to simplify some thing with foresight. I also made an effort to keep scopes in-mind, when dealing with instruction cache locality, i.e. nesting loops with the idea that a hot-path cache could be evicted if the loop part is too encompassing. But as I did this from the start, its hard to really tell what effect it had. As I implement more of the spec, I will keep this updated. But as the biggest part is worksheet parsing, I don't envision a steep falloff from the current times.

I did try atoi to parse ints faster, but I found that the std parse had the same performance. There is some use of unsafe, when converting to &str using std::str::from_utf8_unchecked, but the document can be checked for this from before getting to these sections. I did this pretty early on so I may be miss remembering, but I believe the validation skip saved about ~500ms. I also use it for the column-row, A1, parsing, as this should be an invariant. Both of these are hot-paths. So any kind of gains will add up. And one last thing I found was that by, where possible, working with raw bytes proved faster than with &str.

dimension resizing (can easily be ported I think? Or do we not parse some dimensions currently?

Currently there is a Spreadsheet::insert that is dynamic, and a Spreadsheet::insert_exact that when working with Spreadhseet::resize to give the exact dimensions upfront, will insert in O(1) time. Spreadsheet::insert will rebalance if the <dimension> tag is not available. So that part is already done. And its mostly just columns that need a big readjust, so once the max columns are set, there is an amortized growth for rows. So this gets balanced to an amortized O(1) over time. It should be fine to just have a simple if statement to check if the dimension tag was found or not and then inserting on the branch. It is also something that the branch predictor should have no issue handling, as the value is only changed in one location and the scope changes in the loop so that part is never checked again. So no need to worry about hot-path branch mispredictions.

dense/sparse ranges. This is very worksheet dependent. I'd say defaulting to dense makes more sense but ideally it could be some input from the api

This is something I was also grappling with how to approach. If the end goal also supports writes, then there need to be an underlying cell which to write to. So having the "canvas" in which to "draw" seems like a must for an implimentation. At least when using a Vec as the underlying structure. Otherwise, the next option would be using a linked-list to insert into the middle of two nodes. But parsing into this would mean an allocation every time. With the work made already to reduce allocations, this would basically eat that all back up.

My currently settled on idea is to just default to a full canvas representation, which will be also support cell mutation and random position insert, using Spreadsheet::insert to auto rebalance the structure on-demand. This way the representation doesn't leak the abstraction of the worksheets/books implimentation; it's more inline with expectations. Then for those who only want to read, and with a small memory footprint, we can offer a iterator over the rows that would return the parsed cells for that row. As for implimentation specifics, there is the unstable yield and gen constructs. This would make it easier to implement, as the state managing is done by the compiler, but may not be stabilized for a while. Hand rolling this is also a possible of course, just more boilerplate needed. Its mainly keeping the cursor state which should be able to keep with a &mut to the ZipArchive handle. This should support style info as well, but have to check more to make sure. But if things go well, we can support all metadata.

As for the api, it could either be a Workbook::open_lazythat would lock you into this read only version, or we could have aWorksheet::rows_lazyiterator method on the current worksheets. The currentwipparses the full thing onWorkbook::worksheet("Sheet1"), but that can be moved so that each rowsorrows_lazy` iterator will start the parse instead.

There could be a RowsLazy<'_> iterator that returns the same Row<'_> as Rows<'_>. We can add a Row::insert method that would fill in the gaps of the row, so the indexes for the columns are kept, keeping to user expectations. It would go through each row in the xml, and insert the cell into the Row<'_> struct, and then yield that row. In step with the lines method for Read, we can just allocate every row by default, and then for the really pressing needs, we can offer a rows_lazy_into(&mut Vec<u8>) so the buffer can be reused if possible.

anything else?

One issue I haven't been able to think how to handle is merged cells. In the Spreadsheet now, each index points to a single cell, and if we wanted to have some part of the merged cell index point back to the main cell, lets just say the top right most one of the merged cell rectangle, then we would have to use smart pointers to support self referentials. And these would also have to allocate as the locations of the cells could change do to a rebalance. Just doing pointer offsets and a unsafe would lead to undefined behavior. I'm not sure how its handle now, so I would have to leave the expertise in how to solve this to you, for now.

I'm sure there will be a lot of small things that come up, but the real test for the abstraction will be when I do another format, like ods, and seeing how well it holds up. The idea, though, is to build an abstraction from the implimentation so that in memory they are all the same, so that file conversion can be done between formats, and ideally it would just be as "simple" as implimenting the save logic for that file format to then have a free conversion from other formats, even if they are read only.

@tafia tafia added the perf label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants