Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

head / tail operations are slow on larger files #64

Closed
jamestexas opened this issue Aug 14, 2022 · 0 comments
Closed

head / tail operations are slow on larger files #64

jamestexas opened this issue Aug 14, 2022 · 0 comments

Comments

@jamestexas
Copy link

jamestexas commented Aug 14, 2022

Howdy -

I wanted to preface this with: If I missed a contributor guideline or anything, please let me know. I did check other issues and did not see one relevant to this.

I am somewhat new to using rich-cli (but am familiar with rich) and recently attempted to parse a somewhat large CSV file (~119Mb, 483k lines).
I did not expect the whole CSV to load quickly, but I was somewhat surprised that running --head and --tail took as long as they did. Obviously they won't behave like GNU tail / head, but I took a jab at a minimal / naive change to this and was able to get it much faster. It's around this here
if you want I am happy to open a PR. I'll also just put a code block of what I did. I did take the somewhat naive approach to file parsing (rather than parsing the buffer stream per line, which would be more efficient for tail) to avoid making a huge change.

  • head is just using the existing generator to parse out x rows and filtering out None values. Since the list gets iterated ~ twice, this means the second iteration that adds indexes is also way faster.

  • tail is using a collections.deque example recipe (which, while still going through the whole file, does not store the whole file in memory).


    rows = iter(reader)
    if has_header:
        header = next(rows)
        for column in header:
            table.add_column(column)

    if head is not None:
        table_rows = list(
            filter(
                None,
                (next(rows, None) for _ in range(head)),
            )
        )

    elif tail is not None:
        table_rows = deque(rows, tail)

    else:
        table_rows = list(rows)


These are naive benchmarks, but comparing the two (where rich command is the install CLI, and python3 ./src/rich_cli having my changes:

Head

└> time python3 ./src/rich_cli --head 500 large_csv.csv &> /dev/null                                           [👾 3.10.5]➜
python3 ./src/rich_cli --head 500 large_csv.csv &> /dev/null  0.83s user 0.47s system 94% cpu 1.369 total

└> time rich --head 500 large_csv.csv &> /dev/null                                                             [👾 3.10.5]➜
rich --head 500 large_csv.csv &> /dev/null  2.81s user 0.60s system 99% cpu 3.443 total

Tail

└> time rich --tail 500 large_csv.csv &> /dev/null                                                             [👾 3.10.5]➜
rich --tail 500 large_csv.csv &> /dev/null  2.95s user 0.63s system 99% cpu 3.604 total

└> time python3 ./src/rich_cli --tail 500 large_csv.csv &> /dev/null                                           [👾 3.10.5]➜
python3 ./src/rich_cli --tail 500 large_csv.csv &> /dev/null  1.93s user 0.53s system 96% cpu 2.545 total

Anyway, let me know if you want me to do anything here!

@jamestexas jamestexas closed this as not planned Won't fix, can't repro, duplicate, stale May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant