Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") as default for printing Dataframes #5942

Closed
arturdaraujo opened this issue Dec 29, 2022 · 21 comments · Fixed by #5967
Closed
Labels
enhancement New feature or an improvement of an existing feature

Comments

@arturdaraujo
Copy link

arturdaraujo commented Dec 29, 2022

Problem description

The dataframe printing format should be densed. It does not make sense to make the full verbose format.

Here is an implementation suggestion using only built-in packages and their respective processing times below:

if sys.platform.startswith("linux"):  # could be "linux", "linux2", "linux3", ...
    pl.Config.set_tbl_formatting("ASCII_BORDERS_ONLY_CONDENSED")
elif sys.platform == "darwin":
     pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED")
elif os.name == "not":
    pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED")
    # Windows, Cygwin, etc. (either 32-bit or 64-bit)

%timeit os.name
32 ns ± 0.285 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

%timeit platform.system()
124 ns ± 1.16 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

%timeit sys.platform
33.1 ns ± 0.542 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I also created an issue for differentiating systems OS on #5941

The current default prints like this:

import polars as pl
from datetime import datetime

df_stock = pl.DataFrame(
    {
        "date_time": pl.date_range(
            low=datetime(2000, 1, 1, 0, 0),
            high=datetime(2023, 1, 1, 0, 0),
            interval="1d",
        ).shuffle(seed=0)
    }
)

shape: (8402, 1)
┌─────────────────────┐
│ date_time           │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2008-05-13 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2000-11-24 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2005-08-09 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2009-09-04 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-06-30 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-08-18 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-02-06 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-04-16 00:00:00 │
└─────────────────────┘

But the default should be this:

pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED")

print(df_stock)
shape: (8402, 1)
┌─────────────────────┐
│ date_time           │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2008-05-13 00:00:00 │
│ 2000-11-24 00:00:00 │
│ 2005-08-09 00:00:00 │
│ 2009-09-04 00:00:00 │
│ ...                 │
│ 2010-06-30 00:00:00 │
│ 2010-08-18 00:00:00 │
│ 2010-02-06 00:00:00 │
│ 2010-04-16 00:00:00 │
└─────────────────────┘

@arturdaraujo arturdaraujo added the enhancement New feature or an improvement of an existing feature label Dec 29, 2022
@arturdaraujo arturdaraujo changed the title Make pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") the default for printing Dataframes pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") as default for printing Dataframes Dec 29, 2022
@arturdaraujo
Copy link
Author

Thanks for the try @dandxy89. I don't know what went wrong with your PR

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Dec 30, 2022

This seems like it could be an issue with your specific distribution/terminal/font - I don't have any problems with the UTF8 formats on Linux.

Can you provide a screenshot and some distro/terminal/font details for your Linux machine, so we can at least see what's happening for you? (We're not going to want to change the default to ASCII there, given that there are no other reports of problems).

@arturdaraujo
Copy link
Author

arturdaraujo commented Dec 30, 2022

If you change the default to UTF8_FULL_CONDENSED the problems will appear on bigger dataframes (more columns). However, it is often only small mismatches. UTF8_FULL_CONDENSED is certainly good enough but the ideal would be to check the system for appropriate formatting. UTF8_FULL as default is just a mess on Linux.

The current default:
image

The UTF8_FULL_CONDENSED formatting:
image

---Version info---
Polars: 0.15.7
Index type: UInt32
Platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.10
Python: 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35) 
[GCC 10.4.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.22.4
fsspec: 2022.11.0
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: 3.6.2

@arturdaraujo
Copy link
Author

I just tested it on my Linux Mint 21.1 and it's working fine with utf8 formatting, the problem only occurs on WSL (Windows Subsystem for Linux). Still, I believe that the condensed utf8 formatting should be the default.

@stinodego
Copy link
Member

Personally, I also prefer the look of the condensed formatting. I am not familiar with the intricacies of the various operating systems, but to me it just looks better.

@dandxy89
Copy link
Contributor

Will update my branch this evening once I'm back from the bouldering gym.

Why did I close it? I realised that the Python test suite wasn't running locally.

@arturdaraujo
Copy link
Author

arturdaraujo commented Dec 30, 2022

@stinodego is not only the looks, it is around 40% less lines and it makes it Polars DataFrame much easier to work with. Pandas, which is the "default" library for DataFrame, is even more minimal.

@stinodego
Copy link
Member

It will probably be a bit of a chore to update all the docstring examples, although I'm sure that's doable with some regex magic. Should probably get sign-off from @ritchie46 before going ahead with that.

@ritchie46
Copy link
Member

It will probably be a bit of a chore to update all the docstring examples, although I'm sure that's doable with some regex magic. Should probably get sign-off from @ritchie46 before going ahead with that.

Do you think we can automate such a change? This would be beneficial many times down the road probably.

I have no strong opinions on the format, so a tight format is fine by me. Only on changing the docstrings.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Dec 30, 2022

The line-contraction is definitely a font/kerning issue, rather than a polars issue (@arturdaraujo: you could try switching your font under WSL to something like JetBrains Mono, or one of the Menlo Powerline fonts?)

As for a change in default format, I'm definitely biased towards the UTF8_FULL_CONDENSED, not least because I introduced it to comfy-table for use by polars in the first place, heh... It's a big win space-wise, and I don't think readability suffers.

@ritchie46 / @stinodego: if you want to go ahead, I'll volunteer for updating this and the associated mass docstrings update; how about in conjunction with #5513, which would also require a global docstring update? (It's a different issue, but could kill two birds with one stone).

regular_expressions

@stinodego
Copy link
Member

stinodego commented Dec 30, 2022

Do you think we can automate such a change? This would be beneficial many times down the road probably.

I doubt it would be worth it. The docstrings right now are all flat text, you'd have to find some way to auto-generate the example output. I think regex magic is probably a way easier method of going about this.

if you want to go ahead, I'll volunteer for updating this

Let's do it! Regarding #5513, I'll leave a comment in that issue. I'm not as certain about that one.

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Dec 30, 2022

If it's for the python files - the ast module can be used to isolate the docstrings which can be helpful for performing modifications.

import ast
import re

def remove_row_sep_from_docstring(
    text,
    col_sep = "┼",
    dash = "╌",
    df_start = "┌",
    df_end = "┘",
    row_start = "├",
    row_end = "┤",
    node_types = (ast.ClassDef, ast.FunctionDef)
):

    examples_re = r"(?s)\s*Examples\n\s*-+.+"
    dataframe_re = (
        rf"(?s)\s*shape: [(]\d+, \d+[)]\s*"
        rf"{df_start}.+?{df_end}"
        rf"(?=]?\n)"
    )
    row_sep_re = (
        rf"\n\s*{row_start}{dash}+"
        rf"(?:{col_sep}{dash}+)*{row_end}"
        rf"(?=\n)"
    )
    
    tree = ast.parse(text)

    for node in ast.walk(tree):
        if isinstance(node, node_types):
            doc = ast.get_docstring(node, clean=False) or ""
            has_examples = re.search(examples_re, doc)
            if has_examples:
                old_examples = has_examples.group()
                new_examples = old_examples
                for df in re.findall(dataframe_re, old_examples):
                    new_examples = new_examples.replace(
                        df,
                        re.sub(row_sep_re, "", df)
                    )
                doc = doc.replace(old_examples, new_examples)
                # https://github.com/python/cpython/blob/3.11/Lib/ast.py#L294
                node.body[0].value.value = doc
        
    return ast.unparse(tree)

The result could then be passed to black to fix any lost formatting.

@ritchie46
Copy link
Member

Does unparse do what I think it does?! 👀 That would be amazing.

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Dec 30, 2022

Just submitted a corresponding ASCII_FULL_CONDENSED to comfy-table; bit of an oversight on my part not doing that when I added UTF8_FULL_CONDENSED ... (Don't think we'll need to wait for it though; should be possible to declare the preset locally, then swap it out later).

@zundertj
Copy link
Collaborator

zundertj commented Dec 30, 2022

FYI, I have looked into alternatives before to auto-update the docstring output, and could only find https://github.com/max-sixty/pytest-accept, but does not seem maintained, and relies on pytest, which does not pick up our custom output checker (which we need for IGNORE_RESULT. We have been through this before, and at some point turned it off, only to find later that we had very quickly broken some examples. So yes it is a pain, but having working examples in docstrings seems key to me.

The suggestion by @cmdlineluser to use the ast module is a good one. Note that Python's built-in doctest does not use the ast module, although a popular alternative xdoctest (insofar doctests are popular) does. Not sure if that has any consequences. I have been tempted before to wrap what we have in our script as a separate package, maybe it becomes worth it if we have auto-update as well. Although I realize that is far off from the very targeted use case here.

Not sure how black should be part of this, you are only changing the outputs, right?

@cmdlineluser
Copy link
Contributor

ast.parse doesn't retain the original formatting:

print(ast.unparse(ast.parse("""
def func(
    foo=1,
    bar=2
) -> DataFrame:
    ...
""")))

Output:

def func(foo=1, bar=2) -> DataFrame:
    ...

Also, I just realized it doesn't retain comments either - which makes it a non-solution - d'oh.

@cmdlineluser
Copy link
Contributor

cmdlineluser commented Dec 31, 2022

Looks like Parso or LibCST are the recommended tools that retain formatting/comments.

import parso
import re

def remove_row_sep_from_docstring(
    text,
    row_start = "├",
    row_end = "┤"
):
    examples_re = r"(?s)[^\S\n]*Examples\n\s*-+.+"

    tree = parso.parse(text)
    nodes = []
    for cls in tree.iter_classdefs():
        nodes.append(cls)
        nodes.extend(cls.iter_funcdefs())
    nodes.extend(tree.iter_funcdefs())

    for node in nodes:
        docnode = node.get_doc_node()
        if not docnode: 
            continue
        docstring = docnode.value
        has_examples = re.search(examples_re, docstring)
        if has_examples:
            old_examples = has_examples.group()
            new_examples = []
            for line in old_examples.splitlines():
                stripped = line.strip().strip("#").strip()
                if stripped.startswith(row_start) and stripped.endswith(row_end):
                    continue
                new_examples.append(line)
            docnode.value = docstring.replace(
                old_examples,
                "\n".join(new_examples)
            )

    return tree.get_code()

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Dec 31, 2022

It's a New Year's gift! comfy-table made a fresh release just for the ASCII_FULL_CONDENSED preset I submitted last night. Will pull the latest code, brew a large cup of tea, and make the update; with a little luck it should be quite painless...

@alexander-beedie
Copy link
Collaborator

alexander-beedie commented Dec 31, 2022

Have finished updating everything now except for the rather involved docstring of pl.align_frames. The tea is finished, and this one probably calls for a coffee - but then it's all done ;)

@ritchie46
Copy link
Member

Have finished updating everything now except for the rather involved docstring of pl.align_frames. The tea is finished, and this one probably calls for a coffee - but then it's all done ;)

Wow, that's fast!

@alexander-beedie
Copy link
Collaborator

Done - unit/doctests all passed locally, so let's see how it does through CI ... ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or an improvement of an existing feature
Projects
None yet
7 participants