pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") as default for printing Dataframes #5942

arturdaraujo · 2022-12-29T15:25:41Z

Problem description

The dataframe printing format should be densed. It does not make sense to make the full verbose format.

Here is an implementation suggestion using only built-in packages and their respective processing times below:

if sys.platform.startswith("linux"):  # could be "linux", "linux2", "linux3", ...
    pl.Config.set_tbl_formatting("ASCII_BORDERS_ONLY_CONDENSED")
elif sys.platform == "darwin":
     pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED")
elif os.name == "not":
    pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED")
    # Windows, Cygwin, etc. (either 32-bit or 64-bit)

%timeit os.name
32 ns ± 0.285 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

%timeit platform.system()
124 ns ± 1.16 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

%timeit sys.platform
33.1 ns ± 0.542 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I also created an issue for differentiating systems OS on #5941

The current default prints like this:

import polars as pl
from datetime import datetime

df_stock = pl.DataFrame(
    {
        "date_time": pl.date_range(
            low=datetime(2000, 1, 1, 0, 0),
            high=datetime(2023, 1, 1, 0, 0),
            interval="1d",
        ).shuffle(seed=0)
    }
)

shape: (8402, 1)
┌─────────────────────┐
│ date_time           │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2008-05-13 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2000-11-24 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2005-08-09 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2009-09-04 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ...                 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-06-30 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-08-18 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-02-06 00:00:00 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2010-04-16 00:00:00 │
└─────────────────────┘

But the default should be this:

pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED")

print(df_stock)
shape: (8402, 1)
┌─────────────────────┐
│ date_time           │
│ ---                 │
│ datetime[μs]        │
╞═════════════════════╡
│ 2008-05-13 00:00:00 │
│ 2000-11-24 00:00:00 │
│ 2005-08-09 00:00:00 │
│ 2009-09-04 00:00:00 │
│ ...                 │
│ 2010-06-30 00:00:00 │
│ 2010-08-18 00:00:00 │
│ 2010-02-06 00:00:00 │
│ 2010-04-16 00:00:00 │
└─────────────────────┘

The text was updated successfully, but these errors were encountered:

arturdaraujo · 2022-12-30T05:39:33Z

Thanks for the try @dandxy89. I don't know what went wrong with your PR

alexander-beedie · 2022-12-30T11:58:23Z

This seems like it could be an issue with your specific distribution/terminal/font - I don't have any problems with the UTF8 formats on Linux.

Can you provide a screenshot and some distro/terminal/font details for your Linux machine, so we can at least see what's happening for you? (We're not going to want to change the default to ASCII there, given that there are no other reports of problems).

arturdaraujo · 2022-12-30T14:26:18Z

If you change the default to UTF8_FULL_CONDENSED the problems will appear on bigger dataframes (more columns). However, it is often only small mismatches. UTF8_FULL_CONDENSED is certainly good enough but the ideal would be to check the system for appropriate formatting. UTF8_FULL as default is just a mess on Linux.

The current default:

The UTF8_FULL_CONDENSED formatting:

---Version info---
Polars: 0.15.7
Index type: UInt32
Platform: Linux-5.15.79.1-microsoft-standard-WSL2-x86_64-with-glibc2.10
Python: 3.8.15 | packaged by conda-forge | (default, Nov 22 2022, 08:49:35) 
[GCC 10.4.0]
---Optional dependencies---
pyarrow: 10.0.1
pandas: 1.5.2
numpy: 1.22.4
fsspec: 2022.11.0
connectorx: <not installed>
xlsx2csv: <not installed>
matplotlib: 3.6.2

arturdaraujo · 2022-12-30T14:46:13Z

I just tested it on my Linux Mint 21.1 and it's working fine with utf8 formatting, the problem only occurs on WSL (Windows Subsystem for Linux). Still, I believe that the condensed utf8 formatting should be the default.

stinodego · 2022-12-30T14:52:51Z

Personally, I also prefer the look of the condensed formatting. I am not familiar with the intricacies of the various operating systems, but to me it just looks better.

dandxy89 · 2022-12-30T14:58:36Z

Will update my branch this evening once I'm back from the bouldering gym.

Why did I close it? I realised that the Python test suite wasn't running locally.

arturdaraujo · 2022-12-30T15:02:44Z

@stinodego is not only the looks, it is around 40% less lines and it makes it Polars DataFrame much easier to work with. Pandas, which is the "default" library for DataFrame, is even more minimal.

stinodego · 2022-12-30T15:05:03Z

It will probably be a bit of a chore to update all the docstring examples, although I'm sure that's doable with some regex magic. Should probably get sign-off from @ritchie46 before going ahead with that.

ritchie46 · 2022-12-30T15:31:29Z

It will probably be a bit of a chore to update all the docstring examples, although I'm sure that's doable with some regex magic. Should probably get sign-off from @ritchie46 before going ahead with that.

Do you think we can automate such a change? This would be beneficial many times down the road probably.

I have no strong opinions on the format, so a tight format is fine by me. Only on changing the docstrings.

alexander-beedie · 2022-12-30T15:56:39Z

The line-contraction is definitely a font/kerning issue, rather than a polars issue (@arturdaraujo: you could try switching your font under WSL to something like JetBrains Mono, or one of the Menlo Powerline fonts?)

As for a change in default format, I'm definitely biased towards the UTF8_FULL_CONDENSED, not least because I introduced it to comfy-table for use by polars in the first place, heh... It's a big win space-wise, and I don't think readability suffers.

@ritchie46 / @stinodego: if you want to go ahead, I'll volunteer for updating this and the associated mass docstrings update; how about in conjunction with #5513, which would also require a global docstring update? (It's a different issue, but could kill two birds with one stone).

stinodego · 2022-12-30T16:26:08Z

Do you think we can automate such a change? This would be beneficial many times down the road probably.

I doubt it would be worth it. The docstrings right now are all flat text, you'd have to find some way to auto-generate the example output. I think regex magic is probably a way easier method of going about this.

if you want to go ahead, I'll volunteer for updating this

Let's do it! Regarding #5513, I'll leave a comment in that issue. I'm not as certain about that one.

cmdlineluser · 2022-12-30T18:03:04Z

If it's for the python files - the ast module can be used to isolate the docstrings which can be helpful for performing modifications.

import ast
import re

def remove_row_sep_from_docstring(
    text,
    col_sep = "┼",
    dash = "╌",
    df_start = "┌",
    df_end = "┘",
    row_start = "├",
    row_end = "┤",
    node_types = (ast.ClassDef, ast.FunctionDef)
):

    examples_re = r"(?s)\s*Examples\n\s*-+.+"
    dataframe_re = (
        rf"(?s)\s*shape: [(]\d+, \d+[)]\s*"
        rf"{df_start}.+?{df_end}"
        rf"(?=]?\n)"
    )
    row_sep_re = (
        rf"\n\s*{row_start}{dash}+"
        rf"(?:{col_sep}{dash}+)*{row_end}"
        rf"(?=\n)"
    )
    
    tree = ast.parse(text)

    for node in ast.walk(tree):
        if isinstance(node, node_types):
            doc = ast.get_docstring(node, clean=False) or ""
            has_examples = re.search(examples_re, doc)
            if has_examples:
                old_examples = has_examples.group()
                new_examples = old_examples
                for df in re.findall(dataframe_re, old_examples):
                    new_examples = new_examples.replace(
                        df,
                        re.sub(row_sep_re, "", df)
                    )
                doc = doc.replace(old_examples, new_examples)
                # https://github.com/python/cpython/blob/3.11/Lib/ast.py#L294
                node.body[0].value.value = doc
        
    return ast.unparse(tree)

The result could then be passed to black to fix any lost formatting.

ritchie46 · 2022-12-30T18:34:17Z

Does unparse do what I think it does?! 👀 That would be amazing.

alexander-beedie · 2022-12-30T18:50:15Z

Just submitted a corresponding ASCII_FULL_CONDENSED to comfy-table; bit of an oversight on my part not doing that when I added UTF8_FULL_CONDENSED ... (Don't think we'll need to wait for it though; should be possible to declare the preset locally, then swap it out later).

zundertj · 2022-12-30T20:07:57Z

FYI, I have looked into alternatives before to auto-update the docstring output, and could only find https://github.com/max-sixty/pytest-accept, but does not seem maintained, and relies on pytest, which does not pick up our custom output checker (which we need for IGNORE_RESULT. We have been through this before, and at some point turned it off, only to find later that we had very quickly broken some examples. So yes it is a pain, but having working examples in docstrings seems key to me.

The suggestion by @cmdlineluser to use the ast module is a good one. Note that Python's built-in doctest does not use the ast module, although a popular alternative xdoctest (insofar doctests are popular) does. Not sure if that has any consequences. I have been tempted before to wrap what we have in our script as a separate package, maybe it becomes worth it if we have auto-update as well. Although I realize that is far off from the very targeted use case here.

Not sure how black should be part of this, you are only changing the outputs, right?

cmdlineluser · 2022-12-30T22:22:28Z

ast.parse doesn't retain the original formatting:

print(ast.unparse(ast.parse("""
def func(
    foo=1,
    bar=2
) -> DataFrame:
    ...
""")))

Output:

def func(foo=1, bar=2) -> DataFrame:
    ...

Also, I just realized it doesn't retain comments either - which makes it a non-solution - d'oh.

cmdlineluser · 2022-12-31T00:12:25Z

Looks like Parso or LibCST are the recommended tools that retain formatting/comments.

import parso
import re

def remove_row_sep_from_docstring(
    text,
    row_start = "├",
    row_end = "┤"
):
    examples_re = r"(?s)[^\S\n]*Examples\n\s*-+.+"

    tree = parso.parse(text)
    nodes = []
    for cls in tree.iter_classdefs():
        nodes.append(cls)
        nodes.extend(cls.iter_funcdefs())
    nodes.extend(tree.iter_funcdefs())

    for node in nodes:
        docnode = node.get_doc_node()
        if not docnode: 
            continue
        docstring = docnode.value
        has_examples = re.search(examples_re, docstring)
        if has_examples:
            old_examples = has_examples.group()
            new_examples = []
            for line in old_examples.splitlines():
                stripped = line.strip().strip("#").strip()
                if stripped.startswith(row_start) and stripped.endswith(row_end):
                    continue
                new_examples.append(line)
            docnode.value = docstring.replace(
                old_examples,
                "\n".join(new_examples)
            )

    return tree.get_code()

alexander-beedie · 2022-12-31T07:06:21Z

It's a New Year's gift! comfy-table made a fresh release just for the ASCII_FULL_CONDENSED preset I submitted last night. Will pull the latest code, brew a large cup of tea, and make the update; with a little luck it should be quite painless...

alexander-beedie · 2022-12-31T08:37:32Z

Have finished updating everything now except for the rather involved docstring of pl.align_frames. The tea is finished, and this one probably calls for a coffee - but then it's all done ;)

ritchie46 · 2022-12-31T08:53:15Z

Have finished updating everything now except for the rather involved docstring of pl.align_frames. The tea is finished, and this one probably calls for a coffee - but then it's all done ;)

Wow, that's fast!

alexander-beedie · 2022-12-31T09:26:23Z

Done - unit/doctests all passed locally, so let's see how it does through CI ... ;)

arturdaraujo added the enhancement New feature or an improvement of an existing feature label Dec 29, 2022

arturdaraujo changed the title ~~Make pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") the default for printing Dataframes~~ pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") as default for printing Dataframes Dec 29, 2022

dandxy89 mentioned this issue Dec 29, 2022

feat(rust, python): Set default table formatting to UTF8_FULL_CONDENSED #5943

Closed

arturdaraujo mentioned this issue Dec 30, 2022

Solution for checking system OS to accurately printing polars DataFrame format. #5941

Closed

2 tasks

alexander-beedie mentioned this issue Dec 31, 2022

feat(python,rust): updated default table format from "UTF8_FULL" to "UTF8_FULL_CONDENSED" #5967

Merged

ritchie46 closed this as completed in #5967 Dec 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") as default for printing Dataframes #5942

pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") as default for printing Dataframes #5942

arturdaraujo commented Dec 29, 2022 •

edited

arturdaraujo commented Dec 30, 2022

alexander-beedie commented Dec 30, 2022 •

edited

arturdaraujo commented Dec 30, 2022 •

edited

arturdaraujo commented Dec 30, 2022

stinodego commented Dec 30, 2022

dandxy89 commented Dec 30, 2022

arturdaraujo commented Dec 30, 2022 •

edited

stinodego commented Dec 30, 2022

ritchie46 commented Dec 30, 2022

alexander-beedie commented Dec 30, 2022 •

edited

stinodego commented Dec 30, 2022 •

edited

cmdlineluser commented Dec 30, 2022 •

edited

ritchie46 commented Dec 30, 2022

alexander-beedie commented Dec 30, 2022 •

edited

zundertj commented Dec 30, 2022 •

edited

cmdlineluser commented Dec 30, 2022

cmdlineluser commented Dec 31, 2022 •

edited

alexander-beedie commented Dec 31, 2022 •

edited

alexander-beedie commented Dec 31, 2022 •

edited

ritchie46 commented Dec 31, 2022

alexander-beedie commented Dec 31, 2022

pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") as default for printing Dataframes #5942

pl.Config.set_tbl_formatting("UTF8_FULL_CONDENSED") as default for printing Dataframes #5942

Comments

arturdaraujo commented Dec 29, 2022 • edited

Problem description

arturdaraujo commented Dec 30, 2022

alexander-beedie commented Dec 30, 2022 • edited

arturdaraujo commented Dec 30, 2022 • edited

arturdaraujo commented Dec 30, 2022

stinodego commented Dec 30, 2022

dandxy89 commented Dec 30, 2022

arturdaraujo commented Dec 30, 2022 • edited

stinodego commented Dec 30, 2022

ritchie46 commented Dec 30, 2022

alexander-beedie commented Dec 30, 2022 • edited

stinodego commented Dec 30, 2022 • edited

cmdlineluser commented Dec 30, 2022 • edited

ritchie46 commented Dec 30, 2022

alexander-beedie commented Dec 30, 2022 • edited

zundertj commented Dec 30, 2022 • edited

cmdlineluser commented Dec 30, 2022

cmdlineluser commented Dec 31, 2022 • edited

alexander-beedie commented Dec 31, 2022 • edited

alexander-beedie commented Dec 31, 2022 • edited

ritchie46 commented Dec 31, 2022

alexander-beedie commented Dec 31, 2022

arturdaraujo commented Dec 29, 2022 •

edited

alexander-beedie commented Dec 30, 2022 •

edited

arturdaraujo commented Dec 30, 2022 •

edited

arturdaraujo commented Dec 30, 2022 •

edited

alexander-beedie commented Dec 30, 2022 •

edited

stinodego commented Dec 30, 2022 •

edited

cmdlineluser commented Dec 30, 2022 •

edited

alexander-beedie commented Dec 30, 2022 •

edited

zundertj commented Dec 30, 2022 •

edited

cmdlineluser commented Dec 31, 2022 •

edited

alexander-beedie commented Dec 31, 2022 •

edited

alexander-beedie commented Dec 31, 2022 •

edited