Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(python): Add new alpha, alphanumeric and digit selectors #16310

Merged
merged 2 commits into from
May 19, 2024

Conversation

alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented May 18, 2024

New selectors, making it even easier to classify column names by character type:

  • cs.alphanumeric(): only names composed of letters and digits.
  • cs.alpha(): only names composed of letters.
  • cs.digit(): only names composed of digits.

One of the nice bonuses in having separate selectors for this is making sure that non-ASCII letters are handled automatically, eg: accented characters in words such as "tweeëntwintig", kanji such as "東京", hangul, etc.

(I also suspect that, even amongst users familiar with regular expressions, a reasonable number wouldn't immediately know the equivalent cs.matches pattern ^[\p{Alphabetic}]+$, which is also a little more cryptic in a codebase 🤔)

There is an optional flag ascii_only if you want to limit the definition of "alphabetic" to ASCII, but having Unicode letters recognised by default is a good out-of-the-box experience for more languages.

Examples

import polars as pl
import polars.selectors as cs

df = pl.DataFrame({
    "no1":  [100, 200, 300],
    "café": ["espresso", "latte", "mocha"],
    "t/f":  [True, False, None],
    "hmm":  ["aaa", "bbb", "ccc"],
    "都市":  ["東京", "大阪", "京都"],
})

Select columns with alphabetic names; note that accented characters and kanji are recognised as valid:

df.select(cs.alpha())
# shape: (3, 3)
# ┌──────────┬─────┬──────┐
# │ café     ┆ hmm ┆ 都市 │
# │ ---      ┆ --- ┆ ---  │
# │ str      ┆ str ┆ str  │
# ╞══════════╪═════╪══════╡
# │ espresso ┆ aaa ┆ 東京 │
# │ latte    ┆ bbb ┆ 大阪 │
# │ mocha    ┆ ccc ┆ 京都 │
# └──────────┴─────┴──────┘

Constrain the definition of "alphabetic" to ASCII characters:

df.select(cs.alpha(ascii_only=True))
# shape: (3, 1)
# ┌─────┐
# │ hmm │
# │ --- │
# │ str │
# ╞═════╡
# │ aaa │
# │ bbb │
# │ ccc │
# └─────┘

Select columns with non-ASCII alphabetic names :)

df.select(cs.alpha() - cs.alpha(ascii_only=True))
# shape: (3, 2)
# ┌──────────┬──────┐
# │ café     ┆ 都市 │
# │ ---      ┆ ---  │
# │ str      ┆ str  │
# ╞══════════╪══════╡
# │ espresso ┆ 東京 │
# │ latte    ┆ 大阪 │
# │ mocha    ┆ 京都 │
# └──────────┴──────┘

Select all columns except for those with alphabetic names:

df.select(~cs.alpha())
shape: (3, 2)
# ┌─────┬───────┐
# │ no1 ┆ t/f   │
# │ --- ┆ ---   │
# │ i64 ┆ bool  │
# ╞═════╪═══════╡
# │ 100 ┆ true  │
# │ 200 ┆ false │
# │ 300 ┆ null  │
# └─────┴───────┘

Select alphanumeric names:

# shape: (3, 4)
# ┌─────┬──────────┬─────┬──────┐
# │ no1 ┆ café     ┆ hmm ┆ 都市 │
# │ --- ┆ ---      ┆ --- ┆ ---  │
# │ i64 ┆ str      ┆ str ┆ str  │
# ╞═════╪══════════╪═════╪══════╡
# │ 100 ┆ espresso ┆ aaa ┆ 東京 │
# │ 200 ┆ latte    ┆ bbb ┆ 大阪 │
# │ 300 ┆ mocha    ┆ ccc ┆ 京都 │
# └─────┴──────────┴─────┴──────┘

Select alphanumeric names, constraining the definition to ASCII characters:

df.select(cs.alphanumeric(ascii_only=True))
# shape: (3, 2)
# ┌─────┬─────┐
# │ no1 ┆ hmm │
# │ --- ┆ --- │
# │ i64 ┆ str │
# ╞═════╪═════╡
# │ 100 ┆ aaa │
# │ 200 ┆ bbb │
# │ 300 ┆ ccc │
# └─────┴─────┘

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars labels May 18, 2024
Copy link

codecov bot commented May 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 80.76%. Comparing base (6804f33) to head (e0f3d8b).

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #16310   +/-   ##
=======================================
  Coverage   80.75%   80.76%           
=======================================
  Files        1393     1393           
  Lines      179423   179431    +8     
  Branches     2922     2922           
=======================================
+ Hits       144891   144912   +21     
+ Misses      34029    34016   -13     
  Partials      503      503           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@alexander-beedie alexander-beedie added the A-selectors Area: column selectors label May 18, 2024
@ritchie46 ritchie46 merged commit eb20a7a into pola-rs:main May 19, 2024
19 checks passed
@alexander-beedie alexander-beedie deleted the alpha-digit-selectors branch May 19, 2024 08:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-selectors Area: column selectors enhancement New feature or an improvement of an existing feature python Related to Python Polars
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants