Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Format hex code in unicode escape sequences in string literals #2916

Merged
merged 24 commits into from Jan 22, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
add30b8
Format hex code in unicode escape sequences in string literals
Shivansh-007 Jan 30, 2022
483fc15
Format \N character name escapes with uppercased literals
Shivansh-007 Jan 31, 2022
cc48d2d
Fix formatting with correct length for each format
Shivansh-007 Mar 13, 2022
f1dbc96
Add changelog
Shivansh-007 Mar 13, 2022
ef442a6
Move feature to preview styling only
Shivansh-007 Mar 13, 2022
2ada012
Fix typo
Shivansh-007 Mar 13, 2022
125ebec
Change Match[AnyStr] to Match[str]
Shivansh-007 Mar 16, 2022
af86102
Make UNICODE_RE Final and accept multiline strings
Shivansh-007 Mar 16, 2022
69c9664
Reword regex comments to use 'character'
Shivansh-007 Mar 16, 2022
7d0e548
Merge remote-tracking branch 'upstream/main' into format/hex-code-lit…
Shivansh-007 Mar 16, 2022
52bd904
ITS RE.VERBOSE NOT RE.MULTILINE?!
Shivansh-007 Mar 16, 2022
a5c4e62
Merge branch 'main' into format/hex-code-literals
JelleZijlstra Mar 24, 2022
221995e
Update CHANGES.md
Shivansh-007 Mar 24, 2022
d4dde2e
Merge branch 'main' into format/hex-code-literals
JelleZijlstra Apr 2, 2022
3557faf
Merge branch 'main' into format/hex-code-literals
JelleZijlstra Dec 18, 2022
77a48e6
CR improvements
JelleZijlstra Dec 18, 2022
1b9d5fd
fix lint
JelleZijlstra Dec 18, 2022
3c24427
fix my sloppy code
JelleZijlstra Dec 18, 2022
9f35b61
fix the new test; \U requires exactly 8 digits
JelleZijlstra Dec 18, 2022
27d2d86
fix \N escapes
JelleZijlstra Dec 18, 2022
420a8f9
add a test
JelleZijlstra Dec 18, 2022
296cdb9
bytes tests
JelleZijlstra Dec 18, 2022
625c085
Merge branch 'main' into format/hex-code-literals
JelleZijlstra Dec 29, 2022
1511959
Merge branch 'main' into format/hex-code-literals
JelleZijlstra Dec 29, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
1 change: 1 addition & 0 deletions CHANGES.md
Expand Up @@ -16,6 +16,7 @@

<!-- Changes that affect Black's preview style -->

- Format hex code in unicode escape sequences in string literals (#2916)
Shivansh-007 marked this conversation as resolved.
Show resolved Hide resolved
- Add parentheses around `if`-`else` expressions (#2278)
- Improve the performance on large expressions that contain many strings (#3467)
- Fix a crash in preview style with assert + parenthesized string (#3415)
Expand Down
4 changes: 4 additions & 0 deletions src/black/linegen.py
Expand Up @@ -58,6 +58,7 @@
get_string_prefix,
normalize_string_prefix,
normalize_string_quotes,
normalize_unicode_escape_sequences,
)
from black.trans import (
CannotTransform,
Expand Down Expand Up @@ -362,6 +363,9 @@ def visit_factor(self, node: Node) -> Iterator[Line]:
yield from self.visit_default(node)

def visit_STRING(self, leaf: Leaf) -> Iterator[Line]:
if Preview.hex_codes_in_unicode_sequences in self.mode:
normalize_unicode_escape_sequences(leaf)

if is_docstring(leaf) and "\\\n" not in leaf.value:
# We're ignoring docstrings with backslash newline escapes because changing
# indentation of those changes the AST representation of the code.
Expand Down
1 change: 1 addition & 0 deletions src/black/mode.py
Expand Up @@ -149,6 +149,7 @@ def supports_feature(target_versions: Set[TargetVersion], feature: Feature) -> b
class Preview(Enum):
"""Individual preview style features."""

hex_codes_in_unicode_sequences = auto()
annotation_parens = auto()
empty_lines_before_class_or_def_with_leading_comments = auto()
handle_trailing_commas_in_head = auto()
Expand Down
44 changes: 43 additions & 1 deletion src/black/strings.py
Expand Up @@ -5,7 +5,9 @@
import re
import sys
from functools import lru_cache
from typing import List, Pattern
from typing import List, Match, Pattern

from blib2to3.pytree import Leaf

if sys.version_info < (3, 8):
from typing_extensions import Final
Expand All @@ -18,6 +20,15 @@
r"^([" + STRING_PREFIX_CHARS + r"]*)(.*)$", re.DOTALL
)
FIRST_NON_WHITESPACE_RE: Final = re.compile(r"\s*\t+\s*(\S)")
UNICODE_ESCAPE_RE: Final = re.compile(
r"(?P<backslashes>\\+)(?P<body>"
r"(u(?P<u>[a-fA-F0-9]{4}))" # Character with 16-bit hex value xxxx
r"|(U(?P<U>[a-fA-F0-9]{8}))" # Character with 32-bit hex value xxxxxxxx
r"|(x(?P<x>[a-fA-F0-9]{2}))" # Character with hex value hh
r"|(N\{(?P<N>[a-zA-Z0-9 \-]{2,})\})" # Character named name in the Unicode database
r")",
re.VERBOSE,
)


def sub_twice(regex: Pattern[str], replacement: str, original: str) -> str:
Expand Down Expand Up @@ -236,3 +247,34 @@ def normalize_string_quotes(s: str) -> str:
return s # Prefer double quotes

return f"{prefix}{new_quote}{new_body}{new_quote}"


def normalize_unicode_escape_sequences(leaf: Leaf) -> None:
"""Replace hex codes in Unicode escape sequences with lowercase representation."""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will have to be thought out still, as this comment points out. My two cents: I prefer upper case, and since Black formats hex numbers to upper already I think it would be consistent. The Python repr argument is solid too, but we should think about changing hex literals as well then.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather not change hex numbers, we already changed our mind there a few times.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we're not changing numbers (which I agree with), do y'all share the concern for consistency?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments read a bit ambiguously. So to be clear, I'm proposing that we switch the formatting to be upper case to be consistent with hex numbers. Y'all in?

text = leaf.value
prefix = get_string_prefix(text)
JelleZijlstra marked this conversation as resolved.
Show resolved Hide resolved
if "r" in prefix.lower():
return

def replace(m: Match[str]) -> str:
groups = m.groupdict()
back_slashes = groups["backslashes"]

if len(back_slashes) % 2 == 0:
return back_slashes + groups["body"]

if groups["u"]:
# \u
return back_slashes + "u" + groups["u"].lower()
elif groups["U"]:
# \U
return back_slashes + "U" + groups["U"].lower()
elif groups["x"]:
# \x
return back_slashes + "x" + groups["x"].lower()
else:
assert groups["N"], f"Unexpected match: {m}"
# \N{}
return back_slashes + "N{" + groups["N"].upper() + "}"

leaf.value = re.sub(UNICODE_ESCAPE_RE, replace, text)
33 changes: 33 additions & 0 deletions tests/data/preview/format_unicode_escape_seq.py
@@ -0,0 +1,33 @@
x = "\x1F"
x = "\\x1B"
x = "\\\x1B"
x = "\U0001F60E"
x = "\u0001F60E"
x = r"\u0001F60E"
x = "don't format me"
x = "\xA3"
x = "\u2717"
x = "\uFaCe"
x = "\N{ox}\N{OX}"
x = "\N{lAtIn smaLL letteR x}"
x = "\N{CYRILLIC small LETTER BYELORUSSIAN-UKRAINIAN I}"
x = b"\x1Fdon't byte"
x = rb"\x1Fdon't format"

# output

x = "\x1f"
x = "\\x1B"
x = "\\\x1b"
x = "\U0001f60e"
x = "\u0001F60E"
x = r"\u0001F60E"
x = "don't format me"
x = "\xa3"
x = "\u2717"
x = "\uface"
x = "\N{OX}\N{OX}"
x = "\N{LATIN SMALL LETTER X}"
x = "\N{CYRILLIC SMALL LETTER BYELORUSSIAN-UKRAINIAN I}"
x = b"\x1fdon't byte"
x = rb"\x1Fdon't format"