Match (almost) any valid unicode character #491

rowlesmr · 2023-07-05T03:55:47Z

rowlesmr
Jul 5, 2023

I've just started playing around with pyparser, and it looks great.

I'm trying to implement a parser for a file format which needs to essentially straight up support nearly all valid UTF-8 characters.

Following the examples in the Greek, Latin, Arabic (etc) ubicode subsections, I've created an allchars* variable which is a list with ~1 million elements against which checks need to be made.

Are there any shortcuts here I can take advantage of to avoid having this large list?

*

_allchars_ranges = [(0x0009,), (0x000A,), (0x000D,), (0x0020, 0x007E),
(0x00A0, 0xD7FF), (0xE000, 0xFDCF), (0xFDF0, 0xFFFD),
(0x10000, 0x1FFFD), (0x20000, 0x2FFFD), (0x30000, 0x3FFFD),
(0x40000, 0x4FFFD), (0x50000, 0x5FFFD), (0x60000, 0x6FFFD),
(0x70000, 0x7FFFD), (0x80000, 0x8FFFD), (0x90000, 0x9FFFD),
(0xA0000, 0xAFFFD), (0xB0000, 0xBFFFD), (0xC0000, 0xCFFFD),
(0xD0000, 0xDFFFD), (0xE0000, 0xEFFFD), (0xF0000, 0xFFFFD),
(0x100000, 0x10FFFD)]

tmp = []
for rr in _allchars_ranges:
   tmp.extend(range(rr[0], rr[-1] + 1))

allchars = [chr(c) for c in sorted(set(tmp))]

although, my ranges is already sorted, and is a set, so I could just allchars = [chr(c) for c in tmp]

ptmcg · 2023-07-05T04:18:47Z

ptmcg
Jul 5, 2023
Maintainer

Welcome to pyparsing, I'm glad it is making sense for you.

The base class pyparsing.unicode is the set-of-all-Unicode characters you are looking for, so no need to define your own. I think if you just used this, you'd get what you want:

import pyparsing as pp
ppu = pp.unicode
print(len(ppu.printables))
# prints 1114060

0 replies

ptmcg · 2023-07-05T04:27:07Z

ptmcg
Jul 5, 2023
Maintainer

Sorry, I kind of blew off your existing work, and didn't actually answer your question. With your current list of ranges, here is how you can build a set of all the characters without having to accumulate them into a list first, by using a generator function:

def yield_all_chars():
    for char_range in _allchars_ranges:
        if len(char_range) == 2:
            yield from (chr(c) for c in range(char_range[0], char_range[1]+1))
        else:
            yield chr(char_range[0])


all_chars = set(yield_all_chars())
print(len(all_chars))
# prints 1111936

The generator function will yield all the characters directly into the destination set, without building an intermediate list.

1 reply

rowlesmr Jul 5, 2023
Author

Ta a lot!

I'll have a play.

rowlesmr · 2023-07-07T15:20:54Z

rowlesmr
Jul 7, 2023
Author

Well, it works, but it is by far not the rate-determining step...

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Match (almost) any valid unicode character #491

{{title}}

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Match (almost) any valid unicode character #491

rowlesmr Jul 5, 2023

Replies: 3 comments · 1 reply

ptmcg Jul 5, 2023 Maintainer

ptmcg Jul 5, 2023 Maintainer

rowlesmr Jul 5, 2023 Author

rowlesmr Jul 7, 2023 Author

rowlesmr
Jul 5, 2023

Replies: 3 comments 1 reply

ptmcg
Jul 5, 2023
Maintainer

ptmcg
Jul 5, 2023
Maintainer

rowlesmr Jul 5, 2023
Author

rowlesmr
Jul 7, 2023
Author