Replies: 3 comments 1 reply
-
Welcome to pyparsing, I'm glad it is making sense for you. The base class pyparsing.unicode is the set-of-all-Unicode characters you are looking for, so no need to define your own. I think if you just used this, you'd get what you want: import pyparsing as pp
ppu = pp.unicode
print(len(ppu.printables))
# prints 1114060 |
Beta Was this translation helpful? Give feedback.
-
Sorry, I kind of blew off your existing work, and didn't actually answer your question. With your current list of ranges, here is how you can build a set of all the characters without having to accumulate them into a list first, by using a generator function: def yield_all_chars():
for char_range in _allchars_ranges:
if len(char_range) == 2:
yield from (chr(c) for c in range(char_range[0], char_range[1]+1))
else:
yield chr(char_range[0])
all_chars = set(yield_all_chars())
print(len(all_chars))
# prints 1111936 The generator function will yield all the characters directly into the destination set, without building an intermediate list. |
Beta Was this translation helpful? Give feedback.
-
Well, it works, but it is by far not the rate-determining step... |
Beta Was this translation helpful? Give feedback.
-
I've just started playing around with pyparser, and it looks great.
I'm trying to implement a parser for a file format which needs to essentially straight up support nearly all valid UTF-8 characters.
Following the examples in the Greek, Latin, Arabic (etc) ubicode subsections, I've created an
allchars
* variable which is a list with ~1 million elements against which checks need to be made.Are there any shortcuts here I can take advantage of to avoid having this large list?
*
although, my ranges is already sorted, and is a set, so I could just
allchars = [chr(c) for c in tmp]
Beta Was this translation helpful? Give feedback.
All reactions