Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate unicode aware functions using macros #3

Open
Yamakaky opened this issue Jul 14, 2015 · 3 comments
Open

Generate unicode aware functions using macros #3

Yamakaky opened this issue Jul 14, 2015 · 3 comments

Comments

@Yamakaky
Copy link

Now, you seem to use a python script to update a rust source file using the upstream unicode data. Did you consider doing it at compile time, using macros ? BTW, it's what the language Elixir does.

@kwantam
Copy link
Member

kwantam commented Jul 14, 2015

Thanks for this suggestion. It's interesting, but ultimately it seems like a lot of work for relatively little benefit. After all, the Unicode version changes, what, every year or two?

Consider: beyond translating all of the current scripting support into compile-time macros (which, recall, will have to do crazy things like go grab text files from the internet and feed them to the compiler---among other things, a potential security concern), there's the headache of handling (and testing the code that handles!) error conditions where the tables aren't generated correctly. And beyond all that, now the library's version number and the Unicode version are no longer linked, which means it's harder to debug reported issues. And you can't compile the library without an internet connection. And probably other things I haven't thought of.

If I'm misinterpreting what you've written, and your suggestion is instead that we store the unprocessed Unicode definitions locally and then turn them into tables at compile time, I don't see any benefit compared to what we're doing now other than a warm fuzzy feeling. In other words: why make the compiler do at each compile what a Python script can just do once?

@Yamakaky
Copy link
Author

You can put the files from unicode in git, no need for an internet connection.

I see one advantage : the code is generated safely, as opposed to the current text-based script. + You don't have to think "Did I run the script or not ?". But that's right it's OK given the frequency of the updated.

I was talking about something like that. After reading UnicodeData.txt, it iterates through the code points, and for each it associates the downcase character. It uses patern matching in the parameters of the function for that.

Rust definitively need incremental compilation ^^ Or maybe put the unicode-aware functions in an other inner crate ?

@kwantam
Copy link
Member

kwantam commented Jul 15, 2015

I'm sorry, I'm probably just being slow, but I think probably neither of us is quite understanding the other.

First, no one other than the project maintainers actually needs to run the Python script. Someone else can if he or she wants to, but the Unicode tables are included as part of the release and, as I pointed out above, almost never change.

Second, I'm not sure what you mean when you say "generated safely." What is your definition of safe in this context, and how is the Python script unsafe? Are you assuming that Rust code for text processing is somehow better than equivalent Python code? I just don't see how this can be true in any meaningful sense, unless you're positing that one could come up with some invariants for the type system to enforce (but I'm not convinced that one can get more than superficial guarantees in this way).

Third, I think I do not understand what you are describing with respect to Elixir, but why would we want to iterate through codepoints at runtime rather than preprocessing data into structures that can be searched very quickly? (In principle we could pre-generate structures that are even faster to search, e.g., perfect hash tables. Actually, that would be a nice enhancement...)

Fourth, I'm not sure what you mean when you say "put the Unicode-aware functions in another crate." That's precisely the point of this crate---it does nothing except provide Unicode-aware functions! For example, when you use Cargo to build a project that includes this crate, you get exactly the situation you described.

I am sure that the Elixir developers have a good reason for their approach, and that the reason I don't see an advantage is that I do not understand what they are doing well enough. When I have some time, I will look into it more closely and try to better understand your suggestion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants