Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

python-format flag incorrectly forced on all extractors #35

Open
wichert opened this issue Jul 27, 2013 · 19 comments · May be fixed by #838
Open

python-format flag incorrectly forced on all extractors #35

wichert opened this issue Jul 27, 2013 · 19 comments · May be fixed by #838

Comments

@wichert
Copy link

wichert commented Jul 27, 2013

This is a copy of old ticket 318.

During extraction the results from an extractor plugin are used to create a message which is then added to the catalog. When the Message instance is created it can set the python-format flag if the text matches a specific regular expression. That means that a text like this:

This is a text about 50% of all users.

will be flagged as a python-format string even it does not originate from python code, or from a piece of python code that will never use formatting. This is breaking our translations in a pretty bad way currently.

As far as I can see the extractor is the only thing that should decide the flags for a message. The Message class itself should never try to guess flags or force them upon plugins, especially since extractors have no way to override this behaviour.

@mitsuhiko
Copy link
Member

That is correct. The extractor interface currently does not provide ways to hand out flags. Not sure yet how to properly fix it, but the place to fix it is definitely the extractor interface. I suppose the correct thing to do would be to add a new optional last item to the message tuple. If it's missing the old guessing behavior kicks in, otherwise the flags are accepted from the extractor.

@lelit
Copy link

lelit commented Jan 10, 2014

Not sure if this is related, but the current heuristic used to determine if a message is actually a python-format does not recognize strftime() format codes: such format strings should not be marked with the python-format flag because otherwise polint tools wrongly mark translations as defective.

@JonathanRRogers
Copy link

Issues with the heuristic detection of Python format strings are separate from this.

@wichert
Copy link
Author

wichert commented Feb 11, 2014

For what it's worth I have been working the last few days adding a extraction system to lingua, which fixes this amongst a few other things.

@JonathanRRogers
Copy link

By "fix", do you mean you extended Babel's extractor interface? If so, could you contribute that babel? I was about to do so, but don't want to duplicate effort. I'm not sure if lingua is meant to be an extension to Babel or an alternative.

@wichert
Copy link
Author

wichert commented Feb 11, 2014

I fixed this purely from lingua's point of view: lingua no longer uses Babel at all anymore now. Instead it now has its own extraction command (pot-extract, command line-compatible with xgettext), and uses a richer API for its own extractors. It is able to run Babel plugins for backwards compatibility to support file formats lingua does not support itself, but obviously that comes with the API limitations Babel plugins have.

@lelit
Copy link

lelit commented Feb 11, 2014

Sigh, a pity the waste of resources in duplicating code bases... :-\

Help babel do the right thing, or help lingua? Both does the wrong thing wrt strftime() formats, but at least babel already works with Python 3 while lingua currently does not...

@wichert
Copy link
Author

wichert commented Feb 11, 2014

Lingua in git works perfectly on Python 3. It needs a few last changes and a bit more documentation before I can make a release, but that should happen in the next few days.

@wichert
Copy link
Author

wichert commented Feb 11, 2014

@lelit can you explain why you think lingua does the wrong thing with strftime formats? If it does something wrong I would like to fix that.

@lelit
Copy link

lelit commented Feb 11, 2014

I tried lingua this morning, from a fresh clone from the git repo: at least this line prevents Python 3 to even load the module, maybe it's one the few last changes, dunno, maybe you have a different branch for Py3?

Wrt strftime format, both projects implement the same heuristic to determine if a string contains %x placeholders, but its not adeguate for strftime, because the translated format may contain different set of them (think about %p, ie "AM"/"PM" which is used by englishmen, while other languages, italian to mention one, does not use it and instead prefer %H for 24-hour-based hours): the usual %m/%d/%y %I:%M %p becomes %d/%m/%y %H:%M, and that %d will be recognised as a "c-format" placeholder.

IMHO, a more correct way, although not perfect, would be to check if all %-placeholders in string belongs to the possible strfmt() codes.

In other words, strftime() formats shouldn't be marked as "python-format", otherwise most polints I know will wrongly emit alerts about non-matching translations.

@wichert
Copy link
Author

wichert commented Feb 11, 2014

Ah, the last commit indeed broke Python 3 support. That was stupid, and I’ll fix that immediately.

The strftime thing is interesting, and something I had never run into before. Luckily it is quite easy to fix. I have a local fix which I’ll push as soon as I have an internet connection again.

For what it is worth I do think Babel’s use of the “python-format” flag is incorrect: as I see it this is really c-format. Python format as defined in the Python documentation is {..} notation, something which Babel does not support currently. This is also the naming used by xgettex.

@lelit
Copy link

lelit commented Feb 11, 2014

I saw your fixes on lingua... I will try it out again :-) But is there any reason to keep two different implementation of the extraction machinery? I mean, is there any chance that lingua and Babel cooperate in this area, instead of wasting precious hours duplicating each other code?

I can agree on what you state about python-format and c-format confusion, but then I do not see the point of python-format at all, unless you restrict it to {digit}, ie positional arguments, and not generically to {name} placeholder. AFAICT, the c-format marker purpose is only to let polints to check that the translated string contains the same set of placeholders as the original string (assuming it is possible to change the order, generically depending on the actual function using the string, for example the fprintf(3) manpage mentions the syntax %m$ with m indicating the index of the argument to use...); when using named placeholders, is there any need for a special marker?

Thank you!

@wichert
Copy link
Author

wichert commented Feb 11, 2014

I see no particular reason that to have two implementations. lingua was originally created because Babel's python extractor did not support the i18n syntax user by Pyramid (via translationstring) and Zope (via zope.i18nmessageid), and I needed a new plugin for ZPT files. Unfortunately Babel has not been actively maintained for a long time and bug reports were ignored, which led to a growing desire to drop our dependency on, which is what I worked on over the last few days. I suspect this is also why polib was created instead of using Babel.

As I see it I Babel currently tries to do too many things: a Python API for CLDR, a pot/po-handling framework, a gettext-implementation and an implementation of PO/MO file format. Perhaps it would be useful to split Babel up into separate pieces, each with a single responsibility. This is the direction I wanted to take with lingua: I want it do be nothing other than a tool to extract messages from source code. I haven't made up my mind if it should include a tool to compile PO files, when msgfmt already does that very well. And yes, removing the polint tool currently in lingua is on the to-do list :)

I do see a point of having a python-format: polint tools can use that to check all placeholders from the msgid are also present in the msgstr, perhaps ignoring changes in format specifier. For example if you translate Hello {name} with Beste {naam} the translation is obviously flawed and would result in your software suddenly raising exceptions, but you need to know the text is python-format to be able to do that check.

@wichert
Copy link
Author

wichert commented Feb 11, 2014

For what its worth I do use Babel for its CLDR API - it is much, much nicer than the CLDR wrappers in zone.i18n (even if that is setting the bar low :) ).

@lelit
Copy link

lelit commented Feb 11, 2014

I do see a point of having a python-format: polint tools can use that
to check all placeholders from the msgid are also present in the
msgstr

Yes, right, I missed the obvious checks on typos.

JonathanRRogers pushed a commit to JonathanRRogers/babel that referenced this issue Oct 20, 2015
During extraction, Message instances can be created with the
"python-format" flag, indicating that the message string contains Python
percent-formatting placeholders. To avoid setting the flag erroneously
because the string source is not Python code or otherwise is not
expected to contain such placeholders, the extractor interface must be
extended to allow extractor functions to indicate which flags are
valid.

Fixes python-babel#35
JonathanRRogers pushed a commit to JonathanRRogers/babel that referenced this issue Oct 21, 2015
During extraction, Message instances can be created with the
"python-format" flag, indicating that the message string contains Python
percent-formatting placeholders. To avoid setting the flag erroneously
because the string source is not Python code or otherwise is not
expected to contain such placeholders, the extractor interface must be
extended to allow extractor functions to indicate which flags are
valid.

Fixes python-babel#35
JonathanRRogers pushed a commit to JonathanRRogers/babel that referenced this issue Oct 21, 2015
During extraction, Message instances can be created with the
"python-format" flag, indicating that the message string contains Python
percent-formatting placeholders. To avoid setting the flag erroneously
because the string source is not Python code or otherwise is not
expected to contain such placeholders, the extractor interface must be
extended to allow extractor functions to indicate which flags are
valid.

Fixes python-babel#35
JonathanRRogers pushed a commit to JonathanRRogers/babel that referenced this issue Feb 11, 2016
During extraction, Message instances can be created with the
"python-format" flag, indicating that the message string contains Python
percent-formatting placeholders. To avoid setting the flag erroneously
because the string source is not Python code or otherwise is not
expected to contain such placeholders, the extractor interface must be
extended to allow extractor functions to indicate which flags are
valid.

Fixes python-babel#35
@Changaco
Copy link
Contributor

Changaco commented Sep 1, 2016

As far as I can see the extractor is the only thing that should decide the flags for a message.

I disagree. For example, how would the python extractor know whether the flag should be python-format or python-brace-format (#333)? It can only guess, and not any better than the Message class can.

xgettext has a --flag argument that's supposed to handle this (doc). Calling xgettext -o - --flag=_:1:python-format file.py, with file.py containing calls like _("Some message"), results in all those messages being flagged with python-format.

JonathanRRogers pushed a commit to JonathanRRogers/babel that referenced this issue Jul 20, 2018
During extraction, Message instances can be created with the
"python-format" flag, indicating that the message string contains Python
percent-formatting placeholders. To avoid setting the flag erroneously
because the string source is not Python code or otherwise is not
expected to contain such placeholders, the extractor interface must be
extended to allow extractor functions to indicate which flags are
valid.

Fixes python-babel#35
@lelit
Copy link

lelit commented Aug 7, 2018

Is there a particular reason why the PYTHON_FORMAT regexp accepts whitespaces after the initial % sign?
That's the main reason why wichert's example at the start (that is, This is a text about 50% of all users.) is being marked with python-format...
For now I replaced that with a FULLWIDTH PERCENT SIGN, to not trigger an error at compile time, but I wonder if the optional spaces should be removed from the regexp instead.

@wichert
Copy link
Author

wichert commented Aug 7, 2018

Because Python allows it:

Python 2.7.15 (default, Jun 17 2018, 12:46:58) 
[GCC 4.2.1 Compatible Apple LLVM 9.1.0 (clang-902.0.39.2)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> '% d' % 1
' 1'

Note the leading whitespace in the result, which is significant here. This is inherited from C, which also allows it:

#include <stdio.h>
#include <stdlib.h>

int main() {
	printf("% d\n", 1);
	return 0;
}

produces:

wichert@notos /tmp $ make x
cc     x.c   -o x
wichert@notos /tmp $ ./x
 1

@lelit
Copy link

lelit commented Aug 7, 2018

Oh, didn't know about that! Thanks!

JonathanRRogers pushed a commit to JonathanRRogers/babel that referenced this issue Aug 23, 2018
During extraction, Message instances can be created with the
"python-format" flag, indicating that the message string contains Python
percent-formatting placeholders. To avoid setting the flag erroneously
because the string source is not Python code or otherwise is not
expected to contain such placeholders, the extractor interface must be
extended to allow extractor functions to indicate which flags are
valid.

Fixes python-babel#35
JonathanRRogers pushed a commit to JonathanRRogers/babel that referenced this issue Dec 8, 2018
During extraction, Message instances can be created with the
"python-format" flag, indicating that the message string contains Python
percent-formatting placeholders. To avoid setting the flag erroneously
because the string source is not Python code or otherwise is not
expected to contain such placeholders, the extractor interface must be
extended to allow extractor functions to indicate which flags are
valid.

Fixes python-babel#35
@ljodal ljodal linked a pull request Jan 31, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants