Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make default file sink encoding utf8 #339

Closed
jessekrubin opened this issue Sep 29, 2020 · 8 comments
Closed

Make default file sink encoding utf8 #339

jessekrubin opened this issue Sep 29, 2020 · 8 comments
Labels
enhancement Improvement to an already existing feature

Comments

@jessekrubin
Copy link
Contributor

Making the default file encoding utf-8 would be quite nice (in my opinion and would be better for JSON serialization).

@Delgan
Copy link
Owner

Delgan commented Sep 29, 2020

Well, I tried to copy the default behavior of open() which uses the encoding returned by locate.getpreferredencoding(). It usually returns "utf-8" on Linux and MacOS, but not necessarily on Windows.

There actually exists a big discussion about the question of default locale encoding for Python files: PEP 597: Use UTF-8 for default text file encoding.

It seems reasonable to make loguru uses "utf8" encoding by default regardless of the platform. I will first need to review the discussion I linked above to make sure there are no possible issues.

As you mentioned JSON serialization, I also think we should set ensure_ascii=False here:

return json.dumps(serializable, default=str) + "\n"

Otherwise, non-ascii characters are escaped despite the file being opened with "utf-8" encoding.

@jessekrubin
Copy link
Contributor Author

I think you kinda have made the decision already tho by including emoji-icon things that don't render on my windows machine. It also creates technically invalid JSON (tho it works with the stdlib json parser/serializer).

I have several scripts that use loguru but monkey patch in a new def _serialize_record(text, record): that uses orjson/python-rapidjson. Both libs are faster than the stdlib json-module AND are stricter about sticking to creating valid JSON. The only thing I have to do is set the encoding of the file sink to make this work on windows.

The PEP discussion you link to is pretty interesting, but I don't think the reasons they mention for not making the default encoding utf8 (that some windows users just want to do their job and not delve into the oddities of file encodings) apply to the average lo-guru.

Unrelated: I was thinking of opening a pull request if/when I find the time to allow the serialize option for a sink to be either a bool or a callable that accepts (test, record) and returns a string.

@Delgan
Copy link
Owner

Delgan commented Oct 1, 2020

I think you kinda have made the decision already tho by including emoji-icon things that don't render on my windows machine.

Fair point. 😄
Although such symbols were not mean to be displayed if sink does not support unicode, be it a file or the console. That's why there is no {level.icon} in the default format. However, it is true that if serialize=True there is no choice anyway.

It also creates technically invalid JSON (tho it works with the stdlib json parser/serializer).

For my information, can you please explain me how this can lead to the creation of an invalid JSON?
I don't see how that's possible. Unless if someone tries to read the file using a different encoding than the one used by loguru (say for example utf8 instead of the one returned by locale.getpreferredencoding()).

The PEP discussion you link to is pretty interesting, but I don't think the reasons they mention for not making the default encoding utf8 (that some windows users just want to do their job and not delve into the oddities of file encodings) apply to the average lo-guru.

Yes, you're right. Actually it will be much better to default to utf8 because of the emoji-icon you mentioned. It means if locale.getpreferredencoding() returns cp1252 then an handler with serialize=True can't write to a file sink except if encoding="utf8" is explicitly specified by the user. Otherwise, an UnicodeEncodeError will be raised. Worse, this error may occur even without serialization if someone writes a message with an unsupported character. This happened before: #124 (comment)

We just need to make sure to bump the Loguru version to v0.6.0 as it's a possible breaking change for Windows users.

I have several scripts that use loguru but monkey patch in a new def _serialize_record(text, record): that uses orjson/python-rapidjson. Both libs are faster than the stdlib json-module AND are stricter about sticking to creating valid JSON. The only thing I have to do is set the encoding of the file sink to make this work on windows.

Unrelated: I was thinking of opening a pull request if/when I find the time to allow the serialize option for a sink to be either a bool or a callable that accepts (test, record) and returns a string.

This is a suggestion that came up several times (#278, #203).
I am not convinced that such a change is necessary. I prefer that the serialize option remains a dump boolean.
It is perfectly possible to invoke orjson using the format argument, while avoiding having to monkeypatch the logger:

def formatter(record):
    record["extra"]["serialized"] = orjson.dumps(record, default=str)
    return "{extra[serialized]}\n"

logger.add("file.log", format=formatter)

There exists an entry about custom serialization in the documentation: Serializing log messages using a custom function.
Would such a solution suit you?
Main problem being that some values are not serializable (like datetime.timedelta()) and thus require default=str.

@Delgan Delgan added the enhancement Improvement to an already existing feature label Oct 4, 2020
@jessekrubin
Copy link
Contributor Author

@Delgan OH NO! I thought I had responded and I did not! Sorry for the delay! :/

You are correct that the 'invalid json' is really me stringifying/making JSON on a windows machine and parsing it on a linux machine. So I was wrong ☹️ .


Per a serialization handler, I looked at the 2 issues you mentioned and my monkey-patching is pretty similar.

This is a suggestion that came up several times (#278, #203).

I quite like the #278 version as it also uses orjson!

There exists an entry about custom serialization in the documentation: Serializing log messages using a custom function.
Would such a solution suit you?
Main problem being that some values are not serializable (like datetime.timedelta()) and thus require default=str.

The serialization you mention WOULD work for me, but (imo) the examples you give don't totally fit the awesome functional-ness of the current version of loguru; I think being able to give a serialize-method to the serialize kwarg for ln.add(..., serialize=callable) would be 1) super slick and 2) fits well into loguru in a predictable way.

as for the timedelta problem, stdlib json, orjson and python-rapidjson all support a default argument which is quite nice.


FOOD FOR THOUGHT:

Would you consider making orjson (and/or python-rapidjson would be another good option)an optional requirement of loguru and have loguru use an orjson-specific serializer? I see a few benefits:

  • orjson (and python-rapidjson) are fast
  • orjson dumps to bytes so writing to a BinaryIO-file-like-object (like stderr/stdout) would be fast and wouldn't need decoding then encoding (as the example in Accept a Callable for the serialize argument when adding a sink #278 shows)
  • orjson has an 'append-new-line' option which would probably be a bit faster than the + '\n' that is currently being used

(You could even use the library (i wrote) that swaps out json-libs if you wanna checkit out => jsonbourne)

(PS: I am pretty sure jsonbourne supports timedelta)


Additional question:

You mention on the loguru README that you were at one point aiming to make loguru faster than the builtin std-lib logging module/pkg (possibly using a c/cpp extension), what happened to that? did you ever make any benchmarks?

@jessekrubin
Copy link
Contributor Author

@Delgan Don't mean to bother you, but I am wondering if you ever read me reply above.

@Delgan
Copy link
Owner

Delgan commented Oct 31, 2020

Hey @jessekrubin.

I had read your comment but I postponed my answer because I was a little short of time. Also, I gave priority to support requests rather than to the discussion here as it seemed less urgent. Because this is a subject worthy of further consideration, it takes me some time to formulate a proper response. :)

Anyway, thanks for the clarification about invalid JSON (de)serialization. So at least it's not a bug in loguru. Once UTF-8 will be used as the default encoding, reading logging file should be less error-prone for the users.

About making the serialize accepts a callable. I'm sure it would be slightly more convenient and quite elegant. Still, I have a few concerns that explain my hesitation.

I see serialize, colorize, diagnose et al. as just basic flags, and I think it's easier to reason about these arguments this way. Their purpose was to make the handler quickly adaptable to the most common use cases: just toggle them based on your preferences. However, the important part of logging logic belong to the sink / filter / format arguments, and that's where the handler behavior should be specialized if needed.

If I give special treatment to serialize, why not give it to diagnose or some other argument too? For example, some users prefer to use stackprinter instead of better_exceptions to format exceptions. One could imagine changing the diagnose argument so that it accepts a callable, or add another argument. This is something I have considered. However, for the same reasons I mentioned here, I'm more comfortable suggesting instead to move the logic to the formatter for now. We can probably find convenient uses for each existing argument. If I want to avoid special cases, it is better to keep all these arguments as simple as possible.

It is also necessary to think about the type of object that would be passed to such function. What I mean is that the record is not fully serializable despite being a dict. Trying to dump() the record["file"] (among others) will raise an error. Thus it would be a design issue to pass a non-serializable object to a function called serialize. So, should we pass the serializable dict used internally by loguru instead? Then, exposing this dict to the public api probably implies documenting and type-hinting it, which I would prefer to avoid for simplification reasons.

Basically, I have quite high threshold for adding new features. There already exists different way to customize the serialization of logged message (through sink, format or patch()) and I don't think it worth adding another one. Even if I agree that it would be more convenient, it only saves very few lines of code. In the end, it's easier for me to suggest the use of a suitable format function than to extend the capabilities of the current arguments.

Following the same reasoning, if someone prefers to use something different than the standard json library, he can import it and configure the logger by himself. I tried to reduce the number of loguru dependencies as much as possible, so I don't want to add new ones. Even if it is only optional, I have little reason to prefer orjson over jsonbourne (lovely name by the way) or any other third-party library. It's certainly much faster, no doubt about it. However, users may wish to use another library for some reason. That's why I prefer to let to the user be responsible of advanced configuration of the logger so that it perfectly suits its needs.

As we are discussing speed: I still plan to re-implement some part of loguru in plain C to maximize performances. This should mark the transition to v1.0.0. It's just that it's going to take me quite a bit of time and for now I have other side-projects that I consider to be more of a priority. I did have some C implementation and some benchmarks in the past that I discarded to focus on features of the Python implementation. Currently, loguru is slower than logging, notably because of the aware datetimes.

That's it. I hope this answers some of your questions. 😉

@Delgan
Copy link
Owner

Delgan commented Jan 24, 2022

Fixed on master branch. The "utf8" encoding is now the default as you suggested, thanks!

@Delgan Delgan closed this as completed Jan 24, 2022
@BenjiaH
Copy link

BenjiaH commented Feb 17, 2022

Great improvement

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improvement to an already existing feature
Projects
None yet
Development

No branches or pull requests

3 participants