-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bgzf: add utf-8 support #4436
base: master
Are you sure you want to change the base?
bgzf: add utf-8 support #4436
Conversation
This might work :) The BGZF specification (just a couple of pages within the SAM/BAM) specification does not mention encodings. I have not checked the reference implementation source code ( https://github.com/samtools/htslib Do you think we should propose this as a clarification to the definition: https://github.com/samtools/hts-specs/blob/master/SAMv1.tex I imagine how this gets worded would be open to debate, e.g. if using BGZF on UTF-8 encoded unicode files:
I think those are the two logical approaches... |
Or:
I should reread our old issues. It looks like I was worried about Python text mode detecting newline mode ... but we expose the BGZF virtual offset anyway so that difference probably does not matter (even in ASCII or Latin1, Python text mode offsets differ from bytes offsets with Windows vs Unix new lines). |
# print("Saving %i bytes" % len(block)) | ||
if len(block) > 65536: | ||
raise ValueError(f"{len(block)} Block length > 65536") | ||
assert len(block) <= self._BLOCK_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid assets, the exception is preferable
self._buffer = self._buffer[65535:] | ||
"""Flush data explicitly.""" | ||
assert len(self._buffer) < self._BLOCK_SIZE | ||
# TODO is writing an empty buffer/block an intended behavior? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From memory yes, that is intended.
And again, explicit exceptions preferred over asserts.
For new lines, perhaps all we need to do in the reader in text mode to emulate universal read lines mode is something like map "\r\n" to "\n" and any "\r" to "\n" (subject to some testing vs the standard library behaviour). This seems tangential to the encoding though... |
Thanks for the pointers. I see |
I would agree SAM and FASTA usage of BGZF probably assumes ASCII (or Latin1) but there can be unexpected characters in human readable annotation. We can probably avoid splitting multi-bytes characters on output, but how feasible is it on input? The Python text IO wrapper must do something similar internally... |
Add UTF-8 encoding support to BGZF, both reader and writer.
There were some attempts to do this in the past. Hopefully, this time will be more successful.
The main problem with UTF-8 is multibyte characters, which would be split between BGZF blocks
if the writer performs a naive
str.encode()
. In this PR the writer detects such multibyte character,which would be split between two blocks, and carries this character (as a whole) to the next block.
This implementation should be fully backward compatible with the current user code.
WIP: If the maintainers agree to merge this PR, I'd like to update the docs and add unit tests.
I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the
CONTRIBUTING.rst
file, have runpre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst
andCONTRIB.rst
as part of this pull request, am listedalready, or do not wish to be listed. (This acknowledgement is optional.)
Closes #2512 (again)