Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bgzf: add utf-8 support #4436

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

bgzf: add utf-8 support #4436

wants to merge 1 commit into from

Conversation

ikravets
Copy link

@ikravets ikravets commented Sep 4, 2023

Add UTF-8 encoding support to BGZF, both reader and writer.
There were some attempts to do this in the past. Hopefully, this time will be more successful.
The main problem with UTF-8 is multibyte characters, which would be split between BGZF blocks
if the writer performs a naive str.encode(). In this PR the writer detects such multibyte character,
which would be split between two blocks, and carries this character (as a whole) to the next block.

This implementation should be fully backward compatible with the current user code.

WIP: If the maintainers agree to merge this PR, I'd like to update the docs and add unit tests.


  • I hereby agree to dual licence this and any previous contributions under both
    the Biopython License Agreement AND the BSD 3-Clause License.

  • I have read the CONTRIBUTING.rst file, have run pre-commit
    locally, and understand that continuous integration checks will be used to
    confirm the Biopython unit tests and style checks pass with these changes.

  • I have added my name to the alphabetical contributors listings in the files
    NEWS.rst and CONTRIB.rst as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

Closes #2512 (again)

@peterjc
Copy link
Member

peterjc commented Sep 4, 2023

This might work :)

The BGZF specification (just a couple of pages within the SAM/BAM) specification does not mention encodings. I have not checked the reference implementation source code (bgzf.c and bgzip.c etc):

https://github.com/samtools/htslib

Do you think we should propose this as a clarification to the definition:

https://github.com/samtools/hts-specs/blob/master/SAMv1.tex

I imagine how this gets worded would be open to debate, e.g. if using BGZF on UTF-8 encoded unicode files:

  • the writer must ensure multi-byte characters are not split between blocks, and the reader may assume this
  • the writer should ensure multi-byte characters are not split between blocks, and the reader cannot assume this

I think those are the two logical approaches...

@peterjc
Copy link
Member

peterjc commented Sep 4, 2023

Or:

  • say explicitly unicode and encodings is as per GZIP itself

I should reread our old issues. It looks like I was worried about Python text mode detecting newline mode ... but we expose the BGZF virtual offset anyway so that difference probably does not matter (even in ASCII or Latin1, Python text mode offsets differ from bytes offsets with Windows vs Unix new lines).

# print("Saving %i bytes" % len(block))
if len(block) > 65536:
raise ValueError(f"{len(block)} Block length > 65536")
assert len(block) <= self._BLOCK_SIZE
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoid assets, the exception is preferable

self._buffer = self._buffer[65535:]
"""Flush data explicitly."""
assert len(self._buffer) < self._BLOCK_SIZE
# TODO is writing an empty buffer/block an intended behavior?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From memory yes, that is intended.

And again, explicit exceptions preferred over asserts.

@peterjc
Copy link
Member

peterjc commented Sep 4, 2023

For new lines, perhaps all we need to do in the reader in text mode to emulate universal read lines mode is something like map "\r\n" to "\n" and any "\r" to "\n" (subject to some testing vs the standard library behaviour). This seems tangential to the encoding though...

@ikravets
Copy link
Author

ikravets commented Sep 5, 2023

This might work :)

The BGZF specification (just a couple of pages within the SAM/BAM) specification does not mention encodings. I have not checked the reference implementation source code (bgzf.c and bgzip.c etc):

https://github.com/samtools/htslib

Do you think we should propose this as a clarification to the definition:

https://github.com/samtools/hts-specs/blob/master/SAMv1.tex

I imagine how this gets worded would be open to debate, e.g. if using BGZF on UTF-8 encoded unicode files:

* the writer _must_ ensure multi-byte characters are not split between blocks, and the reader may assume this

* the writer _should_ ensure multi-byte characters are not split between blocks, and the reader _cannot_ assume this

I think those are the two logical approaches...

Thanks for the pointers. I see bgzf.c implementation uses void *data in API, leaving all the encoding/decoding to the caller. It also seems that the callers ignore the character splitting issue completely. My guess is that UTF-8 is only allowed by SAM specs in some fields of the header (e.g. @CO) and they do not intend to seek into the middle of such field.

@peterjc
Copy link
Member

peterjc commented Sep 5, 2023

I would agree SAM and FASTA usage of BGZF probably assumes ASCII (or Latin1) but there can be unexpected characters in human readable annotation.

We can probably avoid splitting multi-bytes characters on output, but how feasible is it on input? The Python text IO wrapper must do something similar internally...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unicode encoding in BGZF
2 participants