bgzf: add utf-8 support #4436

ikravets · 2023-09-04T05:43:56Z

Add UTF-8 encoding support to BGZF, both reader and writer.
There were some attempts to do this in the past. Hopefully, this time will be more successful.
The main problem with UTF-8 is multibyte characters, which would be split between BGZF blocks
if the writer performs a naive str.encode(). In this PR the writer detects such multibyte character,
which would be split between two blocks, and carries this character (as a whole) to the next block.

This implementation should be fully backward compatible with the current user code.

WIP: If the maintainers agree to merge this PR, I'd like to update the docs and add unit tests.

I hereby agree to dual licence this and any previous contributions under both
the Biopython License Agreement AND the BSD 3-Clause License.
I have read the CONTRIBUTING.rst file, have run pre-commit
locally, and understand that continuous integration checks will be used to
confirm the Biopython unit tests and style checks pass with these changes.
I have added my name to the alphabetical contributors listings in the files
NEWS.rst and CONTRIB.rst as part of this pull request, am listed
already, or do not wish to be listed. (This acknowledgement is optional.)

Closes #2512 (again)

peterjc · 2023-09-04T07:44:57Z

This might work :)

The BGZF specification (just a couple of pages within the SAM/BAM) specification does not mention encodings. I have not checked the reference implementation source code (bgzf.c and bgzip.c etc):

https://github.com/samtools/htslib

Do you think we should propose this as a clarification to the definition:

https://github.com/samtools/hts-specs/blob/master/SAMv1.tex

I imagine how this gets worded would be open to debate, e.g. if using BGZF on UTF-8 encoded unicode files:

the writer must ensure multi-byte characters are not split between blocks, and the reader may assume this
the writer should ensure multi-byte characters are not split between blocks, and the reader cannot assume this

I think those are the two logical approaches...

peterjc · 2023-09-04T07:49:43Z

Or:

say explicitly unicode and encodings is as per GZIP itself

I should reread our old issues. It looks like I was worried about Python text mode detecting newline mode ... but we expose the BGZF virtual offset anyway so that difference probably does not matter (even in ASCII or Latin1, Python text mode offsets differ from bytes offsets with Windows vs Unix new lines).

peterjc · 2023-09-04T07:51:00Z

Bio/bgzf.py

-        # print("Saving %i bytes" % len(block))
-        if len(block) > 65536:
-            raise ValueError(f"{len(block)} Block length > 65536")
+        assert len(block) <= self._BLOCK_SIZE


Avoid assets, the exception is preferable

peterjc · 2023-09-04T07:52:17Z

Bio/bgzf.py

-            self._buffer = self._buffer[65535:]
+        """Flush data explicitly."""
+        assert len(self._buffer) < self._BLOCK_SIZE
+        # TODO is writing an empty buffer/block an intended behavior?


From memory yes, that is intended.

And again, explicit exceptions preferred over asserts.

peterjc · 2023-09-04T08:26:23Z

For new lines, perhaps all we need to do in the reader in text mode to emulate universal read lines mode is something like map "\r\n" to "\n" and any "\r" to "\n" (subject to some testing vs the standard library behaviour). This seems tangential to the encoding though...

ikravets · 2023-09-05T07:02:09Z

This might work :)

The BGZF specification (just a couple of pages within the SAM/BAM) specification does not mention encodings. I have not checked the reference implementation source code (bgzf.c and bgzip.c etc):

https://github.com/samtools/htslib

Do you think we should propose this as a clarification to the definition:

https://github.com/samtools/hts-specs/blob/master/SAMv1.tex

I imagine how this gets worded would be open to debate, e.g. if using BGZF on UTF-8 encoded unicode files:
* the writer _must_ ensure multi-byte characters are not split between blocks, and the reader may assume this

* the writer _should_ ensure multi-byte characters are not split between blocks, and the reader _cannot_ assume this
I think those are the two logical approaches...

Thanks for the pointers. I see bgzf.c implementation uses void *data in API, leaving all the encoding/decoding to the caller. It also seems that the callers ignore the character splitting issue completely. My guess is that UTF-8 is only allowed by SAM specs in some fields of the header (e.g. @CO) and they do not intend to seek into the middle of such field.

peterjc · 2023-09-05T08:07:49Z

I would agree SAM and FASTA usage of BGZF probably assumes ASCII (or Latin1) but there can be unexpected characters in human readable annotation.

We can probably avoid splitting multi-bytes characters on output, but how feasible is it on input? The Python text IO wrapper must do something similar internally...

bgzf: add utf-8 support

589ad79

ikravets requested a review from peterjc as a code owner September 4, 2023 05:43

peterjc reviewed Sep 4, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bgzf: add utf-8 support #4436

bgzf: add utf-8 support #4436

ikravets commented Sep 4, 2023

peterjc commented Sep 4, 2023

peterjc commented Sep 4, 2023

peterjc Sep 4, 2023

peterjc Sep 4, 2023

peterjc commented Sep 4, 2023

ikravets commented Sep 5, 2023

peterjc commented Sep 5, 2023

bgzf: add utf-8 support #4436

Are you sure you want to change the base?

bgzf: add utf-8 support #4436

Conversation

ikravets commented Sep 4, 2023

peterjc commented Sep 4, 2023

peterjc commented Sep 4, 2023

peterjc Sep 4, 2023

Choose a reason for hiding this comment

peterjc Sep 4, 2023

Choose a reason for hiding this comment

peterjc commented Sep 4, 2023

ikravets commented Sep 5, 2023

peterjc commented Sep 5, 2023