Compressed(gzip) csv files support. #4039

smagellan · 2024-04-06T17:41:28Z

Hello.

Code should be easy given all the chores are encapsulated within CSVREAD/CSVWRITE functions. I will try to write that code and create the PR if there are no objections against that functionality. Just let me know, if any.

katzyn · 2024-04-07T06:09:22Z

Both CSVREAD and CSVWRITE have significant design flaws. Modern versions of H2 don't have any universal data type any more and there is no way to specify data types to parse values from CSV properly. There are some problems with really large files, parameters are obscure and so on. From my personal point of view, we should deprecate both these functions and introduce a better designed replacement for them. Maybe something similar to the COPY command from PostgreSQL or whatever else. We can try to add additional filters in the new API, but changes in these legacy functions are premature at this point, but If you need a quick fix, your can write some user-defined functions for your application.

GZIP actually provides relatively low compression ratio for files in this format, so if we'll introduce it, the next feature request will be about BZIP2, PPMd or other compression method. Maybe possibility to pass data to a third-party program and read data from a third-party program will be a better option, but it should be carefully reviewed from a security perspective.

manticore-projects · 2024-04-07T10:23:46Z

Greetings. Apologies for chiming in. I have dealt with those problems in 3 different ways: 1) when there is only technical information needed about data and types (without support of "human readable", then in my opinion Parquet is the best choice. Especially when moving tables between DBs without a direct DB link or network connection. For the purpose I wrote https://github.com/manticore-projects/JDBCParquetWriter which can transfer any JDBC Result Set into a parquet file (with compression). Lots of DBs read parquet files directly these days. 2) if human readability is an issue then SQLSheet can help. I took over maintenance and amended the MetaData and Stream capabilities: https://github.com/manticore-projects/sqlsheet Especially regarding to Data Type information, it allows to maintain extra header rows to store this information (reading and writing): Th resulting XLSX files are gzip compressed already and since styles are used they don't bear too much overhead. Its a good trade-off in my opinion. 3) for everything else, simple stupid CSV export is fine in my opinion and yes a Compression can be helpful. Be careful though of ZIP Bombs and stuff, its a real issue. Invest into Apache Commons Compress or any similar well aged solution in order to avoid inventing the wheel. Feel free to count me in, I have big interest in such stuff as we move data in between databases regularly. All the best Andreas

On Sat, 2024-04-06 at 23:09 -0700, Evgenij Ryazanov wrote: Both CSVREAD and CSVWRITE have significant design flaws. Modern versions of H2 don't have any universal data type any more and there is no way to specify data types to parse values from CSV properly. There are some problems with really large files, parameters are obscure and so on. From my personal point of view, we should deprecate both these functions and introduce a better designed replacement for them. Maybe something similar to the COPY command from PostgreSQL or whatever else. We can try to add additional filters in the new API, but changes in these legacy functions are premature at this point, but If you need a quick fix, your can write some user-defined functions for your application. GZIP actually provides relatively low compression ratio for files in this format, so if we'll introduce it, the next feature request will be about BZIP2, PPMd or other compression method. Maybe possibility to pass data to a third-party program and read data from a third- party program will be a better option, but it should be carefully reviewed from a security perspective. — Reply to this email directly, view it on GitHub [1], or unsubscribe [2]. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

[1] view it on GitHub #4039 (comment) [2] unsubscribe https://github.com/notifications/unsubscribe-auth/AEJ6C64XPRF3WFF2KO5RM7LY4DPK5AVCNFSM6AAAAABF2THZ6GVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBRGMZDQOBUHA

smagellan changed the title ~~Compressed csv files support.~~ Compressed(gzip) csv files support. Apr 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compressed(gzip) csv files support. #4039

Compressed(gzip) csv files support. #4039

smagellan commented Apr 6, 2024 •

edited

katzyn commented Apr 7, 2024

manticore-projects commented Apr 7, 2024 via email

Compressed(gzip) csv files support. #4039

Compressed(gzip) csv files support. #4039

Comments

smagellan commented Apr 6, 2024 • edited

katzyn commented Apr 7, 2024

manticore-projects commented Apr 7, 2024 via email

smagellan commented Apr 6, 2024 •

edited