ENH: Add serialization options to to_csv #48478

adrien-berchet · 2022-09-09T08:43:28Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

Sometimes we store Python objects (e.g. numpy.ndarray objects) in a column of a DataFrame and we want to store it as CSV file. In this case, the array is converted to string using the __str__ method of the object, which is not the best format for later parsing. I suggest to add an option similar to the cls parameter of json.dumps which allows to encode a specific type in a custom format.

Feature Description

The user can define an encoder:

def csv_encoder(obj):
    if isinstance(obj, numpy.ndarray):
        return obj.tolist()
    else:
        return obj

Then we can pass it to the to_csv method which is applied to each element of the DF:

df.to_csv("/tmp/file.csv", element_encoder=csv_encoder)

Internally, we could just call

df.applymap(element_encoder)

just before saving the file.

Alternative Solutions

Another solution is to format manually before each call to to_csv():

df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"], "c": [np.array(range(0, 3)), np.array(range(1, 4)), np.array(range(2, 5))]})
formatted_df = df.copy()
formatted_df["c"] = formatted_df["c"].apply(lambda x: x.tolist())
formatted_df.to_csv("/tmp/file.csv")

Additional Context

No response

The text was updated successfully, but these errors were encountered:

Pvlb1998 · 2022-10-19T02:53:52Z

Is this issue still relevant in the latest version of system?

Pvlb1998 · 2022-10-27T18:30:10Z

Take

adrien-berchet · 2022-10-28T09:12:44Z

Is this issue still relevant in the latest version of system?

As far as I know yes

Pvlb1998 · 2022-10-31T12:44:14Z

Great! If it's possible, the encoder you mention in Feature Description, where exactly you want to apply it in Panda's code? You used the example "store Python objects (numpy.ndarray objects)", but I wasn't quite sure if it's only on numpy or in any function where you could store Python objects.

adrien-berchet · 2022-11-01T13:22:45Z

Hi!

Let me give a more complete example in which we want to export a DF with 4 columns (of int, np.array, list and a custom type):

import json
import numpy as np
import pandas as pd
import re


class Something:
    def __init__(self, a):
        self.a = a


df = pd.DataFrame({"a": [1, 2, 3], "b": [np.arange(0, 5), np.arange(1, 6), np.arange(2, 7)]})
df["c"] = df["b"].apply(lambda x: x.tolist())
df["d"] = [Something(i) for i in range(3)]


def csv_encoder(elem):
    if isinstance(elem, Something):
        elem = f"Something({elem.a})"

    try:
        elem = elem.tolist()
    except AttributeError:
        pass
    
    try:
        elem = json.dumps(elem)
    except TypeError:
        pass

    return elem


def csv_decoder(elem):
    try:
        elem = json.loads(elem)
    except (TypeError, json.JSONDecodeError):
        pass

    if isinstance(elem, str):
        match = re.match(r"Something\((\d+)\)", elem)
        if match:
            elem = Something(match.group(1))

    return elem


df.to_csv("/tmp/test_export.csv", index=False)
default_df = pd.read_csv("/tmp/test_export.csv")
print("\nDefault:\n", default_df)
print(type(default_df.loc[0, "d"]))

df.applymap(csv_encoder).to_csv("/tmp/test_export.csv", index=False)
applymap_df = pd.read_csv("/tmp/test_export.csv").applymap(csv_decoder)
print("\nWith applymap:\n", applymap_df)
print(type(applymap_df.loc[0, "d"]))

This code outputs the following:

Default:
a,b,c,d
1,[0 1 2 3 4],"[0, 1, 2, 3, 4]",<__main__.Something object at 0x7fbe21c6c760>
2,[1 2 3 4 5],"[1, 2, 3, 4, 5]",<__main__.Something object at 0x7fbe21ec1670>
3,[2 3 4 5 6],"[2, 3, 4, 5, 6]",<__main__.Something object at 0x7fbe24d4c4c0>

   a            b                c                                              d
0  1  [0 1 2 3 4]  [0, 1, 2, 3, 4]  <__main__.Something object at 0x7fbe21c6c760>
1  2  [1 2 3 4 5]  [1, 2, 3, 4, 5]  <__main__.Something object at 0x7fbe21ec1670>
2  3  [2 3 4 5 6]  [2, 3, 4, 5, 6]  <__main__.Something object at 0x7fbe24d4c4c0>
<class 'str'>
With applymap:
a,b,c,d
1,"[0, 1, 2, 3, 4]","[0, 1, 2, 3, 4]","""Something(0)"""
2,"[1, 2, 3, 4, 5]","[1, 2, 3, 4, 5]","""Something(1)"""
3,"[2, 3, 4, 5, 6]","[2, 3, 4, 5, 6]","""Something(2)"""

   a                b                c                                              d
0  1  [0, 1, 2, 3, 4]  [0, 1, 2, 3, 4]  <__main__.Something object at 0x7fbe24d4cb50>
1  2  [1, 2, 3, 4, 5]  [1, 2, 3, 4, 5]  <__main__.Something object at 0x7fbe24d4c310>
2  3  [2, 3, 4, 5, 6]  [2, 3, 4, 5, 6]  <__main__.Something object at 0x7fbe24cb9ac0>
<class '__main__.Something'>

As you can see, using applymap casts the np.array and Something objects to proper string values before writing the file and then reload them properly (though in this case the np.array is loaded a list). What I suggest is to add this mechanism inside the to_csv/read_csv functions to simplify the codes.

The IO part of the example would thus becomes:

df.to_csv("/tmp/test_export.csv", element_encoder=csv_encoder, index=False)
encoder_df = pd.read_csv("/tmp/test_export.csv", element_decoder=csv_decoder)

It's not a major improvement but I think it can simplify the use of CSV files with custom types.

So, I don't know much about pandas code, but I guess it could go in the CSVFormatter (in pandas/io/formats/csvs.py:L47) which could be like:

class CSVFormatter:
    cols: np.ndarray

    def __init__(
        self,
        formatter: DataFrameFormatter,
        path_or_buf: FilePath | WriteBuffer[str] | WriteBuffer[bytes] = "",
        sep: str = ",",
        cols: Sequence[Hashable] | None = None,
        index_label: IndexLabel | None = None,
        mode: str = "w",
        encoding: str | None = None,
        errors: str = "strict",
        compression: CompressionOptions = "infer",
        quoting: int | None = None,
        lineterminator: str | None = "\n",
        chunksize: int | None = None,
        quotechar: str | None = '"',
        date_format: str | None = None,
        doublequote: bool = True,
        escapechar: str | None = None,
        storage_options: StorageOptions = None,
        csv_encoder=None,
    ) -> None:
        self.fmt = formatter

        if csv_encoder is None:
            self.obj = self.fmt.frame
        else:
            self.obj = self.fmt.frame.applymap(csv_encoder)
        ...

And for the read_csv function the decoder could be passed to the TextFileReader constructor and then used in the read method that just applies it to the df created at pandas/io/parsers/readers.py:L1811.

P.S.: The lines in the files are given for the commit eb69d8943f of pandas.

jreback · 2022-11-01T14:06:52Z

-1 on adding anything like this to csv

csv is not a high fidelity format and not designed in any way for this. you are basically asking pandas to heroically do things and it's just not supported

most of the binary formats - parquet fully support nested structures

adrien-berchet · 2022-11-03T21:41:30Z

This issue is just about properly serializing objects stored in a DataFrame, which does not look so heroic to me since Pandas would just apply a function provided by the user. And Pandas actually already does this implicitly, the main issue is that it calls the __repr__ method of the objects, which is not always relevant.
I just think it would simplify some codes to have an option to call a specific function or method (e.g. calling a serialize method instead of __repr__), but it's no big deal if you disagree.

adrien-berchet · 2023-01-03T11:54:32Z

Should I close this issue @Pvlb1998 @jreback ?

odelmarcelle · 2024-04-30T23:28:15Z

Just came across this issue. It's niche but having an argument allowing to pass a function handling the serialization of complex types would be helpful.

In my case, I want to serialize list as semicolon-separated values. Even though the problem is solved easily by calling something like df.map(lambda x: ';'.join(x) if isinstance(x, list) else x) before to_csv, it's not the first time I come across that need. I think CSV is still one of the easiest format to use and share. Any tool can open a text file and inspect its content, as opposed to binary formats such as parquet.

The argument could behave as dtype in read_csv: applied to the entire dataframe or to specific column. The argument could also be used in read_csv to customzie deserialization.

adrien-berchet added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2022

simonjayhawkins added the IO CSV read_csv, to_csv label Feb 6, 2024

Dtenwolde mentioned this issue May 8, 2024

No comma separator writing list to csv in python duckdb/duckdb#11983

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Add serialization options to to_csv #48478

ENH: Add serialization options to to_csv #48478

adrien-berchet commented Sep 9, 2022

Pvlb1998 commented Oct 19, 2022 •

edited

Pvlb1998 commented Oct 27, 2022

adrien-berchet commented Oct 28, 2022

Pvlb1998 commented Oct 31, 2022 •

edited

adrien-berchet commented Nov 1, 2022

jreback commented Nov 1, 2022 •

edited

adrien-berchet commented Nov 3, 2022

adrien-berchet commented Jan 3, 2023

odelmarcelle commented Apr 30, 2024 •

edited

ENH: Add serialization options to to_csv #48478

ENH: Add serialization options to to_csv #48478

Comments

adrien-berchet commented Sep 9, 2022

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

Pvlb1998 commented Oct 19, 2022 • edited

Pvlb1998 commented Oct 27, 2022

adrien-berchet commented Oct 28, 2022

Pvlb1998 commented Oct 31, 2022 • edited

adrien-berchet commented Nov 1, 2022

jreback commented Nov 1, 2022 • edited

adrien-berchet commented Nov 3, 2022

adrien-berchet commented Jan 3, 2023

odelmarcelle commented Apr 30, 2024 • edited

Pvlb1998 commented Oct 19, 2022 •

edited

Pvlb1998 commented Oct 31, 2022 •

edited

jreback commented Nov 1, 2022 •

edited

odelmarcelle commented Apr 30, 2024 •

edited