Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add serialization options to to_csv #48478

Open
1 of 3 tasks
adrien-berchet opened this issue Sep 9, 2022 · 9 comments
Open
1 of 3 tasks

ENH: Add serialization options to to_csv #48478

adrien-berchet opened this issue Sep 9, 2022 · 9 comments
Labels
Enhancement IO CSV read_csv, to_csv Needs Triage Issue that has not been reviewed by a pandas team member

Comments

@adrien-berchet
Copy link

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

Sometimes we store Python objects (e.g. numpy.ndarray objects) in a column of a DataFrame and we want to store it as CSV file. In this case, the array is converted to string using the __str__ method of the object, which is not the best format for later parsing. I suggest to add an option similar to the cls parameter of json.dumps which allows to encode a specific type in a custom format.

Feature Description

The user can define an encoder:

def csv_encoder(obj):
    if isinstance(obj, numpy.ndarray):
        return obj.tolist()
    else:
        return obj

Then we can pass it to the to_csv method which is applied to each element of the DF:

df.to_csv("/tmp/file.csv", element_encoder=csv_encoder)

Internally, we could just call

df.applymap(element_encoder)

just before saving the file.

Alternative Solutions

Another solution is to format manually before each call to to_csv():

df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"], "c": [np.array(range(0, 3)), np.array(range(1, 4)), np.array(range(2, 5))]})
formatted_df = df.copy()
formatted_df["c"] = formatted_df["c"].apply(lambda x: x.tolist())
formatted_df.to_csv("/tmp/file.csv")

Additional Context

No response

@adrien-berchet adrien-berchet added Enhancement Needs Triage Issue that has not been reviewed by a pandas team member labels Sep 9, 2022
@Pvlb1998
Copy link

Pvlb1998 commented Oct 19, 2022

Is this issue still relevant in the latest version of system?

@Pvlb1998
Copy link

Take

@adrien-berchet
Copy link
Author

Is this issue still relevant in the latest version of system?

As far as I know yes

@Pvlb1998
Copy link

Pvlb1998 commented Oct 31, 2022

Great! If it's possible, the encoder you mention in Feature Description, where exactly you want to apply it in Panda's code? You used the example "store Python objects (numpy.ndarray objects)", but I wasn't quite sure if it's only on numpy or in any function where you could store Python objects.

@adrien-berchet
Copy link
Author

Hi!

Let me give a more complete example in which we want to export a DF with 4 columns (of int, np.array, list and a custom type):

import json
import numpy as np
import pandas as pd
import re


class Something:
    def __init__(self, a):
        self.a = a


df = pd.DataFrame({"a": [1, 2, 3], "b": [np.arange(0, 5), np.arange(1, 6), np.arange(2, 7)]})
df["c"] = df["b"].apply(lambda x: x.tolist())
df["d"] = [Something(i) for i in range(3)]


def csv_encoder(elem):
    if isinstance(elem, Something):
        elem = f"Something({elem.a})"

    try:
        elem = elem.tolist()
    except AttributeError:
        pass
    
    try:
        elem = json.dumps(elem)
    except TypeError:
        pass

    return elem


def csv_decoder(elem):
    try:
        elem = json.loads(elem)
    except (TypeError, json.JSONDecodeError):
        pass

    if isinstance(elem, str):
        match = re.match(r"Something\((\d+)\)", elem)
        if match:
            elem = Something(match.group(1))

    return elem


df.to_csv("/tmp/test_export.csv", index=False)
default_df = pd.read_csv("/tmp/test_export.csv")
print("\nDefault:\n", default_df)
print(type(default_df.loc[0, "d"]))

df.applymap(csv_encoder).to_csv("/tmp/test_export.csv", index=False)
applymap_df = pd.read_csv("/tmp/test_export.csv").applymap(csv_decoder)
print("\nWith applymap:\n", applymap_df)
print(type(applymap_df.loc[0, "d"]))

This code outputs the following:

Default:
a,b,c,d
1,[0 1 2 3 4],"[0, 1, 2, 3, 4]",<__main__.Something object at 0x7fbe21c6c760>
2,[1 2 3 4 5],"[1, 2, 3, 4, 5]",<__main__.Something object at 0x7fbe21ec1670>
3,[2 3 4 5 6],"[2, 3, 4, 5, 6]",<__main__.Something object at 0x7fbe24d4c4c0>

   a            b                c                                              d
0  1  [0 1 2 3 4]  [0, 1, 2, 3, 4]  <__main__.Something object at 0x7fbe21c6c760>
1  2  [1 2 3 4 5]  [1, 2, 3, 4, 5]  <__main__.Something object at 0x7fbe21ec1670>
2  3  [2 3 4 5 6]  [2, 3, 4, 5, 6]  <__main__.Something object at 0x7fbe24d4c4c0>
<class 'str'>
With applymap:
a,b,c,d
1,"[0, 1, 2, 3, 4]","[0, 1, 2, 3, 4]","""Something(0)"""
2,"[1, 2, 3, 4, 5]","[1, 2, 3, 4, 5]","""Something(1)"""
3,"[2, 3, 4, 5, 6]","[2, 3, 4, 5, 6]","""Something(2)"""

   a                b                c                                              d
0  1  [0, 1, 2, 3, 4]  [0, 1, 2, 3, 4]  <__main__.Something object at 0x7fbe24d4cb50>
1  2  [1, 2, 3, 4, 5]  [1, 2, 3, 4, 5]  <__main__.Something object at 0x7fbe24d4c310>
2  3  [2, 3, 4, 5, 6]  [2, 3, 4, 5, 6]  <__main__.Something object at 0x7fbe24cb9ac0>
<class '__main__.Something'>

As you can see, using applymap casts the np.array and Something objects to proper string values before writing the file and then reload them properly (though in this case the np.array is loaded a list). What I suggest is to add this mechanism inside the to_csv/read_csv functions to simplify the codes.

The IO part of the example would thus becomes:

df.to_csv("/tmp/test_export.csv", element_encoder=csv_encoder, index=False)
encoder_df = pd.read_csv("/tmp/test_export.csv", element_decoder=csv_decoder)

It's not a major improvement but I think it can simplify the use of CSV files with custom types.

So, I don't know much about pandas code, but I guess it could go in the CSVFormatter (in pandas/io/formats/csvs.py:L47) which could be like:

class CSVFormatter:
    cols: np.ndarray

    def __init__(
        self,
        formatter: DataFrameFormatter,
        path_or_buf: FilePath | WriteBuffer[str] | WriteBuffer[bytes] = "",
        sep: str = ",",
        cols: Sequence[Hashable] | None = None,
        index_label: IndexLabel | None = None,
        mode: str = "w",
        encoding: str | None = None,
        errors: str = "strict",
        compression: CompressionOptions = "infer",
        quoting: int | None = None,
        lineterminator: str | None = "\n",
        chunksize: int | None = None,
        quotechar: str | None = '"',
        date_format: str | None = None,
        doublequote: bool = True,
        escapechar: str | None = None,
        storage_options: StorageOptions = None,
        csv_encoder=None,
    ) -> None:
        self.fmt = formatter

        if csv_encoder is None:
            self.obj = self.fmt.frame
        else:
            self.obj = self.fmt.frame.applymap(csv_encoder)
        ...

And for the read_csv function the decoder could be passed to the TextFileReader constructor and then used in the read method that just applies it to the df created at pandas/io/parsers/readers.py:L1811.

P.S.: The lines in the files are given for the commit eb69d8943f of pandas.

@jreback
Copy link
Contributor

jreback commented Nov 1, 2022

-1 on adding anything like this to csv

csv is not a high fidelity format and not designed in any way for this. you are basically asking pandas to heroically do things and it's just not supported

most of the binary formats - parquet fully support nested structures

@adrien-berchet
Copy link
Author

This issue is just about properly serializing objects stored in a DataFrame, which does not look so heroic to me since Pandas would just apply a function provided by the user. And Pandas actually already does this implicitly, the main issue is that it calls the __repr__ method of the objects, which is not always relevant.
I just think it would simplify some codes to have an option to call a specific function or method (e.g. calling a serialize method instead of __repr__), but it's no big deal if you disagree.

@adrien-berchet
Copy link
Author

Should I close this issue @Pvlb1998 @jreback ?

@simonjayhawkins simonjayhawkins added the IO CSV read_csv, to_csv label Feb 6, 2024
@odelmarcelle
Copy link

odelmarcelle commented Apr 30, 2024

Just came across this issue. It's niche but having an argument allowing to pass a function handling the serialization of complex types would be helpful.

In my case, I want to serialize list as semicolon-separated values. Even though the problem is solved easily by calling something like df.map(lambda x: ';'.join(x) if isinstance(x, list) else x) before to_csv, it's not the first time I come across that need. I think CSV is still one of the easiest format to use and share. Any tool can open a text file and inspect its content, as opposed to binary formats such as parquet.

The argument could behave as dtype in read_csv: applied to the entire dataframe or to specific column. The argument could also be used in read_csv to customzie deserialization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv Needs Triage Issue that has not been reviewed by a pandas team member
Projects
None yet
Development

No branches or pull requests

5 participants