-
-
Notifications
You must be signed in to change notification settings - Fork 17.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Add serialization options to to_csv #48478
Comments
Is this issue still relevant in the latest version of system? |
Take |
As far as I know yes |
Great! If it's possible, the encoder you mention in Feature Description, where exactly you want to apply it in Panda's code? You used the example "store Python objects (numpy.ndarray objects)", but I wasn't quite sure if it's only on numpy or in any function where you could store Python objects. |
Hi! Let me give a more complete example in which we want to export a DF with 4 columns (of int, np.array, list and a custom type): import json
import numpy as np
import pandas as pd
import re
class Something:
def __init__(self, a):
self.a = a
df = pd.DataFrame({"a": [1, 2, 3], "b": [np.arange(0, 5), np.arange(1, 6), np.arange(2, 7)]})
df["c"] = df["b"].apply(lambda x: x.tolist())
df["d"] = [Something(i) for i in range(3)]
def csv_encoder(elem):
if isinstance(elem, Something):
elem = f"Something({elem.a})"
try:
elem = elem.tolist()
except AttributeError:
pass
try:
elem = json.dumps(elem)
except TypeError:
pass
return elem
def csv_decoder(elem):
try:
elem = json.loads(elem)
except (TypeError, json.JSONDecodeError):
pass
if isinstance(elem, str):
match = re.match(r"Something\((\d+)\)", elem)
if match:
elem = Something(match.group(1))
return elem
df.to_csv("/tmp/test_export.csv", index=False)
default_df = pd.read_csv("/tmp/test_export.csv")
print("\nDefault:\n", default_df)
print(type(default_df.loc[0, "d"]))
df.applymap(csv_encoder).to_csv("/tmp/test_export.csv", index=False)
applymap_df = pd.read_csv("/tmp/test_export.csv").applymap(csv_decoder)
print("\nWith applymap:\n", applymap_df)
print(type(applymap_df.loc[0, "d"])) This code outputs the following: Default:
a,b,c,d
1,[0 1 2 3 4],"[0, 1, 2, 3, 4]",<__main__.Something object at 0x7fbe21c6c760>
2,[1 2 3 4 5],"[1, 2, 3, 4, 5]",<__main__.Something object at 0x7fbe21ec1670>
3,[2 3 4 5 6],"[2, 3, 4, 5, 6]",<__main__.Something object at 0x7fbe24d4c4c0>
a b c d
0 1 [0 1 2 3 4] [0, 1, 2, 3, 4] <__main__.Something object at 0x7fbe21c6c760>
1 2 [1 2 3 4 5] [1, 2, 3, 4, 5] <__main__.Something object at 0x7fbe21ec1670>
2 3 [2 3 4 5 6] [2, 3, 4, 5, 6] <__main__.Something object at 0x7fbe24d4c4c0>
<class 'str'>
With applymap:
a,b,c,d
1,"[0, 1, 2, 3, 4]","[0, 1, 2, 3, 4]","""Something(0)"""
2,"[1, 2, 3, 4, 5]","[1, 2, 3, 4, 5]","""Something(1)"""
3,"[2, 3, 4, 5, 6]","[2, 3, 4, 5, 6]","""Something(2)"""
a b c d
0 1 [0, 1, 2, 3, 4] [0, 1, 2, 3, 4] <__main__.Something object at 0x7fbe24d4cb50>
1 2 [1, 2, 3, 4, 5] [1, 2, 3, 4, 5] <__main__.Something object at 0x7fbe24d4c310>
2 3 [2, 3, 4, 5, 6] [2, 3, 4, 5, 6] <__main__.Something object at 0x7fbe24cb9ac0>
<class '__main__.Something'> As you can see, using The IO part of the example would thus becomes: df.to_csv("/tmp/test_export.csv", element_encoder=csv_encoder, index=False)
encoder_df = pd.read_csv("/tmp/test_export.csv", element_decoder=csv_decoder) It's not a major improvement but I think it can simplify the use of CSV files with custom types. So, I don't know much about pandas code, but I guess it could go in the class CSVFormatter:
cols: np.ndarray
def __init__(
self,
formatter: DataFrameFormatter,
path_or_buf: FilePath | WriteBuffer[str] | WriteBuffer[bytes] = "",
sep: str = ",",
cols: Sequence[Hashable] | None = None,
index_label: IndexLabel | None = None,
mode: str = "w",
encoding: str | None = None,
errors: str = "strict",
compression: CompressionOptions = "infer",
quoting: int | None = None,
lineterminator: str | None = "\n",
chunksize: int | None = None,
quotechar: str | None = '"',
date_format: str | None = None,
doublequote: bool = True,
escapechar: str | None = None,
storage_options: StorageOptions = None,
csv_encoder=None,
) -> None:
self.fmt = formatter
if csv_encoder is None:
self.obj = self.fmt.frame
else:
self.obj = self.fmt.frame.applymap(csv_encoder)
... And for the P.S.: The lines in the files are given for the commit |
-1 on adding anything like this to csv csv is not a high fidelity format and not designed in any way for this. you are basically asking pandas to heroically do things and it's just not supported most of the binary formats - parquet fully support nested structures |
This issue is just about properly serializing objects stored in a DataFrame, which does not look so heroic to me since Pandas would just apply a function provided by the user. And Pandas actually already does this implicitly, the main issue is that it calls the |
Just came across this issue. It's niche but having an argument allowing to pass a function handling the serialization of complex types would be helpful. In my case, I want to serialize list as semicolon-separated values. Even though the problem is solved easily by calling something like The argument could behave as |
Feature Type
Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas
Problem Description
Sometimes we store Python objects (e.g.
numpy.ndarray
objects) in a column of aDataFrame
and we want to store it as CSV file. In this case, the array is converted to string using the__str__
method of the object, which is not the best format for later parsing. I suggest to add an option similar to thecls
parameter ofjson.dumps
which allows to encode a specific type in a custom format.Feature Description
The user can define an encoder:
Then we can pass it to the
to_csv
method which is applied to each element of the DF:Internally, we could just call
just before saving the file.
Alternative Solutions
Another solution is to format manually before each call to
to_csv()
:Additional Context
No response
The text was updated successfully, but these errors were encountered: