Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A possible issue with eland.Dataframe.value_counts(), the statistical information is missing some values #643

Open
mumuwithw opened this issue Dec 21, 2023 · 0 comments

Comments

@mumuwithw
Copy link

I tried using eland to read data from two data streams, with es_index_pattern=["*java.backend*", "*h3c*"] , where field 'data_stream.dataset' is the name of the data stream of the document, and its value are 'h3c' and 'java.backend' in this example.
When I use 'df' to print the dataframe, I can indeed see 'h3c' data in the printed data, but when I use value_couts() for this field, only 'java.backend' appeared. I'm not sure whether this is a bug, because i saw a warning about this field when create the eland.DataFrame.

The code and returns are in the floowing:

>>> import eland as ed
>>> from elasticsearch import Elasticsearch
>>> import pandas as pd
>>> escli = Elasticsearch(
...         hosts="https://******",
...         basic_auth=("elastic", "***"),
...         ca_certs='./http_ca.crt',
...     )
>>> df = ed.DataFrame(
...     escli,
...     es_index_pattern=["*java.backend*", "*h3c*"],
...     columns=['@timestamp', 'message', 'data_stream.dataset'],
...     es_index_field='@timestamp'
...     )

# here is the warning mentioned before
......
xxxx\lib\site-packages\eland\field_mappings.py:327: UserWarning: Field data_stream.dataset has conflicting types ('constant_keyword', None) != text
......




# here 'data_stream.dataset' has both value of 'h3c' and 'java.backend'
>>> df
                                                     @timestamp  ... data_stream.dataset
2012-12-31T23:59:33.000+08:00         2012-12-31 23:59:33+08:00  ...                 h3c
2012-12-31T23:59:33.000+08:00         2012-12-31 23:59:33+08:00  ...                 h3c
2012-12-31T23:59:48.000+08:00         2012-12-31 23:59:48+08:00  ...                 h3c
2012-12-31T23:59:48.000+08:00         2012-12-31 23:59:48+08:00  ...                 h3c
2012-12-31T23:59:48.000+08:00         2012-12-31 23:59:48+08:00  ...                 h3c
...                                                         ...  ...                 ...
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:00:08.730Z       2023-12-19 07:00:08.730000+00:00  ...        java.backend
2023-12-19T07:38:46.967Z       2023-12-19 07:38:46.967000+00:00  ...        java.backend

[42240705 rows x 3 columns]



# but here value_counts() only return info of 'java.backend'
>>> df['data_stream.dataset'].value_counts()
java.backend    42043023
Name: data_stream.dataset, dtype: int64
>>> df['data_stream.dataset'].value_counts(10) 
java.backend    42043023
Name: data_stream.dataset, dtype: int64
>>> df['data_stream.dataset'].value_counts(2)  
java.backend    42043023
Name: data_stream.dataset, dtype: int64
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant