ENH/API: accept list-like percentiles in describe (WIP) #7088

TomAugspurger · 2014-05-09T18:17:16Z

This is for frames. I'm going to refactor this into generic since to cover series / frames.

A couple questions:

API wise, I added a new kwarg percentiles. For backwards compat, we keep the percentile_width kwarg. I changed the default percentile_width from 50 to None (but the default output is the same) Cases:
1. You specify percentile_width and percentiles -> ValueError
2. You specify neither percentile_width nor percentiles -> percentile_width set to 50 and same as before
3. You specify one of those, everything goes as expected.
I'm accepting either decimals (e.g. [0.25, .5, .75]) or percentages (e.g. [25, 50, 75]). Those two are equivalent in output. Should I move this logic to the .quantile to be more consistent?
I'm choosing to not sort the provided percentiles. It's easy for the user to sort, but hard for them to unsort if for some reason they want it in a specific order. I'd rather not add another kwarg.

TomAugspurger · 2014-05-10T04:14:07Z

I'm close on this... a quick question though. There's also a describe for object types (strs or datetime). Notice the different index order for the last one (this is all current behavior):

# strs only
In [31]: df2 = pd.DataFrame({"C2": ['a', 'a', 'b', 'c']})

In [32]: df2.describe()
Out[32]: 
       C2
count   4
unique  3
top     a
freq    2

[4 rows x 1 columns]

# datetime only
In [28]: df = DataFrame({"C1": pd.date_range('2010-01-01', periods=4, freq='D')})

In [29]: df
Out[29]: 
          C1
0 2010-01-01
1 2010-01-02
2 2010-01-03
3 2010-01-04

[4 rows x 1 columns]

In [30]: df.describe()
Out[30]: 
                         C1
count                     4
unique                    4
first   2010-01-01 00:00:00
last    2010-01-04 00:00:00
top     2010-01-01 00:00:00
freq                      1

[6 rows x 1 columns]

# mix of timestamp and strs
In [33]: df = pd.concat([df, df2], axis=1)

In [35]: df.describe()
Out[35]: 
                         C1   C2
count                     4    4
first   2010-01-01 00:00:00  NaN
freq                      1    2
last    2010-01-04 00:00:00  NaN
top     2010-01-01 00:00:00    a
unique                    4    3

[6 rows x 2 columns]

So the index gets sorted. Is it worth breaking backwards compat to keep the index in a sensible order? I'm not sure.

jreback · 2014-05-10T12:11:27Z

yeh these should be in a sensible order I think
u can put it in API changes

TomAugspurger · 2014-05-10T13:42:41Z

Moved to generic (I'm not sure it was worth it; the code got pretty messy with a bunch of if / else.), updated docs. Should be good when travis says so.

jreback · 2014-05-10T13:57:43Z

ok, in theory ndim>=3 should work via .apply FYI

you can put tests in test_generic.py (you can do specific tests or have it create them generically)

jreback · 2014-05-10T13:58:36Z

pandas/core/generic.py

+        # dtypes: numeric only, numeric mixed, objects only
+        data = self._get_numeric_data()
+        if self.ndim > 1:
+            if len(data.columns) == 0:


do this as len(data._info_axis)

TomAugspurger · 2014-05-11T20:17:23Z

@jreback was there anything else you saw here? I think it''s ready.

jreback · 2014-05-11T20:46:08Z

didn't realize their is an argument percentile_width

I think u should just rename it to percentiles and make it what u have for percentiles (and if it's a scalar then the meaning is unchanged)

I think too confusing with that argument (which is prob not used much at all) - yours is much more useful

TomAugspurger · 2014-05-11T21:43:41Z

Should we do any warning / deprecation? I should be able to handle that very easily.

On May 11, 2014, at 3:46 PM, "jreback" <notifications@github.com mailto:notifications@github.com> wrote:

didn't realize their is an argument percentile_width

I think u should just rename it to percentiles and make it what u have for percentiles (and if it's a scalar then the meaning is unchanged)

I think too confusing with that argument (which is prob not used much at all) - yours is much more useful

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/7088#issuecomment-42782700.

jreback · 2014-05-11T23:23:10Z

sure why don't I deprecate perentile_width and replace with percentile

otherwise functionality is the same

TomAugspurger · 2014-05-12T16:26:47Z

Added a note about this deprecation to ##6581.

Anything else?

TomAugspurger · 2014-05-12T16:28:18Z

@jorisvandenbossche Does my deprecation note here look ok? That's how the numpy guide said to do it for objects. I assumed it was similar for keyword arguments.

jorisvandenbossche · 2014-05-12T20:05:16Z

pandas/core/generic.py

@@ -3478,6 +3478,152 @@ def _convert_timedeltas(x):

        return np.abs(self)

+    _shared_docs['describe'] = """
+        Generate various summary statistics of self, excluding


I think the of self is not very clear for people not knowing the self-concept, maybe just leave it out?

jorisvandenbossche · 2014-05-12T20:21:15Z

Some comments:

is it necessary that both the percentage-style (50) and number-style (0.5) are allowed? This seems to make it a bit complex to me (what if you use the first style, but want to give 0.5%, then you have to use the other style and it becomes confusing I think)
the deprecation of percentile_width is not in the release notes
About the deprecation note in the docstring. I think this is OK, but another option is to keep the parameter in the parameter section, but start its explanation with like 'Deprecated and will be removed. Use instead ... ' (the numpy docstring standard does not really mention this how to do, I think in pandas we mostly mention it in the parameter section itself instead of a seperate note (although we maybe mostly forget it ...), eg http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_html.html?highlight=deprecated).

TomAugspurger · 2014-05-13T18:19:06Z

@jorisvandenbossche thanks for the comments.

I was hoping that accepting both percentages and raw decimals would be less confusing, since I can never remember which we expect.

I actually had a longer reply written and then I realized why it was confusing. I'll switch it back to just expecting decimals between [0, 1], which is at least consistent with quantile.

jorisvandenbossche · 2014-05-13T18:44:14Z

@TomAugspurger I think quantile was originally designed this way to match R, however, now it dismatches np.percentile which uses [0,100]. In any case, too late to change now in pandas I think, and indeed most important to be consistent within pandas with quantile.

jorisvandenbossche · 2014-05-13T18:48:14Z

BTW, nice and informative FutureWarning! +1

ENH/API: accept list-like percentiles in describe (WIP)

jreback · 2014-05-14T01:46:45Z

@TomAugspurger why are you using Counter rather than pd.core.algorithm.value_counts() again?

Counter DOES not sort, while value_counts does.

On windows test_generic/test_describe_object is failing on the datetimes because of the arbitrary order from Counter (as the counts are all 1 for some reason it picks the 2nd one).

======================================================================
FAIL: test_describe_objects (pandas.tests.test_generic.TestDataFrame)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-2.7\pandas\tests\test_generic.py", line 997, in test_describe_objects
    assert_frame_equal(result, expected)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-2.7\pandas\util\testing.py", line 573, in assert_frame_equal
    check_exact=check_exact)
  File "c:\Users\Jeff Reback\Documents\GitHub\pandas\build\lib.win-amd64-2.7\pandas\util\testing.py", line 520, in assert_series_equal
    assert_almost_equal(left.values, right.values, check_less_precise)
  File "testing.pyx", line 58, in pandas._testing.assert_almost_equal (pandas\src\testing.c:2465)
  File "testing.pyx", line 93, in pandas._testing.assert_almost_equal (pandas\src\testing.c:1793)
  File "testing.pyx", line 139, in pandas._testing.assert_almost_equal (pandas\src\testing.c:2338)
AssertionError: Timestamp('2010-01-02 00:00:00') != Timestamp('2010-01-01 00:00:00')

----------------------------------------------------------------------
Ran 6971 tests in 194.439s

FAILED (SKIP=130, failures=1)

2.7\pandas\core\generic.py(3576)describe_categorical_1d()
-> if data.dtype == object:
(Pdb) n
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3585)describe_categorical_1d()
-> elif issubclass(data.dtype.type, np.datetime64):
(Pdb)
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3586)describe_categorical_1d()
-> names = ['count', 'unique']
(Pdb) p data
0   2010-01-01
1   2010-01-02
2   2010-01-03
3   2010-01-04
Name: C1, dtype: datetime64[ns]
(Pdb) n
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3587)describe_categorical_1d()
-> asint = data.dropna().values.view('i8')
(Pdb) n
> c:\users\jeff reback\documents\github\pandas\build\lib.win-amd64-2.7\pandas\core\generic.py(3588)describe_categorical_1d()
-> objcounts = compat.Counter(asint)
(Pdb) p asint
array([1262304000000000000, 1262390400000000000, 1262476800000000000,
       1262563200000000000], dtype=int64)
(Pdb) l
3583                        result += [top, freq]
3584
3585                elif issubclass(data.dtype.type, np.datetime64):
3586                    names = ['count', 'unique']
3587                    asint = data.dropna().values.view('i8')
3588 ->                 objcounts = compat.Counter(asint)
3589                    result = [data.count(), len(objcounts)]
3590                    if result[1] > 0:
3591                        top, freq = objcounts.most_common(1)[0]
3592                        names += ['first', 'last', 'top', 'freq']
3593                        result += [lib.Timestamp(asint.min()),
(Pdb) p compat.Counter(asint)
Counter({1262390400000000000: 1, 1262563200000000000: 1, 1262304000000000000: 1, 1262476800000000000: 1})
(Pdb) p asint
array([1262304000000000000, 1262390400000000000, 1262476800000000000,
       1262563200000000000], dtype=int64)
(Pdb) p pd.algorithms.value_counts
*** AttributeError: AttributeError("'module' object has no attribute 'algorithms'",)
(Pdb) p pd.core.algorithms.value_counts
<function value_counts at 0x00000000076FDBA8>
(Pdb) p pd.core.algorithms.value_counts(asint)
1262304000000000000    1
1262563200000000000    1
1262476800000000000    1
1262390400000000000    1
dtype: int64
(Pdb) p pd.core.algorithms.value_counts(asint,sort=True)
1262304000000000000    1
1262563200000000000    1
1262476800000000000    1
1262390400000000000    1
dtype: int64
(Pdb)

TomAugspurger · 2014-05-14T17:11:08Z

Ahh I missed that one. I'll switch it over to use value counts and fix the test so that it isn't ambiguous.

jreback · 2014-05-14T17:13:06Z

awesome just put up a pr and I can test

jreback added API Design labels May 9, 2014

jreback added this to the 0.14.1 milestone May 9, 2014

jreback reviewed May 10, 2014
View reviewed changes

jsexauer mentioned this pull request May 12, 2014

DEPR: Clean up list of deprecations from prior versions #6581

Closed

1 task

jorisvandenbossche reviewed May 12, 2014
View reviewed changes

ENH/API: accept percentiles in describe

843aa60

jreback modified the milestones: 0.14.0, 0.14.1 May 13, 2014

TomAugspurger pushed a commit that referenced this pull request May 14, 2014

Merge pull request #7088 from TomAugspurger/describe-quantiles

f26e668

ENH/API: accept list-like percentiles in describe (WIP)

TomAugspurger merged commit f26e668 into pandas-dev:master May 14, 2014

TomAugspurger mentioned this pull request May 14, 2014

BUG: cleanup on describe #7129

Merged

jreback mentioned this pull request Jul 24, 2016

DEPR: deprecations log for removed issues #13777

Closed

TomAugspurger deleted the describe-quantiles branch November 3, 2016 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH/API: accept list-like percentiles in describe (WIP) #7088

ENH/API: accept list-like percentiles in describe (WIP) #7088

TomAugspurger commented May 9, 2014

TomAugspurger commented May 10, 2014

jreback commented May 10, 2014

TomAugspurger commented May 10, 2014

jreback commented May 10, 2014

jreback May 10, 2014

TomAugspurger commented May 11, 2014

jreback commented May 11, 2014

TomAugspurger commented May 11, 2014

jreback commented May 11, 2014

TomAugspurger commented May 12, 2014

TomAugspurger commented May 12, 2014

jorisvandenbossche May 12, 2014

jorisvandenbossche commented May 12, 2014

TomAugspurger commented May 13, 2014

jorisvandenbossche commented May 13, 2014

jorisvandenbossche commented May 13, 2014

jreback commented May 14, 2014

TomAugspurger commented May 14, 2014

jreback commented May 14, 2014

ENH/API: accept list-like percentiles in describe (WIP) #7088

ENH/API: accept list-like percentiles in describe (WIP) #7088

Conversation

TomAugspurger commented May 9, 2014

TomAugspurger commented May 10, 2014

jreback commented May 10, 2014

TomAugspurger commented May 10, 2014

jreback commented May 10, 2014

jreback May 10, 2014

Choose a reason for hiding this comment

TomAugspurger commented May 11, 2014

jreback commented May 11, 2014

TomAugspurger commented May 11, 2014

jreback commented May 11, 2014

TomAugspurger commented May 12, 2014

TomAugspurger commented May 12, 2014

jorisvandenbossche May 12, 2014

Choose a reason for hiding this comment

jorisvandenbossche commented May 12, 2014

TomAugspurger commented May 13, 2014

jorisvandenbossche commented May 13, 2014

jorisvandenbossche commented May 13, 2014

jreback commented May 14, 2014

TomAugspurger commented May 14, 2014

jreback commented May 14, 2014