Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

box tip aggregate #1843

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

box tip aggregate #1843

wants to merge 10 commits into from

Conversation

Fil
Copy link
Contributor

@Fil Fil commented Aug 31, 2023

An alternative to #1839 for boxplot tips; this one shows the # of items in each quartile (and # of outliers), using numbers and math. symbols (as opposed to English language). We don't show empty quartiles and redundant information (i.e. when the # of values is small).

I don't know what users of box plots expect from a tip. The visual summary is one thing, but making it into “sentences” is quite challenging, given the complex definition of the elements.

For example, we can't really show percentages, as they are misleading when the data is “discrete” (few values). For instance, if you have 5 values, the proportion that are lower to the median is 2 out of 5 (40%), and the proportion that are lower or equal to the median is 3 out of 5 (60%). There is never a 50%. So, recount everything wrt the quartiles, and don't display %. But of course now it makes things much more difficult to grasp, because these statistics are coming from percentages, and it's hard to understand why the tip would show “made up” values of 4.25, 7, 10.75 as interesting edges to summarize an array of integers.

Capture d’écran 2023-08-31 à 13 26 39

This is my gripe with the whole concept of a box plot, maybe, more than with the tip :)

we have 5 elements to report + the outliers

when there are less than 5 values, a choice must be made. The "report" array is in order of priority (highest priority last, since we use Array.pop).

This seems to work well with 1, 2, 3, 4, and 5 values. I wonder if q1 and q3 make sense when there are < 5 values, by the way, and we should probably censor them in the visible mark too?
@Fil Fil mentioned this pull request Aug 31, 2023
@mbostock
Copy link
Member

I think I would find it more readable if the statistics were labeled with their names, min, max, median, q1, q3 (maybe count or n also?). I guess there is the question of whether we use the simple definitions or the more nuanced ones the box plot uses. I would look for precedence at other tools that provide one dimensional statistical summaries such as DuckDB and R.

@Fil
Copy link
Contributor Author

Fil commented Aug 31, 2023

Maybe we don't need to reinforce the difference between counts (123#) and values? Also, we could use the p25 notation, for consistency.

If we could somehow format as a table (or correctly aligned columns), I'd see something like this (made up numbers):

count
    n: 31,256 (+ 10 n/a)
quartiles
  p25:  7,530 < 23.1
  p50: 16,230 < 45
  p75: 23,530 < 56.1
outliers
  low:    156 < 12.3
 high:    347 > 89.1

(12 numbers, that's quite a lot.)

For small n we wouldn't always display all the stuff, but rather:

     n: 3
median: 42

@Fil
Copy link
Contributor Author

Fil commented Sep 1, 2023

A simpler version, doesn't show counts nor tries to explain what the 99.3% range is:

Capture d’écran 2023-09-01 à 06 32 03

@Fil Fil marked this pull request as ready for review September 1, 2023 04:38
@tophtucker
Copy link
Contributor

I like this general direction! I think “Min: 1.7” and “Max: 2.08” would be clearer than the [1.7 – 2.08] interval notation.

It seems OK to me if the tooltip ignores the concept of outliers… even though “how is outlier defined?” is always my first question when I see a box plot with them!!

@Fil
Copy link
Contributor Author

Fil commented Sep 6, 2023

The point is, they're not min and max — instead, they're “low” and “high” values that capture 99.7% of the (estimated) underlying normal distribution, thus defining “outliers”. If you know how a box plot is defined, the bracket notation is fine; if you don't, you’ll need a key, and I don't think that a proper definition can be conveyed in a concise tip. (Words such as “low” and “high” are not self-sufficient.)

@mbostock
Copy link
Member

mbostock commented Sep 6, 2023

I think low and high is better (more readable) than the bracket notation, and agree it avoids the confusion that min and max would introduce due to outliers. In either case you’d have to look up the details of what the number means, so better to give it a name at least so people know what to look for and how to refer to the presented values.

@mbostock
Copy link
Member

mbostock commented Sep 6, 2023

I would also prefer if the names were bolded like we do with channels, i.e., “p25 42” instead of “p25: 42”.

@Fil
Copy link
Contributor Author

Fil commented Sep 9, 2023

I can't find how to add low, high, etc as independent channels on the tip mark.

I think there's a bug somewhere since, when I do

group({low: loqr1, x: "p50"}, {x, y, z: y, channels: {low: x}, ...options}))

the low channel is generated with the correct number of values (as many as there are groups), but all its values are undefined, because there is no corresponding input "low". If OTOH I specify:

group({low: loqr1, x: "p50"}, {x, y, z: y, low: x, ...options}))

there is now an input for low, and the low output is correctly computed, but it's not considered as a channel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants