New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
box tip aggregate #1843
base: main
Are you sure you want to change the base?
box tip aggregate #1843
Conversation
we have 5 elements to report + the outliers when there are less than 5 values, a choice must be made. The "report" array is in order of priority (highest priority last, since we use Array.pop). This seems to work well with 1, 2, 3, 4, and 5 values. I wonder if q1 and q3 make sense when there are < 5 values, by the way, and we should probably censor them in the visible mark too?
I think I would find it more readable if the statistics were labeled with their names, min, max, median, q1, q3 (maybe count or n also?). I guess there is the question of whether we use the simple definitions or the more nuanced ones the box plot uses. I would look for precedence at other tools that provide one dimensional statistical summaries such as DuckDB and R. |
Maybe we don't need to reinforce the difference between counts (123#) and values? Also, we could use the p25 notation, for consistency. If we could somehow format as a table (or correctly aligned columns), I'd see something like this (made up numbers):
(12 numbers, that's quite a lot.) For small n we wouldn't always display all the stuff, but rather:
|
I like this general direction! I think “Min: 1.7” and “Max: 2.08” would be clearer than the [1.7 – 2.08] interval notation. It seems OK to me if the tooltip ignores the concept of outliers… even though “how is outlier defined?” is always my first question when I see a box plot with them!! |
The point is, they're not min and max — instead, they're “low” and “high” values that capture 99.7% of the (estimated) underlying normal distribution, thus defining “outliers”. If you know how a box plot is defined, the bracket notation is fine; if you don't, you’ll need a key, and I don't think that a proper definition can be conveyed in a concise tip. (Words such as “low” and “high” are not self-sufficient.) |
I think low and high is better (more readable) than the bracket notation, and agree it avoids the confusion that min and max would introduce due to outliers. In either case you’d have to look up the details of what the number means, so better to give it a name at least so people know what to look for and how to refer to the presented values. |
I would also prefer if the names were bolded like we do with channels, i.e., “p25 42” instead of “p25: 42”. |
I can't find how to add low, high, etc as independent channels on the tip mark. I think there's a bug somewhere since, when I do
the low channel is generated with the correct number of values (as many as there are groups), but all its values are
there is now an input for low, and the low output is correctly computed, but it's not considered as a channel. |
An alternative to #1839 for boxplot tips; this one shows the # of items in each quartile (and # of outliers), using numbers and math. symbols (as opposed to English language). We don't show empty quartiles and redundant information (i.e. when the # of values is small).
I don't know what users of box plots expect from a tip. The visual summary is one thing, but making it into “sentences” is quite challenging, given the complex definition of the elements.
For example, we can't really show percentages, as they are misleading when the data is “discrete” (few values). For instance, if you have 5 values, the proportion that are lower to the median is 2 out of 5 (40%), and the proportion that are lower or equal to the median is 3 out of 5 (60%). There is never a 50%. So, recount everything wrt the quartiles, and don't display %. But of course now it makes things much more difficult to grasp, because these statistics are coming from percentages, and it's hard to understand why the tip would show “made up” values of 4.25, 7, 10.75 as interesting edges to summarize an array of integers.
This is my gripe with the whole concept of a box plot, maybe, more than with the tip :)