Talk:Box plot
This is the talk page for discussing improvements to the Box plot article. This is not a forum for general discussion of the article's subject. |
Article policies
|
Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL |
This article is written in American English, which has its own spelling conventions (color, defense, traveled) and some terms that are used in it may be different or absent from other varieties of English. According to the relevant style guide, this should not be changed without broad consensus. |
Statistics C‑class High‑importance | ||||||||||
|
Mathematics C‑class Mid‑priority | ||||||||||
|
This article was the subject of a Wiki Education Foundation-supported course assignment, between 27 August 2021 and 19 December 2021. Further details are available on the course page. Student editor(s): Jx2022 (article contribs).
New Question!
If the only data I have is a box plot, can I determine the SD and SE of the mean? Thanks —Preceding unsigned comment added by 128.118.7.83 (talk) 13:27, 28 March 2008 (UTC)
Question!
I have a question. In the data you use to make the diagram, if a number is repeated, how do you put that on the graph? I'm confused.
—The preceding unsigned comment was added by 74.244.254.26 (talk) 23:08, 6 December 2006 (UTC).
- You don't put it itself in the graph. You use it to compute the median, box edges, and whiskers. Those quantities are the ones plotted. Baccyak4H (Yak!) 03:13, 7 December 2006 (UTC)
Expansion, revision
Hey folks, When I came across this page it was in need of some serious revision. There were grammatical errors and repeated information. I took the liberty of removing some of the repeated information and providing a detailed step-by-step construction of the boxplot.
More suggestions
Something on parallel boxplots would be nice. I am very new here, still feeling my way. I can try to do something, but it might be messy, html is not my forte Plf515 23:52, 26 November 2006 (UTC)plf515
Something should be added about "notches". —Preceding unsigned comment added by CptNautilus (talk • contribs) 17:50, 7 November 2007 (UTC)
Error?
"Histograms and probability density functions require assumptions of the statistical distribution." What about empirical histograms and pdf's? As far as I can see these do not require any a-priori assumptions on distributions?
- I don't know who wrote that but i agree, so i've altered that section accordingly. Qwfp (talk) 17:43, 22 February 2008 (UTC)
one figure has german legend. And yet it is in english wikipedia. Sakaton (talk) 21:49, 5 February 2013 (UTC)
Boxplot Figure
The current figure, Michelsonmorley-boxplot.svg, does not render correctly and could thus confuse readers. I suggest that we revert to the non-svg file R-speed_of_light_boxplot.png. Innohead 13:06, 15 February 2007 (UTC)
How to make a box plot
Ex: Let's say the data is 39, 63, 92, 41, 99, and 55.
1. Find the median. Ex: 39, 99. 41, 92. Then, find the mean of 55 and 63. (It's 59).
2. Find the median of the lower numbers. 39, 59. 41 and 55. Median: 48.
3. Do the same with the higher numbers. 59, 99. 63 and 92. Median: 77.5.
4. Plot on a number line. Ex:
35--40--45--50--55--60--65--70--75--80--85--90--95--100 39------48-------59------------77.5--------------99
5. Draw a box around the 3 quartiles (the medians).
6. Label. Sealpiano 01:06, 15 June 2007 (UTC)
Formatting error
The quartile description is wrongly formatted, with leading [[s but no closing ]]. I'd fix it, but I can't decipher what the correct sentence should be, as I don't understand it and think it may be a partial delete which has become nonsensical.Lilac Soul 08:32, 20 June 2007 (UTC)
Errors
Where did someone get the idea that Sheldon invented the box plot? Tukey invented the plot at least as early as 1970, when he was circulating the manuscript for Exploratory Data Analysis. He adapted the shape from a graphic invented by Mary Ellen Spear in 1952 (Tufte shows a picture of her "range chart in VDQI). Tukey modified Spear's charts in several significant ways: he used letter values instead of quartiles, he devised distribution-free quantiles for representing outliers and extreme outliers, and (with McGill) he devised confidence intervals on the median and represented them with notches.
The algorithm for producing the box plot in this article page is not Tukey's. A discussant above describes the correct algorithm. Will someone please read Tukey's EDA and correct these serious errors? As it stands, this article only contributes to the confusion over boxplot definitions and makes it seem as if anything goes. As the Frigge, Hoaglin, and Iglewicz article clearly shows, only a few statistics packages (Minitab, SYSTAT, DataDesk) get it right. 202.62.81.253 (talk) 04:15, 23 January 2008 (UTC)
- Feel free to change the article as you see fit, especially if you can include reliable sources. However, it may be prudent to recognize that while there may be only one "right" way to do these right in the sense of doing them like (say) Tukey described, note that the method is an exploratory one, and the value is not in getting it "right" but making it useful. Alternative implementations are not a bad thing so long as they are not attributed falsely. Baccyak4H (Yak!) 14:34, 23 January 2008 (UTC)
- For reasons I've explained in a footnote to the article, I don't think these can really be described as "serious errors". Sampling error is always much bigger than the difference between these definitions of quartiles/hinges. And Tukey himself proposed several variations so there's no single "right" version of a boxplot (see also my comments below). Qwfp (talk) 17:41, 22 February 2008 (UTC)
- Cf your footnote: "always dwarfed by sampling variability". It is perfectly valid to use a box plot on data that are not affected by sampling error, so I'd consider "always" a bit too strong. Johannes Hüsing (talk) 12:04, 9 September 2008 (UTC)
- For reasons I've explained in a footnote to the article, I don't think these can really be described as "serious errors". Sampling error is always much bigger than the difference between these definitions of quartiles/hinges. And Tukey himself proposed several variations so there's no single "right" version of a boxplot (see also my comments below). Qwfp (talk) 17:41, 22 February 2008 (UTC)
In need of attention
1) The external link http://www.physics.csbsju.edu/stats/box2.html contradicts the article in regards to how you mark outliers / Suspected outliers on a plot. 2) I believe the "whisker" of a box plot is the line drawn from the box, not the tick mark at the end. 3) The whole article could use some cosmetic work and in-line citation. Ajonlime (talk) 01:35, 29 January 2008 (UTC)
- 1) Precise definitions of what constitutes an outlier and how to mark them do vary. This is not surprising as Tukey himself gave 3 variations on the box plot just in the one book (see mathworld [1] – if you follow the links from that you'll see why he was also famous for introducing a plethora of neologisms and technical meanings for existing words: hinge, fence, H-spread, step, adjacent value... you can't blame people for getting confused). I think a note that "precise definitions vary" is all that's needed, and a link to quantile to explain that there are different ways of calculating quartiles – however sampling variation is always much bigger than the variation between definitions so the latter is of very little consequence.
- 2) Agree. I'll go ahead and change it.
- 3) Maybe, but that's "clean-up" not "expert attention", so when i've done (1) and (2) i'll remove the "expert attention" template.
- Qwfp (talk) 16:58, 22 February 2008 (UTC)
No, Tukey did not give 3 variations on the box plot. You are trusting a secondary source who has not read Tukey carefully and did not know Tukey when he was alive. Go ahead and "follow the links" from a source who never met Tukey, but it will take you to misinformation. Tukey actually cared about the differences you are dismissing with a shrug. Furthermore, the site is obviously being vandalized, since the Sheldon quote has turned up again. This is a minor issue and a small point, but I think it is representative of a Wikipedia problem. Wikipedians have excluded experts by requiring secondary-source attribution, and so the material that ends up in a listing like this is often wrong. You can argue all you want, but I knew Tukey and he would be amused by the content of this article. —Preceding unsigned comment added by 67.173.98.211 (talk) 00:18, 26 March 2008 (UTC)
- 67.173.98.211, If not, it is true that Wikipedia focuses on claims that can be verified. If you can provide a reference other than your personal recollection, I think it would make sense to include it. Do you have one? Pdbailey (talk) 03:25, 26 March 2008 (UTC)
- If you see something wrong it's simpler if you edit the article and give your sources. Wikipedia does not require secondary sources for references, only to establish notability (which is not the issue here) so it's fine to cite Tukey's original works and I'd quite agree that it would be preferable, but unfortunately they aren't always to hand. I've removed the unsourced Sheldon claim — thanks for pointing it out but again why didn't you just go ahead and fix it yourself? I missed that change because I haven't looked at Wikipedia for several days. You've succeeded in making me curious enough to want to dig Tukey's EDA book out of the library though. I'll admit it's several years since I last looked at it. Practises do evolve, and this article is primarily about the box plot as currently used rather than the history of the concept. The history is interesting however and if it's going to be included we should try to get it right. Qwfp (talk) 05:17, 26 March 2008 (UTC)
Thanks. I'm a bit gun-shy about editing and, as you can see, not too literate on the Wiki conventions. I thought it would be better to stay in the background and goad others to think more about the box plot entry. It's fascinating to me that anyone would care to vandalize a page with a "Sheldon" comment without providing a single piece of evidence. I can't imagine the social dynamic going on here. And yes, practices do evolve. One fairly widely used statistics package put the center on the mean, the hinges at one standard deviation, and the end of the whiskers at the extreme values. They called this EDA. It completely defeated the purpose of the box plot and, worse, it led many to think this is what a box plot is. So history does matter. We don't go changing the formula for least-squares regression and call it "least-squares," even though there are more useful regression methods available today. There are better contemporary alternatives to the box plot (for the purpose Tukey had in mind), but we shouldn't change the meaning of the display to defeat his original purpose in inventing it. Thanks again for your thoughtful reply. —Preceding unsigned comment added by 67.173.98.211 (talk) 14:05, 26 March 2008 (UTC)
- Well, thanks for calling my reply courteous. I thought it was slightly curt by my usual WP standards — No angry mastodons had a big influence on me. I'd encourage you to be bold and edit the articles without worrying too much about conventions — if the substance looks right and is referenced somehow, someone else will (usually) fix any perceived problems with the style and format the refs (which I admit gets a bit technical).
- I've no idea what the "Sheldon" business is about either — not come across anything quite like that on other pages. I don't know which stats package uses mean and SD for boxplots but i'd agree that's just plain wrong. I can't get so bothered about minor differences in definitions of quartiles or the distinction between quartiles and hinges. Meant to get EDA out of the library today but forgot my library card... another day. I agree that history matters — I've added the seminal or eponymous ref to several stats articles. I've been wondering what you have in mind when you mention "better contemporary alternatives to the box plot" — violin plots perhaps? I used them quite a bit when I first discovered them but I'd all but forgotten about them since and it seems Wikipedia doesn't have any mention of them (as yet...) Qwfp (talk) 21:18, 26 March 2008 (UTC)
The violin plot is nice, but it's just a kernel density plot. Better to overlay it with a box plot. Then you get both types of information. Also, the violin plot suffers from the bandwidth estimation problem that someone mentioned in the box-plot article. A better alternative is the dot-box plot, found in Wilkinson, L (1999). Dot plots. The American Statistician, 53, 276-281. It overlays a box plot with a dot plot, so you can see all the data and also see the median, outliers, etc. It ameliorates the main deficiency of the box plot - that it can look identical for certain unimodal and multimodal datasets. I also like the Hofmann, Wickham, Kafadar letter-value box plots <http://www.stat.iastate.edu/preprint/articles/2006-10.pdf>. This paper covers a lot of the details I've been bringing up in this discussion.
I want to illustrate the specifics of our disagreement. The argument has nothing to do with the appearance of the box plot (that's the topic of the Mathworld article; Tukey drew several different kinds of box plots -- that's just a matter of surface appearance). It has to do with letter values vs. quantiles. Basically, your statement that "sampling variation is always much bigger than the variation between definitions so the latter is of very little consequence" is false. Take the simple example x = {1,5,6,7,9}. The first 3 Tukey letter values for this batch are 5 (median) and <6, 7> (hinges). The Tukey box plot for these data show an outlier for 1, a box from 5 to 7 and one whisker from 7 to 9. Now, there are many ways to compute quartiles. Here are just a few results using different algorithms (try it with SAS or another comparable statistics package): <4, 6, 7.5>, <2, 5.5, 6.75>, <5, 6, 7>, <3, 6, 8>. Only one of these (the <5, 6, 7> based on the empirical cumulative distribution function) yields the Tukey letter values; and even that method doesn't always yield letter values. If you draw a box plot based on those estimates, it will look quite different from Tukey's. Only one of the quartile methods yields an outlier. Now, you might say that larger datasets will show less dramatic a difference. That would be generally true, but it is easy to construct counterexamples. The kind of ill-behaved data Tukey anticipated are precisely the ones that are smoked out in his box plot (as opposed to histograms and other density estimators).
Let me describe Tukey's letter-value algorithm, because the poster above didn't quite get it right: 1. Sort the data. 2. Label the sorted list W. 3. Compute the conventional median of W (pick middle value if N is odd, or average two middle values if N is even). Save this letter value. 4. Split W at the median into two lists, L and U. If N is odd, include the median at the end of L and at the beginning of U. 5. Recurse 3-4 for L and U (labeling each as W) until there are no blocks left to split.
The algorithm is most easily programmed as a recursive function, but it is simple to do in a loop with several pointers. For the box plot, we need to recurse only once to get the hinges. Tukey computed more letter values to characterize distributions in more detail. The Hofmann, Wickham, Kafadar letter-value plot exploits this characterization.
Now, why did Tukey do such a peculiar thing? 1. It was simple to do with a paper and pencil. That's one of the main points of EDA. Although Tukey revolutionized statistical computing, he always chose the simpler course over the more complex when he could. 2. Tukey chose actual data values as descriptors (exemplars) instead of latent, hidden, hypothetical, population (pick your word) parameters. There are exceptions, of course, but note how often he used letter values in other, more complex, analytics such as smoothing. He didn't like getting too far away from data, and this was the source of many of the controversies Tukey got into with model-oriented statisticians. 3. Letter values have a precise definition in terms of the data batch. A high-school student can understand the algorithm. If you do some research on quantiles, by contrast, you will find it a morass of different approaches. The elementary statistics book algorithm, based on linear interpolation, barely scratches the surface. 4. Letter values are robust. See Understanding Robust and Exploratory Data Analysis by David C. Hoaglin, Frederick Mosteller, John W. Tukey, John Wiley & Sons., 2000. Many quantile methods depend on restrictive assumptions on the data.
So, this is of more than historical interest. The statistics packages really do differ, sometimes substantially. That's the point of the Hoaglin et al. article cited in the references. So, if I were writing this article, I'd devote a paragraph to letter values vs. quantiles. And I'd point out the difference -- feel free to use my example. It's not that one method's right and the other wrong. It's that the quantile/letter-value distinction can have a profound effect on the appearance of the box plot -- enough to influence what one considers an outlier. You would be surprised at how much box plots differ across statistics packages. On the same data. That's because the quantile-based box plots don't always disclose the algorithm they are using to estimate quantiles. With Tukey letter-values, there's no ambiguity.
I'm going to frustrate you again by not touching the article. You are obviously an intelligent and curious editor experienced with the ways of Wiki and a good monitor of this area (I know that sounds patronizing, but I don't mean it to be). And I suspect you edit in other statistical areas as well. So, in the end, the more research you pick up on your own, the more likely the quality of these articles will improve. I am an expert in this area but I'm not likely to get involved much further. And I really should stop adding comments to this discussion, because it's taking more space than the topic deserves. Thanks for your understanding.67.173.98.211 (talk) 16:25, 28 March 2008 (UTC)
New Question!
If the only data I have is a box plot, can I determine the SD and SE of the mean? Thanks —Preceding unsigned comment added by 128.118.7.83 (talk) 13:27, 28 March 2008 (UTC)
Actually, you can, but it's kind of pointless. The estimate isn't very good. Estimating the standard deviation from the range was used in the classic quality control literature because it was more difficult to cumulate sums of squares on a simple calculator. See Introduction to Statistical Quality Control, 5th ed., Douglas Montgomery, pages 95-6. If you use a box plot to do this, you will have to be sure there are no extreme outliers, because they will bias the estimate even more than usual.67.173.98.211 (talk) 14:16, 28 March 2008 (UTC)
About terminology: boxplot or boxplots
If there is in a Figure where there is several box plots, like the first Figure in this article, should it be called boxplot or boxplots? Now in the article it is called boxplot (singular). Yebbey (talk) 04:21, 27 April 2011 (UTC)
External Link to Box Plot Tutorial
Re: Link to my web site that I added to the Box Plot page and removed by Glrx. I agree that the link I posted is inconsistent with the WP:NOTHOWTO policy, which I am now familiar with. My intent was not to promote my own products; the link is a tutorial, potentially very useful, though there is promotional material on the page. JonPeltier (talk) 16:16, 17 August 2011 (UTC)
Revise box plot image?
Shouldn't it say "median" instead of "medium" (fig. 5)? — Preceding unsigned comment added by 12.207.23.130 (talk) 02:31, 21 October 2011 (UTC)
- There doesn't seem to be any discussion on the commons page for the image. But yes, it needs to be changed. — Preceding unsigned comment added by 12.107.194.3 (talk) 19:53, 14 May 2012 (UTC)
- Fixed. Glrx (talk) 01:45, 16 May 2012 (UTC)
Figure 3
Shouldn't the whiskers in Figure 3 be of equal length? — Preceding unsigned comment added by Raywood (talk • contribs) 21:23, 2 September 2012 (UTC)
Michelson or Michelson–Morley
The first image in the article is a very nice real-world example of… what? The annotation says it comes from the notable experiments by Michelson and Morley to determine the speed of light, but the information in the file (https://en.wikipedia.org/wiki/File:Michelsonmorley-boxplot.svg) sais:
This data is not from the Michelson-Morley experiment but from Michelson's measurement of the speed of light. See MICHELSON, A. A. (1882).
Obviously there is a mistake in either one of the annotations!
--Uncronopio (talk) 10:00, 5 January 2017 (UTC)
Date of introduction?
Since the mathematician John W. Tukey introduced this type of visual data display in 1969 ...
This contradicts the article John Tukey, which says He introduced the box plot in his 1977 book, "Exploratory Data Analysis". So, which is true? --Feldkurat Katz (talk) 01:23, 1 May 2018 (UTC)
Outliers versus outside values
It is incorrect to say that individual values plotted above the upper adjacent value or below the lower adjacent value (i.e., beyond the whiskers) are necessarily outliers. They are defined by Tukey as "outside values" which may or may not be outliers. Outliers are values that may or may not be removed from the data set after careful individual consideration. It is however true that outliers will almost certainly appear as outside values.
See J.M. Chambers, W.S. Cleveland, B. Kleiner, P.A. Tukey, Graphical Methods for Data Analysis, 1983, Wadsworth & Brooks/Cole, ISBN 0-412-05271-7
Pg. 22 reads, in part: "The individual outside values give the viewer an opportunity to consider the question of outliers, that is, observations that seem unusually, or even implausibly, large or small. Outside values are not necessarily outliers (indeed, the [plot in example] suggests that the [outside values shown in example plot] are not, but any outliers will almost certainly appear as outside values." — Preceding unsigned comment added by David Buchan (talk • contribs) 16:34, 22 November 2020 (UTC)
Notched box plot widths
Notched box plots have variable notch height, not width. The width (left-right indentation to the sides of the box) is arbitrarily chosen to be visually pleasing, and should be consistent amongst all box plots being displayed on the same page.
The height of the notch is indeed variable, and represents the uncertainty in the median. If two side-by-side box plots have notches which do not overlap, then it is likely the medians are significantly different.
In summary, the text should be revised to reflect that the height is the statistically variable parameter, not the width (which should be fixed for a given set of box plots).
In addition, I would suggest for clarity changing the sentence showing the +/-1.58IQR/sqrt(n) to state that the upper and lower limits of the notch are:
median +/-1.58IQR/sqrt(n).