Box plot: Difference between revisions
m Reverted edits by 2.24.145.161 (talk) to last version by Glrx |
added discussion and example of use of notched and variable width box plots; corrected figure numbering |
||
Line 6: | Line 6: | ||
==Alternative forms== |
==Alternative forms== |
||
[[File:Box-Plot mit Min-Max Abstand.png|thumb|Boxplot with whiskers from minimum to maximum]] |
[[File:Box-Plot mit Min-Max Abstand.png|thumb|Figure 2. Boxplot with whiskers from minimum to maximum]] |
||
[[File:Box-Plot mit Interquartilsabstand.png|thumb|Same Boxplot with whiskers with maximum 1.5 IQR]] |
[[File:Box-Plot mit Interquartilsabstand.png|thumb|Figure 3. Same Boxplot with whiskers with maximum 1.5 IQR]] |
||
Box and whisker plots are uniform in their use of the box: the bottom and top of the box are always the 25<sup>th</sup> and 75<sup>th</sup> [[percentile]] (the lower and upper [[quartile]]s, respectively), and the band near the middle of the box is always the 50<sup>th</sup> [[percentile]] (the [[median]]). But the ends of the whiskers can represent several possible alternative values, among them: |
Box and whisker plots are uniform in their use of the box: the bottom and top of the box are always the 25<sup>th</sup> and 75<sup>th</sup> [[percentile]] (the lower and upper [[quartile]]s, respectively), and the band near the middle of the box is always the 50<sup>th</sup> [[percentile]] (the [[median]]). But the ends of the whiskers can represent several possible alternative values, among them: |
||
* the minimum and maximum of all the data<ref>{{Cite journal |
* the minimum and maximum of all the data<ref name="mcgill tukey larsen">{{Cite journal |
||
| author = Robert McGill, [[John W. Tukey]], Wayne A. Larsen |
| author = Robert McGill, [[John W. Tukey]], Wayne A. Larsen |
||
| title = Variations of Box Plots |
| title = Variations of Box Plots |
||
Line 20: | Line 20: | ||
| pages = 12–16 |
| pages = 12–16 |
||
| url = http://lis.epfl.ch/~markus/References/McGill78.pdf |
| url = http://lis.epfl.ch/~markus/References/McGill78.pdf |
||
}}</ref> |
}}</ref> (as in Figure 2) |
||
* the lowest datum still within 1.5 [[Interquartile range|IQR]] of the lower quartile, and the highest datum still within 1.5 [[Interquartile range|IQR]] of the upper quartile (as in Figure |
* the lowest datum still within 1.5 [[Interquartile range|IQR]] of the lower quartile, and the highest datum still within 1.5 [[Interquartile range|IQR]] of the upper quartile (as in Figure 3) |
||
* one standard deviation above and below the mean of the data |
* one standard deviation above and below the mean of the data |
||
* the 9<sup>th</sup> [[percentile]] and the 91<sup>st</sup> [[percentile]] |
* the 9<sup>th</sup> [[percentile]] and the 91<sup>st</sup> [[percentile]] |
||
Line 37: | Line 37: | ||
The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to show the [[seven-number summary]]. If the [[normal distribution|data are normally distributed]] the locations of the seven marks on the box plot will be equally spaced. |
The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to show the [[seven-number summary]]. If the [[normal distribution|data are normally distributed]] the locations of the seven marks on the box plot will be equally spaced. |
||
==Variations== |
|||
[[File:Fourboxplots.png|thumb|Figure 4. Four box plots, with and without notches and variable width]] |
|||
Several variations on the traditional box plot have been described. Two of the most common are variable width box plots and notched box plots (see figure 4). |
|||
Variable width box plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group.<ref name="mcgill tukey larsen" /> |
|||
Notched box plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide to significance of difference of medians; if the notches of two boxes do not overlap, this offers evidence of a statistically significant difference between the medians.<ref name="mcgill tukey larsen" /> The width of the notches is proportional to the interquartile range of the sample and inversely proportional to the square root of the size of the sample. However, there is uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples).<ref name="mcgill tukey larsen" /> One convention is to use +/-1.58*IQR/sqrt(n).<ref name="Rboxplotstats"> {{Cite web | title = R: Box Plot Statistics | work = R manual | url = http://stat.ethz.ch/R-manual/R-devel/library/grDevices/html/boxplot.stats.html | accessdate = 26 June 2011}} </ref> |
|||
== Visualization == |
== Visualization == |
||
[[Image:Boxplot vs PDF.svg|thumb|Figure |
[[Image:Boxplot vs PDF.svg|thumb|Figure 5. Boxplot and a [[probability density function]] (pdf) of a Normal N(0,1σ<sup>2</sup>) Population]] |
||
The boxplot is a quick way of examining one or more sets of data graphically. Boxplots may seem more primitive than a [[histogram]] or [[kernel density estimation|kernel density estimate]] but they do have some advantages. They take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data (see Figure 1 for an example). Choice of [[Histogram#Number of bins and width|number and width of bins]] techniques can heavily influence the appearance of a histogram, and choice of bandwidth can heavily influence the appearance of a kernel density estimate. |
The boxplot is a quick way of examining one or more sets of data graphically. Boxplots may seem more primitive than a [[histogram]] or [[kernel density estimation|kernel density estimate]] but they do have some advantages. They take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data (see Figure 1 for an example). Choice of [[Histogram#Number of bins and width|number and width of bins]] techniques can heavily influence the appearance of a histogram, and choice of bandwidth can heavily influence the appearance of a kernel density estimate. |
||
As looking at a statistical distribution is more intuitive than looking at a boxplot, comparing the boxplot against the probability density function (theoretical histogram) for a normal N(0,1σ<sup>2</sup>) distribution may be a useful tool for understanding the boxplot (Figure |
As looking at a statistical distribution is more intuitive than looking at a boxplot, comparing the boxplot against the probability density function (theoretical histogram) for a normal N(0,1σ<sup>2</sup>) distribution may be a useful tool for understanding the boxplot (Figure 5). |
||
==See also== |
==See also== |
Revision as of 04:26, 26 June 2011
In descriptive statistics, a box plot or boxplot (also known as a box-and-whisker diagram or plot) is a convenient way of graphically depicting groups of numerical data through their five-number summaries: the smallest observation (sample minimum), lower quartile (Q1), median (Q2), upper quartile (Q3), and largest observation (sample maximum). A boxplot may also indicate which observations, if any, might be considered outliers.
Boxplots display differences between populations without making any assumptions of the underlying statistical distribution: they are non-parametric. The spacings between the different parts of the box help indicate the degree of dispersion (spread) and skewness in the data, and identify outliers. Boxplots can be drawn either horizontally or vertically.
Alternative forms
Box and whisker plots are uniform in their use of the box: the bottom and top of the box are always the 25th and 75th percentile (the lower and upper quartiles, respectively), and the band near the middle of the box is always the 50th percentile (the median). But the ends of the whiskers can represent several possible alternative values, among them:
- the minimum and maximum of all the data[1] (as in Figure 2)
- the lowest datum still within 1.5 IQR of the lower quartile, and the highest datum still within 1.5 IQR of the upper quartile (as in Figure 3)
- one standard deviation above and below the mean of the data
- the 9th percentile and the 91st percentile
- the 2nd percentile and the 98th percentile.
Any data not included between the whiskers should be plotted as an outlier with a dot, small circle, or star, but occasionally this is not done.
Some box plots include an additional dot or a cross is plotted inside of the box, to represent the mean of the data in addition to the median.
On some box plots a crosshatch is placed on each whisker, before the end of the whisker.
Fairly rarely, box plots can be presented with no whiskers at all.
Because of this variability, it is appropriate to describe the convention being used for the whiskers and outliers in the caption for the plot.
The unusual percentiles 2%, 9%, 91%, 98% are sometimes used for whisker cross-hatches and whisker ends to show the seven-number summary. If the data are normally distributed the locations of the seven marks on the box plot will be equally spaced.
Variations
Several variations on the traditional box plot have been described. Two of the most common are variable width box plots and notched box plots (see figure 4).
Variable width box plots illustrate the size of each group whose data is being plotted by making the width of the box proportional to the size of the group. A popular convention is to make the box width proportional to the square root of the size of the group.[1]
Notched box plots apply a "notch" or narrowing of the box around the median. Notches are useful in offering a rough guide to significance of difference of medians; if the notches of two boxes do not overlap, this offers evidence of a statistically significant difference between the medians.[1] The width of the notches is proportional to the interquartile range of the sample and inversely proportional to the square root of the size of the sample. However, there is uncertainty about the most appropriate multiplier (as this may vary depending on the similarity of the variances of the samples).[1] One convention is to use +/-1.58*IQR/sqrt(n).[2]
Visualization
The boxplot is a quick way of examining one or more sets of data graphically. Boxplots may seem more primitive than a histogram or kernel density estimate but they do have some advantages. They take up less space and are therefore particularly useful for comparing distributions between several groups or sets of data (see Figure 1 for an example). Choice of number and width of bins techniques can heavily influence the appearance of a histogram, and choice of bandwidth can heavily influence the appearance of a kernel density estimate.
As looking at a statistical distribution is more intuitive than looking at a boxplot, comparing the boxplot against the probability density function (theoretical histogram) for a normal N(0,1σ2) distribution may be a useful tool for understanding the boxplot (Figure 5).
See also
Notes
- ^ a b c d Robert McGill, John W. Tukey, Wayne A. Larsen (1978). "Variations of Box Plots" (PDF). The American Statistician. 32 (1): 12–16.
{{cite journal}}
: Unknown parameter|month=
ignored (help)CS1 maint: multiple names: authors list (link) - ^ "R: Box Plot Statistics". R manual. Retrieved 26 June 2011.
References
- John W. Tukey. "Exploratory Data Analysis". Addison-Wesley, Reading, MA. 1977.
- Robert Mcgill, John W. Tukey, Wayne A. Larsen. "Variations of Box Plots". The American Statistician, Vol. 32 (1), 1978. 12-16.
- Yoav Benjamini. "Opening the Box of a Boxplot". The American Statistician. Vol 42 (4), November 1988. 257–262.
- Michael Frigge and David C. Hoaglin and Boris Iglewicz. "Some Implementations of the Boxplot". The American Statistician. Vol. 43 (1), February 1989. 50–54.
- Peter J. Rousseeuw, Ida Ruts and John W. Tukey. "The Bagplot: A Bivariate Boxplot". The American Statistician. Vol 53 (4), November 1999. 382–387.
External links
- Visual Presentation of Data by Means of Box Plots (PDF)
- On-line box plot calculator with explanations and examples
- Box and Whisker Plots in gnuplot
- Box and Whisker Diagrams: getting Microsoft Excel to plot them for you
- Box and Whisker Plots in Microsoft Excel
- Box plot and whisker plots in Excel 2007
- Box plot explanation, examples and a javascript/css-based box plot
- Beeswarm Boxplot - superimposing a frequency-jittered stripchart on top of a boxplot