Week3: BoxPlotting
A. What is a boxandwhisker plot?
View the following ActivStats spplets to learn how to construct and interpret boxandwhisker plots:
Let’s go through the construction and interpretation of the boxandwhisker plot in more detail with another example.
Below are data on a sample of 25 cities in the United States. The city, average high temperature in January and average high temperature in July are given. Let’s use the January high temperatures only in this example.



B. How to construct a boxandwhisker plot by hand.
Minimum = 20.7 degrees
1st quartile = 35.7 degrees
Median = 42.3 degrees
3rd quartile = 55.6 degrees
Maximum = 75.2 degrees
Here’s how to get each of these values:
 Order the values from smallest to largest:
20.7 29.0 30.3 31.9 33.7 34.7 35.7 36.4 37.6 37.7 37.9 40.2 42.3 43.2 45.0 45.9 50.4 54.1 55.6 60.8 61.0 65.7 65.9 65.9 75.2
 Minimum is the smallest value in the data set: 20.7 degrees (Minneapolis)
 Maximum is the largest value in the data set: 75.2 degrees (Miami)
 With 25 observations, the median is the (25+1)/2 = 13th ordered value:
20.7 29.0 30.3 31.9 33.7 34.7 35.7 36.4 37.6 37.7 37.9 40.2 42.3 43.2 45.0 45.9 50.4 54.1 55.6 60.8 61.0 65.7 65.9 65.9 75.2
Median = 42.3 degrees
 First quartile: median of observations less than or equal to the overall median*:
There are 13 observations less than or equal to the overall median of 42.3 degrees. The median of these 13 observations is the (13+1)/2 = 7th ordered value = 35.7 degrees.
20.7 29.0 30.3 31.9 33.7 34.7 35.7 36.4 37.6 37.7 37.9 40.2 42.3
 Third quartile: median of observations greater than or equal to the overall median*:
There are 13 observations greater than or equal to the overall median of 42.3 degrees. The median of these 13 observations is the (13+1)/2 = 7th ordered value (starting at the overall median) = 55.6 degrees.
42.3 43.2 45.0 45.9 50.4 54.1 55.6 60.8 61.0 65.7 65.9 65.9 75.2
*the “rules” for finding the first and third quartiles are from Stats: Data and Models. It looks like Sullivan uses a different set of “rules” where the median is NOT included in the calculation of the quartiles.
(instructor notes: perhaps a link to the objective of calculating the five number summary could be used here)
Definitions:
Calculate the fences for this example.
 First, calculate IQR:
Recall, the interquartile range (IQR)* = 3rd quartile – 1st quartile, = 55.6 – 35.7 = 19.9 degrees
 Second, substitute appropriate values into the formulas to determine the fences:
Lower fence = 35.7 – (1.5)(19.9) = 5.85 degrees
Upper fence = 55.6 + (1.5)(19.9) = 85.45 degrees
 Note: the fences are used to determine if there are outliers and are NOT included in the boxandwhisker plot (unless they happen to be values in the data set). Since there are no observations less than 5.85 degrees or greater than 85.45 degrees, none of the observations would be considered outliers (according to this rule).
Draw the axis, label it, and give the plot a title.
Attempt steps 4 and 5 for our example.
(Show Example)
Draw the lower and upper whiskers for our example.
(Show Example)
(Example with outliers)
Let’s suppose one additional city was included in this analysis: Fairbanks, Alaska, with an average high temperature in January of 2 degrees. Let’s construct the boxandwhisker plot with the original 25 cities and Fairbanks. Here are the average January high temperatures for these 26 cities listed from lowest to highest:
2.0 20.7 29.0 30.3 31.9 33.7 34.7 35.7 36.4 37.6 37.7 37.9 40.2 42.3 43.2 45.0 45.9 50.4 54.1 55.6 60.8 61.0 65.7 65.9 65.9 75.2
 Five number summary:
 Minimum: 2.0 degrees (Fairbanks)
 Maximum: 75.2 degrees (Miami)
 Median: (26+1)/2 = 13.5th ordered value (average of 13th and 14th ordered values):
 1st quartile: median of observations less than the overall median*
 3rd quartile: median of observations greater than the overall median*
 Calculate the fences
 IQR = 55.6 – 34.7 = 20.9 degrees
 Lower fence = 34.7 – (1.5)(20.9) = 3.35 degrees
 Upper fence = 55.6 + (1.5)(20.9) = 86.95 degrees
 Finally, construct the boxandwhisker plot:
2.0 20.7 29.0 30.3 31.9 33.7 34.7 35.7 36.4 37.6 37.7 37.9 40.2 42.3 43.2 45.0 45.9 50.4 54.1 55.6 60.8 61.0 65.7 65.9 65.9 75.2
= (40.2 + 42.3) / 2 = 41.25 degrees
There are 13 observations less than the overall median of 41.25 degrees. The median of these 13 observations is the (13+1)/2 = 7th ordered value = 34.7 degrees
2.0 20.7 29.0 30.3 31.9 33.7 34.7 35.7 36.4 37.6 37.7 37.9 40.2
There are 13 observations greater than the overall median of 41.25 degrees. The median of these 13 observations is the (13+1)/2 = 7th ordered value starting at 42.3 degrees = 55.6 degrees
42.3 43.2 45.0 45.9 50.4 54.1 55.6 60.8 61.0 65.7 65.9 65.9 75.2
Therefore, the fivenumber summary is:
Minimum = 2.0 degrees
1st quartile = 34.7 degrees
Median = 41.25 degrees
3rd quartile = 55.6 degrees
Maximum = 75.2 degrees
*again: Quartiles were calculated based on rules in Stats: Data and Models by DeVeaux, Velleman, and Bock.
Note that the minimum value is less than the lower fence. Therefore, the minimum (Fairbanks, Alaska) is considered an outlier following these rules. On the boxandwhisker plot, some sort of symbol (an asterisk, for example) would be placed above 2 on the axis to represent that 2 is considered an outlier.
Question:Would the lower whisker extend to the lower fence of 3.35 degrees in this example?
average high temperature in January (degrees Fahrenheit)
C. Interpreting a boxandwhisker plot
The center and spread of the data can be determined from the boxandwhisker plot.
 the median is indicated with a vertical line inside the box. If no vertical line is inside the box, this indicates that the median is equal to the first or third quartile. This could happen if there were a lot of the same values in a data set.
 The median is the average of the 10th and 11th ordered values. Both are 1, so the median is 1.
 1st quartile: median of the 10 observations less than the overall median of 1, which would be the average of the 5th and 6th ordered values = 0. Note: even though the median is 1, we still include the four ordered observations less than the median of 1 – the 7th, 8th, 9th, and 10th ordered values are all 1 and need to be included in the calculation of the first quartile.
 3rd quartile: median of the 10 observations greater than the overall median, which would be the average of the 5th and 6th ordered values beginning at the 11th observation = 1.
 Although the mean is not typically indicated on a boxandwhisker plot, many software packages have the option of adding a symbol at the mean. As an example, here is the boxandwhisker plot from Minitab of the average high temperatures in January from the sample of 25 U.S. cities – note the mean symbol at about 45 degrees.
 The “box” part of a boxandwhisker plot represents the spread of the middle 50% of the data (i.e. the interquartile range).
 The distance from the observation with the smallest value to the largest value represents the range of the data. If there are no outliers, this is the distance from the left end of the lower whisker to the right end of the upper whisker.
Even some features of the shape can be determined from the boxandwhisker plot.
 Data will be symmetric if the following hold:
 if the median is in the middle of the box (halfway between the first and third quartiles)
 if the length of the left (lower) whisker is about the same length as the right (upper) whisker
 if the mean is about the same as the median (if the mean is given on the boxandwhisker plot)
 If at least one of the above doesn’t hold, it’s an indication that the data are not symmetric and could be skewed. In particular,
 data will be right skewed if the right whisker is longer than the left whisker and the distance between the median and 3rd quartile is larger than the distance between the 1st quartile and the median.
 data will be left skewed if the left whisker is longer than the right whisker and the distance between the 1st quartile and the median is larger than the distance between the median and 3rd quartile.
0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 3 5
Note that the median and the third quartile are the same. In the boxandwhisker plot of these data below, note there is no vertical line inside the box because of this. Also note that there is no lower whisker because the minimum value equals the 1st quartile.
It is a little harder to determine the type of distribution of symmetric data. Below are a couple of examples. The first example is of randomly generated normally distributed data – the histogram of the data is on the left with the corresponding boxandwhisker plot on the right. (Perhaps we can think of the variable as height.)
The second example is of randomly generated data from a uniform distribution – the histogram is on the left and the corresponding boxandwhisker plot is on the right. (Again, perhaps we can think of the variable as height.)
Note that for both, the whiskers in the boxandwhisker plot have about the same length and that the distance from the 1st quartile to the median is about the same as the distance from the median to the third quartile. In addition, notice how the mean and median are identical. Notice that the length of the lower whisker (for example) in the uniformly distributed data is about the same as the distance from the 1st quartile to the median (for example). But, for normally distributed data, the length of each whisker is longer than the distance from the median to either the 1st or 3rd quartile.
Below is an example of right skewed data. Notice all the high outliers in the boxandwhisker plot and the lack of a lower whisker. (Again, we can think of the variable as height.)
In the “high temperatures” example, since the median is closer to the first quartile than the third quartile and because the right whisker is a little longer than the left whisker, we would say the data are slightly skewed to the right (higher temperatures). (See boxandwhisker plot below.) This makes sense as most cities have cold average high temperatures in January, but there are a few with somewhat warm temperatures for January.
D. Using boxandwhisker plots to compare the distribution of a quantitative variable between two or more groups.
When two (or more) boxandwhisker plots are displayed next to each other on the same graph using a common scale on the axis (one for each of the groups being compared), they are called sidebyside boxandwhisker plots. Such plots are very useful for comparing the quantitative variable of interest between two (or more) groups. View the following ActivStats applet to learn more about using sidebyside boxandwhisker plots to compare the distribution of a quantitative variable between groups.Note there are 2 outliers in July: San Francisco’s average July high temperature of 71.6 degrees and Phoenix at 105.9.
When comparing the distribution of a quantitative variable between two or more groups, we are primarily interested in comparing the center and, perhaps, the spread of the distributions.
Question: Does there appear to be a difference in the “average” high temperatures between January and July?
Question: Is the variation of high temperatures in January different than in July?