Tabla de Contenidos
Descriptive statistics allows us to summarize a data set in a small number of numbers or measures that serve to describe how that data is distributed. There are different measures that serve to describe the central tendency of the data, its dispersion and the shape of the distribution curves, some of which are found in the five-number summary.
What is the five number summary?
Based on the above, the summary of five numbers can be defined as a set of five measures or statistics related to a data set that allow describing in a very simple way the amplitude of the set, its dispersion. It also provides a measure of its central tendency. In addition, the five-number summary can also be represented graphically, making it easy to visualize these characteristics of a data set, while allowing it to be easily compared with other related data sets.
What are the five numbers and what do they mean?
The five-number summary is made up of the minimum value, the three quartiles, and the maximum value of a series of statistical data. Quartiles are those data or values that divide the ordered set of all data into four subgroups with the same number of elements . Thus, if we have a set of 100 data, the quartiles are those data or values that divide the set into 4 subsets of 25 data each.
The quartiles are named in the order in which they appear, from lowest to highest, such as the first, second, and third quartiles. In addition, they are represented by the capital letter Q followed by the number that indicates their ordinal position. By its definition, the second quartile, Q2, is also known as the median or midpoint of the data . It should not be confused with the mean, which is the arithmetic average of the data.
In addition to the three quartiles (Q1, Q2, and Q3), the five-number summary also includes the minimum value of the data, ordered from smallest to largest, and the maximum value. In other words, the five numbers in this summary are:
- Minimum.– It is the first value of a set of statistical data ordered from lowest to highest. It is the lowest value data.
- Q1 or first quartile.– It is that data or value that divides the data set, leaving 25% (or a quarter) of them below and the other 75% above.
- Q2 or second quartile.– It is the data or value that divides the data set into two equal groups. That is, it is the value that leaves 50% of the data both below and above it, so it also represents the median or midpoint of the data.
- Q3 or third quartile.– This is the data or value that leaves 75% or three quarters of the data below and the other 25% above.
- Maximum.– As its name indicates, it is the data with the highest value of the entire data series. That is, it is the last data when they are ordered from lowest to highest.
When interpreting the five number summary, the difference between the minimum and maximum value provides what is known as the width of the data series. On the other hand, the difference between the third and first quartiles, called the Interquartile Range (RIC), shows us how dispersed the data is, since it indicates the range of values that contains 50% of the central data.
On the other hand, the second quartile or median is a measure of central tendency that can be used to represent the value of all the data in the series in a single number. Although the mean is often used as a measure of central tendency in many situations, the median offers the advantage of not being sensitive to extreme values (too high or too low).
Box plots: the graphical representation of the five number summary
A practical way to visualize a summary of five numbers is by means of what is called a box plot or Box Plot . In this type of representation, the interquartile range (IQR) is represented as a rectangle or box that extends from Q1 to Q3, and is divided in two by a line perpendicular to the measurement axis located in Q2, that is, in the median.
Finally, on each side of the box lines are drawn parallel to the measurement axis that extend from the minimum to Q1 and from Q3 to the maximum, as long as the minimum and maximum are not more than 1.5.RIC of distance to the left and right of Q1 and Q3, respectively. These lateral lines are what are known as the whiskers of the box. If there is data outside the range demarcated by Q1 – 1.5.RIC and Q3 + 1.5.RIC, then the sides (sometimes called whiskers) extend to the data furthest from the box that is inside. within that range, and the rest are marked as outliers.
Example of the preparation of the summary of five numbers for a series of data
Next, the procedure is presented, step by step, for the elaboration of a summary of five numbers from a set of statistical data. In addition, it explains how to build the box plot for the visualization of this summary in graphical form.
The data correspond to the number of items sold in the women’s department of a department store during a 10-week period. The results of the study are presented below:
Monday | Tuesday | Wednesday | Thursday | Friday | Saturday | Sunday | |
Week 1 | 158 | 145 | 156 | 156 | 164 | 167 | 147 |
week 2 | 161 | 146 | 157 | 152 | 162 | 160 | 153 |
W eek 3 | 152 | 150 | 157 | 155 | 164 | 166 | 152 |
week 4 | 150 | 149 | 153 | 162 | 169 | 162 | 149 |
week 5 | 157 | 152 | 154 | 155 | 168 | 161 | 155 |
week 6 | 157 | 145 | 160 | 164 | 164 | 168 | 149 |
week 7 | 160 | 152 | 151 | 152 | 168 | 163 | 145 |
week 8 | 157 | 152 | 155 | 156 | 162 | 169 | 155 |
week 9 | 160 | 148 | 157 | 150 | 164 | 170 | 154 |
week 10 | 158 | 146 | 163 | 158 | 165 | 169 | 150 |
Step 1: Sort all the data from smallest to largest and assign them an index starting with 1.
The result of this step is presented below:
Index | Worth | Index | Worth | Index | Worth | Index | Worth |
1 | 145 | 22 | 152 | 43 | 158 | 64 | 168 |
2 | 145 | 23 | 153 | 44 | 160 | 65 | 168 |
3 | 145 | 24 | 153 | Four. Five | 160 | 66 | 168 |
4 | 146 | 25 | 154 | 46 | 160 | 67 | 169 |
5 | 146 | 26 | 154 | 47 | 160 | 68 | 169 |
6 | 147 | 27 | 155 | 48 | 161 | 69 | 169 |
7 | 148 | 28 | 155 | 49 | 161 | 70 | 170 |
8 | 149 | 29 | 155 | fifty | 162 | ||
9 | 149 | 30 | 155 | 51 | 162 | ||
10 | 149 | 31 | 155 | 52 | 162 | ||
eleven | 150 | 32 | 156 | 53 | 162 | ||
12 | 150 | 33 | 156 | 54 | 163 | ||
13 | 150 | 3. 4 | 156 | 55 | 163 | ||
14 | 150 | 35 | 157 | 56 | 164 | ||
fifteen | 151 | 36 | 157 | 57 | 164 | ||
16 | 152 | 37 | 157 | 58 | 164 | ||
17 | 152 | 38 | 157 | 59 | 164 | ||
18 | 152 | 39 | 157 | 60 | 164 | ||
19 | 152 | 40 | 157 | 61 | 165 | ||
twenty | 152 | 41 | 158 | 62 | 166 | ||
twenty-one | 152 | 42 | 158 | 63 | 167 |
Step 2: Determine the Q1 and Q3 quartiles
To determine the Q1, Q2 and Q3 quartiles, we begin by calculating an index for the data corresponding to each quartile. The formula is the following:
Where N is the total number of data. This calculation can be integer or not, so the procedure is divided into two cases:
Case 1: Integer result
If the result is integer, then the respective quartile will be the value of the data to which the index corresponds. For example, if the index of Q1 gives 10, this means that Q1 will be the value of data number 10 (149 in our example).
Case 2: Decimal result
If the index is a decimal number, then the quartile will not correspond exactly to any of the data present in the series. In this case, the result is rounded down and the quartile is calculated from this data and the one that follows it, using the following formula:
Where d represents the decimal part of the index, x i is the data with the index rounded down, and x i+1 is the next data point.
In the case of our example, this is the result of calculating the indices of the three quartiles:
In all cases the result was a decimal number, so now we apply the formula from case 2 to determine the value of each quartile:
Step 3: Identify the five numbers
Now that we have the data ordered and we have also determined the values of the three quartiles, the summary of the five numbers is:
Minimum: | 145 |
Q1: | 152 |
Q2 or Median: | 157 |
Q3: | 162.25 |
Maximum: | 170 |
Step 4: Construct the boxplot
We already have everything necessary to build the boxplot except for the RIC. Based on the result obtained in the previous step, the difference between Q3 and Q1 is:
To determine if there are outliers, we calculate Q1 – 1.5 IQR and Q3 + 1.5 IQR and compare with the minimum and maximum:
As we can see, there are no outliers since the minimum, 140, is greater than 136,625. There are also no outliers since the maximum, 170, is less than 177,625.
The following figure shows the result of building the box plot corresponding to the example:
References
How to assemble a five-number summary of a statistical sample . (nd). FaqSalex.info. https://faqsalex.info/educaci%C3%B3n/21361-c%C3%B3mo-reunir-a-un-resumen-de-cinco-n%C3%BAmeros-de-una.html
McAdams, D. (2009, March 4). Summary of five numbers. Life is a Story Problem.org. https://lifeisastoryproblem.tripod.com/en/f/fivenumbersummary.html
Serra, BR (2020, November 22). median . Universe Formulas. https://www.universoformulas.com/estadistica/descriptiva/mediana/#calculo
Serra, BR (2021, August 4). quartiles . Universe Formulas. https://www.universoformulas.com/estadistica/descriptiva/cuartiles/#example
Zentica Global. (nd). Brutalk – How to calculate the 5 number summary for your data in Python . Brutalk. https://www.brutalk.com/en/news/brutalk-blog/view/how-to-calculate-the-summary-of-5-numbers-for-your-data-in-python-6047097da7d56