The sample median is always equal to one of the values in the sample

Nonparametric Hypotheses Tests

Sheldon M. Ross, in Introductory Statistics (Third Edition), 2010

Solution

The sample median is the 12th-smallest score, namely, 76. The sequence of 0s and 1s indicating whether each value is less than or equal to or greater than 76 is as follows:

1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 1 1 1 0 1 1 1 1

Thus, this sequence consists of twelve 0s and eleven 1s, and has seven runs. From Program 14-3, we see that the p value = 0.02997, and thus the hypothesis that the data constitute a random sample is rejected at the 5 percent level of significance.

View chapterPurchase book
Read full chapter
URL:https://www.sciencedirect.com/science/article/pii/B9780123743886000144

Descriptive Statistics

Sheldon M. Ross, in Introduction to Probability and Statistics for Engineers and Scientists (Fifth Edition), 2014

Definition

Order the values of a data set of size n from smallest to largest. If n is odd, the sample median is the value in position (n+1)/2; if n is even, it is the average of the values in positions n/2 and n/2+1.

Thus the sample median of a set of three values is the second smallest; of a set of four values, it is the average of the second and third smallest.

Example 2.3c

Find the sample median for the data described in Example 2.3b.

Solution

Since there are 54 data values, it follows that when the data are put in increasing order, the sample median is the average of the values in positions 27 and 28. Thus, the sample median is 18.5.

The sample mean and sample median are both useful statistics for describing the central tendency of a data set. The sample mean makes use of all the data values and is affected by extreme values that are much larger or smaller than the others; the sample median makes use of only one or two of the middle values and is thus not affected by extreme values. Which of them is more useful depends on what one is trying to learn from the data. For instance, if a city government has a flat rate income tax and is trying to estimate its total revenue from the tax, then the sample mean of its residents income would be a more useful statistic. On the other hand, if the city was thinking about constructing middle-income housing, and wanted to determine the proportion of its population able to afford it, then the sample median would probably be more useful.

Example 2.3d

In a study reported in Hoel, D. G., A representation of mortality data by competing risks, Biometrics, 28, pp. 475488, 1972, a group of 5-week-old mice were each given a radiation dose of 300rad. The mice were then divided into two groups; the first group was kept in a germ-free environment, and the second in conventional laboratory conditions. The numbers of days until death were then observed. The data for those whose death was due to thymic lymphoma are given in the following stem and leaf plots (whose stems are in units of hundreds of days); the first plot is for mice living in the germ-free conditions and the second for mice living under ordinary laboratory conditions.

Germ-Free Mice
158,92,93,94,95202, 12,15, 29,30,37,40,44,47, 59301,01,21,37415,34,44,85,96529, 37624707800
Conventional Mice
159,89,91,98235,45,50,56,61,65,66,80343, 56, 83403,14,28,32
Determine the sample means and the sample medians for the two sets of mice.
Solution

It is clear from the stem and leaf plots that the sample mean for the set of mice put in the germ-free setting is larger than the sample mean for the set of mice in the usual laboratory setting; indeed, a calculation gives that the former sample mean is 344.07, whereas the latter one is 292.32. On the other hand, since there are 29 data values for the germ-free mice, the sample median is the 15th largest data value, namely, 259; similarly, the sample median for the other set of mice is the 10th largest data value, namely, 265. Thus, whereas the sample mean is quite a bit larger for the first data set, the sample medians are approximately equal. The reason for this is that whereas the sample mean for the first set is greatly affected by the five data values greater than 500, these values have a much smaller effect on the sample median. Indeed, the sample median would remain unchanged if these values were replaced by any other five values greater than or equal to 259. It appears from the stem and leaf plots that the germ-free conditions probably improved the life span of the five longest living rats, but it is unclear what, if any, effect it had on the life spans of the other rats.

Another statistic that has been used to indicate the central tendency of a data set is the sample mode, defined to be the value that occurs with the greatest frequency. If no single value occurs most frequently, then all the values that occur at the highest frequency are called modal values.

Example 2.3e

The following frequency table gives the values obtained in 40 rolls of a die.

ValueFrequency192835455667
Find (a) the sample mean, (b) the sample median, and (c) the sample mode.
Solution

(a) The sample mean is

x¯ = (9 + 16 + 15 + 20 + 30 + 42)/40 = 3.05
(b) The sample median is the average of the 20th and 21st smallest values, and is thus equal to 3. (c) The sample mode is 1, the value that occurred most frequently.
View chapterPurchase book
Read full chapter
URL:https://www.sciencedirect.com/science/article/pii/B9780123948113500022

NONPARAMETRIC HYPOTHESIS TESTS

Sheldon M. Ross, in Introduction to Probability and Statistics for Engineers and Scientists (Fourth Edition), 2009

EXAMPLE 12.5b

The lifetime of 19 successively produced storage batteries is as follows:

145 152 148 155 176 134 184 132 145 162 165 185 174 198 179 194 201 169 182

The sample median is the 10th smallest value namely, 169. The data indicating whether the successive values are less than or equal to or greater than 169 are as follows:

1 1 1 1 0 1 0 1 1 1 1 0 0 0 0 0 0 1 0

Hence, the number of runs is 8. To determine if this value is statistically significant, we run Program 12.5 (with n = 10, m = 9) to obtain the result:

p-value=.357

Thus the hypothesis of randomness is accepted.

It can be shown that, when n and m are both large and H0 is true, R will have approximately a normal distribution with mean and standard deviation given by

(12.5.2)μ =2nmn+m +1    and    σ =2nm(2nm-n-m)(n+m)2(n+m-1)

Therefore, when n and m are both large

PH0{Rr} =PH0 {R-μσr-μσ}p{Zr-μσ},         ZN(0.1)=Φ(r-μσ)

and, similarly,

PH0{Rr}  1 -Φ(r-μσ)

Hence, for large n and m, the p -value of the runs test for randomness is approximately given by

p-value 2min{Φ(r-μσ)  1-Φ(r-μσ)}

where μ and σ are given by Equation 12.5.2 and r is the observed number of runs.

View chapterPurchase book
Read full chapter
URL:https://www.sciencedirect.com/science/article/pii/B9780123704832000175

SUMMARIZING DATA

Rand R. Wilcox, in Applying Contemporary Statistical Techniques, 2003

3.2.2 The Sample Median

Another important measure of location is the sample median, which is intended as an estimate of the population median. Simply put, if the sample size is odd, the sample median is the middle value after putting the observations in ascending order. If the sample size is even, the sample median is the average of the two middle values.

Chapter 2 noted that for symmetric distributions, the population mean and median are identical, so for this special case the sample median provides another way of estimating the population mean. But for skewed distributions the population mean and median differ, so generally the sample mean and median are attempting to estimate different quantities.

It helps to describe the sample median in a more formal manner in order to illustrate a commonly used notation. For the observations X1, , Xn, let X(1) represent the smallest number, X(2) the next smallest, and X(n) the largest. More generally,

X(1)X(2)X(3)X(n)

is the notation used to indicate that n values are to be put in ascending order. The sample median is computed as follows:

1.

If the number of observations, n, is odd, compute m = (n + 1)/2. Then the sample median is

M=X(m),
the mth value after the observations are put in order.2.

If the number of observations, n, is even, compute m = n/2. Then the sample median is

M=X(m)+X(m+1))/2,
the average of the mth and (m + 1)th observations after putting the observed values in ascending order.

EXAMPLE.

Consider the values 1.1, 2.3, 1.7, 0.9, and 3.1. The smallest of the five observations is 0.9, so X(1) = 0.9. The smallest of the remaining four observations is 1.1, and this is written as X(2) = 1.1. The smallest of the remaining three observations is 1.7, so X(3) = 1.7; the largest of the five values is 3.1, and this is written as X(5) = 3.1.

EXAMPLE.

Seven subjects are given a test that measures depression. The observed scores are

34, 29, 55, 45, 21, 32, 39.

Because the number of observations is n = 7, which is odd, m = (7 + 1)/2 = 4. Putting the observations in order yields

21, 29, 32, 34, 39, 45, 55.

The fourth observation is X(4) = 34, so the sample median is M = 34.

EXAMPLE.

We repeat the last example, only with six subjects having test scores

29, 55, 45, 21, 32, 39.

Because the number of observations is n = 6, which is even, m = 6/2 = 3. Putting the observations in order yields

21, 29, 32, 39, 45, 55.

The third and fourth observations are X(3) = 32 and X(4) = 39, so the sample median is M = (32 + 39)/2 = 35.5.

Notice that nearly half of any n values can be made arbitrarily large without making the value of the sample median arbitrarily large as well. Consequently, the finite-sample breakdown point is approximately .5, the highest possible value. So the mean and median lie at two extremes in terms of their sensitivity to outliers. The sample mean can be affected by a single outlier, but nearly half of the observations can be outliers without affecting the median. For the data in Table 3.2, the sample median is M = 1, which gives a decidedly different picture of what is typical as compared to the mean, which is 64.9.

TABLE 3.2. Desired Number of Sexual Partners for 105 Males

X:0123456789fx:54945944112X:10111213151819304045fx:3231212211X:1506000fx:21

Based on the single criterion of having a high breakdown point, the median beats the mean. But it is stressed that this is not a compelling reason to routinely use the median over the mean. Chapter 4 describes other criteria for judging measures of location, and situations will be described where both the median and mean are unsatisfactory.

Although we will see several practical problems with the mean, it is not being argued that the mean is always inappropriate. Imagine that someone invests $200,000 and reports that the median amount earned per year, over a 10-year period, is $100,000. This sounds good, but now imagine that the earnings for each year are:

$100,000, $200,000, $200,000, $200,000, $200,000, $200,000, $200,000, $300,000, $300,000, $1,800,000.

So at the end of 10 years this individual has earned nothing and in fact has lost the initial $200,000 investment. Certainly the long-term total amount earned is relevant, in which case the sample mean provides a useful summary of the investment strategy that was followed.

View chapterPurchase book
Read full chapter
URL:https://www.sciencedirect.com/science/article/pii/B9780127515410500249

Computation of null exact distributions

Jaroslav Hájek, ... Pranab K. Sen, in Theory of Rank Tests (Second Edition), 1999

5.2 EXPLICIT FORMULAS FOR DISTRIBUTIONS

5.2.1 Statistics using the scores 0,12,1

Theorem 1

Let S be the two-sample median, test statistic introduced in (4.1.1.10). Then

(1)P(S=k)=([12(m+n)][k])([12(m+n)][mk])(m+nm)1,
where [x] denotes the largest integer not exceeding x. If m + n is even, formula (1) holds for k 0.1,, m in the case of m n, and for k=12(mn),12(mn)+1,,12(m+n)in the case of m n. If m + n is odd, this formula holds for k=0,12,,m12, m in the case of m < n. and for k=12(mn),12(mn)+12,,12(m+n)12,12(m+n)in the case of m > n

The proof is easy on making use of the box model from Subsection 5.1.2. (See Hájek and šidák (1967).)

Theorem 2

If S+ is the sign test statistic given by (4.5.1.12), then

(2)P(S+=k)=(Nk)2Nfor  k=0,1,,N.
Proof

Under H1, this is simply a special case of the binomial distribution with success probability 12.

Theorem 3

Let S be the quadrant test statistic defined by (4.6.1.17). Then for N even

(3)P(S=k)=(N12N)1(12Nk)2for  k=0,1,,12N,
and for N odd
(4)P(S=k+14)=1N(N112(N1))1(12(N1)k)2for k = 0,1, 000,12(N1),P(S=k+12)=N1N(N112(N1))1(12(N3)k)(12(N1)k)for k = 0,1, ...,12(N3),P(S=k)=N1N(N112(N1))1(12(N3)k1)(12(N1)k)for k = 1,2, ..., 12(N1).
Proof

It can be carried out again by means of a special different box model but here we will omit it (for details see Hájek and šidák (1967)).

View chapterPurchase book
Read full chapter
URL:https://www.sciencedirect.com/science/article/pii/B9780126423501500230

Using Statistics to Summarize Data Sets

Sheldon M. Ross, in Introductory Statistics (Third Edition), 2010

Solution

(a)

Since there are 17 data values, the sample median is the 9th smallest. Therefore, the sample median is

m = 30.2
(b)

The sum of all 17 values is 517.4, and so the sample mean is

x¯=517.41730.435

Historical Perspective

The Dutch mathematician Christian Huyghens was one of the early developers of the theory of probability. In 1669 his brother Ludwig, after studying the mortality tables of the time, wrote to his famous older brother that I have just been making a table showing how long people have to live. Live well! According to my calculations you will live to be about 5612and I to 55. Christian, intrigued, also looked at the mortality tables but came up with different estimates for how long both he and his brother would live. Why? Because they were looking at different statistics. Ludwig was basing his estimates on the sample median while Christian was basing his on the sample mean!

For data sets that are roughly symmetric about their central values, the sample mean and sample median will have values close to each other. For instance, the data

4, 6, 8, 8, 9, 12, 15, 17, 19, 20, 22

are roughly symmetric about the value 12, which is the sample median. The sample mean is

= 140/11 = 12.73, which is close to 12.

The question as to which of the two summarizing statistics is the more informative depends on what you are interested in learning from the data set. For instance, if a city government has a flat-rate income tax and is trying to figure out how much income it can expect, then it would be more interested in the sample mean of the income of its citizens than in the sample median (why is this?). On the other hand, if the city government were planning to construct some middle-income housing and were interested in the proportion of its citizens who would be able to afford such housing, then the sample median might be more informative (why is this?).

Although it is interesting to consider whether the sample mean or sample median is more informative in a particular situation, note that we need never restrict ourselves to a knowledge of just one of these quantities. They are both important, and thus both should always be computed when a data set is summarized.

View chapterPurchase book
Read full chapter
URL:https://www.sciencedirect.com/science/article/pii/B978012374388600003X

ROBUST AND EXPLORATORY REGRESSION

Rand R. Wilcox, in Applying Contemporary Statistical Techniques, 2003

For the observations X1, , Xn, let θ^be the sample median. Choose a value for β between 0 and 1 and compute

Wi=|Xiθ^|,m=[(1β)n],

where the notation [(1 β)n] is (1 β)n rounded down to the nearest integer. Using β = .2 appears to be a good choice in most situations. (The value of β determines the finite-sample breakdown point of a measure of scale used to detect outliers.) Let W(1) W(n) be the Wi values written in ascending order and let

ω^x=W(m).
View chapterPurchase book
Read full chapter
URL:https://www.sciencedirect.com/science/article/pii/B9780127515410500341

Estimating Measures of Location and Scale

Rand Wilcox, in Introduction to Robust Estimation and Hypothesis Testing (Fourth Edition), 2017

3.5.3 The MaritzJarrett Estimate of the Standard Error of xˆq

Maritz and Jarrett (1978) derived an estimate of the standard error of sample median, which is easily extended to the more general case involving xˆq. That is, when using a single order statistic, its standard error can be estimated using the method outlined here. It is based on the fact that E(xˆq)and E(xˆq2)can be related to a beta distribution. The beta probability density function, when a and b are positive integers, is

(3.15)f(x)=(a+b+1)!a!b!xa(1x)b,0x1.

Details about the beta distribution are not important here. Interested readers can refer to Johnson and Kotz (1970, Chapter 24).

As before, let m=[qn+0.5]. Let Y be a random variable having a beta distribution with a=m1and b=nm, and let

Wi=P(i1nYin).

Many statistical computing packages have functions that evaluate the beta distribution, so evaluating the Wivalues is relatively easy to do. In R, there is the function pbeta(x,a,b) that computes P(Yx). Thus, Wican be computed by setting x=i/n, y=(i1)/n, in which case Wiis pbeta(x,m-1,n-m) minus pbeta(y,m-1,n-m).

Let

Ck=i=1nWiX(i)k.

When k=1, Ckis a linear combination of the order statistics. Linear sums of order statistics are called L-estimators. Other examples of L-estimators are the trimmed and Winsorized means already discussed. The point here is that Ckcan be shown to estimate E(X(m)k), the kth moment of the mth order statistic. Consequently, the standard error of the mth order statistic, X(m)=xˆq, is estimated with

C2C12.

Note that when n is odd, this last equation provides an alternative to the McKeanSchrader estimate of the standard error of M described in Section 3.3.4. Based on limited studies, it seems that when computing confidence intervals or testing hypotheses based on M, the McKeanSchrader estimator is preferable.

View chapterPurchase book
Read full chapter
URL:https://www.sciencedirect.com/science/article/pii/B9780128047330000032

DESCRIPTIVE STATISTICS

Sheldon M. Ross, in Introduction to Probability and Statistics for Engineers and Scientists (Fourth Edition), 2009

SOLUTION

Since there are 54 data values, it follows that when the data are put in increasing order, the sample median is the average of the values in positions 27 and 28. Thus, the sample median is 18.5.

The sample mean and sample median are both useful statistics for describing the central tendency of a data set. The sample mean makes use of all the data values and is affected by extreme values that are much larger or smaller than the others; the sample median makes use of only one or two of the middle values and is thus not affected by extreme values. Which of them is more useful depends on what one is trying to learn from the data. For instance, if a city government has a flat rate income tax and is trying to estimate its total revenue from the tax, then the sample mean of its residents' income would be a more useful statistic. On the other hand, if the city was thinking about constructing middle-income housing, and wanted to determine the proportion of its population able to afford it, then the sample median would probably be more useful.

View chapterPurchase book
Read full chapter
URL:https://www.sciencedirect.com/science/article/pii/B9780123704832000072

Using Statistics to Summarize Data Sets

Sheldon M. Ross, in Introductory Statistics (Fourth Edition), 2017

3.3 Sample Median

The following data represent the number of weeks after completion of a learn-to-drive course that it took a sample of seven people to obtain a driver's license:

2,110,5,7,6,7,3

The sample mean of this data set is x=140/7=20; and so six of the seven data values are quite a bit less than the sample mean, and the seventh is much greater. This points out a weakness of the sample mean as an indicator of the center of a data setnamely, its value is greatly affected by extreme data values.

A statistic that is also used to indicate the center of a data set but that is not affected by extreme values is the sample median, defined as the middle value when the data are ranked in order from smallest to largest. We will let m denote the sample median.

Definition

Order the data values from smallest to largest. If the number of data values is odd, then the sample median is the middle value in the ordered list; if it is even, then the sample median is the average of the two middle values.

It follows from this definition that if there are three data values, then the sample median is the second-smallest value; and if there are four, then it is the average of the second- and the third-smallest values.

Example 3.6

The following data represent the number of weeks it took seven individuals to obtain their driver's licenses. Find the sample median.

2,110,5,7,6,7,3

Solution

First arrange the data in increasing order.

2,3,5,6,7,7,110
Since the sample size is 7, it follows that the sample median is the fourth-smallest value. That is, the sample median number of weeks it took to obtain a driver's license is m=6weeks.

Example 3.7

The following data represent the number of days it took 6 individuals to quit smoking after completing a course designed for this purpose.

1,2,3,5,8,100
What is the sample median?

Solution

Since the sample size is 6, the sample median is the average of the two middle values; thus,

m=3+52=4
That is, the sample median is 4 days.

In general, for a data set of n values, the sample median is the [(n+1)/2]-smallest value when n is odd and is the average of the (n/2)-smallest value and the (n/2+1)-smallest value when n is even.

The sample mean and sample median are both useful statistics for describing the central tendency of a data set. The sample mean, being the arithmetic average, makes use of all the data values. The sample median, which makes use of only one or two middle values, is not affected by extreme values.

Example 3.8

The following data give the names of the National Basketball Association (NBA) individual scoring champions and their season scoring averages in each of the seasons from 2000 to 2015.

(a)

Find the sample median of the scoring averages.

(b)

Find the sample mean of the scoring averages.

199900Shaquille O'Neal, L.A. Lakers29.7200001Allen Iverson, Philadelphia 76ers31.1200102Allen Iverson, Philadelphia 76ers31.4200203Tracy McGrady, Orlando Magic32.1200304Tracy McGrady, Orlando Magic28.0200405Allen Iverson, Philadelphia 76ers30.7200506Kobe Bryant, L.A. Lakers35.4200607Kobe Bryant, Los Angeles Lakers31.6200708Lebron James, Cleveland Cavaliers30.0200809Dwyane Wade, Miami Heat30.2200910Kevin Durant, Oklahoma Thunder30.1201011Kevin Durant, Oklahoma Thunder27.7201112Kevin Durant, Oklahoma Thunder28.0201213Carmelo Anthony, New York Knicks28.7201314Kevin Durant, Oklahoma Thunder32.0201415Russell Westbrook, Oklahoma Thunder28.1201516Stephen Curry, San Francisco Warriors30.1

Solution

(a)

Since there are 17 data values, the sample median is the 9th smallest. Therefore, the sample median is

m=30.1
(b)

The sum of all 17 values is 514.9, and so the sample mean is

x=514.91730.288

Historical Perspective

The Dutch mathematician Christian Huyghens was one of the early developers of the theory of probability. In 1669 his brother Ludwig, after studying the mortality tables of the time, wrote to his famous older brother that I have just been making a table showing how long people have to live. Live well! According to my calculations you will live to be about 5612and I to 55. Christian, intrigued, also looked at the mortality tables but came up with different estimates for how long both he and his brother would live. Why? Because they were looking at different statistics. Ludwig was basing his estimates on the sample median while Christian was basing his on the sample mean!

For data sets that are roughly symmetric about their central values, the sample mean and sample median will have values close to each other. For instance, the data

4,6,8,8,9,12,15,17,19,20,22

are roughly symmetric about the value 12, which is the sample median. The sample mean is x=140/11=12.73, which is close to 12.

The question as to which of the two summarizing statistics is the more informative depends on what you are interested in learning from the data set. For instance, if a city government has a flat-rate income tax and is trying to figure out how much income it can expect, then it would be more interested in the sample mean of the income of its citizens than in the sample median (why is this?). On the other hand, if the city government were planning to construct some middle-income housing and were interested in the proportion of its citizens who would be able to afford such housing, then the sample median might be more informative (why is this?).

Although it is interesting to consider whether the sample mean or sample median is more informative in a particular situation, note that we need never restrict ourselves to a knowledge of just one of these quantities. They are both important, and thus both should always be computed when a data set is summarized.

View chapterPurchase book
Read full chapter
URL:https://www.sciencedirect.com/science/article/pii/B9780128043172000035

Video