Europe PMC

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


Background

In systematic reviews and meta-analysis, researchers often pool the results of the sample mean and standard deviation from a set of similar clinical trials. A number of the trials, however, reported the study using the median, the minimum and maximum values, and/or the first and third quartiles. Hence, in order to combine results, one may have to estimate the sample mean and standard deviation for such trials.

Methods

In this paper, we propose to improve the existing literature in several directions. First, we show that the sample standard deviation estimation in Hozo et al.'s method (BMC Med Res Methodol 5:13, 2005) has some serious limitations and is always less satisfactory in practice. Inspired by this, we propose a new estimation method by incorporating the sample size. Second, we systematically study the sample mean and standard deviation estimation problem under several other interesting settings where the interquartile range is also available for the trials.

Results

We demonstrate the performance of the proposed methods through simulation studies for the three frequently encountered scenarios, respectively. For the first two scenarios, our method greatly improves existing methods and provides a nearly unbiased estimate of the true sample standard deviation for normal data and a slightly biased estimate for skewed data. For the third scenario, our method still performs very well for both normal data and skewed data. Furthermore, we compare the estimators of the sample mean and standard deviation under all three scenarios and present some suggestions on which scenario is preferred in real-world applications.

Conclusions

In this paper, we discuss different approximation methods in the estimation of the sample mean and standard deviation and propose some new estimation methods to improve the existing literature. We conclude our work with a summary table (an Excel spread sheet including all formulas) that serves as a comprehensive guidance for performing meta-analysis in different situations.

Free full text 


Logo of bmcmrmBioMed Central web sitethis articleSearchManuscript submissionRegistrationJournal front page
BMC Med Res Methodol. 2014; 14: 135.
Published online 2014 Dec 19. https://doi.org/10.1186/1471-2288-14-135
PMCID: PMC4383202
PMID: 25524443

Estimating the sample mean and standard deviation from the sample size, median, range and/or interquartile range

Associated Data

Supplementary Materials

Abstract

Background

In systematic reviews and meta-analysis, researchers often pool the results of the sample mean and standard deviation from a set of similar clinical trials. A number of the trials, however, reported the study using the median, the minimum and maximum values, and/or the first and third quartiles. Hence, in order to combine results, one may have to estimate the sample mean and standard deviation for such trials.

Methods

In this paper, we propose to improve the existing literature in several directions. First, we show that the sample standard deviation estimation in Hozo et al.’s method (BMC Med Res Methodol 5:13, 2005) has some serious limitations and is always less satisfactory in practice. Inspired by this, we propose a new estimation method by incorporating the sample size. Second, we systematically study the sample mean and standard deviation estimation problem under several other interesting settings where the interquartile range is also available for the trials.

Results

We demonstrate the performance of the proposed methods through simulation studies for the three frequently encountered scenarios, respectively. For the first two scenarios, our method greatly improves existing methods and provides a nearly unbiased estimate of the true sample standard deviation for normal data and a slightly biased estimate for skewed data. For the third scenario, our method still performs very well for both normal data and skewed data. Furthermore, we compare the estimators of the sample mean and standard deviation under all three scenarios and present some suggestions on which scenario is preferred in real-world applications.

Conclusions

In this paper, we discuss different approximation methods in the estimation of the sample mean and standard deviation and propose some new estimation methods to improve the existing literature. We conclude our work with a summary table (an Excel spread sheet including all formulas) that serves as a comprehensive guidance for performing meta-analysis in different situations.

Electronic supplementary material

The online version of this article (10.1186/1471-2288-14-135) contains supplementary material, which is available to authorized users.

Keywords: Interquartile range, Median, Meta-analysis, Sample mean, Sample size, Standard deviation

Background

In medical research, it is common to find that several similar trials are conducted to verify the clinical effectiveness of a certain treatment. While individual trial study could fail to show a statistically significant treatment effect, systematic reviews and meta-analysis of combined results might reveal the potential benefits of treatment. For instance, Antman et al. [1] pointed out that systematic reviews and meta-analysis of randomized control trials would have led to earlier recognition of the benefits of thrombolytic therapy for myocardial infarction and may save a large number of patients.

Prior to the 1990s, the traditional approach to combining results from multiple trials is to conduct narrative (unsystematic) reviews, which are mainly based on the experience and subjectivity of experts in the area [2]. However, this approach suffers from many critical flaws. The major one is due to inconsistent criteria of different reviewers. To claim a treatment effect, different reviewers may use different thresholds, which often lead to opposite conclusions from the same study. Hence, from the mid-1980s, systematic reviews and meta-analysis have become an imperative tool in medical effectiveness measurement. Systematic reviews use specific and explicit criteria to identify and assemble related studies and usually provide a quantitative (statistic) estimate of aggregate effect over all the included studies. The methodology in systematic reviews is usually referred to as meta-analysis. With the combination of several studies and more data taken into consideration in systematic reviews, the accuracy of estimations will get improved and more precise interpretations towards the treatment effect can be achieved via meta-analysis.

In meta-analysis of continuous outcomes, the sample size, mean, and standard deviation are required from included studies. This, however, can be difficult because results from different studies are often presented in different and non-consistent forms. Specifically in medical research, instead of reporting the sample mean and standard deviation of the trials, some trial studies only report the median, the minimum and maximum values, and/or the first and third quartiles. Therefore, we need to estimate the sample mean and standard deviation from these quantities so that we can pool results in a consistent format. Hozo et al. [3] were the first to address this estimation problem. They proposed a simple method for estimating the sample mean and the sample variance (or equivalently the sample standard deviation) from the median, range, and the size of the sample. Their method is now widely accepted in the literature of systematic reviews and meta-analysis. For instance, a search of Google Scholar on November 12, 2014 showed that the article of Hozo et al.’s method has been cited 722 times where 426 citations are made recently in 2013 and 2014.

In this paper, we will show that the estimation of the sample standard deviation in Hozo et al.’s method has some serious limitations. In particular, their estimator did not incorporate the information of the sample size and so consequently, it is always less satisfactory in practice. Inspired by this, we propose a new estimation method that will greatly improve their method. In addition, we will investigate the estimation problem under several other interesting settings where the first and third quartiles are also available for the trials.

Throughout the paper, we define the following summary statistics:

a= the minimum value,

q1= the first quartile,

m= the median,

q3= the third quartile,

b= the maximum value,

n= the sample size.

The {a,q1,m,q3,b} is often referred to as the 5-number summary [4]. Note that the 5-number summary may not always be given in full. The three frequently encountered scenarios are:

equation image

equation image

equation image

Hozo et al.’s method only addressed the estimation of the sample mean and variance under Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq1_HTML.gif while Scenarios An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq2_HTML.gif and An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq3_HTML.gif are also common in systematic review and meta-analysis. In Sections 'Methods’ and 'Results’, we study the estimation problem under these three scenarios, respectively. Simulation studies are conducted in each scenario to demonstrate the superiority of the proposed methods. We conclude the paper in Section 'Discussion’ with some discussions and a summary table to provide a comprehensive guidance for performing meta-analysis in different situations.

Methods

Estimating An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq4_HTML.gifand Sfrom An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq5_HTML.gif

Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq6_HTML.gif assumes that the median, the minimum, the maximum and the sample size are given for a clinical trial study. This is the same assumption as made in Hozo et al.’s method. To estimate the sample mean and standard deviation, we first review the Hozo et al.’s method and point out some limitations of their method in estimating the sample standard deviation. We then propose to improve their estimation by incorporating the information of the sample size.

Throughout the paper, we let X1,X2,…,Xn be a random sample of size n from the normal distribution N(μ,σ2), and X(1)X(2)(...)X(n) be the ordered statistics of X1,X2,(...),Xn. Also for the sake of simplicity, we assume that n=4Q + 1 with Q being a positive integer. Then

equation image
1

In this section, we are interested in estimating the sample mean An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq7_HTML.gif and the sample standard deviation An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq8_HTML.gif, given that a,m,b, and n of the data are known.

Hozo et al.’s method

For ease of notation, let M=2Q+1. Then, M=(n+1)/2. To estimate the mean value, Hozo et al. applied the following inequalities:

equation image

Adding up all above inequalities and dividing by n, we have An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq9_HTML.gif, where the lower and upper bounds are

equation image

Hozo et al. then estimated the sample mean by

equation image
2

Note that the second term in (2) is negligible when the sample size is large. A simplified mean estimation is given as

equation image
3

For estimating the sample standard deviation, by assuming that the data are non-negative, Hozo et al. applied the following inequalities:

equation image
4

With some simple algebra and approximations on the formula (4), we have An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq10_HTML.gif, where the lower and upper bounds are

equation image

Then by (3) and the approximation An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq11_HTML.gif, the sample standard deviation is estimated by An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq12_HTML.gif, where

equation image

When n is large, it results in the following well-known range rule of thumb:

equation image
5

Note that the range rule of thumb (5) is independent of the sample size. It may not work well in practice, especially when n is extremely small or large. To overcome this problem, Hozo et al. proposed the following improved range rule of thumb with respect to the different size of the sample:

equation image
6

where the formula for n≤15 is derived under the equidistantly spaced data assumption, and the formula for n>70 is suggested by the Chebyshev’s inequality [5]. Note also that when the data are symmetric, we have a+b≈2m and so

equation image

Hozo et al. showed that the adaptive formula (6) performs better than the original formula (5) in most settings.

Improved estimation of S

We think, however, that the adaptive formula (6) may still be less accurate for practical use. First, the threshold values 15 and 70 are suggested somewhat arbitrarily. Second, given the normal data N(μ,σ2) with σ>0 being a finite value, we know that σ≈(b-a)/6→ as n. This contradicts to the assumption that σ is a finite value. Third, the non-negative data assumption in Hozo et al.’s method is also quite restrictive.

In this section, we propose a new estimator to further improve (6) and, in addition, we remove the non-negative assumption on the data. Let Z1,…,Zn be independent and identically distributed (i.i.d.) random variables from the standard normal distribution N(0,1), and Z(1)(...)Z(n) be the ordered statistics of Z1,…,Zn. Then Xi=μ+σZi and X(i)=μ+σZ(i) for i=1,…,n. In particular, we have a=μ+σZ(1) and b=μ+σZ(n). Since E(Z(1)) = -E(Z(n)), we have E(b-a)=2σE(Z(n)). Hence, by letting ξ(n)=2E(Z(n)), we choose the following estimation for the sample standard deviation:

equation image
7

Note that ξ(n) plays an important role in the sample standard deviation estimation. If we let ξ(n)≡4, then (7) reduces to the original rule of thumb in (5). If we let An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq13_HTML.gif for n≤15, 4 for 15<n≤70, or 6 for n>70, then (7) reduces to the improved rule of thumb (6).

Next, we present a method to approximate ξ(n) and establish an adaptive rule of thumb for standard deviation estimation. By David and Nagaraja’s method [6], the expected value of Z(n) is

equation image

where An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq14_HTML.gif is the probability density function and An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq15_HTML.gif is the cumulative distribution function of the standard normal distribution. For ease of reference, we have computed the values of ξ(n) by numerical integration using the computer in Table Table11 for n up to 50. From Table Table1,1, it is evident that the adaptive formula (6) in Hozo et al.’s method is less accurate and also less flexible.

Table 1

Values of ξ ( n ) in the formula ( 7 ) and the formula ( 12 ) for n ≤50

n ξ ( n ) n ξ ( n ) n ξ ( n ) n ξ ( n ) n ξ ( n )
10113.173213.778314.113414.341
21.128123.259223.819324.139424.361
31.693133.336233.858334.165434.379
42.059143.407243.895344.189444.398
52.326153.472253.931354.213454.415
62.534163.532263.964364.236464.433
72.704173.588273.997374.259474.450
82.847183.640284.027384.280484.466
92.970193.689294.057394.301494.482
103.078203.735304.086404.322504.498

When n is large (say n>50), we can apply Blom’s method [7] to approximate E(Z(n)). Specifically, Blom suggested the following approximation for the expected values of the order statistics:

equation image
8

where Φ-1(z) is the inverse function of Φ(z), or equivalently, the upper zth percentile of the standard normal distribution. Blom observed that the value of α increases as n increases, with the lowest value being 0.330 for n=2. Overall, Blom suggested α=0.375 as a compromise value for practical use. Further discussion on the choice of α can be seen, for example, in [8] and [9]. Finally, by (7) and (8) with r=n and α=0.375, we estimate the sample standard deviation by

equation image
9

In the statistical software R, the upper zth percentile Φ-1(z) can be computed by the command “qnorm(z)”.

Estimating An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq16_HTML.gifand Sfrom An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq17_HTML.gif

Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq18_HTML.gif assumes that the first quartile, q1, and the third quartile, q3, are also available in addition to An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq19_HTML.gif. In this setting, Bland’s method [10] extended Hozo et al.’s results by incorporating the additional information of the interquartile range (IQR). He further claimed that the new estimators for the sample mean and standard deviation are superior to those in Hozo et al.’s method. In this section, we first review the Bland’s method and point out some limitations of this method. We then, accordingly, propose to improve this method by incorporating the size of a sample.

Bland’s method

Noting that n=4Q+1, we have Q=(n-1)/4. To estimate the sample mean, Bland’s method considered the following inequalities:

equation image

Adding up all above inequalities and dividing by n, it results in An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq20_HTML.gif, where the lower and upper bounds are

equation image

Bland then estimated the sample mean by (LB2+UB2)/2. When the sample size is large, by ignoring the negligible second terms in LB2 and UB2, a simplified mean estimation is given as

equation image
10

For the sample standard deviation, Bland considered some similar inequalities as in (4). Then with some simple algebra and approximation, it results in An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq21_HTML.gif, where the lower and upper bounds are

equation image

Next, by the approximation An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq22_HTML.gif,

equation image
11

Bland’s method then took the square root An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq23_HTML.gif to estimate the sample standard deviation. Note that the estimator (11) is independent of the sample size n. Hence, it may not be sufficient for general use, especially when n is small or large. In the next section, we propose an improved estimation for the sample standard deviation by incorporating the additional information of the sample size.

Improved estimation of S

Recall that the range b-a was used to estimate the sample standard deviation in Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq24_HTML.gif. Now for Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq25_HTML.gif, since the IQR q3-q1 is also known, another approach is to estimate the sample standard deviation by (q3-q1)/η(n), where η(n) is a function of n. Taking both methods into account, we propose the following combined estimator for the sample standard deviation:

equation image
12

Following Section 'Improved estimation of S’, we have ξ(n)=2E(Z(n)). Now we look for an expression for η(n) so that (q3-q1)/η(n) also provides a good estimate of S. By (1), we have q1=μ+σZ(Q+1) and q3=μ+σZ(3Q+1). Then, q3-q1=σ(Z(3Q+1)-Z(Q+1)). Further, by noting that E(Z(Q+1))=-E(Z(3Q+1)), we have E(q3-q1)=2σE(Z(3Q+1)). This suggests that

equation image

In what follows, we propose a method to compute the value of η(n). By [6], the expected value of Z(3Q+1) is

equation image

In Table Table2,2, we provide the numerical values of η(n)=2E(Z(3Q+1)) for Q≤50 using the statistical software R. When n is large, we suggest to apply the formula (8) to approximate η(n). Specifically, noting that Q=(n-1)/4, we have η(n)≈2Φ-1((0.75n-0.125)/(n+0.25)) for r=3Q+1 with α=0.375. Then consequently, for the scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq26_HTML.gif we estimate the sample standard deviation by

equation image
13

Table 2

Values of η ( n ) in the formula ( 12 ) and the formula ( 15 ) for Q ≤50, where n =4 Q +1

Q η ( n ) Q η ( n ) Q η ( n ) Q η ( n ) Q η ( n )
10.990111.307211.327311.334411.338
21.144121.311221.328321.334421.338
31.206131.313231.329331.335431.338
41.239141.316241.330341.335441.338
51.260151.318251.330351.336451.339
61.274161.320261.331361.336461.339
71.284171.322271.332371.336471.339
81.292181.323281.332381.337481.339
91.298191.324291.333391.337491.339
101.303201.326301.333401.337501.340

We note that the formula (13) is more concise than the formula (11). The numerical comparison between the two formulas will be given in the section of simulation study.

Estimating An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq27_HTML.gifand Sfrom An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq28_HTML.gif

Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq29_HTML.gif is an alternative way to report the study other than Scenarios An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq30_HTML.gif and An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq31_HTML.gif. It reports the first and third quartiles instead of the minimum and maximum values. One main reason to report An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq32_HTML.gif is because the IQR is usually less sensitive to outliers compared to the range. For the new scenario, we note that Hozo et al.’s method and Bland’s method will no longer be applicable. Particularly, if their ideas are followed, we have the following inequalities:

equation image

where the first Q inequalities are unbounded for the lower limit, and the last Q inequalities are unbounded for the upper limit. Now adding up all above inequalities and dividing by n, we have An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq33_HTML.gif. This shows that the approaches based on the inequalities do not apply to Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq34_HTML.gif.

In contrast, the following procedure is commonly adopted in the recent literature including [11, 12]: “If the study provided medians and IQR, we imputed the means and standard deviations as described by Hozo et al. [[3]]. We calculated the lower and upper ends of the range by multiplying the difference between the median and upper and lower ends of the IQR by 2 and adding or subtracting the product from the median, respectively”. This procedure, however, performs very poorly in our simulations (not shown).

A quantile method for estimating An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq35_HTML.gifand S

In this section, we propose a quantile method for estimating the sample mean and the sample standard deviation, respectively. In detail, we first revisit the estimation method in Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq36_HTML.gif. By (10), we have

equation image

Now for Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq37_HTML.gif, a and b are not given. Hence, a reasonable solution is to remove a and b from the estimation and keep the second term. By doing so, we have the estimation form as An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq38_HTML.gif, where C is a constant. Finally, noting that E(q1+m+q3)=3μ+σE(Z(Q+1)+Z2Q+1+Z(3Q+1))=3μ, we let C=3 and define the estimator of the sample mean as follows:

equation image
14

For the sample standard deviation, following the idea in constructing (12) we propose the following estimation:

equation image
15

where η(n)=2E(Z(3Q+1)). As mentioned above that E(q3-q1)=2σE(Z(3Q+1))=ση(n), therefore, the estimator (15) provides a good estimate for the sample standard deviation. The numerical values of η(n) are given in Table Table22 for Q≤50. When n is large, by the approximation E(Z(3Q+1))≈Φ-1((0.75n-0.125)/(n+0.25)), we can also estimate the sample standard deviation by

equation image
16

A similar estimator for estimating the standard deviation from IQR is provided in the Cochrane Handbook [13], which is defined as

equation image
17

Note that the estimator (17) is also independent of the sample size n and thus may not be sufficient for general use. As we can see from Table Table2,2, the value of η(n) in the formula (15) converges to about 1.35 when n is large. Note also that the denominator in formula (16) converges to 2*Φ-1(0.75) which is 1.34898 as n tends to infinity. When the sample size is small, our method will provide more accurate estimates than the formula (17) for the standard deviation estimation.

Results

Simulation study for An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq39_HTML.gif

In this section, we conduct simulation studies to compare the performance of Hozo et al.’s method and our new method for estimating the sample standard deviation. Following Hozo et al.’s settings, we consider five different distributions: the normal distribution with mean μ=50 and standard deviation σ=17, the log-normal distribution with location parameter μ=4 and scale parameter σ=0.3, the beta distribution with shape parameters α=9 and β=4, the exponential distribution with rate parameter λ=10, and the Weibull distribution with shape parameter k=2 and scale parameter λ=35. The graph of each of these distributions with the specified parameters is provided in Additional file 1. In each simulation, we first randomly sample n observations and compute the true sample standard deviation using the whole sample. We then use the median, the minimum and maximum values of the sample to estimate the sample standard deviation by the formulas (6) and (9), respectively. To assess the accuracy of the two estimates, we define the relative error of each method as

equation image
18

With 1000 simulations, we report the average relative errors in Figure Figure11 for the normal distribution with the sample size ranging from 5 to 1001, and in Figure Figure22 for the four non-normal distributions with the sample size ranging from 5 to 101. For normal data which are most commonly assumed in meta-analysis, our new method provides a nearly unbiased estimate of the true sample standard deviation. Whereas for Hozo et al.’s method, we do observe that the best cutoff value is about n=15 for switching between the estimates An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq40_HTML.gif and (b-a)/4, and is about n=70 for switching between (b-a)/4 and (b-a)/6. However, its overall performance is not satisfactory by noting that the estimate always fluctuates from -20% to 20% of the true sample standard deviation. In addition, we note that ξ(27)≈4 from Table Table11 and ξ(n)≈6 when Φ-1((n-0.375)/(n+0.25))=3, that is, n=(0.375+0.25*Φ(3))/(1-Φ(3))≈463. This coincides with the simulation results in Figure Figure11 where the method (b-a)/4 crosses the x-axis between n=20 and n=30, and the method (b-a)/6 crosses the x-axis between n=400 and n=500.

An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_Fig1_HTML.jpg

Relative errors of the sample standard deviation estimation for normal data, where the red lines with solid circles represent Hozo et al.’s method, and the green lines with empty circles represent the new method.

An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_Fig2_HTML.jpg

Relative errors of the sample standard deviation estimation for non-normal data (log-normal, beta, exponential and Weibull), where the red lines with solid circles represent Hozo et al.’s method, and the green lines with empty circles represent the new method.

From Figure Figure22 with the skewed data, our proposed method (9) makes a slightly biased estimate with the relative errors about 5% of the true sample standard deviation. Nevertheless, it is still obvious that the new method is much better compared to Hozo et al.’s method. We also note that, for the beta and Weibull distributions, the best cutoff values of n should be larger than 70 for switching between (b-a)/4 and (b-a)/6. This again coincides with Table one in Hozo et al. [3] where the suggested cutoff value is n=100 for Beta and n=110 for Weibull.

Simulation study for An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq41_HTML.gif

In this section, we evaluate the performance of the proposed method (13) and compare it to Bland’s method (11). Following Bland’s settings, we consider (i) the normal distribution with mean μ=5 and standard deviation σ=1, and (ii) the log-normal distribution with location parameter μ=5 and scale parameter σ=0.25, 0.5, and 1, respectively. For simplicity, we consider the sample size being n=4Q+1, where Q takes values from 1 to 50. As in Section 'Simulation study for An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq42_HTML.gif

’, we assess the accuracy of the two estimates by the relative error defined in (18).

In each simulation, we draw a total of n observations randomly from the given distribution and compute the true sample standard deviation of the sample. We then use and only use the minimum value, the first quartile, the median, the third quartile, and the maximum value to estimate the sample standard deviation by the formulas (11) and (13), respectively. With 1000 simulations, we report the average relative errors in Figure Figure33 for the four specified distributions. From Figure Figure3,3, we observe that the new method provides a nearly unbiased estimate of the true sample standard deviation. Even for the very highly skewed log-normal data with σ=1, the relative error of the new method is also less than 10% for most sample sizes. On the contrary, Bland’s method is less satisfactory. As reported in [10], the formula (11) only works for a small range of sample sizes (In our simulations, the range is about from 20 to 40). When the sample size gets larger or the distribution is highly skewed, the sample standard deviations will be highly overestimated. Additionally, we note that the sample standard deviations will be seriously underestimated if n is very small. Overall, it is evident that the new method is better than Bland’s method in most settings.

An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_Fig3_HTML.jpg

Relative errors of the sample standard deviation estimation for normal data and log-normal data, where the red lines with solid circles represent Bland’s method, and the green lines with empty circles represent the new method.

Simulation study for An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq43_HTML.gif

In the third simulation study, we conduct a comparison study that not only assesses the accuracy of the proposed method under Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq44_HTML.gif, but also addresses a more realistic question in meta-analysis, “For a clinical trial study, which summary statistics should be preferred to report,An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq45_HTML.gif, An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq46_HTML.gif or An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq47_HTML.gif? and why?"

For the sample mean estimation, we consider the formulas (3), (10), and (14) under three different scenarios, respectively. The accuracy of the mean estimation is also assessed by the relative error, which is defined in the same way as that for the sample standard deviation estimation. Similarly, for the sample standard deviation estimation, we consider the formulas (9), (13), and (15) under three different scenarios, respectively. The distributions we considered are the same as in Section 'Simulation study for An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq48_HTML.gif

’, i.e., the normal, log-normal, beta, exponential and Weibull distributions with the same parameters as those in previous two simulation studies.

In each simulation, we first draw a random sample of size n from each distribution. The true sample mean and the true sample standard deviation are computed using the whole sample. The summary statistics are also computed and categorized into Scenarios An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq49_HTML.gif, An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq50_HTML.gif and An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq51_HTML.gif. We then use the aforementioned formulas to estimate the sample mean and standard deviation, respectively. The sample sizes are n=4Q+1, where Q takes values from 1 to 50. With 1000 simulations, we report the average relative errors in Figure Figure44 for both An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq52_HTML.gif and S with the normal distribution, in Figure Figure55 for the sample mean estimation with the non-normal distributions, and in Figure Figure66 for the sample standard deviation estimation with the non-normal distributions.

An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_Fig4_HTML.jpg

Relative errors of the sample mean and standard deviation estimations for normal data, where the black solid circles represent the method under scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq53_HTML.gif , the red solid triangles represent the method under scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq54_HTML.gif , and the green empty circles represent the method under scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq55_HTML.gif .

An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_Fig5_HTML.jpg

Relative errors of the sample mean estimation for non-normal data (log-normal, beta, exponential and Weibull), where the black lines with solid circles represent the method under scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq56_HTML.gif , the red lines with solid triangles represent the method under scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq57_HTML.gif , and the green lines with empty circles represent the method under scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq58_HTML.gif .

An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_Fig6_HTML.jpg

Relative errors of the sample standard deviation estimation for non-normal data (log-normal, beta, exponential and Weibull), where the black lines with solid circles represent the method under scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq59_HTML.gif , the red lines with solid triangles represent the method under scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq60_HTML.gif , and the green lines with empty circles represent the method under scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq61_HTML.gif .

For normal data which meta-analysis would commonly assume, all three methods provide a nearly unbiased estimate of the true sample mean. The relative errors in the sample standard deviation estimation are also very small in most settings (within 1% in general). Among the three methods, however, we recommend to estimate An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq62_HTML.gif and S using the summary statistics in Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq63_HTML.gif. One main reason is because the first and third quartiles are usually less sensitive to outliers compared to the minimum and maximum values. Consequently, An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq64_HTML.gif produces a more stable estimation than An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq65_HTML.gif, and also An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq66_HTML.gif that is partially affected by the minimum and maximum values.

For non-normal data from Figure Figure5,5, we note that the mean estimation from An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq67_HTML.gif is always better than that from An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq68_HTML.gif. That is, if the additional information in the first and third quartiles is available, we should always use such information. On the other hand, the estimation from An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq69_HTML.gif may not be consistently better than that from An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq70_HTML.gif even though An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq71_HTML.gif contains the additional information of minimum and maximum values. The reason is that this additional information may contain extreme values which may not be fully reliable and thus lead to worse estimation. Therefore, we need to be cautious when making the choice between An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq72_HTML.gif and An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq73_HTML.gif. It is also noteworthy that (i) the mean estimation from An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq74_HTML.gif is not sensitive to the sample size, and (ii) An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq75_HTML.gif and An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq76_HTML.gif always lead to opposite estimations (one underestimates and the other overestimates the true value). While from Figure Figure6,6, we observe that (i) the standard deviation estimation from An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq77_HTML.gif is quite sensitive to the skewness of the data, (ii) An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq78_HTML.gif and An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq79_HTML.gif would also lead to the opposite estimations except for very small sample sizes, and (iii) An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq80_HTML.gif turns out to be a good compromise for estimating the sample standard deviation. Taking both into account, we recommend to report Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq81_HTML.gif in clinical trial studies. However, if we do not have all information in the 5-number summary and have to make a decision between An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq82_HTML.gif and An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq83_HTML.gif, we recommend An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq84_HTML.gif for small sample sizes (say n≤30), and An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq85_HTML.gif for large sample sizes.

Discussion

Researchers often use the sample mean and standard deviation to perform meta-analysis from clinical trials. However, sometimes, the reported results may only include the sample size, median, range and/or IQR. To combine these results in meta-analysis, we need to estimate the sample mean and standard deviation from them. In this paper, we first show the limitations of the existing works and then propose some new estimation methods. Here we summarize all discussed and proposed estimators under different scenarios in Table Table33.

Table 3

Summary table for estimating An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq86_HTML.gif and S under different scenarios

Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq87_HTML.gif Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq88_HTML.gif Scenario An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq89_HTML.gif
Hozo et al. (2005) An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq90_HTML.gif: Eq. (3)
S: Eq. (6)
Bland (2013) An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq91_HTML.gif: Eq. (10)
S: Eq. (11)
New methods An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq92_HTML.gif: Eq. (3) An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq93_HTML.gif: Eq. (10) An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq94_HTML.gif: Eq. (14)
S: Eq. (9) S: Eq. (13) S: Eq. (16)

We note that the proposed methods are established under the assumption that the data are normally distributed. In meta-analysis, however, the medians and quartiles are often reported when data do not follow a normal distribution. A natural question arises: “To which extent it makes sense to apply methods that are based on a normal distribution assumption?” In practice, if the entire sample or a large part of the sample is known, standard methods in statistics can be applied to estimate the skewness or even the density of the population. For the current study, however, the information provided is very limited, say for example, only a, m, b and n are given in Scenario 1. Under such situations, it may not be feasible to obtain a reliable estimate for the skewness unless we specify the underlying distribution for the population. Note that the underlying distribution is unlikely to be known in practice. Instead, if we arbitrarily choose a distribution (more likely to be misspecified), then the estimates from the wrong model can be even worse than that from the normal distribution assumption. As a compromise, we expect that the proposed formulas under the normal distribution assumption are among the best we can achieve.

Secondly, we note that even if the means and standard deviations can be satisfyingly estimated from the proposed formulas, it still remains a question to which extent it makes sense to use them in a meta-analysis, if the underlying distribution is very asymmetric and one must assume that they don’t represent location and dispersion adequately. Overall, this is a very practical yet challenging question and may warrant more research. In our future research, we propose to develop some test statistics (likelihood ratio test, score test, etc) for pre-testing the hypothesis that the distribution is symmetric (or normal) under the scenarios we considered in this article. The result of the pre-test will then suggest us whether or not we should still include the (very) asymmetric data in the meta-analysis. Other proposals that address this issue will also be considered in our future study.

Finally, to promote the usability, we have provided an Excel spread sheet to include all formulas in Table Table33 in Additional file 2. Specifically, in the Excel spread sheet, our proposed methods for estimating the sample mean and standard deviation can be applied by simply inputting the sample size, the median, the minimum and maximum values, and/or the first and third quartiles for the appropriate scenario. Furthermore, for ease of comparison, we have also included Hozo et al.’s method and Bland’s method in the Excel spread sheet.

Conclusions

In this paper, we discuss different approximation methods in the estimation of the sample mean and standard deviation and propose some new estimation methods to improve the existing literature. Through simulation studies, we demonstrate that the proposed methods greatly improve the existing methods and enrich the literature. Specifically, we point out that the widely accepted estimator of standard deviation proposed by Hozo et al. has some serious limitations and is always less satisfactory in practice because the estimator does not fully incorporate the sample size. As we explained in Section 'Estimating An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq95_HTML.gif and S from An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq96_HTML.gif ’, using (b-a)/6 for n>70 in Hozo et al.’s adaptive estimation is untenable because the range b-a tends to be infinity as n approaches infinity if the distribution is not bounded, such as the normal and log-normal distributions. Our estimator replaces the adaptively selected thresholds (An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq97_HTML.gif with a unified quantity 2Φ-1((n-0.375)/(n+0.25)), which can be quickly computed and obviously is more stable and adaptive. In addition, our method removes the non-negative data assumption in Hozo et al.’s method and so is more applicable in practice.

Bland’s method extended Hozo et al.’s method by using the additional information in the IQR. Since extra information is included, it is expected that Bland’s estimators are superior to those in Hozo et al.’s method. However, the sample size is still not considered in Bland’s method for the sample standard deviation, which again limits its capability in real-world cases. Our simulation studies show that Bland’s estimator significantly overestimates the sample standard deviation when the sample size is large while seriously underestimating it when the sample size is small. Again, we incorporate the information of the sample size in the estimation of standard deviation via two unified quantities, 4Φ-1((n-0.375)/(n+0.25)) and 4Φ-1((0.75n-0.125)/(n+0.25)). With some extra but trivial computing costs, our method makes significant improvement over Bland’s method when the IQR is available.

Moreover, we pay special attention to an overlooked scenario where the minimum and maximum values are not available. We show that the methodology following the ideas in Hozo et al.’s method and Bland’s method will lead to unbounded estimators and is not feasible. On the contrary, we extend the ideas of our proposed methods in the other two scenarios and again construct a simple but still valid estimator. After that, we take a step forward to compare the estimators of the sample mean and standard deviation under all three scenarios. For simplicity, we have only considered three most commonly used scenarios, including An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq98_HTML.gif, An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq99_HTML.gif and An external file that holds a picture, illustration, etc.
Object name is 12874_2014_1175_IEq100_HTML.gif, in the current article. Our method, however, can be readily generalized to other scenarios, e.g., when only {a,q1,q3,b;n} are known or when additional quantile information is given.

Acknowledgements

The authors would like to thank the editor, the associate editor, and two reviewers for their helpful and constructive comments that greatly helped improving the final version of the article. X. Wan’s research was supported by the Hong Kong RGC grant HKBU12202114 and the Hong Kong Baptist University grant FRG2/13-14/005. T.J. Tong’s research was supported by the Hong Kong RGC grant HKBU202711 and the Hong Kong Baptist University grants FRG2/11-12/110, FRG1/13-14/018, and FRG2/13-14/062.

Footnotes

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

TT, XW, and JL conceived and designed the methods. TT and WW conducted the implementation and experiments. All authors were involved in the manuscript preparation. All authors read and approved the final manuscript.

Contributor Information

Xiang Wan, kh.ude.ubkh.pmoc@nawx.

Wenqian Wang, ude.nretsewhtron.u@4102gnawnaiqnew.

Jiming Liu, kh.ude.ubkh.pmoc@gnimij.

Tiejun Tong, kh.ude.ubkh@tgnot.

References

1. Antman EM, Lau J, Kupelnick B, Mosteller F, Chalmers TC. A comparison of results of meta-analyses of randomized control trials and recommendations of clinical experts: treatments for myocardial infarction. J Am Med Assoc. 1992;268:240–248. 10.1001/jama.1992.03490020088036. [Abstract] [CrossRef] [Google Scholar]
2. Cipriani A, Geddes J. Comparison of systematic and narrative reviews: the example of the atypical antipsychotics. Epidemiol Psichiatr Soc. 2003;12:146–153. 10.1017/S1121189X00002918. [Abstract] [CrossRef] [Google Scholar]
3. Hozo SP, Djulbegovic B, Hozo I. Estimating the mean and variance from the median, range, and the size of a sample. BMC Med Res Methodol. 2005;5:13. 10.1186/1471-2288-5-13. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
4. Triola M. F. Elementary Statistics, 11th Ed. 2009. [Google Scholar]
5. Hogg RV, Craig AT. Introduction to Mathematical Statistics. Maxwell: Macmillan Canada; 1995. [Google Scholar]
6. David HA, Nagaraja HN. Order Statistics, 3rd Ed. 2003. [Google Scholar]
7. Blom G. Statistical Estimates and Transformed Beta Variables. New York: John Wiley and Sons, Inc.; 1958. [Google Scholar]
8. Harter HL. Expected values of normal order statistics. Biometrika. 1961;48:151–165. 10.1093/biomet/48.1-2.151. [CrossRef] [Google Scholar]
9. Cramér H. Mathematical Methods of Statistics. 1999. [Google Scholar]
10. Bland M. International Journal of Statistics in Medical Research, in press. 2014. Estimating mean and standard deviation from the sample size, three quartiles, minimum, and maximum. [Google Scholar]
11. Liu T, Li G, Li L, Korantzopoulos P. Association between c-reactive protein and recurrence of atrial fibrillation after successful electrical cardioversion: a meta-analysis. J Am Coll Cardiol. 2007;49:1642–1648. 10.1016/j.jacc.2006.12.042. [Abstract] [CrossRef] [Google Scholar]
12. Zhu A, Ge D, Zhang J, Teng Y, Yuan C, Huang M, Adcock IM, Barnes PJ, Yao X. Sputum myeloperoxidase in chronic obstructive pulmonary disease. Eur J Med Res. 2014;19:12. 10.1186/2047-783X-19-12. [Europe PMC free article] [Abstract] [CrossRef] [Google Scholar]
13. Higgins JPT, Green S. Cochrane Handbook for Systematic Reviews of Interventions. 2008. [Google Scholar]

Pre-publication history


Articles from BMC Medical Research Methodology are provided here courtesy of BMC

Citations & impact 


Impact metrics

Jump to Citations

Citations of article over time

Alternative metrics

Altmetric item for https://www.altmetric.com/details/2997784
Altmetric
Discover the attention surrounding your research
https://www.altmetric.com/details/2997784

Smart citations by scite.ai
Smart citations by scite.ai include citation statements extracted from the full text of the citing article. The number of the statements may be higher than the number of citations provided by EuropePMC if one paper cites another multiple times or lower if scite has not yet processed some of the citing articles.
Explore citation contexts and check if this article has been supported or disputed.
https://scite.ai/reports/10.1186/1471-2288-14-135

Supporting
Mentioning
Contrasting
8
4618
2

Article citations


Go to all (3,606) article citations

Data 


Data behind the article

This data has been text mined from the article, or deposited into data resources.

Similar Articles