# The Importance of Effect Sizes in the Interpretation of Research

## Primer on Research: Part 3

##### see also

Clinical practices in speech-language pathology and audiology should be both theoretically motivated and grounded in empirical evidence. To ensure that clinical practices keep pace with empirical literature, researchers need to report study findings in a manner that is meaningful to their clinical audience, and clinicians need to evaluate the importance and relevance of research findings to their clinical practice. To do so, clinicians must draw upon principles learned in their research coursework. One essential principle is that a study's results must be not only *significant* but *meaningful*. In this article, we discuss the difference between significant results and meaningful results and the use of effect sizes for determining the meaningfulness of a result.

When clinicians find a research article that seems relevant to their clinical practice, perhaps one evaluating a state-of-the-art intervention, they might consider adopting the approach if the results show that it has a "statistically significant" positive effect. Unfortunately, statistical significance does not automatically equate to a *meaningful* or *practical* effect. Some statistically significant effects are meaningful, yet others are not. Because statistical significance and practical significance are often conflated when one interprets research findings (i.e., statistical significance is assumed to establish practical significance), researchers now are asked to explicitly interpret the practical import of statistical results by providing estimates of effect sizes. Effect-size estimates are values that characterize "the magnitude of an effect or the strength of a relationship" (APA, 2001, p. 25) in practical terms, such as standard deviation units.

### Beyond Chance

It is important to differentiate the meaning of significant in *statistical significance* from the meaning of *significant* in everyday life. In the *Merriam-Webster Dictionary* (1998), "significant "is defined as (1) having meaning, especially a hidden or special meaning; (2) having or likely to have a considerable influence or effect. Synonyms include *important, noteworthy, major,* and *momentous.* But in a research context, statistical significance simply conveys that the "probability of the observed difference arising by chance was sufficiently small" (Norman & Streiner, 2003, p. 32). It does not tell us about the size of the difference or whether the difference is meaningful. To address meaningfulness, researchers can report and interpret an effect-size estimate. So, readers must consider two issues to decide whether research results are sufficient to lead to a change in clinical practice. First, are the results statistically significant? Second, are the results clinically meaningful or relevant?

Before we discuss effect size, it is important to recall two points about statistical significance. Statistical analysis indicates whether a non-zero difference between groups is likely to be a random occurrence or if it is likely to be found again and again if the study is repeated; thus, statistical significance is based on estimates of probabilities. The first point concerns interpretation of *p* values, the most common metric by which statistical significance is determined. Most often, a finding of statistical significance is one in which a particular test value (e.g., *t* test, ANOVA) corresponds to a probability estimate (the *p* value) of less than .05; the chance that this finding is spurious is less than 5%. The *p* value concerns only probability, not importance of findings. Sometimes we find researchers using the phrase *highly significant* when a *p* of .01 or .001 is reported; these words serve to confuse, more than enlighten, the reader.

The second point concerns the influence of sample size on a *p* value (or the likelihood of achieving statistical significance). A study with a large number of participants, for example, a few hundred, may report a statistically significant group difference for a seemingly small numerical difference in the dependent variable. Thompson (2002a) illustrated in a study of 12,000 students how a difference of 0.3 standard score points between two groups based on zip codes achieved statistical significance. The association between sample size and likelihood of achieving statistical significance is also an important consideration for studies with a small number of participants (e.g., low incidence disabilities). With a small sample size, statistical comparisons may show there to be no statistically significant difference between two groups, even when the means of the two groups seem quite different based on informal inspection of the data.

### The Magnitude of Difference

These issues concerning statistical significance highlight the importance of considering the meaningfulness and relevance of research findings using other metrics, particularly estimates of effect sizes. Effect-size estimates are metrics designed specifically to characterize results in more functional and meaningful ways by discussing the *magnitude* of an effect in addition to estimates of probability. Reports of research in ASHA journals follow the guidelines of the *Publication Manual of the American Psychological Association* (APA, 2001), which suggest that authors both report and interpret effect size estimates in their results section. There are many effect size indices but all address the magnitude of the difference between groups or the relationship between variables. For the former, differences are typically interpreted based on standard deviation units (e.g., one group's scores are 0.5 standard deviation units greater than those of the other group); for the latter, differences are typically interpreted in terms of percent of variance accounted for (e.g., variable X accounts for 20% of the variance in variable Y). Thus, effect size informs the reader of the practical importance of the research findings.

Let's consider two types of effect sizes used to estimate the magnitude of differences between two or more groups: simple effect size and standardized effect size. Simple effect size is the raw difference between the means of two groups; it is most useful when the variable of interest and the unit of measure are easily interpretable. To illustrate, the *Preschool Language Scale* (PLS; 3rd ed.) was administered to 500 preschoolers, and on the Auditory Comprehension Scale girls had a higher group mean than boys. At first glance, this might suggest that boys need a language enrichment program to catch up to girls. However, an analysis of the simple effect size suggests otherwise. In standard score points, the mean for girls was 1.3 points above the mean for boys. Knowing that the PLS has a mean of 100 and a standard deviation of 15, it is quite easy to interpret the simple effect size. You can readily conclude that this difference is not clinically relevant.

Simple effect size, however, is typically an inadequate index. Most often measures used in communication sciences and disorders studies require authors to report a standardized effect size, an effect size that measures the difference between groups relative to a pooled standard deviation. There are several commonly used indices of effect size, such as Cohen's *d* or eta2. [See Vacha-Haase and Thompson (2004) on how to calculate effect size.] To illustrate, researchers reported a statistically significant group difference on an experimental measure of print knowledge when preschool children with language impairment (LI) were compared with typically developing preschoolers. The mean for the LI children was 7.5 and for the typical children 11.5. Although the simple effect size of four points is meaningless, a standardized effect size estimate that considers the 4-point difference relative to the pooled standard deviation of the two groups (LI SD = 3.33; typical SD = 3.28) can be interpreted. Cohen's *d*, calculated to be 1.21, indicated that the LI group mean was more than one standard deviation below the typical group mean. Knowing the effect size, authors and readers can consider clinical relevance or meaningfulness of findings.

### Interpreting Effect Size

Typically, effect-size estimates are interpreted in two ways. One way is to rely on commonly accepted benchmarks that differentiate small, medium, and large effects. Perhaps most well-known are those benchmarks presented by Cohen (1988) for interpreting Cohen's *d*, whereby 0.2 equates to a small effect, 0.5 equates to a medium effect, and effects larger than 0.8 equate to large effects. Thus, in the example above, the difference represents a large effect. However, concerns have been raised, even by Cohen himself, over blanket use of such benchmarks.

The second way to interpret an effect size value is to explicitly compare the reported effect size to those reported in prior studies of a similar nature (Thompson, 2002a; Vaccha-Haase & Thompson, 2004). For instance, hypothetically a researcher might study the impact of a 30-week/60-hour home treatment program for hoarseness compared with that of a no-treatment control condition. Let's assume that post-treatment measurement of vocal quality, based on rating-scale scores, indicated an effect size of *d* = 0.5, medium in size based on Cohen's benchmarks. A savvy reader, however, is particularly interested in how this treatment's effect size compares to those of other treatments, such as a clinician-implemented treatment. As a complement to providing the effect size (*d* = 0.5) and its standard interpretation (medium in size), the researcher also should point out how this effect compares with those of other treatments of vocal hoarseness. For example, perhaps a previously published study found an effect size of 0.92 for a 15-week/30-hour clinician-directed treatment. This effect size provides a useful comparison to interpret the impact of the home treatment program. It is not enough to know that one treatment is better than another; readers of the research literature should expect authors to quantify and explain how much better.

Inclusion of effect sizes has an important benefit beyond the calculation of practical effects. Specifically, effect sizes can be compared across studies using a technique called meta-analysis. In a meta-analysis, a researcher statistically summarizes and integrates the effect sizes of multiple studies to calculate an average effect size. For example, Casby (2001) conducted a meta-analysis of the effects of otitis media on language development, and Law, Garrett, and Nye (2004) conducted a meta-analysis of the effects of treatment for children with speech and language disorders. Meta-analyses are useful for characterizing the average effects seen among a set of variables across an accumulated body of research; they are also important for characterizing the state of research in a given area. For instance, readers may be surprised to know that Law and colleagues' meta-analysis showed inconclusive effects for intervention on improving receptive language skills. Meta-analyses are important for establishing the science needed to further the field of communication sciences and disorders and to ensure the effectiveness of our clinical interventions.

By itself, an individual study provides only a limited contribution to a particular area of science. But a consideration of findings across a number of studies, ideally conducted by different research groups, provides an important advancement to the field and allows us to focus scientific resources more carefully.

### A Closer Look at Effect Size

As Thompson aptly noted, "statistical significance is not sufficiently useful to be invoked as the sole criterion for evaluating the noteworthiness" of research (2002a, p. 66). Therefore, when readers review research literature they should expect authors to both report and interpret effect sizes. An author's interpretation of effect size can make reference to accepted benchmarks but must make comparisons to effects reported in a particular area of study, if available. In addition, authors should inform readers as to the importance of small or large effects in a particular area of study.

References

**American Psychological Association** (2001). Publication manual of the American Psychological Association (5th ed.). Washington, D.C.: Author.

**Casby, M.** (2001). Otitis media and language development: A meta-analysis. American *Journal of Speech-Language Pathology, 10,* 65-80.

**Cohen, J.** (1988). *Statistical power analysis for the behavioral sciences* (2nd ed.). Hillsdale, NJ: Erlbaum.

**Law, J., Garrett, Z., & Nye, C.** (2005). The efficacy of treatment for children with developmental speech and language delay/disorder: A meta-analysis. *Journal of Speech, Language, and Hearing Research, 47,* 924-943.

**Norman, G., & Streiner, D.** (2003). PDQ statistics. Hamilton, Ontario: BC Decker.

**Thompson, B.** (2002a). "Statistical," "Practical," and "Clinical": How many kinds of significance to counselors need to consider? *Journal of Counseling and Development, 80,* 64-71.

**Thompson, B.** (2002b). What future quantitative social science research could look like: Confidence intervals for effect sizes. *Educational Researcher, 31*(3), 25-32.

**Vacha-Haase, T., & Thompson, B.** (2004). How to estimate and interpret various effect sizes. *Journal of Counseling Psychology, 51,* 473-481.