#IATEFL 2017: Meta may not be better

Apparently there were several mentions at IATEFL 2017 to the meta-meta-analytical report of John Hattie:

Click image to see all mentions

I won’t talk here about the wisdom of importing studies based on secondary school subjects into language learning. What I will do is summarise the main argument from Adrian Simpson (a professor in mathematics education) 1 against using meta-analysis study rankings to drive educational policy. I hope to briefly use the examples he gives  (which I have more or less paraphrased or reported verbatim) to allow us to be a bit better informed about the growing trust in meta-analysis in language teaching.

A meta-analysis tries to summarise studies of interest using a statistic called the effect size.  A meta-meta analysis summarises other meta-analysis. John Hattie’s report 2 and the Education Endowment Foundation, EEF 3, a UK government supported organisation, use meta-meta-analysis.

Simpson defines an effect size as the standardised mean difference – that is the difference in mean scores between treatment groups divided by a measure of how those groups vary – usually called Cohen’s d statistic.

Simpson argues that the numbers produced by Hattie and the EEF in league tables do not reflect larger educational impacts but rather they reflect more sensitive studies. Hence we should not use them to drive educational policy. There is some suggestion that teachers and policy makers uncritically accept that higher ranked factors are educationally more important.

For example at IATEFL 2017 Sarah Mercer in her plenary stated 4:

Some of you may know John Hattie’s meta-study, where basically he looked at lots of research lots of studies that have been done in education and he tried to look at what was the sort of key things that influence education and achievement. And he filtered them down to 138 factors. Where do you think relationships were in this? Relationship between the teacher and the learner? That’s 138 of all the most important factors in education. Number 11. Now I gotta tell you that’s really high. Just to give you a clue. Motivation is down at 51.  So it’s hugely important and makes a massive difference to learning, engagement and other positive outcomes of education.

My emphasis of the quote shows acceptance that higher rankings mean better educational outcomes. Simpson describes three issues that affect the size of the Cohen d statistic – comparison groups, sample selection and outcome measures.

Comparison groups
Imagine 2 farmers. One farmer plants two rows of bean seeds. In the first row, the experimental row, she plants seeds in fertilizer; in the 2nd comparison row she uses no fertilizer at all. On her comparison row her beans grow to a mean length of 10cm with a standard deviation (SD) of 1 cm. In her experimental row beans grow to a mean length of 11cm and SD of 1cm. So she reports a d = (11-10/1) = 1.
The second farmer thinks his fertilizer is better than manure. So in his experimental row he plants seeds in fertilizer. In his comparison row he plants seeds in manure. He finds his comparison row beans length to be a mean of 10.5cm and SD of 1cm, experimental row length mean of 11cm and SD of 1cm, so d = (11-10.5/1) = 0.5.

We cannot now say that the first farmer’s fertilizer has a larger impact on bean length compared to the second farmer’s. Nor can we combine two d values to provide a meaningful estimate of the effectiveness of the fertilizer. The farmers were using different comparison groups (the first no fertilizer, the second manure).

We should ask, if we think back to the two factors Mercer highlighted – what groups were compared for the teacher relationship effect size and what groups for the motivation effect size?

Sample selection or range restriction
To illustrate this consider the first farmer choosing seeds from a nursery that did not give very long or short beans; the second framer chooses seeds at random from the nursery. Then at the end of their trials the first farmer will report a bigger effect size than the second because the first farmer restricted the range of her sample. While both farmers may find similar mean differences in average bean length the first farmer will have a smaller variance, hence the denominator in the calculation of d is smaller and the d statistic (effect size) will be bigger.

So what was the sample selection like for the studies in the teacher relationship meta-analysis compared to sample selection in the studies of the motivation meta-analysis?

Range restriction can be corrected for but Simpson claims there is no evidence that the meta-meta-analysis of Hattie and the EEF does this.

Design of measures
A researcher can increase their chances of finding a significant difference between groups if they use a test very similar to the nature of the intervention or if they increase the number of test items.
Consider that unknown to the farmers the fertilizer is only effective on beans which are exposed to direct sunlight not those shaded. So now the first farmer selects beans to measure from those which are easy to reach i.e. those that tend to be exposed to sunlight. The second farmer selects across the plants including those hidden under leaves. The first farmer will report a larger mean difference (and so bigger effect size) than the second farmer since all the beans in her sample will have been affected by the fertilizer while the second farmer includes many shaded beans not affected.

We should ask about the relationship of the outcome measures to the interventions for the teacher relationship and motivation studies.

The above illustrates the focus of an outcome measure, its precision or accuracy can also change the effect size.

Consider the first farmer measures mean length of 5 beans chosen at random from each plant and the second farmer measures mean length of 10 beans. The second farmer could report an effect size much larger than the first. By choosing a larger number of beans to measure the second farmer gets a more precise estimate of mean length of beans i.e. reduces the contribution of within plant variance. So again like restricted range above we divide by a smaller standard deviation for the second farmer hence larger d.

So how precise were the measurements used for the teacher relationships studies compared to the motivation studies?

Simpson suggests that effect size is mis-named. It is better to name it effect clarity, i.e. a large d means a difference between groups is clear. It does not mean that the difference is large or important or educationally significant.

As Simpson emphasizes individual study decisions regarding the three factors of comparison groups, sample selections and measurement design are a normal part of the research design. The argument is that meta-analysis and meta-meta-analysis which attempt to rank order study interventions based on effect size are misleading.

Thanks for reading.


1. Simpson, A. (2017). The misdirection of public policy: Comparing and combining standardised effect sizes. Journal of Education Policy, 32(4), 450-466.

2. Hattie, J. (2008). Visible learning: A synthesis of over 800 meta-analyses relating to achievement. Routledge.

3. Education Endowment Foundation, Teaching and Learning Toolkit [https://educationendowmentfoundation.org.uk/resources/teaching-learning-toolkit/]

4. IATEFL 2017 Plenary session by Sarah Mercer