Thin word lists and fat concordances

One of the aspects of the proposed changes in the GCSE modern foreign language, MFL, syllabus in the UK is the use of corpus derived word lists 1. Distribution of words when counted follow a power law. A common power law is Pareto in economics – “Pareto showed that approximately 80% of the land in Italy was owned by 20% of the population” 2 . Similarly in any piece of text a large percentage of it comes from a relatively small amount of words – the top 100 words in English accounts for 50% of any text. The MFL review wants to use wordlists of the most frequent 2000 words – which would cover about 80% of any text.

Currently the MFL syllabus is topic based, so one issue here is that most words one can use for any particular topic will be limited to that topic. Or another way to say it is that although the word may be frequent within a topic it won’t have range and appear in other topics. The NCELP, National Centre for Excellence for Language Pedagogy in Vocabulary lists: Rationales and Uses writes “For example, many of the words for pets or hobbies will be low frequency words which are not useful beyond those particular topics. ” 3

There have been many critics of this wordlist driven proposal who have pointed out various weaknesses, see – AQA Exam board 4, ASCL, Association of School and College Leaders 5, Transform MFL 6 , Linguistics in MFL Project 7.

I want to take a different tack and argue that the wordlist driven approach is a half-hearted version of what could be a full blooded corpus approach to vocabulary content.

Corpus stylist Bill Louw writes that he “has become suspicious of decontextualised frequency lists” (Louw & Milojkovi, 2016:32). He calls such lists thin lists because they tend to cover things rather than events (Louw 2010). Events are states of affairs, what one of the originaters of the notion of meaning by collocation JR Firth has called context of situations. Looking at collocates of things in concordance lines allows us to “chunk the context of situation and culture into facts” (Louw 2010).

A concordance line brings together and displays instances of use of a particular word from the widely disparate contexts in which it occurs. To cover events one would need to examine collocates in concordances hence the term fat concordances.

The most frequent words are often bleached out of their literal meanings. Compare the word “take” on its own, most people would think of the meanings of “the act of receiving, picking up or even stealing” (Louw & Milojkovi, 2016:5), to a collocation such as “take place”, we see that the meaning here is distant from the literal meaning of “take” 8. When the NCELP say “Very high frequency words often have multiple meanings.” they are describing the notion of delexicalisation.

To demonstrate context of situation and context of culture, reproduced below is corpus linguist John Sinclair’s PhraseBite pamphlet which is reproduced in Louw (2008):

When she was- – – – – Phrasebite© John Sinclair, 2006.

  1. The first grammatical collocate of when is she
  2. The first grammatical collocate of when she is was
  3. The vocabulary collocates of when she was are hair-raising. On the first page:
    diagnosed, pregnant, divorced, raped, assaulted, attacked
    The diagnoses are not good, the pregnancies are all problematic.
  4. Select one that looks neutral: approached
  5. Look at the concordance, first page.
  6. Nos 1, 4, 5, 8,10 are of unpleasant physical attacks
  7. Nos 2, 3, 6, 7, 9 are of excellent opportunities
  8. How can you tell the difference?
  9. the nasties are all of people out and about, while the nice ones are of people working somewhere.
  10. Get wider cotext and look at verb tenses in front of citation.
  11. In all the nasties the verb is past progressive, setting a foreground for the approach.
  12. In the nice ones, the verb is non-progressive, either simple past or past-in-past.

Data for para 4 above.
(1) walking in Burnfield Road , Mansewood , when she was approached by a man who grabbed her bag
(2) teamed up with her mother in business when she was approached by Neiman Marcus , the department store
(3) resolved itself after a few months , when she was approached by Breege Keenan , a nun who
(4) Bridge Road close to the Causeway Hospital when she was approached by three men who attacked her
(5) Drive , off Saughton Mains Street , when she was approached by a man . He began talking the original
(6) film of The Stepford Wives when she was approached by producer Scott Rudin to star as
(7) bony. ‘ ‘ Kidd was just 15 when she was approached to be a model . Posing on
(8) near her home with an 11-year-old friend when she was approached by the fiend . The man
(9) finished a storming set of jazz standards when she was approached by SIR SEAN CONNERY . And she
(10) on Douglas Street in Cork city centre when she was approached by the pervert . The man persuaded

As Louw (2008) puts it:

“The power of this publication, coming as it did so close to Sinclair’s death, is to be found in the detail of his method. By beginning with a single word, she, from the whole of the Bank of English, Sinclair simply requests the most frequent collocate from the Bank of English (approximately 500 million words of running text). The computer provides it: when. The results are then merged: when+she. A new search is initiated for the most frequent collocate of this two-word phrase. The computer provides it: was. The concordances are scrutinized and cultural insights are gathered.”

The ASCL quotes applied linguist Vivian Cook:

“While word frequency has some relevance to teaching, other factors are also important, such as the ease with which the meaning of an item can be demonstrated (’blue’ is easier to explain than ‘local’) and its appropriateness for what pupils want to say (‘plane’ is more useful than ‘system’ if you want to travel)”

Blue is easier to explain than local because most collocates of blue are its literal colour meaning e.g. “blue eyes”. Yet consider this from a children’s corpus:

“There, I feel better. I’ve been needing a good cry for some time, and
now I shall be all right. Never mind it, Polly, I’m nervous and tired;
I’ve danced too much lately, and dyspepsia makes me blue;” and Fanny
wiped her eyes and laughed.” (An Old-fashioned Girl, by Louisa May Alcott)

So while it is true that blue is often associated with color, it also associates with mental states where the colour meaning is delexicalised, or washed out.

To conclude, the MFL proposals on using corpus derived word lists to drive content is not taking full advantage of corpora. They are promoting thin wordlists when they could also be promoting fat concordances.

Thanks for reading.


  1. MFL consultation –
  2. Pareto –
  3. NCELP –
  4. AQA –
  5. ASCL –
  6. Transform MFl –
  7. Linguistics in MFL Project –
  8. Take place –


Louw, B. (2008). Consolidating empirical method in data-assisted stylistics. Directions in Empirical Literary Studies: In Honor of Willie Van Peer, 5, 243.

Louw, B. (2010). Collocation as instrumentation for meaning: a scientific fact. In Literary education and digital learning: methods and technologies for humanities studies (pp. 79-101). IGI Global.

Louw, B., & Milojkovic, M. (2016). Corpus stylistics as contextual prosodic theory and subtext (Vol. 23). John Benjamins Publishing Company.

One thought on “Thin word lists and fat concordances

Penny for your thoughts

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.