Signs o’ the times – some/any invariant meanings and COCA

I am glad to be writing this particular (rushed, see end) post as it involves corpus linguistics and I have not done such a post for a while. It is also about my current interest – Columbia School linguistics.

I have been over the years less enamored of the power of corpus linguistics for language teaching. It is certainly very useful to access descriptions of language but that is not enough. Explanations are also needed. Columbia School (CS) linguistics is about analyzing invariant meanings that motivate choices in both grammar and lexis. It is about one form to one meaning mappings – an ideal aim when looking to help students.

Nadav Sabar in 2016 analyses the use of some and any. The following borrows heavily from this paper.

Most pedagogical grammars state (formal) rules such as “any is used in negative sentences and not in affirmative statements”. Yet such rules cannot account for why some is used in contexts that are said to be used for any. Sabar gives the following attested example:

1) When Yvonne lived in Italy, where it seems like the whole country is married, people always wanted to know about her personal life. I remember her telling me that every time she’d come back from a great vacation, the first question from married friends was, “Did you meet anybody?” It was as if the whole point of going on vacation was to meet someone. That she had a great time and saw something new and interesting didn’t matter. The entire vacation was cancelled or a flop because she didn’t meet someone. (

Formal accounts could only say that any is also acceptable as in she didn’t meet anyone and is unconcerned with why the writer chose some in this case.

Formal accounts use the sentence as unit of analysis and see meaning as compositional – i.e. the meanings of individual words in a sentence add to the whole. CS uses signs (pairing of symbol to meaning) as the unit of analysis and sees meaning as instrumental rather than compositional. That is the individual meanings of signals need not add up to sentence meaning. There is a distinction between linguistic code that has an invariant meaning (that always corresponds to a linguistic signal) and interpretation of the code which is the subjective outcome of messages. Meanings are very sparse in that they do not encode messages but only offer prompts that may only suggest message elements.

The meaning hypotheses of some and any are shown below:

I.e. some as RESTRICTED suggests limits, internal divisions, boundaries while any as UNRESTRICTED suggests no boundaries, limits or divisions. Note that this does not mean that the domain in question in reality has no divisions or boundaries. Just that the reality is irrelevant to the message. Also note that in a pedagogical grammar such as Martin Parrott’s this meaning division between restricted and unrestricted is only described for stressed SOME and ANY.

Sabar uses the following as examples:

2) If you see something, say something. (New York City public safety slogan)
3) No parking any time (street sign)

In 2) some is used because the message suggested is a restriction on the set of things people see and say. The context drives the inference as to the nature of the restriction – suspicious looking things. Any could also have been used but that would not have been as effective a message – any would have suggested no restriction i.e. people should call no matter what they see.

Similarly in 3) any is used because there is no restriction on the domain of times of the day.

So now for 1) we can see some is used because the message suggests a restriction of the set of people Yvonne did not meet, and the context shows that this restriction as people who may qualify as marriage potential.

Now the interesting corpus linguistics part.

The methodology of CS first involves a qualitative step where some aspect of the sign in question is looked at. So for some which suggests restriction another element which suggests the same is looked for:

4) Some Feds [Federal workers] are held up as national heroes while others are considered a national joke. (ABC Nightline: Income Tax)

Here others is used to refer to a different subset of people within the domain of Federal workers. This message element is also suggested by some – RESTRICTED. This does not mean there is only one reason for the choice of these forms rather that this message feature of internal division is one reason out of many possible reasons that has motivated the choice of these two forms.

To test this claim generally we can look at a corpus to see if there is a higher than probable chance that others occurs with some more than others occurs with any.

We can do this in COCA by using these search terms:

COCA searches for others:

Favoured Disfavoured
some [up to 9 slots] others any [up to 9 slots] others

The following screenshot shows how to find some [up to 9 slots] others (do similar for any):

To find some with not others see the next screenshot (i.e. use the minus sign -):

And tabulating the data in a contingency table:

others present others absent
N % N %
some 19078 90 8946046 65
any 2022 10 4841946 35
Total 21100 100 13787992 100

p < .0001

The table percentages and significance test supports the claim that there is one message feature that motivates use of both some and others. Note that the meaning hypothesis itself is not directly tested; it is only indirectly tested via the counts in COCA. Sabar goes onto to test both qualitatively and quantitatively other signals that contribute to the meaning hypothesis of some – RESTRICTED and any – UNRESTRICTED.

I wondered how the singular other would distribute with any and some:

other present other absent
N % N %
any 39244 52 4811937 35
some 35175 48 8930621 65
Total 74419 100 13742558 100

p < .0001

Here can we say that singular other contributes to a message meaning of unrestricted? I have no idea as I have not had time to explore this further!

I hope dear reader you forgive the rushed nature of this post but I wanted to get something up before the risk of forgetting this due to holiday haze!

Thanks for indulging.

Update 1:

Thanks to heads up from some tweeters Michael Lewis in his book The English Verb in 1986 was also pointing to the primacy of meaning:

Update 2:

Nadav Sabar has pointed out that he looked for others in one direction i.e. following some/any whereas I looked at occurrence of others both following and before some/any.
Plus in a new version of his paper a window size of 2 is used instead of 9.


Parrott, M. (2000). Grammar for English language teachers: with exercises and a key. Cambridge University Press.

Sabar, N. (2016). Using big data to test meaning hypotheses for any and some. In Otheguy, R., Stern, N., Reid, W. and Ruggles, J. (Eds.) Columbia School linguistics in the 21st century: advances in sign-based linguistics. Amsterdam/Philadelphia: John Benjamins. Retrieved from []


#IATEFL 2016 – Corpus Tweets 2

This is a storify of tweets by Sandy Millan, Dan Ruelle and Leo Selivan on the talk Answering language questions from corpora by James Thomas. Hats off to the tweeters I know it’s not an easy task!

IATEFL 2016 Corpus Tweets 2

Answering language questions from corpora by James Thomas as reported by Sandy Millin, Dan Ruelle & Leo Selivan

  1. James Thomas on answering language questions from corpora. Did not know Masaryk uni was home of Sketch Engine!
  2. JT has written a book about discovering English through SketchEngine with lots of ways you can search and use the corpus
  3. JT trains his trrainees how to use SketchEngine, so they can teach learners how to learn language from language
  4. JT Need to ensure that tasks have a lot of affordances of tasks and texts
  5. We live in an era of collocation, multi-word units, pragmatic competence, fuzziness and multiple affordances – James Thomas
  6. JT Why do SS have language questions? Are the rules inadequate? It’s about hirarchy of choice…
  7. JT Not much choice in terms of letters or morphemes, but lots of choice at text level
  8. JT Patterns are visible in corpora. They are regular features and cover a lot of core English
  9. JT What counts as a language pattern? Collocation, word grammar, language chunks, colligation (and more I didn’t get!)
  10. JT Students have questions about lexical cohesion, spelling mistakes, collocations: at every level of hierarchy
  11. JT Examples of q’s: Does whose refer only to people? Can women be described as handsome? Any patterns with tense/aspect clauses?
  12. JT q’s: Does the truth lie? What is friendly fire? What are the collocations of rule?
  13. JT introduces SKELL: Sketch Engine for Language Learning http://skell. (don’t know!)
  14. “Rules don’t tell whole story” – James Thomas making an analogy w/ Einstein who said same about both the wave & the particle theory
  15. JT SKELL selects useful sentences only, excludes proper nouns, obscure words etc. 40 sentences

    Nice simple interface – need to play with it more. #iatefl

  17. JT searched for mansplain in SKELL and it already has 7 or 8 examples in there
  18. JT Algorithm to reduce amount of sentences only works when there are a lot of examples. With a few, sentences often longer
  19. Sketch Engine is a pretty hardcore linguistic tool, but I can see the use of Skell for language learners. #iatefl
  20. JT Corpora can also teach you more about grammar patterns too, for example periphrasis (didn’t get definition fast enough!)
  21. JT Can search for present perfect continuous for example: have been .*ing
  22. JT You can search for ‘could of’ in SKELL – appears fairly often, but relatively insignificant compared to ‘could have’
  23. Can use frequency in corpus search results to gauge which is “more correct” / “the norm”. #iatefl
  24. JT SKELL can sort collocations by whether a noun is the object or subject of a word for example. Can use ‘word sketch’ function
  25. Unclear whether collocation results in Skell are sorted according to “significance” / frequency or randomly #iatefl
  26. JT See @versatilepub for discounts on book about SKELL


#IATEFL 2016 – Corpus Tweets 1

This is a storified verison of the tweets by Sandy Millinon a talk called Making trouble free corpus tasks in ten minutes by Jennie Wright @teflhelper. Hopefully there will be other tweeters attending the other corpus based talks, who will be up to the standard set by Sandy : )

IATEFL2016 Corpus Tweets 1

Making trouble free corpus tasks in ten minutes – Jennie Wright as reported by Sandy Millin

  1. Jennie Wright now in Hall 8b ‘Making trouble-free corpus tasks in ten minutes’
  2. Jennie Wright runs the TEFL helper blog: 
  3. Jennie Wright All you need to make quick corpus tasks is a good copy-paster
  4. Jennie Wright Key terms: corpus/corpora: multi-million word collections of lang, concordance lines: search term presented in middle
  5. Jennie Wright POS/grammatical tagging is tagging with noun, verb etc; KWIC is key word in the middle (was wrong before! Sorry)
  6. Jennie Wright COCA is one her business English students go back to: 
  7. #iatefl Jennie Wright This is the search for COCA. It's vey intuitive. Put a key word in the box

    Jennie Wright This is the search for COCA. It’s vey intuitive. Put a key word in the box
  8. Jennie Wright You can click KWIC to see the key word in the centre. Very fast!
  9. Jennie Wright COCA is great because it’s colour coded and the parts of speech are tagged
  10. #iatefl Jennie Wright What's the missing word? Get your concordance lines, then blank out one word

    Jennie Wright What’s the missing word? Get your concordance lines, then blank out one word
  11. Jennie Wright It’s ‘thingie’ – she wanted her students to use something other than ‘thing’ or ‘stuff’! Good for fossilised errors
  12. Jennie Wright To make it, use a screengrab or clipping tool tog et the concordance lines, then block out words you want SS to guess
  13. Jennie Wright Don’t forget to read the concordance lines before you copy and paste! Avoid accidents 🙂
  14. Jennie Wright Activity 2: collocation gamble. Focus on strong and weak collocations and lexical chunking. Good for SS misusing them
  15. #iatefl Jennie Wright. What are the two most common adjective collocations each for bitterly, deeply and sincerely?

    Jennie Wright. What are the two most common adjective collocations each for bitterly, deeply and sincerely?
  16. @teflhelper 2 most common=’bitterly’: ‘disappointed’/’cold’. ‘deeply’:’concerned’/’sorry’. ‘sincerely’:’interested’/’sorry’
  17. @teflhelper To make collocation gamble, use list function, type ‘bitterly [j*]’ in word box, then adj.ALL in POS list
  18. I think Sandy meant you can use POS list to insert code for adjective if you want not in addition to the search term given [mura]
  19. #iatefl @teflhelper 3. Colour-code me. COCA helps you to do this easily. Can you colour-code these sentences?

    @teflhelper 3. Colour-code me. COCA helps you to do this easily. Can you colour-code these sentences?
  20. @teflhelper To make this, search for your word, screen shot the answers and retype the sentences for them to colour code
  21. @teflhelper Audience member suggests giving them a key of colours and get them to figure out a sentence which matches
  22. @teflhelper Tips: 1.train a little: better to know one corpus well than a lot of them a little. 2. Imperfections exist in corpora
  23. @teflhelper Tips: Don’t be afraid to oppose what’s in the corpus – you’re the ‘live corpus in the classroom’ Read it carefully
  24. @teflhelper Tips: 3. Choose wisely – never more than 10 lines and don’t overwhelm. 4. What’s the problem you want to solve?
  25. @teflhelper COCA is very helpful when your students don’t believe you 🙂 maximal v. maximum [thing Ngrams useful here too]
  26. @teflhelper Tip 5: consider how to do this online or offline. If online, what’s your backup plan? Have paper copies!
  27. @teflhelper COCA bites on Youtube are 2-minute tutorials on how to use COCA
  28. @teflhelper Thanks very much Jennie for an excellent talk. Exactly what I’ve needed for a long time 🙂

IATEFL 2016 – Corpus carnival!

I started following IATEFL online in 2012 but it is only after seeing this year’s programme that I felt any slight regret not being a member and not going in person. The number of corpus based talks is encouraging though it seems from the presenter’s affiliations that most uses are still based in higher/tertiary education. There is also a forum on using corpora in the classroom.

I hope to blog the conference (do check the list of registered bloggers) and if you want to keep up with either TESOL 2016 or IATEFL 2016 tweeting do consider following the bot @TESOL_IATEFL_50.

Finally there is a new dedicated blog Corpus Linguistics for EFL, do check it.

Note for folks interested in TESOL 2016 corpus related talks see this list.

If you spot any relevant corpus talks that are missing let me know, thanks.

Talks using the word corpus or corpora in the IATEFL 2016 programme (pdf):

Wednesday 13 April
Making trouble-free corpus tasks in ten minutes
Jennie Wright (Target Training)
For business English learners who repeatedly misuse specific vocabulary and grammar, using a corpus (electronic multi-million word collections of real-world language examples) significantly enhances accuracy and competence. Accessible to everyone, with masses of free material to exploit, workshop participants will leave knowing how to quickly and easily use corpora to design activities that take less than ten minutes to create.

Using corpora to remedy language errors in L2 writing
Hulya Can (Bilkent University)
I present a classroom study, conducted with 13 intermediate-level university students, which tested if corpora helped learners improve L2 writing. Participants were asked to use a corpus to correct their written language errors and later questionnaires and interviews were carried out. Data analysis suggested a decrease in the number of language errors; furthermore, participants believed it was an effective language tool.

Classroom applications of corpora training for learner autonomy
Federico Espinosa (The University of Birmingham)
There is an established belief in ELT that training learners in strategies for independent language analysis fosters a deeper understanding of English. Following up from last year’s research talk on corpora training for increasing learner autonomy, this practical workshop will present three fully-developed activities to use corpora with learners in a classroom environment.

Conceptual interface of corpus-based error analysis through error mapping
Paschalis Chliaras (University of Birmingham, UK)
This presentation examines the effectiveness of ‘error mapping’ as a macro and micro error analysis of non-native English language learners’ essays. The procedure involved collecting data from essays, interpreting it, reporting information, and implementing it to teaching and learning. Subsequently, students understood their mistakes, identified their needs, learned to avoid mother tongue interference and handed in a competently proofread essay.

Teaching the pragmatics of spoken requests in EAP
Christian Jones (University of Liverpool, UK)
This talk will describe the impact of one explicit interventional treatment on developing pragmatic awareness and production of spoken requests and apologies in an EAP context at a British higher education institution. The talk will describe the effectiveness of the instruction, the linguistic features of successful spoken requests and apologies in this context, and the implications for EAP teaching. (the presenter here assures us “I am not speaking directly about corpora but may slip some mentions in!”)

Thursday 14 April
Answering language questions from corpora
James Thomas (Masaryk University)
There are many language questions that dictionaries, grammar books and native speakers cannot and do not readily answer. The range of questions extends across the whole hierarchy of language from morphology to sentence building to discourse and pragmatics. This talk offers an approach to asking questions to thousands of native speakers whose language has been sampled and stored in corpora.

Using English Grammar Profile to improve curriculum design
Geraldine Mark (Gloucestershire College/Cambridge University Press) & Anne O’Keeffe (Mary Immaculate College, Limerick/Cambridge University Press)
This talk showcases the English Grammar Profile, a new open educational resource developed to enhance our understanding of English learner grammar. Based on the Cambridge Learner Corpus, it provides over 1,200 corpus-based grammar competency statements across the six levels of the CEFR. The talk will showcase the resource and explore its importance for the design of materials and curricula.

Focus on B2 writing: preparing students for Cambridge English: First
Annette Capel (Freelance)
How can students score top marks? What aspects of writing should they work on at B2? This practical session explores the strengths and weaknesses of candidate performance using real answers from the Cambridge Learner Corpus. Participants will work with the Cambridge English Assessment Scale and evaluate preparation strategies. Learner data from the English Grammar Profile will illustrate useful grammatical development.

Electronic theses online – developing domain-specific corpora from open access
Alannah Fitzgerald (Concordia University) & Chris Mansfield (Queen Mary University of London)
Research findings will be presented from a study into the development and evaluation of domain-specific corpora from the Electronic Theses Online Service (EThOS) at the British Library. These collections were built using the interactive FLAX open-source language software for uptake in English for Specific Academic Purposes (ESAP) programmes at Queen Mary University of London.

Friday 15 April
Grammar for academic purposes
Louise Greenwood (Zayed University, Dubai)
Does an explicit focus on grammar help our students? If so, which grammatical structures should we focus on? This talk will argue that form-focused instruction is valuable and that careful selection of structures based on evidence from a corpus is essential in order to plan a targeted syllabus that meets the needs of students preparing for higher education.

Teacher-driven corpus development: the online restaurant review
Chad Langford & Joshua Albair (University Lille 3, France)
We present our project to develop a user-friendly, high-quality corpus of online restaurant reviews, which we consider a specific genre. Our goals are threefold: to present the genesis and results of our project; to elaborate on concrete pedagogical applications (concerning lexis, discourse, grammar and genre-based writing); and to foster collaboration between colleagues eager to develop and share corpora.
Forum on using corpora in the classroom

Guiding EAP learners to autonomously use online corpora: lessons learned
Daniel Ruelle (RMIT University Vietnam)
This presentation outlines the lessons learned from an initiative to guide upper-intermediate EAP learners to independently use online corpora to improve their written lexical range and accuracy. Experienced and less-experienced educators will leave with a better understanding of the benefits and challenges of training learners to use corpora, and several online tools and practical resources to use with their learners.

Learning academic vocabulary through a discovery-based approach
Nicole Keng (University of Vaasa, Finland)
This talk will examine the effectiveness of using corpora to learn academic vocabulary. The learning experiences and vocabulary knowledge of two groups of Finnish students will be compared. The findings will show how a discovery-based approach to academic vocabulary acquisition can profitably be embedded in EAP course design in a Finnish university context.

Exploring EAP teachers’ familiarity and experiences of corpora
Rachel Peacock (University of Nottingham Ningbo China)
This talk will present findings of a questionnaire investigating 52 EAP teachers’ understanding and practical classroom experience of corpora. Results highlight that the pedagogical potential of corpus-based applications remains at the research level. To address this, three user-friendly online reference tools that can be used by students or teachers in various teaching contexts will be introduced.
Data-driven learning – 25 years on
Crayton Walker (University of Birmingham)
Tim Johns from the University of Birmingham came up with the term Data Driven Learning (DDL) to describe the different ways language teachers can use corpora and corpus-based evidence in the classroom to support learning. In this workshop, I revisit DDL in order to find out how the methodology can be used with the online resources we currently have available.

Chatting in the academy: exploring spoken English for academic purposes
Michael McCarthy (Cambridge University Press)
How does spoken academic English typically differ from academic writing in university settings and how might this influence EAP materials? Using illustrations from corpora, this talk will focus on some key differences to be taken into account when planning materials. Practical examples will be drawn from the
new edition of Academic Vocabulary in Use and from Viewpoint (both CUP).

Skylight interview with Gill Francis & Andy Dickinson

Skylight is a relatively new corpus interface designed with teachers and students in mind. Gill Francis one of the developers kindly answered some questions. The news about forthcoming suggestions for classroom activities is something to look forward to as well as the collocation feature. It is interesting to note that Gill is very much in favour of the use of keyword in context (KWIC) concordance lines. Others such as the FLAX language learning team see KWICs as more of an hinderance and propose their own novel interfaces.

Can you share a little of your background?

Andrew Dickinson is a software writer who is interested in the use of corpora in the classroom and Gill Francis (that’s me) is a corpus linguist. In 1991 I joined the pioneering Cobuild project as Senior Grammarian. Cobuild was founded in 1980 by Professor John Sinclair (University of Birmingham). Its aim was to compile and investigate huge collections of written and spoken language in order to produce a range of dictionaries and grammars for learners that reflect how English is actually spoken and written today. My interest and direction in corpus linguistics owes everything to John Sinclair and our colleagues at Cobuild.

The Bank of English corpora grew to about 450 million words by the late 1990s. We used a fast, versatile, and powerful corpus analysis tool called ‘lookup’. As a grammarian, I was responsible for the grammatical information in the second edition of the Collins Cobuild Advanced Learner’s Dictionary (1995), along with Susan Hunston and Elizabeth Manning. The three of us also wrote the Cobuild Grammar Patterns series (1996, 97, and 98). All these publications reflected a detailed study of corpus evidence.

I’ve continued to work and publish in corpus linguistics since leaving Cobuild. (A list of publications is available.) Then a few years ago I got together with Andy to design Skylight, a program with a clear, easy interface for use by teachers and learners. Since then we have presented Skylight at various corpus linguistics conferences and seminars, and are currently developing it for more general release.

You are targeting classroom use by teachers with Skylight so what do you hope to bring that other corpus tools don’t?

1 – A clear, simple interface

Skylight has a clear, visually attractive interface. The query language is simple and intuitive, and can be learned in a couple of minutes. You can make a query by simply typing in a word or phrase without any special spacing or punctuation, for example “in my opinion” or “in the middle of” or “it’s a case of”.

To vary any word in the query, you use a pipe: “in my|his|her opinion”, or “in the middle|midst of”.

If you want to vary the query and see the range of words in a particular phrase or frame, you use one or more asterisks, for example “in my * opinion” will return “in my humble opinion”, “in my honest opinion”, “in my personal opinion” and so on.

This is about as complex as the query language gets – click on the User Manual from any page of Skylight to see examples of each kind of query. The rules are few and easily mastered by teachers and learners.

2 – Fast, easy alphabetical sorting

If you want to sort concordance lines to the right, or the left, you just click on a button above the lines. This helps you to see at a glance what the right-hand or left-hand collocates of a word or phrase are.

3 – Worksheets and classroom activities

If you are a teacher, you can use Skylight to prepare your own worksheets for corpus-based language activities. When you receive the results of a query, you can tailor the lines to fit your teaching point. This means that you can show only the lines you want, or hide those that you don’t, by clicking or entering text. You can copy the result into Word or another application using the Copy to Clipboard button. The results appear as a neat table, properly displayed and ready for your use. See the User Manual for further details and lots of examples.

Ideally, too, teachers and learners would be able to access a corpus at any point during a class, whenever they want to investigate how a word or phrase is used in a range of real language texts and situations.

For initial guidance and ideas, we are also preparing a large number of suggestions for stand-alone classroom activities practising points of grammar, lexis, and phraseology. Some of these activities address language change and the tension between prescription and description in language teaching. We’ll let you know when we release the first batch of these.

4 – A range of corpora

There are several corpora already available on Skylight – choose any one from the drop-down menu. For example, there is a very large general corpus, ukWaC, which contains 1.4 billion words, as well as smaller corpora like the BNC, BASE, and VOICE. Then there are even smaller corpora – for example a corpus of all Shakespeare’s plays and sonnets that is particularly useful for school children studying English literature.

In addition, any corpus can be compiled in response to the needs of groups of users, such as English school children or intermediate level EFL students. This depends, of course, on copyright restrictions. For more information, see the final sections of the User Manual.

Which other corpus tools would you recommend for teachers either in the classroom or outside?

We don’t feel particularly qualified to answer this question. There a lot of tools that access huge corpora and are extremely useful to linguists and lexicographers, such as Sketch Engine; the COCA (a large corpus of American English) concordancer, and Lancaster’s Corpus Query Processor. If you look up ‘corpus’ and ‘classroom’ together in any search engine, there will be several hits, but we don’t know of anything that combines an easy-to-use interface with really good classroom applications. This doesn’t mean there isn’t anything of course!

What present and/or future do you see for Google as a corpus in language learning?

One of the drawbacks of compiled corpora, such as UkWaC and the BNC, is that they are a snapshot of how language is used at a particular time (or at successive times, if a corpus is updated on a regular basis). The gathering and cleaning-up of text can take many months, so all corpora – even the most recent – are necessarily out-of-date by the time they appear.

The only way to get today’s language today is to use the web as a corpus (see for example Birmingham City University’s WebCorp). This gives results in the KWIC (Key Word in Context) format, with the word or phrase in the centre. The results are not cleaned up or processed, however, which limits their usefulness in the classroom.

But Google itself won’t give you the output you need for focusing on a word or phrase, sorting it, or looking at collocations. You’ll get plenty of examples, of course, but they won’t be shown in the KWIC format. The KWIC display is probably the most important and exciting development in modern corpus linguistics, and you need it if you are to do real corpus-based language work in the classroom or anywhere else.

Anything else you would like to add?

You asked whether we intend to add information about collocation. We are experimenting with a display modelled on the ‘Picture’ technique used in the lookup software used for the Bank Of English, which shows where collocates appear in relation to the node (the central word or phrase) – whether they tend to occur before or after it, for example.

We call the collocation display ‘Searchlight’. The Searchlight display below shows that the most frequent words immediately after obvious are that, then reasons (plural), then choice, then reason (singular). The most frequent words two to the right are of, for, and is. And so on – the columns are not connected, of course; they simply give positional collocations.

The brilliant thing about ‘picture’ that we want to replicate is that you simply click on any word to go to the relevant concordance lines. So if you click on reasons, you’d get all the lines with the combination obvious reasons. So it gives you a subset of the lines, which can then be sorted and tailored in any way you like.


We will add Searchlight to the Skylight website as soon as possible, though we have not yet decided whether to add statistical information – probably not. In the meantime, I’d just like to say that in my many years of scrolling down concordance lines, I find that alphabetical sorting is a very good guide to the collocations of a word. I happened to search for the word intuitively recently, and returned 500 lines. If I sort them one to the right and scroll rapidly down, it’s clear that among the most frequent adjectives that follow it are appealing, correct, and obvious, while the verbs are know and understand. If I sort them one to the left, it is clear that one of the most frequent collocates is the verb be in various forms: ‘it is intuitively obvious’ and so on. Sorting one way and the other gives you a quick thumbnail sketch of a word, and is extremely useful.

So go ahead and try Skylight. And above all, click onto the User Manual, which tells you all you need to know and provides lots of examples of searches using different features.

A huge thanks to the Skylight team and do comment here about your opinions of the interface.

Thanks for reading.

Quick cup of COCA – compound words

A new quick cup of coca post, whayhay. Thanks to Mike Harrison (@harrisonmike) on Twitter who was asking about finding compound adjectives.

Here we can use wildcard asterix, with part of speech.

So say we were looking for adjectives starting with well, we could use [well-*].[j*] to give the following top ten results –

(click on words to see full search results)

To find all compound adjectives we would simply replace the first part of the compound with another wildcard asterix like so:


which gives us the following top 10 results:

(click on words to see full search results)

Similarly if you were looking for noun, adverb or verb compounds simply add the appropriate POS tag i.e. [n*], [r*] and [v*] respectively.

Note do double-check result in concordance lines as sometimes the POS tagging is off.

As an interesting aside a search for compound adjectives historically in COHA gives us a very nice ascending curve. Wonder what the significance of that is?

Compound adjectives over time in COHA (click on graph)

Finally do check out the previous quick cup of coca posts if you want help with searching in COCA.

IATEFL 2015: Recent corpus tools for your students

Jane Templeton’s talk 1 illustrated corpus use by using the wordandphrase tool 2. (Lizzie Pinard has a write-up of the talk 3). I have described using this and other tools on this blog, and there is a nice round-up of corpus tools written by Steve Neufield 4 that looks at just the word, ozdic, word neighbors, netspeak, and stringnet.

This post reports on some more recent tools you may not be aware of (but posted sometime ago in G+ CL community so do check that if you want the skinny early on:\) – WriteAway, Linggle, Skell, Netcollo.

I list them in the order I think students will find easy to use and useful.

1. WriteAway – this tool auto-completes words to help highlight typical structures, so for example it gives two common patterns for Jane’s example of weakness as weakness of something and weakness in something. The first example in pattern one includes the collocation overcomes.

WriteAway screenshot for word weakness

2. Linggle – one could follow-up with a search on Linggle which is basically a souped up version of just the word and uses a 1 trillion word Web based corpus as opposed to the much smaller BNC that just the word uses

It is interesting that overcome weakness is not listed:

Linggle screenshot for verb + weakness
Linggle screenshot for verb + weakness (click image to see results)

but a search for overcome followed by a noun shows that it occurs less than 1% in web pages:

Linggle screenshot for overcome + noun (click image to see results)

3. SkeLL from Sketch Engine is neat for its word-sketch feature so a look at weakness brings up a nice set of collocations and colligations in one screen:

SkeLL wordsketch for weakness (click image to see results)

4. NetCollo corpus tool can compare BNC, a medical corpus and a law corpus, this is useful if you are looking at academic language in medicine and law. For example using the example of weakness we see that it is much more common in BNC:

NetCollo result for weakness (click image to see results)

and we can see that the collocation with overcome only appears once in the Medical corpus.

As ever do try these tools out yourself and then show not tell, as Jane says, your students as and when the need arises in class. By the way do check out the integrative rationale for corpus use by Anna Frankenberg-Garcia5.

Thanks for reading.


1. IATEFL 2015 video – Bringing corpus research into the language classroom

2. Word and tool

3. IATEFL 2015 Bringing corpus research into the language classroom – Jane Templeton

4. Teacher Development: Five ways to introduce concordances to your students

5. Integrating corpora with everyday language teaching

Fav the PHaVE Pedagogical List for the New Year

Great New Year news for teachers, a new word list of phrasal verbs, the PHaVE List (Garnier & Schmitt, 2014) finds that of the top 150 most common verbs there are only 288 meanings in total. That is on average about 2 meanings a phrasal verb. Consider that some estimates of the total number of phrasal verbs number it at nearly 9000.

You can try out the PHaVE Dictionary yourself.

What you will see are the 150 verbs ranked from 1 to 150 and their most common meanings.

The study used the following criteria to include verbs and their meanings:

For the top 150 verbs, each occurs at least 10 times per million. For a meaning to be included it needed to have 75 percent coverage in COCA-BYU and if the primary meaning did not reach this then secondary meanings of at least 10 percent were added until either 75 percent was reached or all 10 percent meanings used.

Thus 6 verbs have 4 meanings, 34 verbs have 3 meanings, 52 verbs have two meanings and 58 have one meaning.

As the study notes, in the user manual for the list, some of the verbs may well be easier to understand than others i.e. be more semantically transparent. A reminder to users that the list is a general guide and teachers, as ever, need to exercise their judgement.

You can access raw lists.

So do go on and set about exploring the PHaVE pedagogical list for the new year.

A huge thanks to all the readers for your support of the blog these past couple of years, here’s to more and better for 2015.


Garnier, M., & Schmitt, N. (2014). The PHaVE List: A pedagogical list of phrasal verbs and their most frequent meaning senses. Language Teaching Research, 1362168814559798. Retrieved January 15 2016

Guy Aston talks speech corpora

I had the pleasure of chatting to Guy Aston as he was staying in Paris on his way back to Italy, where he works at the University of Bologna. Guy has been an active researcher in corpora over the years. Here he recollects one significant event  that encouraged him to pursue his interest in corpora and mentions his current area of investigation (best heard using headphones):

Regular readers may know that I have been using the TED Corpus Search Engine a few times recently to get my students to work on phonetic transcriptions. Multi-media corpora offer the possibilities to examine the prosodic features of language and this is what interests Guy with speech corpora.

For example the phrase Thank you was found to have a falling tone most of the time and frequently occurring phrases such as don’t you, last year, I don’t know and I don’t care are expected to have a fast rhythm (examples taken from The prosody of formulaic expressions in the IBM Lancaster Spoken English Corpus by Phoebe Lin).

Guy went on to detail some requirements and challenges involved when setting up speech corpora:

In the next audio Guy gives some examples of a learner using a speech corpus:

The following phone camera video of Guy’s TED speech corpus using Mike Scott’s Wordsmith (version 5 or later) illustrates listening to concordances of matter of fact:

Finally I asked Guy the old favourite about the misplaced early optimism of using corpuses in the language classroom:

Guy hinted that a version of his corpus may be available for AntConc (it is currently only compatible with WordSmith) but at the same time hinting not to hold your breath waiting for one :).

Once again thanks to Guy for sharing some of his current work. Check out some of his publications.

Do also check out an interview with another corpus linguist Costas Gabrielatos.

Thanks for reading.

A 2nd tipple of the TED Corpus Search Engine

Here is a short (and no doubt to some obvious) note about working with the TED Corpus Search Engine,TCSE. I was looking at examples of the use of route, there are 126 results. I want my students to go through these and transcribe the 2 different ways the word is said. 126 is however too long, so a simple way to reduce this is to use the n-gram feature.

For 2-grams we get:

1 route to 13
2 the route 12
3 route of 5
4 a route 6
5 this route 4

So I could use route to with strong students with 13 examples that they would have to listen to and this route with weak students as they would just need to examine 4 examples.

FYI TCSE can now play search term immediately with an option to start 10 seconds earlier.

Thanks for reading.

P.S. Do check out the first tipple of the TCSE if you haven’t yet.