Alphabet Street aka Corpus Symposium at VRTwebcon 8

I was delighted to be able to take part in my first webinar as a presenter. Leo Selivan (@leoselivan) asked me to join the corpus symposium for the 8th VRT web conference along side Jenny Wright (@teflhelper) and Sharon Hartle (@hartle). You can find links to our talks at the end of this post as well as my slides.

Presenting on a webinar is definitely a unique experience like talking to yourself knowing others are watching and listening in. Other things to be noted are making sure your microphone is loud enough and that uploaded powerpoints to online systems like Adobe Connect don’t show your slide notes!

My talk was about using BYU-Wikipedia corpus to help recycle coursebook vocabulary and was titled Darling (BYU) Wiki in homage to the recent passing of the great musician Prince. Another webinar note – people can’t hear the music from your computer if you have headphones on!

As I have already posted about using BYU-Wiki for vocabulary recycling, in this post I want to give some brief notes on designing worksheets using some principles from the research literature. When talking about the slide below I did not really explain in the talk what input enhancement and input flood were. And I also did not point out that my adaptation from Barbieri & Eckhardt (2007) was  very loose : ).


Input  enhancement  draws  learners’  attention  to  targeted grammatical features by visually or acoustically flagging L2 input to  enhance  its  perceptual  saliency but  with  no  guarantee  that  learners will attend to the features” (Kim, 2006: 345).

For written text they include things such as underlining, bolding, italicizing, capitalizing, and colouring. Note that the KWIC output from COCA uses colour to label parts of speech.

Input flood similarly enhances saliency through frequency and draws its basis from studies showing importance of repetition in language learning.

Szudarski & Carter (2015) concluded that a combination of input enhancement and input flood can lead to performance gains in collocational knowledge.

Hopefully this post has briefly highlighted some points I did not cover in my 20 min talk. A huge thanks to those who took the time to attend, to Leo and Heike (Philip, @heikephilp) for organizing things smoothly and my co-presenters Jennie and Sharon. Do browse the recordings of the other talks as there are some very interesting ones to check out.

Talk recording links, slides and related blog posts

Jennie Wright, Making trouble-free tasks with corpora

Sharon Hartle, SkELL as a Key to Unlock Exam Preparation

Mura Nava, Darling (BYU) Wiki

Question and Answer Round

My talk slides (pdf)

Summary Post by Sharon Hartle

8th Virtual Round Table Web Conference 6-8 May 2016 program overview

References and further reading:

Barbieri, F., & Eckhardt, S. E. (2007). Applying corpus-based findings to form-focused instruction: The case of reported speech. Language Teaching Research, 11(3), 319-346

Han, Z.,  Park, E. S., & Combs, C. (2008). Textual enhancement of input: issues and possibilities. Applied Linguistics 29.4: 597–618.

Kim,Y. (2006). Effects of input elaboration on vocabulary acquisition through reading by Korean learners of English as a foreign language. TESOL Quarterly 40.2: 341–373.

Szudarski, P., & Carter, R. (2015). The role of input flood and input enhancement in EFL learners’ acquisition of collocations. International Journal of Applied Linguistics.

Monco and “fresh” make & do collocations

Monco the web news monitor corpus (which means it is continuously updated) has a tremendous collocation feature. I first saw a reference to the collocation feature from a tweet by Diane Nicholls ‏@lexicoloco  but when I tried it the server was acting up. I was reminded to try again by a tweet from Dr. Michael Riccioli ‏@curvedway, whoa it is impressive.
For example let’s see what are the collocates of the famous make and do verbs.

For make here is screenshot of search settings for collocation (to get to collocation function look under tools menu from main Monco page). Note I am looking for nouns that come after the verb make. Also the double asterisk is a short cut to look for all forms of make (try it without the asterisks and see what you get).


I get as results for the top 10 collocates (for all forms of make) the following:

Top 10 collocates-make
click on image for full results

Interesting collocations include make sense, make way, make debut. The results can show you at a glance the types of constructions involved:


Or you can open another window for more details:


The top 10 collocates for do are:

Top 10 collocates-do
click on image for full results

Interesting collocates here are do thing, do anything, do something, do nothing makes a change from do shopping, cooking etc : )

Thanks for reading.

Using BYU-Wikipedia corpus to answer genre related questions

A link was posted recently on Twitter to an IELTS site looking at writing processes and describing graphs.
The following caught my eye:

…natural processes are often described using the active voice, whereas man-made or manufacturing processes are usually described using the passive.

The claim seems to go back to 2011 online (

This is an interesting claim. It has been shown that passives are more common in abstract, technical and formal writing (Biber, 1988 as cited by McEnery & Xiao, 2005). Here the claim is about specific written texts on natural processes and man-made processes.

Well we can simplify this by asking are there more passives used when writing about man-made processes than when writing about natural processes? Since if you use passive clauses then you don’t use active clauses and we can come to a conclusion by deduction.

BYU-Wikipedia corpus can be used to get approximations of natural process writing and man-made process writing. The keywords I used (for the title word) were ecology and manufacturing. Filtering out unwanted texts took longer than expected especially for the manufacturing corpus. In the end I had an ecology corpus of 77 articles and  153,621 words and a manufacturing corpus of 116 articles and 98,195 words.

The search term I used to look for passives was are|were [v?n*]. This gave me a total of 293 passives for ecology and 304 passives for manufacturing. According to the Lancaster LL calculator this showed a significant overuse of passives in manufacturing compared to ecology. According to the log ratio score this is about 2 times as common (if I understand this statistic correctly). Now this does not mean much as a lot of the texts in the wikipedia corpora won’t be specifically about processes but still it is interesting.

What is more interesting are the types of verbs used in passives in ecology and manufacturing. The top ten in each case:
























Thanks for reading.


Biber, D. (1988) Variation Across Speech and Writing(Cambridge: Cambridge University Press).

McEnery, A. M. and Xiao, R. Z. (2005) Passive constructions in English and Chinese: A corpus-based contrastive study . Proceedings from the Corpus Linguistics Conference Series, 1 (1). ISSN 1747-9398 Retrieved from

Using BYU Wiki corpus to recycle coursebook vocabulary in a variety of contexts

Recycling vocabulary in a variety of contexts is recommended by the vocabulary literature. Simply going back to texts one has used in a coursebook is an option but it misses the variety of context.

I need to recycle vocabulary from Unit 1 of my TOEIC book, so I take the topics from the table of contents as input to create a wiki corpus.

The main title of Unit 1 in my book is careers, with sub topics of professions, recruitment, training. I could also add in job interview, job fair, temp agency.

Note for more details on various features of the BYU WIKI corpus do see the videos by Mark Davies, for the rest of this post I assume you have some familiarity with these.

So when creating a corpus in BYU WIKI corpus in my Title word(s) search I enter career* to find all titles with career and careers.

Then in the Words in pages box I enter professions, profession, recruitment, training. Note search for plural and 300 as number of pages:

Screenshot 1: corpus search terms

After pressing submit a screen of a list of wiki pages is presented, you can scroll through this to find pages that may be irrelevant to you:

Screenshot 2: wiki pages

After unticking any irrelevant pages press submit. I won’t talk a lot about filtering your corpus build here. As mentioned do make sure to watch Mark Davies series of videos to get more details.

Now you will see your newly created corpus:

Screenshot 3: my virtual corpora

Tick the Specific radio button:

Screenshot 4: specific key word radio button

and then click the nouns keywords. Skill is the top keyword here which also appears in the wordlist in my book:

Screenshot 5: noun keywords

What I am more interested in is verbs so I click that:

Screenshot 6: verb keywords

The noun requirement, which by the way does not come from the careers unit, appears in the book wordlist but not the verb. So now I can look at some example uses of the verb require that I could use in class.

One step is to see what collocates with require:

Screenshot 7: collocates of require

Clicking on the top 5 collocates brings up some potential language.

Another interesting use is once you have a number of corpora you can see what word appear most in each corpora. The following screenshots show corpora related to the first 3 units of my book i.e. Careers, Workplaces, Communications:

Screenshot 8: my virtual corpora

The greyed lines mean those corpora are omitted from my search. This could be a nice exercise where you take some word and get students to see how they are distributed. So for example you may show the distribution of the verb fill:

Screenshot 9: distribution of verb fill

We see that it appears most in the recruit* corpus. One option now is to get students to predict how the verb is used in that corpus and then click the bar to see some examples.

After this demonstration you can now ask students to guess what words will appear most in the various corpora and do the search for the students to see the resulting graphs.

Hope this has shown how we can use BYU WIKI corpus to recycle vocabulary in different contexts.

Do shoot me any questions as this post may indeed be confusing.

Quick cup of COCA – compare

How are the words rotate and revolve used?

Have a think, then read this description.

Now using COCA-BYU’s compare feature let’s see if the description is supported and whether we can add anything to the picture painted by the Grammarist.

The screen recording above shows:
1. clicking on the compare radio button
2. entering the word rotate in the first box and the word revolve in the second box
3. looking through the results and seeing a rough pattern (of rotate being used with concrete (body) words and revolve with more abstract words)
4. entering noun part of speech (POS) into the collocates box (from the POS drop down menu)
5. now the results are much clearer

The description from the Grammarist article that revolve is used with astronomical objects such as planets is supported. We also see the stronger collocations of more abstract nouns used with revolve such as questions and lives. COCA-BYU also supports the Grammarist description that the primary meaning of rotate to mean spin on axis seems to be most frequent. And the Grammarist description could be added to by saying that rotate is used a lot  with body parts.

Do sip the other cups of COCA if you have not already.

Thanks for reading.

Corpus Linguistics for Grammar – Christian Jones & Daniel Waller interview

CLgrammarFollowing on from James Thomas’s Discovering English with SketchEngine and Ivor Timmis’s Corpus Linguistics for ELT: Research & Practice I am delighted to add an interview with Christan Jones and Daniel Waller authors of Corpus Linguistics for Grammar: A guide for research.

An added bonus are the open access articles listed at the end of the interview. I am very grateful to Christian () and Daniel for taking time to answer my questions.

1. Can you relate some of your background(s)?

We’ve both been involved in ELT for over twenty years and we both worked as teachers and trainers abroad for around a decade; Chris in Japan, Thailand and the UK and Daniel in Turkey. We are now both senior lecturers at the University of Central Lancashire (UCLan, Preston, UK),  where we’ve been involved in a number of programmes including MA and BA TESOL as well as EAP courses.

We both supervise research students and undertake research. Chris’s research is in the areas of spoken language, corpus-informed language teaching and lexis while Daniel focuses on written language, language testing (and the use of corpora in this area) and discourse. We’ve published a number of research papers in these areas and have listed some of these below. We’ve indicated which ones are open-access.

2. The focus in your book is on grammar could you give us a quick (or not so quick) description of how you define grammar in your book?

We could start by saying what grammar isn’t. It isn’t a set of prescriptive rules or the opinion of a self-appointed expert, which is what the popular press tend to bang on about when they consider grammar! Such approaches are inadequate in the definition of grammar and are frequently contradictory and unhelpful (we discuss some of these shortcomings in the book).  Grammar is defined in our book as being (a) descriptive rather than prescriptive (b) the analysis of form and function (c) linked at different levels (d) different in spoken and written contexts (e) a system which operates in contexts to make meaning (f) difficult to separate from vocabulary (g) open to choice.

The use of corpora has revolutionised the ways in which we are now able to explore language and grammar and provides opportunities to explore different modes of text (spoken or written) and different types of text. Any description of grammar must take these into account and part of what we wanted to do was to give readers the tools to carry out their own research into language. When someone is looking at a corpus of a particular type of text, they need to keep in mind the communicative purpose of the text and how the grammar is used to achieve this.

For example, a written text might have a number of complex sentences containing both main and subordinate clauses. It may do so in order to develop an argument but it can also be more complex because the expectation is that a reader has time to process the text, even though it is dense, unlike in spoken language. If we look at a corpus we can discover if there is a general tendency to use a particular pattern such as complex sentences across a number of texts and how it functions within these texts.

3. What corpora do you use in the book?

We have only used open-access corpora in the book including BYU-BNC, COCA, GloWbe, the Hong Kong Corpus of Spoken English. The reason for using open-access corpora was to enable readers to carry out their own examinations of grammar. We really want the book to be a tool for research.

4. Do you have any opinions on the public availability of corpora and whether wider access is something to push for?

Short answer: yes. Longer answer: We would say it’s essential for the development of good language teaching courses, materials and assessments as well as democratising the area of language research. To be fair to many of the big corpora, some like the BNC have allowed limited access for a long time.

5. The book is aimed at research so what can Language Teachers get out of it?

By using the book teachers can undertake small-scale investigations into a piece of language they are about to teach even if it is as simple as finding out which of two forms is the more frequent. We’ve all had situations in our teaching where we’ve come across a particular piece of language and wondered if a form is as frequent as it is made to appear in a text-book, or had a student come up and say ‘can I say X in this text’ and struggled with the answer. Corpora can help us with such questions. We hope the book might make teachers think again about what grammar is and what it is for.

For example, when we consider three forms of marry (marry, marries and married) we find that married is the most common form in both the BYU-BNC newspaper corpus and the COCA spoken corpus. But in the written corpus, the most common pattern is in non-defining relative clauses (Mark, who is married with two children, has been working for two years…). In the spoken corpus, the most common pattern is going to get married e.g. When are they going to get married?

We think that this shows that separating vocabulary and grammar is not always helpful because if a word is presented without its common grammatical patterns then students are left trying to fit the word into a structure and in fact words are patterned in particular ways. In the case of teachers, there is no reason why an initially small piece of research couldn’t become larger and ultimately a publication, so we hope the book will inspire teachers to become interested in investigating language.

6. Anything else you would like to add?

One of the things that got us interested in writing the book was the need for a book pitched at undergraduate students in their final year of their programme and those starting an MA, CELTA or DELTA programme who may not have had much exposure to corpus linguistics previously. We wanted to provide tools and examples to help these readers carry out their own investigations.

Sample Publications

Jones, C., & Waller, D. (2015). Corpus Linguistics for Grammar: A guide for Research. London: Routledge.

Jones, C. (2015).  In defence of teaching and acquiring formulaic sequences. ELT Journal, 69 (3), pp 319-322.

Golebiewksa, P., & Jones, C. (2014). The Teaching and Learning of Lexical Chunks: A Comparison of Observe Hypothesise Experiment and Presentation Practice Production. Journal of Linguistics and Language Teaching, 5 (1), pp.99–115. OPEN ACCESS

Jones, C., & Carter, R. (2014). Teaching spoken discourse markers explicitly: A comparison of III and PPP. International Journal of English Studies, 14 (1), pp.37–54. OPEN ACCESS

Jones, C., & Halenko, N.(2014). What makes a successful spoken request? Using corpus tools to analyse learner language in a UK EAP context. Journal of Applied Language Studies, 8(2), pp. 23–41. OPEN ACCESS

Jones, C., & Horak, T. (2014). Leave it out! The use of soap operas as models of spoken discourse in the ELT classroom. The Journal of Language Teaching and Learning, 4(1), pp.1–14. OPEN ACCESS

Jones, C, Waller, D., & Golebiewska, P. (2013). Defining successful spoken language at B2 Level: Findings from a corpus of learner test data. European Journal of Applied Linguistics and TEFL, 2(2), pp.29–45.

Waller, D., & Jones, C. (2012). Equipping TESOL trainees to teach through discourse. UCLan Journal of Pedagogic Research, 3, pp. 5–11. OPEN ACCESS

Discovering English with SketchEngine – James Thomas interview

2015 seems to be turning into a good year for corpus linguistics books on teaching and learning, you may have read about Ivor Timmis’s Corpus Linguistics for ELT: Research & Practice. There is also a book by Christian Jones and Daniel Waller called Corpus Linguistics for Grammar: A guide for research.

This post is an interview with James Thomas,, on Discovering English with SketchEngine.

1. Can you tell us a bit about you background?

2. Who is your audience for the book?

3. Can your book be used without Sketch Engine?

4. How do you envision people using your book?

5. Do you recommend any other similar books?

6. Anything else you would like to add?

1. Can you tell us a bit about your background?^

Currently I’m head of teacher training in the Department of English and American Studies, Faculty of Arts, Masaryk University, Czech Republic. In addition to standard teacher training courses, I am active in e-learning, corpus work and ICT for ELT. In 2010 my co-author and I were awarded the ELTon for innovation in ELT publishing for our book, Global Issues in ELT. I am secretary of the Corpora SIG of EUROCALL, and a committee member of the biennial conference, TALC (Teaching and Language Corpora).

My work investigates the potential for applying language acquisition and contemporary linguistic findings to the pedagogical use of corpora, and training future teachers to include corpus findings in their lesson preparation and directly with students.

In 1990, I moved to the Czech Republic for a one year contract with ILC/IH and have been here ever since. Up until that time, I had worked as a pianist and music teacher, and had two music theory books published in the early 1990s. Their titles also beginning with “Discovering”! 🙂

2. Who is your audience for the book?^

The book uses the acronym DESKE. Quite a broad catchment area:

  • Teachers of English as a foreign language.
  • Teacher trainees – the digital natives – whether they are doing degree courses or CELTA TESOL Trinity courses.
  • People doing any guise of applied linguistics that involve corpora.
  • Translators, especially those translating into their foreign language. (Only yesterday I presented the book at LEXICOM in Telč.)
  • Students and aficionados of linguistics.
  • Test writers.
  • Advanced students of English who want to become independent learners.

3. Can your book be used without Sketch Engine?^

No. (the answer to the next question explains why not).

Like any book it can be read cover to cover, or aspects of language and linguistics can be found via the indices: (1) Index of names and notions, (2) Lexical focus index.

4. How do you envision people using your book?^

It is pretty essential that the reader has Sketch Engine open most of the time. Apart from some discussions of features of linguistic and English, the book primarily consists of 342 language questions/tasks which are followed by instructions – how to derive the data from the corpus recommended for the specific task, and then how to use Sketch Engine tools to process the data, so that the answer is clear.

Example questions:
About words
Can you say handsome woman in English?
Do marriages break up or down?
How is friend used as a verb?
Which two syllable adjectives form their comparatives with more?
Do men say sorry more than women?

About collocation
I’ve come across boldly go a few times and wonder if it is more than a collocation.
It would be reasonable to expect the words that follow the adverb positively
to be positive, would it not?
Is there anything systematic about the uses of little and small?
What are some adjectives suitable for giving feedback to students?

About phrases and chunks
Does at all reinforce both positive and negative things?
What are those phrase with lastleast; believeears; leadhorse?
How do the structures of to photograph differ from take a photo(graph),
guess with make a guess, smile with give a smile?
Which –ing forms follow verbs like like?

About grammar
How do sentences start with Given?
Who or whom?
Which adverbs are used with the present perfect continuous?
Do the subject and verb typically change places in indirect questions?
How new and how frequent is the question tag, innit?

About text
Are both though and although used to start sentences? Equally?
How much information typically appears in brackets?
Does English permit numbers at the beginning of sentences?
Is it really true that academic prose prefers the passive?
In Pride and Prejudice, are the Darcies ever referred to with their first names?

There is an accompanying website with a glossary – a work eternally in progress, and a page with all the links which appear in the footnotes (142 of them), and another page with the list of questions, which a user might copy and paste into their own document so that they can make notes under them.

5. Do you recommend any other similar books?^

The 223 page book has three interwoven training goals, the upper level being SKE’s interface and tools, the second being a mix of language and linguistics, while the third is training in deriving answers to pre-set questions from data.

AFAIK, there is nothing like this.

6. Anything else you would like to add?^

In all the conference presentations and papers and articles that I have seen and heard over the years in connection with using corpora in ELT, with very few exceptions teachers and researchers focus on a very narrow range of language questions. When my own teacher trainees use corpora to discover features of English in the ways of DESKE, they realise that the steep learning curve is worth it. They are being equipped with a skill for life. It is a professional’s tool.

Sketch Engine consists of both data and software. Both are being constantly updated, which argues well for print-on-demand. It’ll be much easier to bring out updated versions of DESKE than through standard commercial publishers. I’m also expecting feedback from readers, which can also be incorporated into new editions.

My interests in self-publishing are partly related to my interest in ICT. This book is printed through the print-on-demand service, One of the beauties of such a mode of publishing is the relative ease with which the book can be updated as the incremental changes in the software go online. This is in sharp contrast to the economies of scale that dictate large print runs to commercial publishers and the standard five-year interval between editions.

There is a new free student-friendly interface which has its own corpus and interface, known as SKELL which has been available for less than a year. It is also undergoing development at the moment, and I will be preparing a book of worksheets for learners and their teachers (or the other way round). I see it as a 21st cent. replacement of the much missed “COBUILD Corpus Sampler”.

Lastly, I must express my gratitude to Adam Kilgarriff, who owned Sketch Engine until his death from cancer on May 16th, at the age of 55. He was a brilliant linguist, teacher and presenter. He bought 250 copies of my book over a year before it was finished, which freed me up from other obligations – a typical gesture of a wonderful man, greatly missed.

Many thanks to James for taking the time to be interviewed but pity my poor wallet with some very neat CL books to purchase this year. James also mentioned that, for a second edition file, Chapter 1 will be re-written to be able to use the open corpora in SketchEngine.