Corpus Linguistics for Grammar – Christian Jones & Daniel Waller interview

CLgrammarFollowing on from James Thomas’s Discovering English with SketchEngine and Ivor Timmis’s Corpus Linguistics for ELT: Research & Practice I am delighted to add an interview with Christan Jones and Daniel Waller authors of Corpus Linguistics for Grammar: A guide for research.

An added bonus are the open access articles listed at the end of the interview. I am very grateful to Christian () and Daniel for taking time to answer my questions.

1. Can you relate some of your background(s)?

We’ve both been involved in ELT for over twenty years and we both worked as teachers and trainers abroad for around a decade; Chris in Japan, Thailand and the UK and Daniel in Turkey. We are now both senior lecturers at the University of Central Lancashire (UCLan, Preston, UK),  where we’ve been involved in a number of programmes including MA and BA TESOL as well as EAP courses.

We both supervise research students and undertake research. Chris’s research is in the areas of spoken language, corpus-informed language teaching and lexis while Daniel focuses on written language, language testing (and the use of corpora in this area) and discourse. We’ve published a number of research papers in these areas and have listed some of these below. We’ve indicated which ones are open-access.

2. The focus in your book is on grammar could you give us a quick (or not so quick) description of how you define grammar in your book?

We could start by saying what grammar isn’t. It isn’t a set of prescriptive rules or the opinion of a self-appointed expert, which is what the popular press tend to bang on about when they consider grammar! Such approaches are inadequate in the definition of grammar and are frequently contradictory and unhelpful (we discuss some of these shortcomings in the book).  Grammar is defined in our book as being (a) descriptive rather than prescriptive (b) the analysis of form and function (c) linked at different levels (d) different in spoken and written contexts (e) a system which operates in contexts to make meaning (f) difficult to separate from vocabulary (g) open to choice.

The use of corpora has revolutionised the ways in which we are now able to explore language and grammar and provides opportunities to explore different modes of text (spoken or written) and different types of text. Any description of grammar must take these into account and part of what we wanted to do was to give readers the tools to carry out their own research into language. When someone is looking at a corpus of a particular type of text, they need to keep in mind the communicative purpose of the text and how the grammar is used to achieve this.

For example, a written text might have a number of complex sentences containing both main and subordinate clauses. It may do so in order to develop an argument but it can also be more complex because the expectation is that a reader has time to process the text, even though it is dense, unlike in spoken language. If we look at a corpus we can discover if there is a general tendency to use a particular pattern such as complex sentences across a number of texts and how it functions within these texts.

3. What corpora do you use in the book?

We have only used open-access corpora in the book including BYU-BNC, COCA, GloWbe, the Hong Kong Corpus of Spoken English. The reason for using open-access corpora was to enable readers to carry out their own examinations of grammar. We really want the book to be a tool for research.

4. Do you have any opinions on the public availability of corpora and whether wider access is something to push for?

Short answer: yes. Longer answer: We would say it’s essential for the development of good language teaching courses, materials and assessments as well as democratising the area of language research. To be fair to many of the big corpora, some like the BNC have allowed limited access for a long time.

5. The book is aimed at research so what can Language Teachers get out of it?

By using the book teachers can undertake small-scale investigations into a piece of language they are about to teach even if it is as simple as finding out which of two forms is the more frequent. We’ve all had situations in our teaching where we’ve come across a particular piece of language and wondered if a form is as frequent as it is made to appear in a text-book, or had a student come up and say ‘can I say X in this text’ and struggled with the answer. Corpora can help us with such questions. We hope the book might make teachers think again about what grammar is and what it is for.

For example, when we consider three forms of marry (marry, marries and married) we find that married is the most common form in both the BYU-BNC newspaper corpus and the COCA spoken corpus. But in the written corpus, the most common pattern is in non-defining relative clauses (Mark, who is married with two children, has been working for two years…). In the spoken corpus, the most common pattern is going to get married e.g. When are they going to get married?

We think that this shows that separating vocabulary and grammar is not always helpful because if a word is presented without its common grammatical patterns then students are left trying to fit the word into a structure and in fact words are patterned in particular ways. In the case of teachers, there is no reason why an initially small piece of research couldn’t become larger and ultimately a publication, so we hope the book will inspire teachers to become interested in investigating language.

6. Anything else you would like to add?

One of the things that got us interested in writing the book was the need for a book pitched at undergraduate students in their final year of their programme and those starting an MA, CELTA or DELTA programme who may not have had much exposure to corpus linguistics previously. We wanted to provide tools and examples to help these readers carry out their own investigations.

Sample Publications

Jones, C., & Waller, D. (2015). Corpus Linguistics for Grammar: A guide for Research. London: Routledge.

Jones, C. (2015).  In defence of teaching and acquiring formulaic sequences. ELT Journal, 69 (3), pp 319-322.

Golebiewksa, P., & Jones, C. (2014). The Teaching and Learning of Lexical Chunks: A Comparison of Observe Hypothesise Experiment and Presentation Practice Production. Journal of Linguistics and Language Teaching, 5 (1), pp.99–115. OPEN ACCESS

Jones, C., & Carter, R. (2014). Teaching spoken discourse markers explicitly: A comparison of III and PPP. International Journal of English Studies, 14 (1), pp.37–54. OPEN ACCESS

Jones, C., & Halenko, N.(2014). What makes a successful spoken request? Using corpus tools to analyse learner language in a UK EAP context. Journal of Applied Language Studies, 8(2), pp. 23–41. OPEN ACCESS

Jones, C., & Horak, T. (2014). Leave it out! The use of soap operas as models of spoken discourse in the ELT classroom. The Journal of Language Teaching and Learning, 4(1), pp.1–14. OPEN ACCESS

Jones, C, Waller, D., & Golebiewska, P. (2013). Defining successful spoken language at B2 Level: Findings from a corpus of learner test data. European Journal of Applied Linguistics and TEFL, 2(2), pp.29–45.

Waller, D., & Jones, C. (2012). Equipping TESOL trainees to teach through discourse. UCLan Journal of Pedagogic Research, 3, pp. 5–11. OPEN ACCESS

IATEFL 2015: Recent corpus tools for your students

Jane Templeton’s talk 1 illustrated corpus use by using the wordandphrase tool 2. (Lizzie Pinard has a write-up of the talk 3). I have described using this and other tools on this blog, and there is a nice round-up of corpus tools written by Steve Neufield 4 that looks at just the word, ozdic, word neighbors, netspeak, and stringnet.

This post reports on some more recent tools you may not be aware of (but posted sometime ago in G+ CL community so do check that if you want the skinny early on:\) – WriteAway, Linggle, Skell, Netcollo.

I list them in the order I think students will find easy to use and useful.

1. WriteAway – this tool auto-completes words to help highlight typical structures, so for example it gives two common patterns for Jane’s example of weakness as weakness of something and weakness in something. The first example in pattern one includes the collocation overcomes.

WriteAway-weakness
WriteAway screenshot for word weakness

2. Linggle – one could follow-up with a search on Linggle which is basically a souped up version of just the word and uses a 1 trillion word Web based corpus as opposed to the much smaller BNC that just the word uses

It is interesting that overcome weakness is not listed:

Linggle screenshot for verb + weakness
Linggle screenshot for verb + weakness (click image to see results)

but a search for overcome followed by a noun shows that it occurs less than 1% in web pages:

linggle-overcome_n
Linggle screenshot for overcome + noun (click image to see results)

3. SkeLL from Sketch Engine is neat for its word-sketch feature so a look at weakness brings up a nice set of collocations and colligations in one screen:

Skell-weakness-wordsketch
SkeLL wordsketch for weakness (click image to see results)

4. NetCollo corpus tool can compare BNC, a medical corpus and a law corpus, this is useful if you are looking at academic language in medicine and law. For example using the example of weakness we see that it is much more common in BNC:

netcollo-weakness
NetCollo result for weakness (click image to see results)

and we can see that the collocation with overcome only appears once in the Medical corpus.

As ever do try these tools out yourself and then show not tell, as Jane says, your students as and when the need arises in class. By the way do check out the integrative rationale for corpus use by Anna Frankenberg-Garcia5.

Thanks for reading.

References:

1. IATEFL 2015 video – Bringing corpus research into the language classroom

2. Word and phrase.info tool

3. IATEFL 2015 Bringing corpus research into the language classroom – Jane Templeton

4. Teacher Development: Five ways to introduce concordances to your students

5. Integrating corpora with everyday language teaching

A tipple of the TED Corpus Search Engine

Maybe a series of short posts if things pan out 🙂
Two of my classes are learning the phonetic alphabet, they have already been introduced to it, they have had a couple of exercises on it and they have had a go playing with the Cambridge English phonetics focus set of games and activities.

In a bid to keep a low level of revision going the Ted Corpus Search Engine (TCSE) could be useful. Taking the example of neither (borrowed from a Guy Aston workshop on spoken corpora at Lancaster TaLC 11 this summer) I intend to ask them how they think it is spelt phonetically.

Then I will ask them to search for the word in the TCSE and to look at entry 555 – Michelle Obama and then entry 768 David Cameron and get them to see if they can transcribe the phonetic differences (/ni:ðər/ and /naiðə/ respectively).

Update 1:

I used the above in my classes recently and it went very well, it was integrated with another worksheet they were already doing on pronunication and phonetics. I introduced it with Google images of Michelle Obama and David Cameron.

The following are some more words I may try in future classes:

Garage
880 Rory Sutherland: Sweat the small stuff UK

1931 Christopher Ryan: Are we designed to be sexual omnivores? US

1911 Yves Morieux: As work gets more complex, 6 rules to simplify Fr

Glacier
561 Yann Arthus-Bertrand: A wide-angle view of fragile Earth Fr

1768 Didier Sornette: How we can predict the next financial crisis Fr

535 Al Gore: What comes after An Inconvenient Truth? US

Zebra
1699 Richard Turere: My invention that made peace with lions Kenyan

735 Kiran Sethi: Kids, take charge Ind

1701 Colin Camerer: Neuroscience, game theory, monkeys US

1103 Paul Root Wolpe: It’s time to question bio-engineering  US

Nuclear
2069 Andrew Connolly: What’s the next window into our universe? UK

2067 Martin Rees:Can we prevent the end of the world? UK

2035 Chris Domas: The 1s and 0s behind cyber warfare US

1979 Michel Laberge: How synchronized hammer strikes could generate nuclear fusion Fr

Update 2:

The TCSE puts in a delay of 10 seconds when playing the youtube video, to get youtube to play your search term immediately you need to add in 10s, have a read here by the developer on how to do this.

Update 3:

TCSE plays your search term immediately now with an option to play 10 seconds earlier.

You should have the body!

Mark Hancock has a nice write-up of a talk given by Mike McCarthy on spoken English. The write-up concludes with an interesting metaphor of a corpus being a corpse, language that is no longer alive. It asks whether only using corpus examples is the best way of trying to improve a learner’s use of English.

Very few corpus folks would suggest only using corpus examples, and furthermore a lot of corpus work goes beyond the purely quantitative to also consider the teaching implications.

For example there is a great paper Listening for needles in haystacks: how lecturers introduce key terms by Ron Martinez, Svenja Adolphs, and Ronald Carter on the spoken language of academic lecturers.

They extracted lexical bundles from a spoken corpus of 1.7 million words and then went through those manually to keep only the pedagogically interesting ones. e.g. in other words (kept) vs er this is a (discarded).

Manual review of the list also showed them a hitherto under-emphasized aspect of spoken lectures – the introduction and definition of new terms.

Their analysis split these up into the more transparent but less frequent cues such as call and mean, e.g. …what theorists call.., …what do we mean by… and the less transparent but more frequent cues like basically and essentially e.g. …which are basically…, …so it’s essentially

Further they also showed how complex the delivery of a lot of the definitions or concepts were i.e. there was a lot of rephrasing sometimes using the word or but many times using no signposting language and key definitions usually came at the end of a series of connected points (back-loading).

In addition they found that often lecturers did not explicitly refer to their power point slides which could make it difficult for students to pick out the key terms.

A corpus may be like a corpse but like on the crime show CSI there is an awful lot that dead bodies can reveal.

Habeas corpus, you should have the body! 🙂

Corpus linguistics community news

This will be an occasional thing on postings I’ve been making to the Corpus Linguistics community on Page&Brin’s site which people visiting this blog for such info may find useful (unfortunately you do need to have a chocolate factory account).

Using regex and the command line

Working with concordances

Example corpus exercises

Making a graded reader corpus

Further corpus related postings will probably end up at the G+ site so do check (and join) that if you are interested in that sort of thing. 🙂

You can read CL community news 2 if you have not already.

TESOL France 2013 – slides, handout, notes

Hmm going to be a dull week to face after attending the 32nd annual international TESOL France shindig. Loads of great talks (e.g. see reports on materials writing in ELT by Jonathan Sayers, @jo_sayers; experimental practice in ELT by Lexical Leo,@leoselivan; Tailoring ESP courses by Kirstin Lahaye, ‏ @kirstinlahaye ), amiable company and Celtic dancing.

Seems to be a tradition to post up slides of the talks so I will duly comply. A huge thanks to the folks who came to spend some time with me, and who asked some great questions. A huge thanks to the TESOL committee, volunteers and all attendees, everything seemed to run like clockwork.

Using concordance software to inform classroom practice:

Slidestesolfr2013-classrmconc

Handouttesolfr2013-classrmconc-handout

NB1: for a video walkthrough of the getting your own texts part of the talk please have a look at this:

NB2: my reference to a diy corpus as a Frankencorpus, building your own body, comes from a Rob Troyer IALLT 2013 presentation. The Frankenstein icon was made by http://www.visualpharm.com/.

And as Scott Thornbury said in his plenary – the body (corpus) remembers! So why not make use of such memories in your classroom.

Just the Word – alternatives function or how to introduce concordances to your students.

This post may encourage those who have yet to try out concordances in class. Additionally if you teach the TOEIC using Cambridge Target Score book (Talcott & Tullis, 2007) you may find this post of interest. It takes advantage of the alternatives function in Just the Word which replaces each word entered with a similar word and shows their connection strength.

In the last unit 12 of the Cambridge Target Score book, on page 118 there is a collocations exercise focusing on adjective + noun and adverb + adjective patterns. A way to extend this exercise is to use the Just the word alternatives function.

This works best with the adjective + noun patterns. The first such pattern given in the book is valuable lessons.

Entering valuable lessons then pressing  the alternatives button we get this screen:

valuable_lessons
There are three options when replacing the adjective in valuable lessons:
valuable lesson (36)
important lesson (61)
salutary lesson (23)

Ask students to rank order the above in terms of their frequency.

The blue bars under each alternative shows how similar the replacement word is to the original.

An extract of the text in the exercise which illustrates the use of this collocation is shown below:

…as he gives valuable lessons in living and a fresh, first-hand view of American society…

(Talcott & Tullis, 2007, p.118)

Ask students what do they notice about this use, elicit the verb give, the preposition in. Note, when working with the text from the exercise for the first time, I usually try to get them to see any interesting chunks so in this case give lesson in; give first-hand view of.

Give students the concordance lines of valuable lessons (click on valuable lessons which is hyperlinked to the concordance lines) and ask them to note down any patterns, elicit the most common verb learn and the article a:

valuable_lessons-concordances

You can do something similar with the other patterns given in the book exercise or give it as a task for students to do for the following class.

Thanks for reading.

References:

Talcott, C. & Tullis, G. (2007). Target Score: A communicative course for TOEIC Test preparation. (2nd ed.). Cambridge: Cambridge University Press.

Dictionnaire Cobra – A striking corpus based tool

My twitter feed was particularly fecund recently when a nourishing morsel of a tweet by Antoine Besnehard/‏@Languages2_0 alerted me to the Dictionnaire Cobra. A corpus based dictionary for English/Dutch to French. I am not yet sure what parallel/comparative corpora it is based on but it has great potential.

This post reports on an initial trial with my TOEIC class and a lesson idea based on a reverse translation.

I trialled it with my TOEIC students as only 5 showed up, the others a no-show possibly because of the torrential downpour in Paris tonight. I divided the students into 2 groups and assigned them one word each to explain to the other group – bid, assembly line. I told them that after they spend a few minutes thinking of an explanation they should use the computer to look up the word in the Dictionnaire Cobra and make notes on what they find.

After the 2 groups presented their words, I asked them about their use of the online dictionary. Both groups seemed to find it very useful.

The lesson idea is based on a variation of a reverse translation as described by Phillip Kerr.

Use the Cobra dictionary to collect samples sentences of the words you are interested in. Use these examples in a gap fill exercise. So using the example of the word bid:

  1. Higher bids are likely.
  2. Canadians are having to guess when the call will go out for bids to replace our search and rescue helicopters.
  3. Late bidding is a strategic response to the presence of bidders placing multiple bids.
  4. young unemployed are encouraged to bid for work in other countries.
  5. Because his contract was up, he had to bid on a new contract.

Next list the French translations as a handout for the students:

  1. Des offres plus hautes sont probables.
  2. Les Canadiens se perdent en conjectures sur la date à laquelle un appel d’offres sera lancé pour le remplacement de nos hélicoptères de recherche et de sauvetage.
  3. La surenchère tardive est une réponse stratégique à la présence d’enchérisseurs soumettant des offres multiples.
  4. les jeunes chômeurs sont encouragés à faire une offre pour du travail dans d’autres pays.
  5. Lorsque son contrat a expiré, M. Cormier a dû soumissionner pour un nouveau contrat.

After the gap fill, and after telling students to put away their notes on the gap fill, tell them that you will now dictate the sentences. Their job is to match the translated French sentences to the English sentences that they hear. They can compare their matches with a partner.

After the matching they then have to translate the French sentences into English, they can work in pairs to do this.

Finally they check their translated sentences with the original English sentences.

As an extension one could highlight the different forms of bid in sentence 3 – bidding, bidders and its French equivalents.

Variations of this could include using a set of different words in the initial gap-fill exercise rather than the same word.

Thanks for reading.

Paper Machines – look ma no concordance lines

A major hurdle with corpora is understanding concordance outputs. Current interfaces don’t help much in this regard as documented by Alannah Fitzgerald e.g. Re-using Oxford OpenSpires content in podcast corpora. Geoff Jordan who worked on an early concordancer remarked:

…we found, not surprisingly, that there was a pay-off between simpliciity and allowing for “sophisticated” searches.

Geoff Jordan (blog post comment)

So there is a big need to address the question of interface design with concordance tools. Recently, I stumbled across a visualisation plugin for the Zotero program called Paper Machines (the github site is faster) which can help with the interface problem.

Using the Phrase Net function I can get a visualisation of several fixed phrases e.g. binomials i.e. x and y:

xandy

The size of the word is related to number of occurences of that word in the phrase and the thickness of the arrow how often that phrase occurs. I can of course check these binomials using AntConc.

Here’s another example using x the y:

xthey

Trying to get interesting patterns like this is much more difficult (impossible?) in AntConc. The default phrases in Paper Machines are (you can do your own custom searches):

x and y

x or y

x of the y

x a y

x the y

x at y

x is y

x [space] y which is equivalent to collocations i.e.:

xy

In addition to seeing interesting phrases the visualisation helps you to get to know your corpus much better, e.g. in a couple of the diagrams I noticed spk and dkim, looking those up using AntConc I discovered they were abbreviations for sender policy framework and domain keys identified mail respectively. I have no idea what these are but my students may find them important to know.

There are other functions on Paper Machines like Ngram but I could not get that working. For fun, the image below is a Heatmap of the place names mentioned in my webmonkey corpus:

heatmap

Thanks for reading.

What to teach from corpora output – frequency and transparency

Frequency of occurrence is the main way for teachers to choose what to teach when using corpora however as Andrew Walkley discusses in “Word choice, frequency and definitions” using just frequency is not without limitations. In addition to frequency we can use semantic transparency/opacity,  that is, how the meaning of the whole differs from its individual parts. This is also sometimes referred to as how idiomatic a phrase is. Martinez (2012) offers a Frequency Transparency Framework that teachers can use to help them choose what phrases to teach. Using four collocates of take he presents the following graphic:

The Frequency-Transparency Framework (FTF) using four collocates of the verb take (Martinez, 2013)

The numbered quadrants are the suggested priority of the verb+noun pairs i.e. the most frequent and most opaque phrase would be taught first (1), then the most frequent and transparent phrase (2), followed by the less frequent but opaque phrase (3) and last the least frequent and most transparent phrase (4). As said this is only a suggested priority which can be changed according to the teaching context. For example a further two factors (in addition to word for word decoding) can be considered when evaluating transparency:

  • Is the expression potentially deceptively transparent? – “every so often” can be misread as often; “for some time” can be misunderstood as short amount of time (Martinez & Schmitt, 2012, p.309)
  • Could the learner’s L1 negatively influence accurate perception?

Applying the framework to the binomials list from my webmonkey corpus – I would place up and running in quadrant 1, latest and greatest in quadrant 2, tried and true in quadrant 3 and layout and design in quadrant 4. Note that I did not place drag and drop, the most frequent and somewhat opaque phrase since it is so well-known with my multimedia students (similar to cut and paste) that it would not need teaching. Thanks for reading.

References:

Martinez, R. (2013). A framework for the inclusion of multi-word expressions in ELT. ELT Journal 67(2): 184-198.

Martinez, R. & Schmitt, N. (2012). A Phrasal Expressions List. Applied Linguistics 33(3): 299-320.