Guy Aston talks speech corpora

I had the pleasure of chatting to Guy Aston as he was staying in Paris on his way back to Italy, where he works at the University of Bologna. Guy has been an active researcher in corpora over the years. Here he recollects one significant event  that encouraged him to pursue his interest in corpora and mentions his current area of investigation (best heard using headphones):

Regular readers may know that I have been using the TED Corpus Search Engine a few times recently to get my students to work on phonetic transcriptions. Multi-media corpora offer the possibilities to examine the prosodic features of language and this is what interests Guy with speech corpora.

For example the phrase Thank you was found to have a falling tone most of the time and frequently occurring phrases such as don’t you, last year, I don’t know and I don’t care are expected to have a fast rhythm (examples taken from The prosody of formulaic expressions in the IBM Lancaster Spoken English Corpus by Phoebe Lin).

Guy went on to detail some requirements and challenges involved when setting up speech corpora:

In the next audio Guy gives some examples of a learner using a speech corpus:

The following phone camera video of Guy’s TED speech corpus using Mike Scott’s Wordsmith (version 5 or later) illustrates listening to concordances of matter of fact:

Finally I asked Guy the old favourite about the misplaced early optimism of using corpuses in the language classroom:

Guy hinted that a version of his corpus may be available for AntConc (it is currently only compatible with WordSmith) but at the same time hinting not to hold your breath waiting for one :).

Once again thanks to Guy for sharing some of his current work. Check out some of his publications.

Do also check out an interview with another corpus linguist Costas Gabrielatos.

Thanks for reading.

Easy micro-listenings

John Field has made the case for micro-listenings which are “examples of the same word/phrases in different voices and contexts” (John Field 2013, New directions in second language listening:rethinking the Comprehension Approach presentation, slide 13) that can be replayed easily.

Such listenings helps to develop a learner’s decoding skills. These micro-listenings can be embedded into a task that includes say transcription exercises.

An easy way to get such micro-listenings automatically is to use Videogrep. This tool allows you to search a subtitle file for a word, a grammatical form or hypernym and then able to make a new super cut/edited video containing your search word/grammatical form/hypernym.

I have a 1-to-1 adult student who is keen on motorbikes and wants to see a documentary film of the Isle of Man TT races called TT3D – Closer to the edge (yeah can u dig it, get your motor running, headout on…ah um back to post). He has yet to see it due to lack of time so as a way to take advantage of this interest I created a supercut video using Videogrep.

I initially fed the subtitle file into AntWordProfiler to see what word I could cut, I wanted one that was in the 1st 1000 of the GSL (general service list) but  that did not have  too many hits so get although very interesting (and I may well use that later) had 512 hits so was way too many for a micro-listen. Looking down the list I noticed set with 9 hits. I had read somewhere that set has the most meanings of any word in English.

Anyway this seemed ideal so here is the super cut of set:

Notice we have examples of upset, upsetting, settle so a great way to see if my student can distinguish these from the other uses.

Field recommends a task approach so one can set the instruction before first listen as: The clips have something in common what is it? and then before the subsequent listens one asks the student to transcribe what uses of set they hear.

I’ll report back here how the student got on when I can.

If you want, check out some more examples that includes a search for adjective-noun grammar forms in a Big Bang Theory episode.

Thanks for reading.

Update 1:

The two students I have used #videogrep-ed micro-listenings with liked them a lot. There were some issues about difficulty of transcribing some of the clips which could put off less hardy souls.

Some notes on using videogrep – you can use regular expressions or regex to tighten up searches so for example if I just wanted uses of set and not upset, upsetting, settle I would use \bset as the search term where \b is the regex for word boundary. Here is a list of regexes though I have yet to have a use for anything other than \b so far –

Also you will find you need to expand some clips which are cut early so add the command --padding and a number measured in milliseconds so for example --padding 500 would pad out the beginning and ends of clips by 500ms.

Update 2:

There is an (experimental) graphical interface version for the Apple OSX (103MB) useful for those not comfortable with using command line.

BNCaudio corpus and TOEIC listening

Those of you who teach the TOEIC or other exams will have wanted from time to time to be able to use “authentic” audio along with its hesitations, pauses, repetitions and so on.

There’s a need to expose learners to the jungle English out in the world compared to the garden English in the classroom, terms coined by Richard Cauldwell and Sheila Thorn, see the short youtube clip.

John Hughes makes the case for materials to use such audio and video. He points out that using corpora data for this requires context. I agree though if you want to focus on decoding and building bottom-up listening skills requiring context is not so important.

I very recently used the Lancaster interface to the BNC audio data in my TOEIC exam class.

For details on getting access to this corpus see Google+ community post.

Once you get access make sure the spoken restrictions link is clicked so that it is greyed out as shown in the following screenshot:

BNCaudio-spoken restrictions

Then after entering the search word – contract, I selected the domain as business:


I then looked through the results for some interesting snippets. Note not all audio can be accessed. Also as it is beta there is still some alignment issues between transcripts and audio but you can adjust that and give feedback so that it can be improved.

I told the students that they will listen to snippets of audio using the word contract. I asked them to listen for other words – nouns, verbs, adjectives, adverbs related to the use of contract in the audio.

The following is the transcript of the first audio I used:

All members of staff have standard conditions of service as set out here, with the exception of temporary staff or staff who are er [pause] on a short time contract or maternity leave cover who may have a short term er notice er [pause] erm for erm [pause] a period of notice.

After the first listen one of the students recognised the word notice; after the second listen two students recognised temporary, conditions and staff. A third listen produced recognition of standard conditions.

I then dictated the transcript to them (without the hesitations, pauses etc) for them to write down. And then went through other relevant lexis (short time/term contract, maternity leave cover) and checked for understanding.

I repeated the procedure for two more audio snippets containing the keyword contract.

The students did of course find the audio difficult but they liked that it was real audio and made a change from the coursebook audio. I plan to use this process in the remaining classes. Next time I will probably start off with a much shorter clip and move to longer ones.

Thanks for reading.


The SpokesBNC interface allows you to display just the concordances in the BNC that have audio recordings, very useful.