Finding relative frequencies of tenses in the spoken BNC2014 corpus

Ginseng English‏ @ginsenglish issued a poll on twitter asking:

This is a good exercise to do on the new spoken BN2014 corpus. See instructions to get access to the corpus.

You need to get your head around the parts of speech (POS) tag. The BNC2014 uses CLAWS 6 tagset. For the past tense we can use past tense of lexical verbs and past tense of DO. Using the past tenses of BE and HAVE would also pull in their uses as auxiliary verbs which we don’t want. This could be a neat future exercise in figuring out how to filter out such searches. Another time! Onto this post.

Simple past:


pos = part of speech

VVD = past tense of lexical(main) verbs

VDD = past tense of DO

| = acts like an OR operator

So the above look for parts of speech tagged as either past tense of lexical verbs or past tense of DO.

Simple present

The search term for present simple is also relatively simple to wit:


VVZ     -s form of lexical verb (e.g. gives, works)

Note the above captures third person forms, how can we also catch first and second person forms?

Present perfect

[pos = “VH0|VHZ”] [pos =”R.*|MD|XX” & pos !=”RL”]{0,4} [pos = “AT.*|APPGE”]? [pos = “JJ.*|N.*”]? [pos =”PPH1|PP.*S.*|PPY|NP.*|D.*| NN.*”]{0,2} [pos = “R.*|MD|XX”]{0,4} [pos = “V.*N”]

The search of present perfect may seem daunting; don’t worry the structure is fairly simple, the first search term [pos = “VH0|VHZ”] is saying look for all uses of HAVE and the last term [pos = “VVN”] is saying look for all past participles of lexical verbs.

The other terms are looking for optional adverbs and noun phrases that may come in-between namely

“adverbs (e.g. quite, recently), negatives (not, n’t) or multiword adverbials (e.g. of course, in general); and noun phrases: pronouns or simple NPs consisting of optional premodifiers (such as determiners, adjectives) and nouns. These typically occur in the inverted word order of interrogative utterances (Has he arrived? Have the children eaten yet?)” – Hundt & Smith (2009).

Present progressive

[pos = “VBD.*|VBM|VBR|VBZ”] [pos =”R.*|MD|XX” & pos !=”RL”]{0,4} [pos = “AT.*|APPGE”]? [pos = “JJ.*|N.*”]? [pos =”PPH1|PP.*S.*|PPY|NP.*|D.*| NN.*”]{0,2} [pos = “R.*|MD|XX”]{0,4} [pos = “VVG”]

A similar structure to the present perfect search. The first term [pos = “VBD.*|VBM|VBR|VBZ”]  is looking for past and present forms of BE and the last term [pos = “VVG”] for all ing participle of lexical verb. The terms in between are for optional adverb, negatives and noun phrases.

Note that all these searches are approximate – manual checking will be needed for more accuracy.

So can you predict the order of these forms? Let me know in the comments the results of using these search terms in frequency per million.

Thanks for reading.

Other search terms in spoken BNC2014 corpus.


Ginseng English blogs about frequencies of forms found in one study. Do note that as there are 6 inflectional categories in English – infinitive, first and second person present, third person singular present, progressive, past tense, and past participle, the opportunities to use the simple present form is greater due to the 2 categories of present.


Hundt, M., & Smith, N. (2009). The present perfect in British and American English: Has there been any change, recently. ICAME journal, 33(1), 45-64. (pdf) Available from


Successful Spoken English – interview with authors

The following is an email interview with the authors, Christian Jones, Shelley Byrne, Nicola Halenko, of the recent Routledge publication Successful Spoken English: Findings from Learner Corpora. Note that I have not yet read this (waiting for a review copy!).

Successful Spoken English

1. Can you explain the origins of the book?

We wanted to explore what successful learners do when they speak and in particular learners from B1-C1 levels, which are, we feel, the most common and important levels. The CEFR gives “can do” statements at each level but these are often quite vague and thus open to interpretation. We wanted to discover what successful learners do in terms of their linguistic, strategic, discourse and pragmatic competence and how this differs from level to level.  

We realised it would be impossible to use data from all the interactions a successful speaker might have so we used interactive speaking tests at each level. We wanted to encourage learners and teachers to look at what successful speakers do and use that, at least in part, as a model to aim for as in many cases the native speaker model is an unrealistic target.

2. What corpora were used?

The main corpus we used was the UCLan Speaking Test Corpus (USTC). This contained data from only students  from a range of nationalities who had been successful (based on holistic test scoring) at each level, B1-C1. As points of comparison, we also recorded native speakers undertaking each test. We also made some comparisons to the LINDSEI (Louvain International Database of Spoken English Interlanguage) corpus and, to a lesser extent, the spoken section of the BYU-BNC corpus.

Test data does not really provide much evidence of pragmatic competence so we constructed a Speech Act Corpus of English (SPACE) using recordings of computer-animated production tasks by B2 level learners  for requests and apologies in a variety of contexts. These were also rated holistically and we used only those which were rated as appropriate or very appropriate in each scenario. Native speakers also recorded responses and these were used as a point of comparison. 

3. What were the most surprising findings?

In terms of the language learners used, it was a little surprising that as levels increased, learners did not always display a greater range of vocabulary. In fact, at all levels (and in the native speaker data) there was a heavy reliance on the top two thousand words. Instead, it is the flexibility with which learners can use these words which changes as the levels increase so they begin to use them in more collocations and chunks and with different functions. There was also a tendency across levels to favour use of chunks which can be used for a variety of functions. For example, although we can presume that learners may have been taught phrase such as ‘in my opinion’ this was infrequent and instead they favoured ‘I think’ which can be used to give opinons, to hedge, to buy time etc .

In terms of discourse, the data showed that we really need to pay attention to what McCarthy has called ‘turn grammar’. A big difference as the levels increased was the increasing ability of learners to co-construct  conversations, developing ideas from and contributing to the turns of others. At B1 level, understandably, the focus was much more on the development of their own turns.

4. What findings would be most useful to language teachers?

Hopefully, in the lists of frequent words, keywords and chunks they have something which can inform their teaching at each of these levels. It would seem to be reasonable to use, as an example, the language of successful B2 level speakers to inform what we teach to B1 level speakers. Also, though tutors may present a variety of less frequent or ‘more difficult’ words and chunks to learners, successful speakers will ultimately employ lexis which is more common and more natural sounding in their speech, just as the native speakers in our data also did.

We hope the book will also give clearer guidance as to what the CEFR levels mean in terms of communicative competence and what learners can actually do at different levels. Finally, and related to the last  point, we hope that teachers will see how successful speakers need to develop all aspects of communicative competence (linguistic, strategic, discourse and pragmatic competence) and that teaching should focus on each area rather than only one of two of these areas.

There has been some criticism, notably by Stefan Th. Gries and collaborators that much learner corpus research is restricting itself factorwise when explaining a linguistic phenomenon. Gries calls for a multi-factor approach whose power can be seen in a study conducted with Sandra C. Deshors, 2014, on the uses of may, can and pouvoir with native English users and French learners of English. Using nearly 4000 examples from 3 corpora, annotated with over 20 morphosyntactic and semantic features, they found for example that French learners of English see pouvoir as closer to can than may.

The analysis for Successful Spoken English was described as follows:

“We examined the data with a mixture of quantitative and qualitative data analysis, using measures such as log-likelihood to check significance of frequency counts but then manual examination of concordance line to analyse the function of language.”

Hopefully with the increasing use of multi-factor methods learner corpus analysis can yield even more interesting and useful results than current approaches allow.

Chris and his colleagues kindly answered some follow-up questions:

5. How did you measure/assign CEFR level for students?  

Students were often already in classes where they had been given a proficiency test and placed in a level . We then gave them our speaking  test and only took data from students who had been given a global pass score of 3.5 or 4 (on a scale of 0-5). The borderline pass mark was 2.5 so we only chose students who had clearly passed but were not at the very top of the level and obviously then only those who gave us permissions to do so. The speaking tests we used were based on Canale’s (1984) oral proficiency interview design and consisted of a warm up phase, a paired interactive discussion task and a topic specific conversation based on the discussion task. Each lasted between 10-15 minutes.

6. So most of the analysis was in relation to successful students who were measured holistically?  


7. And could you explain what holistically means here?

Yes, we looked at successful learners at each CEFR level, according to the test marking criteria. They were graded for grammar, vocabulary, pronunciation, discourse management and interactive ability based on criteria such as  the following (grade 3-3.5) for discourse management ‘Contributions are normally relevant, coherent and of an appropriate length’. These scores were then amalgamated into a global score. These scales are holistic in that they try to assess what learners can do in terms of these competences to gain an overall picture of their spoken English rather than ticking off a list of items they can or cannot use. 

8. Do I understand correctly that comparisons with native speaker corpora were not as much used as with successful vs unsuccessful students? 

No, we did not look at unsuccessful students at all. We were trying to compare successful students at B1-C1 levels and to draw some comparison to native speakers. We also compared our data to the LINDSEI spoken learner corpus to check the use of key words.

9. For the native speaker comparisons what kind of things were compared?

We compared each aspect of communicative competence – linguistic, strategic, discourse and pragmatic competences to some degree. The native speakers took exactly the same tests so we compared (as one example), the most frequent words they used.


Thanks for reading.



Deshors, S. C., & Gries, S. T. (2014). A case for the multifactorial assessment of learner language. Human Cognitive Processing (HCP), 179. Retrieved from


Corpus linguistics community news 6

Although some say G+ is a dying forum the CL G+ group now has 482 members, nice. Admittedly not many interact but I do (like to) think a lot do appreciate the resources put on there. I had the pleasure of being a mentor on the Lancaster University Future Learn Corpus Linguistics MOOC last September, which is going to have another round next September so look out for that.

First up for this installment of news I check the claim that elementary my dear Watson was never used in the Sherlock Holmes stories.

Next I challenged readers to check how many business idioms in a list can be accounted for in relevant corpora.

There is a useful table of search syntax for the COCA interface.

My search for publically available spoken corpora.

A list of phrasal verbs Jeremy Corbyn used in his final rally speech before becoming Labour leader.

Using the SKELL interface to get some good examples for a review quiz.

Some AntConc alternatives.

Bypassing limits of spreadsheet rows.

Finally a description of using a scraping tool to get biology and medical abstracts in simpler language that might be suitable for EAP students.

Thanks for reading. And don’t forget to check out the previous corpus linguistics community news if you haven’t already.