Finding relative frequencies of tenses in the spoken BNC2014 corpus

Ginseng English‏ @ginsenglish issued a poll on twitter asking:

This is a good exercise to do on the new spoken BN2014 corpus. See instructions to get access to the corpus.

You need to get your head around the parts of speech (POS) tag. The BNC2014 uses CLAWS 6 tagset. For the past tense we can use past tense of lexical verbs and past tense of DO. Using the past tenses of BE and HAVE would also pull in their uses as auxiliary verbs which we don’t want. This could be a neat future exercise in figuring out how to filter out such searches. Another time! Onto this post.

Simple past:


pos = part of speech

VVD = past tense of lexical(main) verbs

VDD = past tense of DO

| = acts like an OR operator

So the above look for parts of speech tagged as either past tense of lexical verbs or past tense of DO.

Simple present

The search term for present simple is also relatively simple to wit:


VVZ     -s form of lexical verb (e.g. gives, works)

Note the above captures third person forms, how can we also catch first and second person forms?

Present perfect

[pos = “VH0|VHZ”] [pos =”R.*|MD|XX” & pos !=”RL”]{0,4} [pos = “AT.*|APPGE”]? [pos = “JJ.*|N.*”]? [pos =”PPH1|PP.*S.*|PPY|NP.*|D.*| NN.*”]{0,2} [pos = “R.*|MD|XX”]{0,4} [pos = “V.*N”]

The search of present perfect may seem daunting; don’t worry the structure is fairly simple, the first search term [pos = “VH0|VHZ”] is saying look for all uses of HAVE and the last term [pos = “VVN”] is saying look for all past participles of lexical verbs.

The other terms are looking for optional adverbs and noun phrases that may come in-between namely

“adverbs (e.g. quite, recently), negatives (not, n’t) or multiword adverbials (e.g. of course, in general); and noun phrases: pronouns or simple NPs consisting of optional premodifiers (such as determiners, adjectives) and nouns. These typically occur in the inverted word order of interrogative utterances (Has he arrived? Have the children eaten yet?)” – Hundt & Smith (2009).

Present progressive

[pos = “VBD.*|VBM|VBR|VBZ”] [pos =”R.*|MD|XX” & pos !=”RL”]{0,4} [pos = “AT.*|APPGE”]? [pos = “JJ.*|N.*”]? [pos =”PPH1|PP.*S.*|PPY|NP.*|D.*| NN.*”]{0,2} [pos = “R.*|MD|XX”]{0,4} [pos = “VVG”]

A similar structure to the present perfect search. The first term [pos = “VBD.*|VBM|VBR|VBZ”]  is looking for past and present forms of BE and the last term [pos = “VVG”] for all ing participle of lexical verb. The terms in between are for optional adverb, negatives and noun phrases.

Note that all these searches are approximate – manual checking will be needed for more accuracy.

So can you predict the order of these forms? Let me know in the comments the results of using these search terms in frequency per million.

Thanks for reading.

Other search terms in spoken BNC2014 corpus.


Ginseng English blogs about frequencies of forms found in one study. Do note that as there are 6 inflectional categories in English – infinitive, first and second person present, third person singular present, progressive, past tense, and past participle, the opportunities to use the simple present form is greater due to the 2 categories of present.


Hundt, M., & Smith, N. (2009). The present perfect in British and American English: Has there been any change, recently. ICAME journal, 33(1), 45-64. (pdf) Available from


2 thoughts on “Finding relative frequencies of tenses in the spoken BNC2014 corpus

    1. Hi Marc yah getting into regex may be worthwhile, though myself getting meta about regex often useful, i.e. being able to look for someone else’s regex solution 😀

Penny for your thoughts

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s