Signs o’ the times – some/any invariant meanings and COCA

I am glad to be writing this particular (rushed, see end) post as it involves corpus linguistics and I have not done such a post for a while. It is also about my current interest – Columbia School linguistics.

I have been over the years less enamored of the power of corpus linguistics for language teaching. It is certainly very useful to access descriptions of language but that is not enough. Explanations are also needed. Columbia School (CS) linguistics is about analyzing invariant meanings that motivate choices in both grammar and lexis. It is about one form to one meaning mappings – an ideal aim when looking to help students.

Nadav Sabar in 2016 analyses the use of some and any. The following borrows heavily from this paper.

Most pedagogical grammars state (formal) rules such as “any is used in negative sentences and not in affirmative statements”. Yet such rules cannot account for why some is used in contexts that are said to be used for any. Sabar gives the following attested example:

1) When Yvonne lived in Italy, where it seems like the whole country is married, people always wanted to know about her personal life. I remember her telling me that every time she’d come back from a great vacation, the first question from married friends was, “Did you meet anybody?” It was as if the whole point of going on vacation was to meet someone. That she had a great time and saw something new and interesting didn’t matter. The entire vacation was cancelled or a flop because she didn’t meet someone. (

Formal accounts could only say that any is also acceptable as in she didn’t meet anyone and is unconcerned with why the writer chose some in this case.

Formal accounts use the sentence as unit of analysis and see meaning as compositional – i.e. the meanings of individual words in a sentence add to the whole. CS uses signs (pairing of symbol to meaning) as the unit of analysis and sees meaning as instrumental rather than compositional. That is the individual meanings of signals need not add up to sentence meaning. There is a distinction between linguistic code that has an invariant meaning (that always corresponds to a linguistic signal) and interpretation of the code which is the subjective outcome of messages. Meanings are very sparse in that they do not encode messages but only offer prompts that may only suggest message elements.

The meaning hypothesis of some and any are shown below:

I.e. some as RESTRICTED suggests limits, internal divisions, boundaries while any as UNRESTRICTED suggests no boundaries, limits or divisions. Note that this does not mean that the domain in question in reality has no divisions or boundaries. Just that the reality is irrelevant to the message. Also note that in a pedagogical grammar such as Martin Parrott’s this meaning division between restricted and unrestricted is only described for stressed SOME and ANY.

Sabar uses the following as examples:

2) If you see something, say something. (New York City public safety slogan)
3) No parking any time (street sign)

In 2) some is used because the message suggested is a restriction on the set of things people see and say. The context drives the inference as to the nature of the restriction – suspicious looking things. Any could also have been used but that would not have been as effective a message – any would have suggested no restriction i.e. people should call no matter what they see.

Similarly in 3) any is used because there is no restriction on the domain of times of the day.

So now for 1) we can see some is used because the message suggests a restriction of the set of people Yvonne did not meet, and the context shows that this restriction as people who may qualify as marriage potential.

Now the interesting corpus linguistics part.

The methodology of CS first involves a qualitative step where some aspect of the sign in question is looked at. So for some which suggests restriction another element which suggests the same is looked for:

4) Some Feds [Federal workers] are held up as national heroes while others are considered a national joke. (ABC Nightline: Income Tax)

Here others is used to refer to a different subset of people within the domain of Federal workers. This message element is also suggested by some – RESTRICTED. This does not mean there is only one reason for the choice of these forms rather that this message feature of internal division is one reason out of many possible reasons that has motivated the choice of these two forms.

To test this claim generally we can look at a corpus to see if there is a higher than probable chance that others occurs with some more than others occurs with any.

We can do this in COCA by using these search terms:

COCA searches for others:

Favoured Disfavoured
some [up to 9 slots] others any [up to 9 slots] others

The following screenshot shows how to find some [up to 9 slots] others (do similar for any):

To find some with not others see the next screenshot (i.e. use the minus sign -):

And tabulating the data in a contingency table:

others present others absent
N % N %
some 19078 90 8946046 65
any 2022 10 4841946 35
Total 21100 100 13787992 100

p < .0001

The table percentages and significance test supports the claim that there is one message feature that motivates use of both some and others. Note that the meaning hypothesis itself is not directly tested; it is only indirectly tested via the counts in COCA. Sabar goes onto to test both qualitatively and quantitatively other signals that contribute to the meaning hypothesis of some – RESTRICTED and any – UNRESTRICTED.

I wondered how the singular other would distribute with any and some:

other present other absent
N % N %
any 39244 52 4811937 35
some 35175 48 8930621 65
Total 74419 100 13742558 100

p < .0001

Here can we say that singular other contributes to a message meaning of unrestricted? I have no idea as I have not had time to explore this further!

I hope dear reader you forgive the rushed nature of this post but I wanted to get something up before the risk of forgetting this due to holiday haze!

Thanks for indulging.

Update 1:

Thanks to heads up from some tweeters Michael Lewis in his book The English Verb in 1986 was also pointing to the primacy of meaning:

Update 2:

Nadav Sabar has pointed out that he looked for others in one direction i.e. following some/any whereas I looked at occurrence of others both following and before some/any.
Plus in a new version of his paper a window size of 2 is used instead of 9.


Parrott, M. (2000). Grammar for English language teachers: with exercises and a key. Cambridge University Press.

Sabar, N. (2016). Using big data to test meaning hypotheses for any and some. In Otheguy, R., Stern, N., Reid, W. and Ruggles, J. (Eds.) Columbia School linguistics in the 21st century: advances in sign-based linguistics. Amsterdam/Philadelphia: John Benjamins. Retrieved from []


The Prime Machine – a new concordancer in town

One of the impulses behind The Prime Machine was to help students distinguish similar or synonymous words. Recently a student of mine  asked about the difference between “occasion” and “opportunity”. I used the compare function on the BYU COCA to help the student induce some meaning from the listed collocations. It kinda, sorta, helped.

The features offered by The Prime Machine promises much better help for this kind of question. For example in the screenshot below the (Neighbourhood) Label function shows the kind of semantic tags associated with the words “occasion” and “opportunity”. Having this info certainly helps reduce time figuring out the differences between the words.

Neighbourhood Labels for the comparison of occasion and opportunity

One of the other sweet new features brought to the concordancer table, is a card display system as seen in the first screenshot below. Another is information based on Michael Hoey’s lexical priming theory such as shown in the second screenshot below.

Card display for comparison of words occasion and opportunity
Paragraph position of the words occasion and opportunity

The developer of the new concordancer Stephen Jeaco kindly answered some questions.

1. Can you speak a little about your background?

Well, I’m British but I’ve lived in China for 18 years now.  My first degree was in English Literature and then I did my MA Applied Linguistics/TESOL and my PhD was under the supervision of Michael Hoey with the University of Liverpool.

I took up programming as a hobby in my teens.  If I hadn’t got the grades to read English at York, I would have gone on to study Computer Science somewhere.  In those days the main thing was to choose a degree programme that you felt you would enjoy.  Over the years, though, I’ve kept a technical interest and produced a program here or there for MA projects and things like that.

I’ve worked at XJTLU for 12 years now.  I was the founding director of the English Language Centre, and set up and ran that for 6 years.  After rotating out of role, I moved into what is now called the Department of English where I lecture in linguistics to our undergraduate English majors and to our MA TESOL students.

2. What needs is The Prime Machine setting out to fill?

I started working on The Prime Machine in 2010, at the beginning of my part-time PhD.  At that time, I was interested in corpus linguistics but I found it hard to pass that enthusiasm on to my colleagues and students.  We had some excellent software and some good web tools, but internet access to sites outside China wasn’t always very reliable, and getting started with using corpora for language learning usually meant having to learn quite a lot about what to look for, how to look for it, and also how to understand what the data on-screen could mean.

Having taught EAP for about 10 years at that time, I felt that my Chinese learners of English needed a way to help them see some of the patterns of English which can be found through exploring examples, and in particular I wanted to help them see differences between synonyms and become familiar with how collocation information could help them improve their writing.

I’d read some of Michael Hoey’s work while doing my MA, and in his role of Pro Vice Chancellor for Internationalization I met him at our university in China.  His theory of lexical priming provided both a rationale for how patterns familiar in corpus linguistics relate to acquisition and it also gave me some specific aspects to focus on in terms of thinking about what to encourage students to notice in corpus lines. 

The main aim of The Prime Machine was to provide an easy start to corpus linguistic analysis – or rather an easy start to using corpus tools to explore examples.  Central to the concept were two main ideas: (1) that students would need some additional help finding what to look for and knowing what to compare and (2) that new or enhanced ways of displaying corpus lines and summary data could help draw their attention do different patterns.  Personally, I really like the “Card” display, and while KWIC is always going to be effective for most things, when it comes to trying to work out where specific examples come from and what the wider context might be, I think the cards go a long way towards helping students in their first experiences of DDL.

Practically speaking, another thing I wanted to do was to start with a search screen where they could get very quick feedback on anything that couldn’t be found and whether other corpora on the system would have some results. 

3. What kind of feedback have you got from students and staff on the corpus tool?

I’ve had a lot of feedback and development suggestions from my students at my own institution.  Up until a few weeks ago, The Prime Machine was only assessable to our own staff and students.  The majority of users have been students studying linguistics modules, mostly those who are taking or have taken a module introducing corpus linguistics. However, for several years now I have also had students using it as a research tool for their Final Year Project – a year-long undergraduate dissertation project where typically each of us has 4 to 5 students for one-to-one supervision.  They’ve done a range of projects with it including trying to apply some of Michaela Mahlberg’s approaches to another author, exploring synonyms, exploring the naturalness of student paraphrases or exam questions.  People often think of Chinese students as being shy and wanting to avoid direct criticism of the teacher, but our students certainly develop the skills for expressing their thoughts and give me suggestions!

In my own linguistics module on corpus linguistics, I’ve found the new version of The Prime Machine to be a much easier way to get students started at looking at their own English writing or transcripts of their speech and getting them to consider whether evidence about different synonyms and expressions from corpora can help them improve their English production.  Personally, I use it as a stepping stone to introducing features of WordSmith Tools and other resources.

In terms of staff input, I’ve had a couple of more formal projects, getting feedback from colleagues on the ranking features and the Lines and Cards displays.  I’ve also had feedback by running sessions introducing the tool as part of a professional development day and a symposium.  Some of my colleagues have used it a bit with students, but I think while it required access from campus and before I had the website up, it was a bit too tricky even on site. 

On the other hand, I’ve given several conference papers introducing the software, and received some very useful comments and suggestions.

I need to balance my teaching workload, time spent working towards more concrete research outputs and family life, but if we can get over some of the connectivity issues and language teachers want to start using The Prime Machine with their students, I’m going to need as much feedback as possible.  I’d like to hope I could respond and build up or extend the tool, but at the same time there’s a need to try to keep things simple and suitable for beginners. 

 4. You have some extra materials for students at your institution, could you describe these?

There’s nothing really very special about these.  But having the two ways of accessing the server (offsite vs. on-site) means if corpus resources come with access restrictions or if a student wants to set up a larger DIY corpus for a research project I’m able to limit access to these.

Other than additional corpora, there are a few simple wordlists which I use in my own teaching and some additional options for some of the research tools.

5. What developments are in the pipeline for future versions of The Prime Machine?

One of the main reasons I wanted The Prime Machine to be publically available and available for free was so that others would be able to see some of the features I’ve written about or presented about at conferences in action.  In some ways, my focus has changed a bit towards smaller undergraduate projects for linguistics, but I still have interests and contacts in English language teaching.  Given some of the complications of connecting from Europe to a server in China, unless someone finds it really interesting and wants to set up a mirror server or work more collaboratively, I don’t think I can hope to have a system as widely popular and reliable as the big names in online concordancing tools.  But having interviews like this and getting the message out about the software through social media means that there is a lot more potential for suggestions and feature requests to help me develop in ways I’ve not thought of.

But left to my own perceptions and perhaps through interactions with my MA TESOL students, local high schools and our language centre, I’m interested in adding to the capabilities of the search screen to help students find collocations when the expression they have in mind is wildly different from anything stored in the corpus.  At the moment, it can do quite a good job of suggesting different word forms, giving some collocation suggestions and using other resources to suggest words with a similar meaning.  But sometimes students use words together in ways that (unless they want to use language very creatively) would stump most information retrieval systems.

Another aspect which I could develop would be the DIY text tools, which currently start to slow down quite rapidly when reading more than 80,000 words or so.  That would need a change of underlying data management, even without changing any of the features that the user sees.  I added those features in the last month or two before my current cohort of students were to start their projects, and again, feedback on those tools and some of the experimental features would be really useful.  On the other hand, I point my own students to tools like WordSmith Tools and AntConc when it comes to handling larger amounts of text!

The other thing, of course, is that I’m looking forward to getting hold of the BNC 2014 and adding another corpus or two.  Again, I can’t compete with the enormous corpora available elsewhere, but since most of the features I’m trying to help students notice differ across genre, register and style, I am quite keen on moderately sized corpora which have clearly defined sub-corpora or plenty of metadata.

One thing I would like to explore is porting The Prime Machine to Mac OS, and also possibly to mobile devices and tablets.  But as it stands, using The Prime Machine requires the kind of time commitment and concentration (and multiple searches and shuffling of results) that may not be so suitable for mobile phones.  I sometimes think it is more like the way we’d hunt for a specialist item on Taobao or Ebay when we’re not sure of a brand or even a product name, rather than the kind of Apps we tend to expect from our smart phones which provide instant ready-made answers.  Redesigning it for mobile use will need some thought.

Personally, I’m hoping to start one or two new projects, perhaps working with Chinese and English or looking more generally at Computer Assisted Language Teaching.  

Now that The Prime Machine is available, while of course it would be great if people use it and find it useful, more importantly beyond China I think I’d hope that it could inspire others to try creating new tools.  If someone says to the developer working on their new corpus web interface, “Do you think you could make a display that looks a bit like that?”, or “Can you pull in other data resources so those kinds of suggestions will pop up?”, I think they wouldn’t find it difficult, and we’d probably have more web tools which are a bit more user-friendly in terms of operation and more intuitive in terms of support for interpretation of the results. 

6. What other corpus tools do you recommend for teachers and students?

Well, I love seeing the enhancements and new features we get with new versions of popular corpus tools.  And at conferences, I’m always really impressed by some of the new things people are doing with web-based tools.   But one thing that I would say is that for the students I work with, I think knowing a bit more about the corpus is more useful than having something billions of words in size; being able to explore a good proportion of concordance lines for a mid-frequency item is great.  I think having a list of collocations or lines from millions of different sources to look at isn’t going to help language learners become familiar with the idea that concordance lines and corpus data can help them understand, explore and remember more about how to use words effectively. 

Nevertheless, I think those of us outside Europe should be quite jealous of the Europe-wide university access to Sketch Engine that’s just started for the next 5 years.  I also really like the way the BYU tool has developed.  I was thrilled to get hold of the MAT software for multidimensional analysis.  And I think I’ll always have my WordSmith Tools V4 on my home computer, and a link to our university network version of WordSmith Tools in my office and in the computer labs I use.

Thanks for reading. Do note if you comment here I need to forward them to Stephen (as he is behind the great firewall of China) and so there may be a delay in any feedback. Alternatively contact Stephen yourself from the main The Prime Machine website.

Also do note that the current available version of The Prime Machine may not work at the moment but wait a few days for a fix to be applied by Stephen and try again then.

Finding relative frequencies of tenses in the spoken BNC2014 corpus

Ginseng English‏ @ginsenglish issued a poll on twitter asking:

This is a good exercise to do on the new spoken BN2014 corpus. See instructions to get access to the corpus.

You need to get your head around the parts of speech (POS) tag. The BNC2014 uses CLAWS 6 tagset. For the past tense we can use past tense of lexical verbs and past tense of DO. Using the past tenses of BE and HAVE would also pull in their uses as auxiliary verbs which we don’t want. This could be a neat future exercise in figuring out how to filter out such searches. Another time! Onto this post.

Simple past:


pos = part of speech

VVD = past tense of lexical(main) verbs

VDD = past tense of DO

| = acts like an OR operator

So the above look for parts of speech tagged as either past tense of lexical verbs or past tense of DO.

Simple present

The search term for present simple is also relatively simple to wit:


VVZ     -s form of lexical verb (e.g. gives, works)

Note the above captures third person forms, how can we also catch first and second person forms?

Present perfect

[pos = “VH0|VHZ”] [pos =”R.*|MD|XX” & pos !=”RL”]{0,4} [pos = “AT.*|APPGE”]? [pos = “JJ.*|N.*”]? [pos =”PPH1|PP.*S.*|PPY|NP.*|D.*| NN.*”]{0,2} [pos = “R.*|MD|XX”]{0,4} [pos = “V.*N”]

The search of present perfect may seem daunting; don’t worry the structure is fairly simple, the first search term [pos = “VH0|VHZ”] is saying look for all uses of HAVE and the last term [pos = “VVN”] is saying look for all past participles of lexical verbs.

The other terms are looking for optional adverbs and noun phrases that may come in-between namely

“adverbs (e.g. quite, recently), negatives (not, n’t) or multiword adverbials (e.g. of course, in general); and noun phrases: pronouns or simple NPs consisting of optional premodifiers (such as determiners, adjectives) and nouns. These typically occur in the inverted word order of interrogative utterances (Has he arrived? Have the children eaten yet?)” – Hundt & Smith (2009).

Present progressive

[pos = “VBD.*|VBM|VBR|VBZ”] [pos =”R.*|MD|XX” & pos !=”RL”]{0,4} [pos = “AT.*|APPGE”]? [pos = “JJ.*|N.*”]? [pos =”PPH1|PP.*S.*|PPY|NP.*|D.*| NN.*”]{0,2} [pos = “R.*|MD|XX”]{0,4} [pos = “VVG”]

A similar structure to the present perfect search. The first term [pos = “VBD.*|VBM|VBR|VBZ”]  is looking for past and present forms of BE and the last term [pos = “VVG”] for all ing participle of lexical verb. The terms in between are for optional adverb, negatives and noun phrases.

Note that all these searches are approximate – manual checking will be needed for more accuracy.

So can you predict the order of these forms? Let me know in the comments the results of using these search terms in frequency per million.

Thanks for reading.

Other search terms in spoken BNC2014 corpus.


Ginseng English blogs about frequencies of forms found in one study. Do note that as there are 6 inflectional categories in English – infinitive, first and second person present, third person singular present, progressive, past tense, and past participle, the opportunities to use the simple present form is greater due to the 2 categories of present.


Hundt, M., & Smith, N. (2009). The present perfect in British and American English: Has there been any change, recently. ICAME journal, 33(1), 45-64. (pdf) Available from

Successful Spoken English – interview with authors

The following is an email interview with the authors, Christian Jones, Shelley Byrne, Nicola Halenko, of the recent Routledge publication Successful Spoken English: Findings from Learner Corpora. Note that I have not yet read this (waiting for a review copy!).

Successful Spoken English

1. Can you explain the origins of the book?

We wanted to explore what successful learners do when they speak and in particular learners from B1-C1 levels, which are, we feel, the most common and important levels. The CEFR gives “can do” statements at each level but these are often quite vague and thus open to interpretation. We wanted to discover what successful learners do in terms of their linguistic, strategic, discourse and pragmatic competence and how this differs from level to level.  

We realised it would be impossible to use data from all the interactions a successful speaker might have so we used interactive speaking tests at each level. We wanted to encourage learners and teachers to look at what successful speakers do and use that, at least in part, as a model to aim for as in many cases the native speaker model is an unrealistic target.

2. What corpora were used?

The main corpus we used was the UCLan Speaking Test Corpus (USTC). This contained data from only students  from a range of nationalities who had been successful (based on holistic test scoring) at each level, B1-C1. As points of comparison, we also recorded native speakers undertaking each test. We also made some comparisons to the LINDSEI (Louvain International Database of Spoken English Interlanguage) corpus and, to a lesser extent, the spoken section of the BYU-BNC corpus.

Test data does not really provide much evidence of pragmatic competence so we constructed a Speech Act Corpus of English (SPACE) using recordings of computer-animated production tasks by B2 level learners  for requests and apologies in a variety of contexts. These were also rated holistically and we used only those which were rated as appropriate or very appropriate in each scenario. Native speakers also recorded responses and these were used as a point of comparison. 

3. What were the most surprising findings?

In terms of the language learners used, it was a little surprising that as levels increased, learners did not always display a greater range of vocabulary. In fact, at all levels (and in the native speaker data) there was a heavy reliance on the top two thousand words. Instead, it is the flexibility with which learners can use these words which changes as the levels increase so they begin to use them in more collocations and chunks and with different functions. There was also a tendency across levels to favour use of chunks which can be used for a variety of functions. For example, although we can presume that learners may have been taught phrase such as ‘in my opinion’ this was infrequent and instead they favoured ‘I think’ which can be used to give opinons, to hedge, to buy time etc .

In terms of discourse, the data showed that we really need to pay attention to what McCarthy has called ‘turn grammar’. A big difference as the levels increased was the increasing ability of learners to co-construct  conversations, developing ideas from and contributing to the turns of others. At B1 level, understandably, the focus was much more on the development of their own turns.

4. What findings would be most useful to language teachers?

Hopefully, in the lists of frequent words, keywords and chunks they have something which can inform their teaching at each of these levels. It would seem to be reasonable to use, as an example, the language of successful B2 level speakers to inform what we teach to B1 level speakers. Also, though tutors may present a variety of less frequent or ‘more difficult’ words and chunks to learners, successful speakers will ultimately employ lexis which is more common and more natural sounding in their speech, just as the native speakers in our data also did.

We hope the book will also give clearer guidance as to what the CEFR levels mean in terms of communicative competence and what learners can actually do at different levels. Finally, and related to the last  point, we hope that teachers will see how successful speakers need to develop all aspects of communicative competence (linguistic, strategic, discourse and pragmatic competence) and that teaching should focus on each area rather than only one of two of these areas.

There has been some criticism, notably by Stefan Th. Gries and collaborators that much learner corpus research is restricting itself factorwise when explaining a linguistic phenomenon. Gries calls for a multi-factor approach whose power can be seen in a study conducted with Sandra C. Deshors, 2014, on the uses of may, can and pouvoir with native English users and French learners of English. Using nearly 4000 examples from 3 corpora, annotated with over 20 morphosyntactic and semantic features, they found for example that French learners of English see pouvoir as closer to can than may.

The analysis for Successful Spoken English was described as follows:

“We examined the data with a mixture of quantitative and qualitative data analysis, using measures such as log-likelihood to check significance of frequency counts but then manual examination of concordance line to analyse the function of language.”

Hopefully with the increasing use of multi-factor methods learner corpus analysis can yield even more interesting and useful results than current approaches allow.

Chris and his colleagues kindly answered some follow-up questions:

5. How did you measure/assign CEFR level for students?  

Students were often already in classes where they had been given a proficiency test and placed in a level . We then gave them our speaking  test and only took data from students who had been given a global pass score of 3.5 or 4 (on a scale of 0-5). The borderline pass mark was 2.5 so we only chose students who had clearly passed but were not at the very top of the level and obviously then only those who gave us permissions to do so. The speaking tests we used were based on Canale’s (1984) oral proficiency interview design and consisted of a warm up phase, a paired interactive discussion task and a topic specific conversation based on the discussion task. Each lasted between 10-15 minutes.

6. So most of the analysis was in relation to successful students who were measured holistically?  


7. And could you explain what holistically means here?

Yes, we looked at successful learners at each CEFR level, according to the test marking criteria. They were graded for grammar, vocabulary, pronunciation, discourse management and interactive ability based on criteria such as  the following (grade 3-3.5) for discourse management ‘Contributions are normally relevant, coherent and of an appropriate length’. These scores were then amalgamated into a global score. These scales are holistic in that they try to assess what learners can do in terms of these competences to gain an overall picture of their spoken English rather than ticking off a list of items they can or cannot use. 

8. Do I understand correctly that comparisons with native speaker corpora were not as much used as with successful vs unsuccessful students? 

No, we did not look at unsuccessful students at all. We were trying to compare successful students at B1-C1 levels and to draw some comparison to native speakers. We also compared our data to the LINDSEI spoken learner corpus to check the use of key words.

9. For the native speaker comparisons what kind of things were compared?

We compared each aspect of communicative competence – linguistic, strategic, discourse and pragmatic competences to some degree. The native speakers took exactly the same tests so we compared (as one example), the most frequent words they used.


Thanks for reading.



Deshors, S. C., & Gries, S. T. (2014). A case for the multifactorial assessment of learner language. Human Cognitive Processing (HCP), 179. Retrieved from


CORE blimey – genre language

A #corpusmooc participant in answering a discussion question on what they would like to use corpora for replied that they wanted a reference book that shows various common structures in various genres such as “letters of condolence, public service announcements, obituaries”.

The CORE (Corpus of Online Registers) corpus at BYU along with the virtual corpora feature allows a way to reach for this.

For example, the screenshot below shows the keywords of verbs & adjectives in the Reviews genre:

Before I briefly show how to make a virtual corpus do note that the standard interface allows you do to a lot of things with the various registers. The CORE interface shows you examples of this. For example the following shows the distribution of the present perfect across the genres:

Create virtual corpora

To create a virtual corpus first go to the CORE start page:

Then click on Texts/Virtual and get this screen:

Next press Create corpus to get this screen:

We want the Reviews Genre so choose it from the drop down box:

Then press Submit to get the following screen:

Here you can either accept these texts or say you want to build only a film review corpus manually look through links and filter for film reviews only. Give your corpus a name or add it to an already existing corpus. Here we give it the name “review”:

Then after submitting you will be taken to the following screen which shows you all your virtual corpora collection we can see the corpus we just created at number 5:

Now you can list keywords.

Do note that the virtual corpora feature is available in most of the BYU collection so if genre is not your thing maybe the other choices of corpora might be useful.

Thanks for reading and do let me know if anything appears unclear.


#TESOL2017 – Corpus related talks and posters

While IATEFL2017 may well have the razzledazzle, TESOL2017 is the big kahuna. Find below corpus related talks and posters (program pdf). There are some well known names here – Kiyomi Chujo, Randi Reppen, Diane Schmitt, Dilin Liu, Keith Folse.

Do TESOL record talks like IATEFL? Otherwise am putting faith in some tweeters to get inkling of what goes down. You know what to do folks.

Tuesday 21 March
Developing Academic Discourse Competence Through Formulaic Sequences
Content Area: Vocabulary/Lexicon
The Academic Formulas List and Phrasal Expressions List include formulaic sequences that build on traditional lists, such as the Academic Word List, to better meet student proficiency needs at the discourse level. Participants investigate the lists; experience collaborative activities designed to assist students in acquisition, including online and corpus-based; and discuss considerations for adaptation and implementation. Step-by-step guides provided.
Alissa Nostas, Arizona State University, USA
Mariah Fairley, American University in Cairo, Egypt
Susanne Rizzo, American University in Cairo, USA

Wednesday 22 March
Engaging Students in Making Grammar Choices: An In‑Depth Approach
Content Area: Grammar
Appropriate use of grammar structures in academic writing can be a challenge even for advanced ESL writers. Drawing on corpus research on the characteristics of written discourse, the presenters demonstrate how to engage students in making effective grammar choices to improve their academic writing. Sample instructional materials are provided.
Wendy Wang, Eastern Michigan University, USA
Susan Ruellan, Eastern Michigan University, USA

Lexical Bundles in L1 and L2 University Student Argumentative Essays
Content Area: Second Language Writing/Composition
This presentation reports findings of a corpus-based analysis of the use, overuse, and misuse of lexical bundles in L2 university student argumentative essays. The presentation also provides ways ESL composition instructors can assist learners in using lexical bundles more appropriately.
Tetyana Bychkovska, Ohio University, USA

Teachers’ U.S. Corpus
Content Area: Research/Research Methodology
The presenters amassed a linguistic corpus-TUSC-representing approximately 4 million words based on over 50 K–12 content area textbooks. Findings of the corpus, including word lists representative of academic language, are offered. Participants are invited to discuss ways this corpus may assist K–12 teachers, especially teachers of ELLs.
Seyedjafar Ehsanzadehsorati, Florida International University, USA

And Furthermore
Content Area: Discourse and Pragmatics
Advanced learner materials offer few guidelines for the use of the expressions “moreover,” “furthermore,” “in fact,” “likewise,” “in turn,” and other additive connectors. Grounded in pragmatic theory and drawing on written corpus examples and experimental speaker judgement data, this talk defines optimal uses and paves a path to enlightened class instruction.
Howard Williams, Teachers College, Columbia University, USA

Teacher Electronic Feedback in ESL Writing Course Chats
Content Area: Second Language Writing/Composition
This corpus-based study analyzes the rhetorical moves, uptake, and student perceptions of the teacher-student chats from five freshman ESL writing courses taught by three expert teachers. Findings show that chats are useful for establishing rapport and clarifying feedback, but we suggest that longer chat sessions may be more effective.
Estela Ene, Indiana University Purdue University Indianapolis, USA
Thomas Upton, Indiana University Purdue University Indianapolis, USA

Using Corpus Linguistics in Teaching ESL Writing
Content Area: Applied Linguistics
This session explores the use of corpus linguistics in teaching L2 writing as an effective way to bring authentic language into the classroom. The presenters discuss ways of incorporating corpora in teaching L2 writing and demonstrate a sample activity of how to use a corpus to address discourse competence.
Gusztav Demeter, Case Western Reserve University, USA
Ana Codita, Case Western Reserve Universtiy, USA
Hee-Seung Kang, Case Western Reserve University, USA

How Technology Shapes Our Language and Feedback: Mode Matters
Content Area: Applied Linguistics
This presentation explores how the use of evaluative language differs between parallel corpora of text and screencast feedback and what this means for the role of feedback and position of instructor. In understanding the implications of technology choices, instructors can better match tools to their pedagogical purposes
Kelly Cunningham, Iowa State University, USA

An Effective Bilingual Sentence Corpus for Low-Proficiency EFL Learners
Content Area: CALL/Computer-Assisted Language Learning/
Technology in Education
Kiyomi Chujo, Nihon University, Japan

Propositional Precision in Learner Corpora: Turkish and Greek EFL Learners
Content Area: English as a Foreign Language
Jülide Inözü, Cukurova University, Turkey
Cem Can, Cukurova University, Turkey

Thursday 23 March
Corpus‑Based Learning of Reporting Verbs in L2 Academic Writing
Content Area: Higher Education
We present findings from our study on the effectiveness of corpus based learning of reporting verbs during a multidraft literature review assignment. The results suggest corpus-based instruction can improve L2 students’ genre awareness and lexical variety without time consuming training. Participants receive sample corpus-based teaching
materials used in the revision workshop.
Ji-young Shin, Purdue University, USA
R. Scott Partridge, Purdue University, USA
Ashley J. Velázquez, Purdue University, USA
Aleksandra Swatek, Purdue University, USA
Shelley Staples, University of Arizona, USA

Providing EAP Listening Input: An Evaluation of Recorded Listening Passages
Content Area: Listening, Speaking/Speech
Are the recorded passages that accompany listening textbooks providing students with exposure to all the necessary elements of academic lecture language? The presenter shares results of a corpusbased study, illustrating what recorded passages do well, where they fall short, and providing activities designed to supplement EAP listening instruction.
Erin Schnur, Northern Arizona University, USA

Developing Learner Resources Using Corpus Linguistics
Randi Reppen, Northern Arizona University, USA

Applying Research Findings to L2 Writing Instruction
Content Area: Second Language Writing/Composition
Effective pedagogical practices have a strong research base and respond directly to students’ learning needs. Presenters share materials developed for such needs in EAP writing classrooms, drawing on grammar/vocabulary corpus research, integration of CBI principles with current L2 writing approaches, and research findings regarding assignment sequencing for larger end-products.
Margi Wald, UC Berkeley, USA
Jan Frodesen, UC Santa Barbara, USA
Diane Schmitt, Nottingham Trent University, United Kingdom (Great Britain)
Gena Bennett, Independent, USA

Teaching Students Self‑Editing in Writing With Interactive Online Corpus Tool
Content Area: CALL/Computer-Assisted Language Learning/
Technology in Education
L2 academic writers often struggle with word choice and collocates when composing in academic English. In this teaching tip, the presenter uses, a free corpus-based online interactive tool, to show how to teach self-editing strategies to L2 writers and demonstrates activities that can be incorporated into EAP writing courses.
Aleksandra Swatek, Purdue University, USA

Corpus 101: Navigating the Corpus of Contemporary American English (COCA)
Content Area: Vocabulary/Lexicon
The Corpus of Contemporary American English (COCA) may look overwhelming at first, but it is in fact an easy-to-use resource. Presenters guide participants through step-by-step navigation of this valuable tool, sharing tips and ideas for teachers and tasks for students that relate to several of COCA’s search and analysis functions.
Heather Gregg Zitlau, Georgetown University, USA
Heather Weger, Georgetown University, USA
Kelly Hill Zirker, Diplomatic Language Services, USA

Using a Medical Research Corpus to Teach ESP Students
Content Area: English for Specific Purposes
The study discussed investigated how expert writers use lexical bundles in medical research articles. More than 200 bundles were identified using a corpus of more than 1 million words. A structural and functional analysis revealed patterns that can be used in developing materials for medical students in international ESP classes.
Ndeye Bineta Mbodj, Health Department Thies University, Senegal

Using Corpora for Engaging Language Teaching: Effective Techniques and Activities
Using concrete examples from their new book published by TESOL, the presenters introduce some common useful procedures and activities for using corpora to teach various aspects of English, including vocabulary, grammar, and writing. They also explain how to develop and use corpora to assess learner language and develop teaching materials.
Dilin Liu, University of Alabama, USA
Lei Lei, Huazhong University of Science and Technology, China

Flexible, Free, and Open Data‑Driven Learning for the Masses
Content Area: Media (Print, Broadcast, Video, and Digital)
This presentation shares findings from multisite research with the open-source FLAX (Flexible Language Acquisition) project. Open digital collections used in formal classroom-based language education and in non-formal online education (MOOCs) are presented to demonstrate how openly licensed linguistic content using data-driven methods can support learning, teaching, and materials development.
Alannah Fitzgerald, Concordia University, USA

Visualizing Vocabulary Across Cultures: Web Images as a Corpus
Content Area: Vocabulary/Lexicon
Cameron Romney, Doshisha University, Japan
John Campbell-Larsen, Kyoto Women’s University, Japan

Developing Autonomous Academic Writing Competence Through Corpus Linguistics
Content Area: CALL/Computer-Assisted Language Learning/
Technology in Education
Chinger Zapata, Universidad Católica del Norte, Chile
Hugo Keith

Data-Driven Learning (DDL) for Teaching Vocabulary and Grammar
Content Area: Teaching Methodology and Strategy
Pramod Sah, University of British Columbia, Canada
Anu Upadhaya, Tribhuvan University, Nepal

Friday 24 March
16 Keys to Teaching ESL Grammar and Vocabulary
Content Area: Grammar
This session uses corpus linguistics data to examine not only which grammar points should be taught but which vocabulary should be taught with each key grammar point. Sample lessons for teaching vocabulary with grammar and tips for designing and teaching these activities are presented.
Keith Folse, University of Central Florida, USA

Beyond Word Lists: Approaching Verbal Complements Lexicogrammatically and Cognitively
Content Area: Grammar
Gerund and infinitive verbal complements are often taught back-to-back via the use of memorization and word lists. This presentation suggests varying lesson placement, approaching the subject from a position of conceptualization of components drawn from Conti’s rule, and incorporating corpus data in classroom materials to improve salience thereof.
Miranda Hartley, University of Alabama, USA

Corpus‑Based Comparison Between Two Lists of Academic English Words
Content Area: Vocabulary/Lexicon
The study discussed compares Coxhead’s Academic Word List and Gardner and Davies’ Academic Vocabulary List in an independently developed 72-million-token university academic corpus to reveal which list is more suitable for academic vocabulary education across different academic disciplines to improve the effectiveness of English‑medium instruction.
Huamin Qi, Western University, Canada

Fostering Effective Participation in L1 Discourse Communities Through Formulaic Sequences
Content Area: Vocabulary/Lexicon
While vocabulary lists contribute substantially to lexical knowledge, discourse-level proficiency remains a challenge. The Academic Formulas List and Phrasal Expressions List, sets of formulaic sequences, address this challenge, helping learners participate more effectively in L1 discourse communities. Facilitators share online and corpus-based activities for formulaic sequence acquisition.
Susanne Rizzo, American University in Cairo, Egypt
Alissa Nostas, Arizona State University, USA
Mariah Fairley, American University in Cairo, Egypt

Developing an Open Educational Resources EAP Corpus
Content Area: English for Specific Purposes
This presentation focuses on the development of an open educational resources EAP corpus. Presenters demonstrate how the corpus can be accessed and downloaded, reused in a variety of ways, revised, remixed, and redistributed to other interested teachers, researchers, and/or students.
Brent Green, Salt Lake Community College, USA
Dean Huber, Salt Lake Community College, USA
George Ellington, Salt Lake Community College, USA

The Emergence of Academic Language Among Advanced Learners
Content Area: Second Language Writing/Composition
This session addresses the gradual changes of academic language based on a pilot study of 35 students over a 16-week graduate course. Suggestions and practical activities, informed by these findings, are demonstrated, including academic discourse techniques and the use of corpora and other online tools for text analysis.
Cheryl Zimmerman, California State University, Fullerton, USA
Jun Li, California State University, Fullerton, USA

Alphabet Street aka Corpus Symposium at VRTwebcon 8

I was delighted to be able to take part in my first webinar as a presenter. Leo Selivan (@leoselivan) asked me to join the corpus symposium for the 8th VRT web conference along side Jenny Wright (@teflhelper) and Sharon Hartle (@hartle). You can find links to our talks at the end of this post as well as my slides.

Presenting on a webinar is definitely a unique experience like talking to yourself knowing others are watching and listening in. Other things to be noted are making sure your microphone is loud enough and that uploaded powerpoints to online systems like Adobe Connect don’t show your slide notes!

My talk was about using BYU-Wikipedia corpus to help recycle coursebook vocabulary and was titled Darling (BYU) Wiki in homage to the recent passing of the great musician Prince. Another webinar note – people can’t hear the music from your computer if you have headphones on!

As I have already posted about using BYU-Wiki for vocabulary recycling, in this post I want to give some brief notes on designing worksheets using some principles from the research literature. When talking about the slide below I did not really explain in the talk what input enhancement and input flood were. And I also did not point out that my adaptation from Barbieri & Eckhardt (2007) was  very loose : ).


Input  enhancement  draws  learners’  attention  to  targeted grammatical features by visually or acoustically flagging L2 input to  enhance  its  perceptual  saliency but  with  no  guarantee  that  learners will attend to the features” (Kim, 2006: 345).

For written text they include things such as underlining, bolding, italicizing, capitalizing, and colouring. Note that the KWIC output from COCA uses colour to label parts of speech.

Input flood similarly enhances saliency through frequency and draws its basis from studies showing importance of repetition in language learning.

Szudarski & Carter (2015) concluded that a combination of input enhancement and input flood can lead to performance gains in collocational knowledge.

Hopefully this post has briefly highlighted some points I did not cover in my 20 min talk. A huge thanks to those who took the time to attend, to Leo and Heike (Philip, @heikephilp) for organizing things smoothly and my co-presenters Jennie and Sharon. Do browse the recordings of the other talks as there are some very interesting ones to check out.

Talk recording links, slides and related blog posts

Jennie Wright, Making trouble-free tasks with corpora

Sharon Hartle, SkELL as a Key to Unlock Exam Preparation

Mura Nava, Darling (BYU) Wiki

Question and Answer Round

My talk slides (pdf)

Summary Post by Sharon Hartle

8th Virtual Round Table Web Conference 6-8 May 2016 program overview

References and further reading:

Barbieri, F., & Eckhardt, S. E. (2007). Applying corpus-based findings to form-focused instruction: The case of reported speech. Language Teaching Research, 11(3), 319-346

Han, Z.,  Park, E. S., & Combs, C. (2008). Textual enhancement of input: issues and possibilities. Applied Linguistics 29.4: 597–618.

Kim,Y. (2006). Effects of input elaboration on vocabulary acquisition through reading by Korean learners of English as a foreign language. TESOL Quarterly 40.2: 341–373.

Szudarski, P., & Carter, R. (2015). The role of input flood and input enhancement in EFL learners’ acquisition of collocations. International Journal of Applied Linguistics.