Corpus linguistics community news 6

Although some say G+ is a dying forum the CL G+ group now has 482 members, nice. Admittedly not many interact but I do (like to) think a lot do appreciate the resources put on there. I had the pleasure of being a mentor on the Lancaster University Future Learn Corpus Linguistics MOOC last September, which is going to have another round next September so look out for that.

First up for this installment of news I check the claim that elementary my dear Watson was never used in the Sherlock Holmes stories.

Next I challenged readers to check how many business idioms in a list can be accounted for in relevant corpora.

There is a useful table of search syntax for the COCA interface.

My search for publically available spoken corpora.

A list of phrasal verbs Jeremy Corbyn used in his final rally speech before becoming Labour leader.

Using the SKELL interface to get some good examples for a review quiz.

Some AntConc alternatives.

Bypassing limits of spreadsheet rows.

Finally a description of using a scraping tool to get biology and medical abstracts in simpler language that might be suitable for EAP students.

Thanks for reading. And don’t forget to check out the previous corpus linguistics community news if you haven’t already.


What is the ideal title for a talk/poster at IATEFL and TESOL 2015?

Although a lot of criticism can be made of mainstream teaching conferences they are not going away (yet?). As a way to get ready for the IATEFL 2015 conference I thought it interesting to see what kind of titles were most common.

Professional development in the classroom – this is the ideal title if you want to present at IATEFL and TESOL 2015.

Of the 680 titles (including talks, posters, forums) at IATEFL 2015 and of the 1092 titles at TESOL 2015, the top 2 word bundles are:
professional development (16, 15)
in the (26, 42)
the classroom (11, 15)

Furthermore in + elt/tesol, in a are also common bundles.

If you want to aim for only IATEFL then use how to  bundle in your title as this tops IATEFL 2015 with 28 instances.

By contrast if you want to target TESOL then use strategies for bundle which has 16 instances.

In addition TESOL titles prefers to use language learners (14) while IATEFL prefers language learning (10). Do also make sure you give added value for TESOL since teaching and, and the are common bundles.

If you want you can download IATEFL 2015 and TESOL 2105 titles yourself to explore (there may be some errors in terms of duplicates and/or missing titles). I used AntConc to count the 2-gram bundles. All the bundles were taken from the top 20.

One could explore the top 20 keywords to add another perspective.  Or a count of titles over the years. A look at other major ELT conferences would also be interesting. Someone may also be interested in making a count of the gender of the presenters.

Thanks for reading.

TESOL France 2013 – slides, handout, notes

Hmm going to be a dull week to face after attending the 32nd annual international TESOL France shindig. Loads of great talks (e.g. see reports on materials writing in ELT by Jonathan Sayers, @jo_sayers; experimental practice in ELT by Lexical Leo,@leoselivan; Tailoring ESP courses by Kirstin Lahaye, ‏ @kirstinlahaye ), amiable company and Celtic dancing.

Seems to be a tradition to post up slides of the talks so I will duly comply. A huge thanks to the folks who came to spend some time with me, and who asked some great questions. A huge thanks to the TESOL committee, volunteers and all attendees, everything seemed to run like clockwork.

Using concordance software to inform classroom practice:



NB1: for a video walkthrough of the getting your own texts part of the talk please have a look at this:

NB2: my reference to a diy corpus as a Frankencorpus, building your own body, comes from a Rob Troyer IALLT 2013 presentation. The Frankenstein icon was made by

And as Scott Thornbury said in his plenary – the body (corpus) remembers! So why not make use of such memories in your classroom.

Building your own corpus – using general English lexis

I was reading a post on money idioms by Martin Sketchley/ @ELTExperiences and was reminded that a lot of general English lexis and topics which crop up in coursebooks can be a good way into any specialised corpus you build. Usually I work with a single text with my class and then use any analysis from my corpus to focus on specific lexis, i.e. multi-media lexis. What you can also do is look at general English words, in this post I will look at the words – pay, payments, money, cost, costs, cheap, cheaper.

Note you can use the asterisk * wildcard to search for all forms of words you are interested in. e.g. pay*, cost* and cheap*.

Pay, payments, money, cost, costs, cheap and cheaper return 17, 8, 16, 10,10, 11 and 7 hits respectively, the following is an example of the pay concordance lines:


There is an option in Tool Preferences>Category>Concordance to Hide search term in KWIC display which will take out your word and replace it with asterisks.e.g. for the payments concordance lines:


You would then give out the concordances in 7 groups along with the 7 missing words and ask your students to fill in the gaps. It’s a good idea to examine your concordance lines in case of co-text that may be too difficult or needs explaining.

If students have not seen concordance lines they should be exposed to them beforehand.

So for example take a line from each concordance group:

1. Yes, thats right, IE7 users visiting Kogan will pay more than those using a modern web browser. Or
2. H.264 support to decode video. That means royalty payments are covered by hardware makers, not Mozil
3. lla) that Google is willing to spend that kind of money just to keep Microsoft from starting a partn
4.numbers, the Google Maps API will soon be another cost to factor into the plan. And that may be enou
5. some developers were worried about the potential costs. The vast majority of maps hackers and casua
6.ep prices for all smart shoppers down. Sure its a cheap, attention-getting gimmick, but who hasnt wa
7.t if its just backups you want, clearly there are cheaper alternatives. The real appeal of Cloud Dri

and ask students to try to work out the context in terms of the topic text, what comes before fragment, what comes after. Note if you want to expand the range of the sentence extract adjust the Search Window Size, the default is set to 50 characters to the left and right of the search word.

After the gap filling exercise, you divide students into seven groups and ask each group to examine one set of concordance lines for any patterns that they notice.

There is plenty of scope to use general English in terms of grammatical forms as well which I’ll talk about in another post.

Thanks for reading.

Building your own corpus – first steps in AntConc

Another installment in building your own corpus, check out the previous ones if you haven’t already.

This time we’ll look at the first steps, Wordlist, Keyword list and Save settings, to make in AntConc once you have your text file(s) as you want it.

Step one would be to make a wordlist. First go to the Tool Preferences>Category>Wordlist and make sure the Treat all data as lowercase checkbox is ticked:


Press Apply and then go to the Wordlist tab and click the Start button, you will get a screen like this :


Not surprisingly function words top the list. We are more interested in content words so we can go on to step two which is the keyword list function. Go to the Tool Preferences>Category>Keyword List:


Here you will need to load a reference corpus e.g. the Brown word list. Press the Add Files button and locate your copy of the reference corpus. As the reference corpus used is a word list and not a raw file you need to click the Use word list(s) radio button then press the Load button and the bar to the right will fill in green once the reference corpus is loaded:


Now press the Apply button, when you get back to the Keyword List window press the Start button and your screen will look similar to this:


Finally step three is to go to File>Export Settings to File and save your settings. Now we have everything we need to start exploring our corpus. Stay tuned for related posts.

Building your own corpus – cleaning up your text file(s)

One of the issues with crawling web pages is the irrelevant text that is also collected. This short post describes one way to clean up your text file(s). [Update 1: note if you use ICEWeb instead of TextSTAT you won’t need to clean up as much, if at all; update 2: BootCat also does not require much cleaning]. I assume you are using LibreOffice, I think one can find similar functions or extensions for Microsoft Office.

The extension you need is Alternative dialog Find & Replace for Writer. Load the extension by going to Tools>Extension Manager>Add. Then restart LibreOffice. You will then see the extension under Tools>Add-Ons>Alternative searching.

You need to look at your text and see if you can spot patterns to the parts of the text you do not need. For my case I had to do five passes on the text. The first pass looked like this:


I get the [::BigBlock::] function by selecting Extended>Series of paragraphs (limited between start and end marks). This means See Also[::BigBlock::]File Under will find all text starting from See Also and ending with File Under. Note that you need to have the Regular expressions checkbox ticked. You can cycle through the selections by pressing Find to verify that is the text you want. Once you have done that you can simply press Replace All button.

My next pass looked like this:


My third pass was this:


Note the use of | as the OR operator.

My fourth pass removed any odd characters e.g.:


My fifth pass looked to remove any photo; image attributions so text like “photo by” or “image by” etc.

Of course depending on how thorough you want to clean your text one or two passes may be enough.

Stay tuned for related posts.

Building your own corpus – binomials example

Note this entry may not mean much if you are new to AntConc, I will post more beginner type ones later, I wanted to take advantage of a well known ELT blogger’s post as a hook to this one!

This post shows how useful building your own corpus can be particularly for ESP purposes. Leo Selivan’s/@leoselivan latest describes how to teach binomials, be sure to check that out before proceeding. A great post but the examples he uses are from general English.

Using AntConc I can see what examples are relevant to my multi-media students.

The following screenshot show the results of the search /# and #/ in the Clusters/N-Grams tab of AntConc (# is the default wildcard for any one word):


Looking down to a frequency hit of 4 I get this list:

drag and drop
up and running
audio and video
latest and greatest
desktop and mobile
designers and developers
quickly and easily
sites and apps
tips and tricks
username and password
building and testing
download and install
tried and true
when and where

I also note that there are a number of ones using “new and …”:

new and different
new and evolving
new and exciting
new and improved
new and underused

There are also a number of others with a frequency hit of 3 or less e.g.:

backup and archiving
first and foremost
icons and logos
layout and design
panning and zooming
tips and tricks
faster and lighter
save and refresh
scissors and glue
leaps and bounds
looks and feels
new and shiny
pluses and minuses
pros and cons

Further I can easily use the concordance as example sentences e.g. for latest and greatest:


So this was a quick post to highlight how useful your own corpus can be. Stay tuned for related posts.

Building your own corpus – TextSTAT and AntConc

This post describes how to set up a workflow using two programs to build up a database of text from the internet. The two programs used are TextSTAT and AntConc. TextSTAT is used for its webcrawler to build your corpus [update1: an alternative program ICEweb, update 2: BootCat custom url] and AntConc is used to analyse the corpus. Note that I won’t be detailing any analysis in this post, that needs a series of other posts.


The first program TextSTAT is used to bring in your text from the internet. When you first run TextSTAT you will get the screen below:


First create a new corpus by going to Corpus/New corpus.

Next go to Corpus/Add file from web or press this icon:Image

to get the following dialogue box:


The above screenshot shows that I have entered the directory for posts from February 2011 on the site i.e. You need to have a good idea of how your site is structured in terms of how it archives its text.

I then select 25 as the number of pages to retrieve (you need to be aware of how many posts your particular website averages). I finally select domain: only subdirectory. Once you press search you should get a window like below:


Note that links such as and etc are indexes to the texts and so can be deleted by first selecting them:


And then right clicking:


and finally selecting remove files and ending up with the following screen:


Next switch to the Word Forms tab:


Press the Frequency List button and you will get a screen like this:


Once you save your recently made corpus, you need to change the file into a text file by simply renaming the .crp file into a .txt file. A good idea is to make a copy of the .crp file before renaming so that you can still use original file in TextSTAT.


The start screen of AntConc looks like this:


Go to File/Open File and you will get a screen like this:


You can see the .txt file you renamed from the file you built in TextSTAT in the top left of the first column. You may notice a difference in the wordlist that AntCONC builds compared to the wordlist TextSTAT builds:


This is because the renaming of the .crp file to .txt file is a dirty workaround which does not strip any headers that TextSTAT uses in its .crp file. The differences should be negligible for ESP purposes.

Once you have your txt file in AntConc it is just a matter of playing around to see how you can use it for your purposes. Use the AntConc website as a jump point to learn about the features. Stay tuned for a series of posts on how I use it for multimedia English classes.