#IATEFL 2016 – Corpus Tweets 4

It is good to see a talk on how to create your own corpus as this is arguably one of the key strengths of corpus linguistics i.e. language that is mined for your students in your particular context. Very far from what a coursebook can address. As ever much appreciation to Sandy Millin for bringing us this talk.

IATEFL 2016 Corpus Tweets 4

Teacher-driven corpus development: the online restaurant review by Chad Langford & Joshua Albair as reported by Sandy Millin

  1. Chad Langford and Josh Albair on creating a corpus of restaurant reviews based on TripAdvisor, as they are linguists and teachers

  2. CL/JA They teach adults, not degree seeking, but find writing is a challenge, esp as learners don’t write much, even in L1

  3. CL/JA Genre of these reviews works as learners can relate to it and feel empowered, memberes of non-geographically bound community

  4. CL/JA By crating a corpus, they believed that would characterise the genre as objectively as possible, and improve materials devmnt

  5. #iatefl CL/JA Basic steps for treating the data to make corpus https://t.co/NobXQJ2782

    CL/JA Basic steps for treating the data to make corpus pic.twitter.com/NobXQJ2782

  6. CL/JA They narrowed down TripAdvisor reviews to London, with 100-200 reviews per restaurant, with 3-dot average

  7. CL/JA They copied over 8000 reviews and copied them into Word – pretty tedious! Huge amount of text and lots to be manually deleted

  8. CL/JA Cleaned data in Word is readable and only has tagline and body of review, maintaining paragraphs for later research

  9. CL/JA Needed to standardise, e.g. three dots for ellipsis, standardise common misspellings, removing extraneous spaceing

  10. CL/JA To so this they used Notepad++ which is a free powerful text editor which they used to tidy up formatting

  11. #iatefl CL/JA Examples of coding they were able to lesrn very quickly https://t.co/fk7uHn48zM

    CL/JA Examples of coding they were able to lesrn very quickly pic.twitter.com/fk7uHn48zM

  12. CL/JA Then added POS tagging, metadata about tagline and types of restaurant etc. Used Wordsmith tools which is cheap, but good

  13. CL/JA They used wordlist, keywords and concord tools within WordSmith

  14. CL/JA Final corpus has 67 restaurants, over 8000 reviews and over 1 million words. Can start to identify restaurant review genre

  15. CL/JA Identified positive/neg evaluative adjectives, retaurant-related vocab: experience, description, food, non-food, person, place

  16. CL/JA Also very high frequency of first person pronouns, overwhelming use of was/were (copulative use?)

  17. CL/JA Discourse showed very common to use ‘but’ as marker in 3dot reviews,very rare in 1/5 “good but” v “but good” – meaning change

  18. CL/JA One was much more common the other. Think it was “good but” – missed it!

  19. CL/JA High instance of subject-less clauses, determiner ellipsis and one more grammar feature I missed

  20. CL/JA Determiner ellipsis is very rarely pointed out to our students, except in headlines. e.g. restaurant was dirty, fish was tasty

  21. CL/JA In class they’ve used it for ranking activity – place five taglines on cline on board, next group can add 5 more/move first

  22. CL/JA Second activity is guided discovery sheet based on authentic review which exemplifies characteristics they’ve identified

  23. CL/JA Can get in touch with them at the University of Lille if you’d like to find out more

  24. CL/JA Tilly Harrison brings up the point that this corpus data draws on comments that perhaps people haven’t given permission to use



Building your own corpus – BootCat custom URL feature

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator (now called TranslatorBank/CorpusMode)

– a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.

IceWeb  – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.

Building your own corpus – ICEweb

This post describes another program that you can use to crawl a website, the advantage of ICEweb is that it automatically strips html tags for you and it is faster than TextSTAT in retrieving webpages. A disadvantage is that ICEweb does not explicitly save your settings.

Prepare yourself for a deluge of screenshots, you could of course just skip to the help screen of the program and follow that :). Hat tip to Christopher Phipps/ @lousylinguist for alerting me to the program.

When the program starts you will see the following:


You need to name your corpus in terms of a region/category and country/sub-category since we are only interested in one website (Webmonkey.com) ignore region and country labels.

We’ll call the category webmonkey and when we press create category button we’ll see this dialogue box:


This will create the webmonkey folder in the ICEweb folder (note that as I have already created the webmonkey folder you can see it in the screenshot).

You will get a similar dialogue box for when you create the subcategory, in this case I have called that 2013 which should be located in the webmonkey folder:


Again the 2013 folder has already been created. Note here since you already have some folders and files in 2013 folder a dialogue box will appear asking whether you want to overwrite files you should press no, this is where the disadvantage of not being able to save a session appears.

Then press the add/edit URLs button:


a text file opens where you can enter the URLs of the files you want, you then save and exit this text file.

Press the start retrieve button, the program will ask you which folder you want, which will be the subcategory folder i.e. 2013:


and then once the files are downloaded you’ll get some messages in the right-hand pane like so:


Now you can have a look at one of the files by pressing the open result file button and choosing a file from the raw folder as in the following two screenshots:



As you can see compared to the files from TextSTAT (see cleaning up text file post) the text file above is much easier to handle in terms of easily cutting out text you are not interested in.

You can check out the rest of the building your own corpus series if you haven’t already, thanks for reading.

Paper Machines – look ma no concordance lines

A major hurdle with corpora is understanding concordance outputs. Current interfaces don’t help much in this regard as documented by Alannah Fitzgerald e.g. Re-using Oxford OpenSpires content in podcast corpora. Geoff Jordan who worked on an early concordancer remarked:

…we found, not surprisingly, that there was a pay-off between simpliciity and allowing for “sophisticated” searches.

Geoff Jordan (blog post comment)

So there is a big need to address the question of interface design with concordance tools. Recently, I stumbled across a visualisation plugin for the Zotero program called Paper Machines (the github site is faster) which can help with the interface problem.

Using the Phrase Net function I can get a visualisation of several fixed phrases e.g. binomials i.e. x and y:


The size of the word is related to number of occurences of that word in the phrase and the thickness of the arrow how often that phrase occurs. I can of course check these binomials using AntConc.

Here’s another example using x the y:


Trying to get interesting patterns like this is much more difficult (impossible?) in AntConc. The default phrases in Paper Machines are (you can do your own custom searches):

x and y

x or y

x of the y

x a y

x the y

x at y

x is y

x [space] y which is equivalent to collocations i.e.:


In addition to seeing interesting phrases the visualisation helps you to get to know your corpus much better, e.g. in a couple of the diagrams I noticed spk and dkim, looking those up using AntConc I discovered they were abbreviations for sender policy framework and domain keys identified mail respectively. I have no idea what these are but my students may find them important to know.

There are other functions on Paper Machines like Ngram but I could not get that working. For fun, the image below is a Heatmap of the place names mentioned in my webmonkey corpus:


Thanks for reading.

#Lexconf2013 – Novice workshop-talk hues

As mentioned in my #lexconf2013 conference summary this post outlines my first “proper” workshop-talk experience.

Things will go wrong, more so in cases where technology is involved.* My workshop had 5 computers in total, 4 of which were kindly provided by the event organisers. Unfortunately two programs were not installed on these computers so bringing a copy of software on a key is vital. I wasted time trying to find a solution to one of the missing programs and only arrived at a partial solution in any case.

Forty-five minutes really translates into 35-40 mins adding 5-10 mins waiting for conference attendees to arrive and settle. So subtract 10mins from official talk length.

Related to timing with workshops is how important it is to focus on a limited amount of content. I was too ambitious in what I wanted to cover. This I guess comes with experience.

Wish for great plenary sessions in your conference 🙂 so that you have an interesting source of common experience which you can use to connect to your audience. It also helps to calm your nerves as you are thinking about a great plenary you have just heard rather than overthinking your impending talk.

Reading about other teacher experiences really does help e.g. Ava Fruin/@avafruin courageously blogged about her first talk from conception to post talk reflection.

To get a taste of my talk you can check a remixed version (resolution can be switched to high def for clearer image), many thanks to the workshop participants for their questions and feedback.

*Even giants like Google battle the tech gremlins and lose – http://youtu.be/Rd_UrSB1MAY?t=13m8s

What to teach from corpora output – frequency and transparency

Frequency of occurrence is the main way for teachers to choose what to teach when using corpora however as Andrew Walkley discusses in “Word choice, frequency and definitions” using just frequency is not without limitations. In addition to frequency we can use semantic transparency/opacity,  that is, how the meaning of the whole differs from its individual parts. This is also sometimes referred to as how idiomatic a phrase is. Martinez (2012) offers a Frequency Transparency Framework that teachers can use to help them choose what phrases to teach. Using four collocates of take he presents the following graphic:

The Frequency-Transparency Framework (FTF) using four collocates of the verb take (Martinez, 2013)

The numbered quadrants are the suggested priority of the verb+noun pairs i.e. the most frequent and most opaque phrase would be taught first (1), then the most frequent and transparent phrase (2), followed by the less frequent but opaque phrase (3) and last the least frequent and most transparent phrase (4). As said this is only a suggested priority which can be changed according to the teaching context. For example a further two factors (in addition to word for word decoding) can be considered when evaluating transparency:

  • Is the expression potentially deceptively transparent? – “every so often” can be misread as often; “for some time” can be misunderstood as short amount of time (Martinez & Schmitt, 2012, p.309)
  • Could the learner’s L1 negatively influence accurate perception?

Applying the framework to the binomials list from my webmonkey corpus – I would place up and running in quadrant 1, latest and greatest in quadrant 2, tried and true in quadrant 3 and layout and design in quadrant 4. Note that I did not place drag and drop, the most frequent and somewhat opaque phrase since it is so well-known with my multimedia students (similar to cut and paste) that it would not need teaching. Thanks for reading.


Martinez, R. (2013). A framework for the inclusion of multi-word expressions in ELT. ELT Journal 67(2): 184-198.

Martinez, R. & Schmitt, N. (2012). A Phrasal Expressions List. Applied Linguistics 33(3): 299-320.

Building your own corpus – Such as example

Although the power of corpus is in discovering horizontal relations between lexis such as collocations and colligation, the more familiar vertical relations that teachers are used to can also be explored at the same time. Such vertical relations or semantic preference have been used traditionally in ELT coursebooks. See the post by Leo Selivan critiquing this use. This post describes using concordance output to look at both collocation (adjective + noun) and semantic preference (hyponymy or general category/specific example relation)

I adapted an activity by Tribble (1997) who in turn based it on an example from Tim Johns who used the keyword “such as”, e.g.:

encompass many _______frequently labeled HTML5 such as the Geolocation API, offline storage API a

Make sure to see the worksheet (in odt format) to understand the following. The first question A in the worksheet is a traditional pre-learning task using the specific examples of the semantic categories, so in the line above the examples of geolocation api and offline storage api are instances of the category of features (the gapped word). The next questions B and C asks students to find the category that the words in bold on the right are examples of (these bold words are the ones in question A). Question D asks students to identify the types of words in the lists and in the italicised words; to get them to notice the adjective + noun structure e.g. important features. Finally question E asks them to identify adjective + noun structures in a single text taken from the website that the corpus is based on.

Less experienced first year multimedia students struggled more with the exercises than more experienced second year students. I assume this is because they were less familiar with the lexis? Though both seemed more at ease with the last task of identifying structures in the text.

I’ll survey some corpora classroom task types in a later post.

Thanks for reading.


Tribble, C. (1997). Improvising corpora for ELT: Quick-and-dirty ways of developing corpora for language teaching. In J. Melia. & B. Lewandowska-Tomaszczyk (Eds.), PALC ’97 Proceedings, Practical Applications in Language Corpora (pp. 106-117). Lodz: Lodz University Press.

Building your own corpus – sorting body idioms

In my continuing and vain efforts to be associated with great bloggers this post could be said to be inspired by Matt Halsdorff/@MattHalsdorff’s post on native speaker use of idioms. 😉

The sort command in a concordance is one of the simplest yet most powerful features in identifying patterns. The following is a screenshot after a search on the word “face”:


Compare the above with the screenshot below which is sorted alphabetically by the word to the left of “face”:


The sort allows us to clearly see two groupings of about face and let’s face it.

The most common body idiom in my corpus is head (on) (over) to which occurs 165 times. Other body idioms in this corpus incude keep an eye on; on the other hand; fall back to; neck and neck; the bottom line; start (off) on the right foot. So here I have a nice list to explore with students.

Note that the sort command in AntConc can be a little confusing as the tick box is very close and to the left of the adjustable level boxes acting against the usual expectation that tick boxes are often to the right of the attribute that they refer to. The following screenshot shows in red what we expect compared to the blue that is the case:


Stay tuned for further diy corpus bulletins.

Building your own corpus – initial exploration

I’ll describe here what can be done initially with your corpus in AntConc. This post should be read after the first steps in Antconc post.

The keyword results, which shows the most significant words compared to a reference corpus, is a great way to check what words your class already knows. You can use your judgement as to how many and what words to use, the list can be recycled to check students understanding over time, and any number of ways could be used to exploit the list.

Next you can focus on certain words and explore further. Before making use of the Collocation function we need to make sure that in Tool Preferences>Category>Collocates the Selected Collocate measure is set to T-score. Then focusing on say the adjective responsive, we switch to the Collocation tab and enter the word and we get this screen:


We see that the most significant noun which collocates one word to the right of responsive is design. We can do this with other adjectives from the keyword list and can get the following list:

Responsive design
New features
Faster javascript
Latest version
Desktop browser

Depending on your purposes you can add to the list and use it in a matching exercise. Focusing further we can use the collocation function again to look at what goes with responsive design:


The only statistically significant word is of with a T-score of more than 2. However we see off, that’s curious, entering off responsive design into the concordance function we discover:


i.e. turn off is the appropriate collocation. The above shows how unexpected uses can arise, i.e. we can ask students to think of a sentence using X + of + responsive design and Y + off + responsive design and see if what they come up with makes sense.

Additionally although embrace/s/d/ing is not statistically significant it is an interesting collocation:


Presenting the above sentences we can try to see if the students can work out the meaning of embrace.

Stay tuned for related posts.

Building your own corpus – using general English lexis

I was reading a post on money idioms by Martin Sketchley/ @ELTExperiences and was reminded that a lot of general English lexis and topics which crop up in coursebooks can be a good way into any specialised corpus you build. Usually I work with a single text with my class and then use any analysis from my corpus to focus on specific lexis, i.e. multi-media lexis. What you can also do is look at general English words, in this post I will look at the words – pay, payments, money, cost, costs, cheap, cheaper.

Note you can use the asterisk * wildcard to search for all forms of words you are interested in. e.g. pay*, cost* and cheap*.

Pay, payments, money, cost, costs, cheap and cheaper return 17, 8, 16, 10,10, 11 and 7 hits respectively, the following is an example of the pay concordance lines:


There is an option in Tool Preferences>Category>Concordance to Hide search term in KWIC display which will take out your word and replace it with asterisks.e.g. for the payments concordance lines:


You would then give out the concordances in 7 groups along with the 7 missing words and ask your students to fill in the gaps. It’s a good idea to examine your concordance lines in case of co-text that may be too difficult or needs explaining.

If students have not seen concordance lines they should be exposed to them beforehand.

So for example take a line from each concordance group:

1. Yes, thats right, IE7 users visiting Kogan will pay more than those using a modern web browser. Or
2. H.264 support to decode video. That means royalty payments are covered by hardware makers, not Mozil
3. lla) that Google is willing to spend that kind of money just to keep Microsoft from starting a partn
4.numbers, the Google Maps API will soon be another cost to factor into the plan. And that may be enou
5. some developers were worried about the potential costs. The vast majority of maps hackers and casua
6.ep prices for all smart shoppers down. Sure its a cheap, attention-getting gimmick, but who hasnt wa
7.t if its just backups you want, clearly there are cheaper alternatives. The real appeal of Cloud Dri

and ask students to try to work out the context in terms of the topic text, what comes before fragment, what comes after. Note if you want to expand the range of the sentence extract adjust the Search Window Size, the default is set to 50 characters to the left and right of the search word.

After the gap filling exercise, you divide students into seven groups and ask each group to examine one set of concordance lines for any patterns that they notice.

There is plenty of scope to use general English in terms of grammatical forms as well which I’ll talk about in another post.

Thanks for reading.