#IATEFL 2016 – Corpus Tweets 4

It is good to see a talk on how to create your own corpus as this is arguably one of the key strengths of corpus linguistics i.e. language that is mined for your students in your particular context. Very far from what a coursebook can address. As ever much appreciation to Sandy Millin for bringing us this talk.

IATEFL 2016 Corpus Tweets 4

Teacher-driven corpus development: the online restaurant review by Chad Langford & Joshua Albair as reported by Sandy Millin

  1. Chad Langford and Josh Albair on creating a corpus of restaurant reviews based on TripAdvisor, as they are linguists and teachers

  2. CL/JA They teach adults, not degree seeking, but find writing is a challenge, esp as learners don’t write much, even in L1

  3. CL/JA Genre of these reviews works as learners can relate to it and feel empowered, memberes of non-geographically bound community

  4. CL/JA By crating a corpus, they believed that would characterise the genre as objectively as possible, and improve materials devmnt

  5. #iatefl CL/JA Basic steps for treating the data to make corpus https://t.co/NobXQJ2782

    CL/JA Basic steps for treating the data to make corpus pic.twitter.com/NobXQJ2782

  6. CL/JA They narrowed down TripAdvisor reviews to London, with 100-200 reviews per restaurant, with 3-dot average

  7. CL/JA They copied over 8000 reviews and copied them into Word – pretty tedious! Huge amount of text and lots to be manually deleted

  8. CL/JA Cleaned data in Word is readable and only has tagline and body of review, maintaining paragraphs for later research

  9. CL/JA Needed to standardise, e.g. three dots for ellipsis, standardise common misspellings, removing extraneous spaceing

  10. CL/JA To so this they used Notepad++ which is a free powerful text editor which they used to tidy up formatting

  11. #iatefl CL/JA Examples of coding they were able to lesrn very quickly https://t.co/fk7uHn48zM

    CL/JA Examples of coding they were able to lesrn very quickly pic.twitter.com/fk7uHn48zM

  12. CL/JA Then added POS tagging, metadata about tagline and types of restaurant etc. Used Wordsmith tools which is cheap, but good

  13. CL/JA They used wordlist, keywords and concord tools within WordSmith

  14. CL/JA Final corpus has 67 restaurants, over 8000 reviews and over 1 million words. Can start to identify restaurant review genre

  15. CL/JA Identified positive/neg evaluative adjectives, retaurant-related vocab: experience, description, food, non-food, person, place

  16. CL/JA Also very high frequency of first person pronouns, overwhelming use of was/were (copulative use?)

  17. CL/JA Discourse showed very common to use ‘but’ as marker in 3dot reviews,very rare in 1/5 “good but” v “but good” – meaning change

  18. CL/JA One was much more common the other. Think it was “good but” – missed it!

  19. CL/JA High instance of subject-less clauses, determiner ellipsis and one more grammar feature I missed

  20. CL/JA Determiner ellipsis is very rarely pointed out to our students, except in headlines. e.g. restaurant was dirty, fish was tasty

  21. CL/JA In class they’ve used it for ranking activity – place five taglines on cline on board, next group can add 5 more/move first

  22. CL/JA Second activity is guided discovery sheet based on authentic review which exemplifies characteristics they’ve identified

  23. CL/JA Can get in touch with them at the University of Lille if you’d like to find out more

  24. CL/JA Tilly Harrison brings up the point that this corpus data draws on comments that perhaps people haven’t given permission to use


Building your own corpus – BootCat custom URL feature

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator  – a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.

IceWeb  – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.

Building your own corpus – ICEweb

This post describes another program that you can use to crawl a website, the advantage of ICEweb is that it automatically strips html tags for you and it is faster than TextSTAT in retrieving webpages. A disadvantage is that ICEweb does not explicitly save your settings.

Prepare yourself for a deluge of screenshots, you could of course just skip to the help screen of the program and follow that :). Hat tip to Christopher Phipps/ @lousylinguist for alerting me to the program.

When the program starts you will see the following:


You need to name your corpus in terms of a region/category and country/sub-category since we are only interested in one website (Webmonkey.com) ignore region and country labels.

We’ll call the category webmonkey and when we press create category button we’ll see this dialogue box:


This will create the webmonkey folder in the ICEweb folder (note that as I have already created the webmonkey folder you can see it in the screenshot).

You will get a similar dialogue box for when you create the subcategory, in this case I have called that 2013 which should be located in the webmonkey folder:


Again the 2013 folder has already been created. Note here since you already have some folders and files in 2013 folder a dialogue box will appear asking whether you want to overwrite files you should press no, this is where the disadvantage of not being able to save a session appears.

Then press the add/edit URLs button:


a text file opens where you can enter the URLs of the files you want, you then save and exit this text file.

Press the start retrieve button, the program will ask you which folder you want, which will be the subcategory folder i.e. 2013:


and then once the files are downloaded you’ll get some messages in the right-hand pane like so:


Now you can have a look at one of the files by pressing the open result file button and choosing a file from the raw folder as in the following two screenshots:



As you can see compared to the files from TextSTAT (see cleaning up text file post) the text file above is much easier to handle in terms of easily cutting out text you are not interested in.

You can check out the rest of the building your own corpus series if you haven’t already, thanks for reading.

Paper Machines – look ma no concordance lines

A major hurdle with corpora is understanding concordance outputs. Current interfaces don’t help much in this regard as documented by Alannah Fitzgerald e.g. Re-using Oxford OpenSpires content in podcast corpora. Geoff Jordan who worked on an early concordancer remarked:

…we found, not surprisingly, that there was a pay-off between simpliciity and allowing for “sophisticated” searches.

Geoff Jordan (blog post comment)

So there is a big need to address the question of interface design with concordance tools. Recently, I stumbled across a visualisation plugin for the Zotero program called Paper Machines (the github site is faster) which can help with the interface problem.

Using the Phrase Net function I can get a visualisation of several fixed phrases e.g. binomials i.e. x and y:


The size of the word is related to number of occurences of that word in the phrase and the thickness of the arrow how often that phrase occurs. I can of course check these binomials using AntConc.

Here’s another example using x the y:


Trying to get interesting patterns like this is much more difficult (impossible?) in AntConc. The default phrases in Paper Machines are (you can do your own custom searches):

x and y

x or y

x of the y

x a y

x the y

x at y

x is y

x [space] y which is equivalent to collocations i.e.:


In addition to seeing interesting phrases the visualisation helps you to get to know your corpus much better, e.g. in a couple of the diagrams I noticed spk and dkim, looking those up using AntConc I discovered they were abbreviations for sender policy framework and domain keys identified mail respectively. I have no idea what these are but my students may find them important to know.

There are other functions on Paper Machines like Ngram but I could not get that working. For fun, the image below is a Heatmap of the place names mentioned in my webmonkey corpus:


Thanks for reading.

#Lexconf2013 – Novice workshop-talk hues

As mentioned in my #lexconf2013 conference summary this post outlines my first “proper” workshop-talk experience.

Things will go wrong, more so in cases where technology is involved.* My workshop had 5 computers in total, 4 of which were kindly provided by the event organisers. Unfortunately two programs were not installed on these computers so bringing a copy of software on a key is vital. I wasted time trying to find a solution to one of the missing programs and only arrived at a partial solution in any case.

Forty-five minutes really translates into 35-40 mins adding 5-10 mins waiting for conference attendees to arrive and settle. So subtract 10mins from official talk length.

Related to timing with workshops is how important it is to focus on a limited amount of content. I was too ambitious in what I wanted to cover. This I guess comes with experience.

Wish for great plenary sessions in your conference 🙂 so that you have an interesting source of common experience which you can use to connect to your audience. It also helps to calm your nerves as you are thinking about a great plenary you have just heard rather than overthinking your impending talk.

Reading about other teacher experiences really does help e.g. Ava Fruin/@avafruin courageously blogged about her first talk from conception to post talk reflection.

To get a taste of my talk you can check a remixed version (resolution can be switched to high def for clearer image), many thanks to the workshop participants for their questions and feedback.

*Even giants like Google battle the tech gremlins and lose – http://youtu.be/Rd_UrSB1MAY?t=13m8s

What to teach from corpora output – frequency and transparency

Frequency of occurrence is the main way for teachers to choose what to teach when using corpora however as Andrew Walkley discusses in “Word choice, frequency and definitions” using just frequency is not without limitations. In addition to frequency we can use semantic transparency/opacity,  that is, how the meaning of the whole differs from its individual parts. This is also sometimes referred to as how idiomatic a phrase is. Martinez (2012) offers a Frequency Transparency Framework that teachers can use to help them choose what phrases to teach. Using four collocates of take he presents the following graphic:

The Frequency-Transparency Framework (FTF) using four collocates of the verb take (Martinez, 2013)

The numbered quadrants are the suggested priority of the verb+noun pairs i.e. the most frequent and most opaque phrase would be taught first (1), then the most frequent and transparent phrase (2), followed by the less frequent but opaque phrase (3) and last the least frequent and most transparent phrase (4). As said this is only a suggested priority which can be changed according to the teaching context. For example a further two factors (in addition to word for word decoding) can be considered when evaluating transparency:

  • Is the expression potentially deceptively transparent? – “every so often” can be misread as often; “for some time” can be misunderstood as short amount of time (Martinez & Schmitt, 2012, p.309)
  • Could the learner’s L1 negatively influence accurate perception?

Applying the framework to the binomials list from my webmonkey corpus – I would place up and running in quadrant 1, latest and greatest in quadrant 2, tried and true in quadrant 3 and layout and design in quadrant 4. Note that I did not place drag and drop, the most frequent and somewhat opaque phrase since it is so well-known with my multimedia students (similar to cut and paste) that it would not need teaching. Thanks for reading.


Martinez, R. (2013). A framework for the inclusion of multi-word expressions in ELT. ELT Journal 67(2): 184-198.

Martinez, R. & Schmitt, N. (2012). A Phrasal Expressions List. Applied Linguistics 33(3): 299-320.

Building your own corpus – Such as example

Although the power of corpus is in discovering horizontal relations between lexis such as collocations and colligation, the more familiar vertical relations that teachers are used to can also be explored at the same time. Such vertical relations or semantic preference have been used traditionally in ELT coursebooks. See the post by Leo Selivan critiquing this use. This post describes using concordance output to look at both collocation (adjective + noun) and semantic preference (hyponymy or general category/specific example relation)

I adapted an activity by Tribble (1997) who in turn based it on an example from Tim Johns who used the keyword “such as”, e.g.:

encompass many _______frequently labeled HTML5 such as the Geolocation API, offline storage API a

Make sure to see the worksheet (in odt format) to understand the following. The first question A in the worksheet is a traditional pre-learning task using the specific examples of the semantic categories, so in the line above the examples of geolocation api and offline storage api are instances of the category of features (the gapped word). The next questions B and C asks students to find the category that the words in bold on the right are examples of (these bold words are the ones in question A). Question D asks students to identify the types of words in the lists and in the italicised words; to get them to notice the adjective + noun structure e.g. important features. Finally question E asks them to identify adjective + noun structures in a single text taken from the website that the corpus is based on.

Less experienced first year multimedia students struggled more with the exercises than more experienced second year students. I assume this is because they were less familiar with the lexis? Though both seemed more at ease with the last task of identifying structures in the text.

I’ll survey some corpora classroom task types in a later post.

Thanks for reading.


Tribble, C. (1997). Improvising corpora for ELT: Quick-and-dirty ways of developing corpora for language teaching. In J. Melia. & B. Lewandowska-Tomaszczyk (Eds.), PALC ’97 Proceedings, Practical Applications in Language Corpora (pp. 106-117). Lodz: Lodz University Press.