Building your own corpus – BootCat custom URL feature

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator (now called TranslatorBank/CorpusMode)

– a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.

IceWeb  – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.


Corpus linguistics community news 3

I have realised it’s been a while since I have reported any potentially useful posts I have done over at the G+ CL community. So here is bullletin number 3.

Some pointers when re-writing text for graded readers – this is my interpretation of a Japanese researcher’s slide presentation so I may be talking out me backside!

AntWordProfiler and specialised vocabulary profiling – this arose out of a question that a participant on the iTDi ELT Reading materials design course had.

Videogrep, a tool to make concordances of video – very neat tool

Building your own corpus -TagAnt – continuation of my series showing how to use TagAnt POS tagging to mine your DIY corpus.

Semantic tagging Sugata Mitra – oh noes he’s back! Maybe of interest to upcoming round 2 #corpusmooc folk as I try to make sense of semantic tagging.

Seeding BootCat – notes on how best to seed BootCat when building your own corpus from the web.

Don’t forget CL community news 1 and 2 if you have not checked them already.

Building your own corpus – ICEweb

This post describes another program that you can use to crawl a website, the advantage of ICEweb is that it automatically strips html tags for you and it is faster than TextSTAT in retrieving webpages. A disadvantage is that ICEweb does not explicitly save your settings.

Prepare yourself for a deluge of screenshots, you could of course just skip to the help screen of the program and follow that :). Hat tip to Christopher Phipps/ @lousylinguist for alerting me to the program.

When the program starts you will see the following:


You need to name your corpus in terms of a region/category and country/sub-category since we are only interested in one website ( ignore region and country labels.

We’ll call the category webmonkey and when we press create category button we’ll see this dialogue box:


This will create the webmonkey folder in the ICEweb folder (note that as I have already created the webmonkey folder you can see it in the screenshot).

You will get a similar dialogue box for when you create the subcategory, in this case I have called that 2013 which should be located in the webmonkey folder:


Again the 2013 folder has already been created. Note here since you already have some folders and files in 2013 folder a dialogue box will appear asking whether you want to overwrite files you should press no, this is where the disadvantage of not being able to save a session appears.

Then press the add/edit URLs button:


a text file opens where you can enter the URLs of the files you want, you then save and exit this text file.

Press the start retrieve button, the program will ask you which folder you want, which will be the subcategory folder i.e. 2013:


and then once the files are downloaded you’ll get some messages in the right-hand pane like so:


Now you can have a look at one of the files by pressing the open result file button and choosing a file from the raw folder as in the following two screenshots:



As you can see compared to the files from TextSTAT (see cleaning up text file post) the text file above is much easier to handle in terms of easily cutting out text you are not interested in.

You can check out the rest of the building your own corpus series if you haven’t already, thanks for reading.

Building your own corpus – Such as example

Although the power of corpus is in discovering horizontal relations between lexis such as collocations and colligation, the more familiar vertical relations that teachers are used to can also be explored at the same time. Such vertical relations or semantic preference have been used traditionally in ELT coursebooks. See the post by Leo Selivan critiquing this use. This post describes using concordance output to look at both collocation (adjective + noun) and semantic preference (hyponymy or general category/specific example relation)

I adapted an activity by Tribble (1997) who in turn based it on an example from Tim Johns who used the keyword “such as”, e.g.:

encompass many _______frequently labeled HTML5 such as the Geolocation API, offline storage API a

Make sure to see the worksheet (in odt format) to understand the following. The first question A in the worksheet is a traditional pre-learning task using the specific examples of the semantic categories, so in the line above the examples of geolocation api and offline storage api are instances of the category of features (the gapped word). The next questions B and C asks students to find the category that the words in bold on the right are examples of (these bold words are the ones in question A). Question D asks students to identify the types of words in the lists and in the italicised words; to get them to notice the adjective + noun structure e.g. important features. Finally question E asks them to identify adjective + noun structures in a single text taken from the website that the corpus is based on.

Less experienced first year multimedia students struggled more with the exercises than more experienced second year students. I assume this is because they were less familiar with the lexis? Though both seemed more at ease with the last task of identifying structures in the text.

I’ll survey some corpora classroom task types in a later post.

Thanks for reading.


Tribble, C. (1997). Improvising corpora for ELT: Quick-and-dirty ways of developing corpora for language teaching. In J. Melia. & B. Lewandowska-Tomaszczyk (Eds.), PALC ’97 Proceedings, Practical Applications in Language Corpora (pp. 106-117). Lodz: Lodz University Press.

Building your own corpus – sorting body idioms

In my continuing and vain efforts to be associated with great bloggers this post could be said to be inspired by Matt Halsdorff/@MattHalsdorff’s post on native speaker use of idioms. 😉

The sort command in a concordance is one of the simplest yet most powerful features in identifying patterns. The following is a screenshot after a search on the word “face”:


Compare the above with the screenshot below which is sorted alphabetically by the word to the left of “face”:


The sort allows us to clearly see two groupings of about face and let’s face it.

The most common body idiom in my corpus is head (on) (over) to which occurs 165 times. Other body idioms in this corpus incude keep an eye on; on the other hand; fall back to; neck and neck; the bottom line; start (off) on the right foot. So here I have a nice list to explore with students.

Note that the sort command in AntConc can be a little confusing as the tick box is very close and to the left of the adjustable level boxes acting against the usual expectation that tick boxes are often to the right of the attribute that they refer to. The following screenshot shows in red what we expect compared to the blue that is the case:


Stay tuned for further diy corpus bulletins.

Building your own corpus – initial exploration

I’ll describe here what can be done initially with your corpus in AntConc. This post should be read after the first steps in Antconc post.

The keyword results, which shows the most significant words compared to a reference corpus, is a great way to check what words your class already knows. You can use your judgement as to how many and what words to use, the list can be recycled to check students understanding over time, and any number of ways could be used to exploit the list.

Next you can focus on certain words and explore further. Before making use of the Collocation function we need to make sure that in Tool Preferences>Category>Collocates the Selected Collocate measure is set to T-score. Then focusing on say the adjective responsive, we switch to the Collocation tab and enter the word and we get this screen:


We see that the most significant noun which collocates one word to the right of responsive is design. We can do this with other adjectives from the keyword list and can get the following list:

Responsive design
New features
Faster javascript
Latest version
Desktop browser

Depending on your purposes you can add to the list and use it in a matching exercise. Focusing further we can use the collocation function again to look at what goes with responsive design:


The only statistically significant word is of with a T-score of more than 2. However we see off, that’s curious, entering off responsive design into the concordance function we discover:


i.e. turn off is the appropriate collocation. The above shows how unexpected uses can arise, i.e. we can ask students to think of a sentence using X + of + responsive design and Y + off + responsive design and see if what they come up with makes sense.

Additionally although embrace/s/d/ing is not statistically significant it is an interesting collocation:


Presenting the above sentences we can try to see if the students can work out the meaning of embrace.

Stay tuned for related posts.

Building your own corpus – using general English lexis

I was reading a post on money idioms by Martin Sketchley/ @ELTExperiences and was reminded that a lot of general English lexis and topics which crop up in coursebooks can be a good way into any specialised corpus you build. Usually I work with a single text with my class and then use any analysis from my corpus to focus on specific lexis, i.e. multi-media lexis. What you can also do is look at general English words, in this post I will look at the words – pay, payments, money, cost, costs, cheap, cheaper.

Note you can use the asterisk * wildcard to search for all forms of words you are interested in. e.g. pay*, cost* and cheap*.

Pay, payments, money, cost, costs, cheap and cheaper return 17, 8, 16, 10,10, 11 and 7 hits respectively, the following is an example of the pay concordance lines:


There is an option in Tool Preferences>Category>Concordance to Hide search term in KWIC display which will take out your word and replace it with asterisks.e.g. for the payments concordance lines:


You would then give out the concordances in 7 groups along with the 7 missing words and ask your students to fill in the gaps. It’s a good idea to examine your concordance lines in case of co-text that may be too difficult or needs explaining.

If students have not seen concordance lines they should be exposed to them beforehand.

So for example take a line from each concordance group:

1. Yes, thats right, IE7 users visiting Kogan will pay more than those using a modern web browser. Or
2. H.264 support to decode video. That means royalty payments are covered by hardware makers, not Mozil
3. lla) that Google is willing to spend that kind of money just to keep Microsoft from starting a partn
4.numbers, the Google Maps API will soon be another cost to factor into the plan. And that may be enou
5. some developers were worried about the potential costs. The vast majority of maps hackers and casua
6.ep prices for all smart shoppers down. Sure its a cheap, attention-getting gimmick, but who hasnt wa
7.t if its just backups you want, clearly there are cheaper alternatives. The real appeal of Cloud Dri

and ask students to try to work out the context in terms of the topic text, what comes before fragment, what comes after. Note if you want to expand the range of the sentence extract adjust the Search Window Size, the default is set to 50 characters to the left and right of the search word.

After the gap filling exercise, you divide students into seven groups and ask each group to examine one set of concordance lines for any patterns that they notice.

There is plenty of scope to use general English in terms of grammatical forms as well which I’ll talk about in another post.

Thanks for reading.

Building your own corpus – first steps in AntConc

Another installment in building your own corpus, check out the previous ones if you haven’t already.

This time we’ll look at the first steps, Wordlist, Keyword list and Save settings, to make in AntConc once you have your text file(s) as you want it.

Step one would be to make a wordlist. First go to the Tool Preferences>Category>Wordlist and make sure the Treat all data as lowercase checkbox is ticked:


Press Apply and then go to the Wordlist tab and click the Start button, you will get a screen like this :


Not surprisingly function words top the list. We are more interested in content words so we can go on to step two which is the keyword list function. Go to the Tool Preferences>Category>Keyword List:


Here you will need to load a reference corpus e.g. the Brown word list. Press the Add Files button and locate your copy of the reference corpus. As the reference corpus used is a word list and not a raw file you need to click the Use word list(s) radio button then press the Load button and the bar to the right will fill in green once the reference corpus is loaded:


Now press the Apply button, when you get back to the Keyword List window press the Start button and your screen will look similar to this:


Finally step three is to go to File>Export Settings to File and save your settings. Now we have everything we need to start exploring our corpus. Stay tuned for related posts.

Building your own corpus – cleaning up your text file(s)

One of the issues with crawling web pages is the irrelevant text that is also collected. This short post describes one way to clean up your text file(s). [Update 1: note if you use ICEWeb instead of TextSTAT you won’t need to clean up as much, if at all; update 2: BootCat also does not require much cleaning]. I assume you are using LibreOffice, I think one can find similar functions or extensions for Microsoft Office.

The extension you need is Alternative dialog Find & Replace for Writer. Load the extension by going to Tools>Extension Manager>Add. Then restart LibreOffice. You will then see the extension under Tools>Add-Ons>Alternative searching.

You need to look at your text and see if you can spot patterns to the parts of the text you do not need. For my case I had to do five passes on the text. The first pass looked like this:


I get the [::BigBlock::] function by selecting Extended>Series of paragraphs (limited between start and end marks). This means See Also[::BigBlock::]File Under will find all text starting from See Also and ending with File Under. Note that you need to have the Regular expressions checkbox ticked. You can cycle through the selections by pressing Find to verify that is the text you want. Once you have done that you can simply press Replace All button.

My next pass looked like this:


My third pass was this:


Note the use of | as the OR operator.

My fourth pass removed any odd characters e.g.:


My fifth pass looked to remove any photo; image attributions so text like “photo by” or “image by” etc.

Of course depending on how thorough you want to clean your text one or two passes may be enough.

Stay tuned for related posts.

Building your own corpus – binomials example

Note this entry may not mean much if you are new to AntConc, I will post more beginner type ones later, I wanted to take advantage of a well known ELT blogger’s post as a hook to this one!

This post shows how useful building your own corpus can be particularly for ESP purposes. Leo Selivan’s/@leoselivan latest describes how to teach binomials, be sure to check that out before proceeding. A great post but the examples he uses are from general English.

Using AntConc I can see what examples are relevant to my multi-media students.

The following screenshot show the results of the search /# and #/ in the Clusters/N-Grams tab of AntConc (# is the default wildcard for any one word):


Looking down to a frequency hit of 4 I get this list:

drag and drop
up and running
audio and video
latest and greatest
desktop and mobile
designers and developers
quickly and easily
sites and apps
tips and tricks
username and password
building and testing
download and install
tried and true
when and where

I also note that there are a number of ones using “new and …”:

new and different
new and evolving
new and exciting
new and improved
new and underused

There are also a number of others with a frequency hit of 3 or less e.g.:

backup and archiving
first and foremost
icons and logos
layout and design
panning and zooming
tips and tricks
faster and lighter
save and refresh
scissors and glue
leaps and bounds
looks and feels
new and shiny
pluses and minuses
pros and cons

Further I can easily use the concordance as example sentences e.g. for latest and greatest:


So this was a quick post to highlight how useful your own corpus can be. Stay tuned for related posts.