Building your own corpus – BootCat custom URL feature

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator  – a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.

IceWeb  – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.

Corpus linguistics community news 3

I have realised it’s been a while since I have reported any potentially useful posts I have done over at the G+ CL community. So here is bullletin number 3.

Some pointers when re-writing text for graded readers – this is my interpretation of a Japanese researcher’s slide presentation so I may be talking out me backside!

AntWordProfiler and specialised vocabulary profiling – this arose out of a question that a participant on the iTDi ELT Reading materials design course had.

Videogrep, a tool to make concordances of video – very neat tool, see my first comment to post to see examples using The Big Bang Theory.

Building your own corpus -TagAnt – continuation of my series showing how to use TagAnt POS tagging to mine your DIY corpus.

Semantic tagging Sugata Mitra – oh noes he’s back! Maybe of interest to upcoming round 2 #corpusmooc folk as I try to make sense of semantic tagging.

Seeding BootCat – notes on how best to seed BootCat when building your own corpus from the web.

Don’t forget CL community news 1 and 2 if you have not checked them already.

Building your own corpus – ICEweb

This post describes another program that you can use to crawl a website, the advantage of ICEweb is that it automatically strips html tags for you and it is faster than TextSTAT in retrieving webpages. A disadvantage is that ICEweb does not explicitly save your settings.

Prepare yourself for a deluge of screenshots, you could of course just skip to the help screen of the program and follow that :). Hat tip to Christopher Phipps/ @lousylinguist for alerting me to the program.

When the program starts you will see the following:


You need to name your corpus in terms of a region/category and country/sub-category since we are only interested in one website ( ignore region and country labels.

We’ll call the category webmonkey and when we press create category button we’ll see this dialogue box:


This will create the webmonkey folder in the ICEweb folder (note that as I have already created the webmonkey folder you can see it in the screenshot).

You will get a similar dialogue box for when you create the subcategory, in this case I have called that 2013 which should be located in the webmonkey folder:


Again the 2013 folder has already been created. Note here since you already have some folders and files in 2013 folder a dialogue box will appear asking whether you want to overwrite files you should press no, this is where the disadvantage of not being able to save a session appears.

Then press the add/edit URLs button:


a text file opens where you can enter the URLs of the files you want, you then save and exit this text file.

Press the start retrieve button, the program will ask you which folder you want, which will be the subcategory folder i.e. 2013:


and then once the files are downloaded you’ll get some messages in the right-hand pane like so:


Now you can have a look at one of the files by pressing the open result file button and choosing a file from the raw folder as in the following two screenshots:



As you can see compared to the files from TextSTAT (see cleaning up text file post) the text file above is much easier to handle in terms of easily cutting out text you are not interested in.

You can check out the rest of the building your own corpus series if you haven’t already, thanks for reading.

Building your own corpus – Such as example

Although the power of corpus is in discovering horizontal relations between lexis such as collocations and colligation, the more familiar vertical relations that teachers are used to can also be explored at the same time. Such vertical relations or semantic preference have been used traditionally in ELT coursebooks. See the post by Leo Selivan critiquing this use. This post describes using concordance output to look at both collocation (adjective + noun) and semantic preference (hyponymy or general category/specific example relation)

I adapted an activity by Tribble (1997) who in turn based it on an example from Tim Johns who used the keyword “such as”, e.g.:

encompass many _______frequently labeled HTML5 such as the Geolocation API, offline storage API a

Make sure to see the worksheet (in odt format) to understand the following. The first question A in the worksheet is a traditional pre-learning task using the specific examples of the semantic categories, so in the line above the examples of geolocation api and offline storage api are instances of the category of features (the gapped word). The next questions B and C asks students to find the category that the words in bold on the right are examples of (these bold words are the ones in question A). Question D asks students to identify the types of words in the lists and in the italicised words; to get them to notice the adjective + noun structure e.g. important features. Finally question E asks them to identify adjective + noun structures in a single text taken from the website that the corpus is based on.

Less experienced first year multimedia students struggled more with the exercises than more experienced second year students. I assume this is because they were less familiar with the lexis? Though both seemed more at ease with the last task of identifying structures in the text.

I’ll survey some corpora classroom task types in a later post.

Thanks for reading.


Tribble, C. (1997). Improvising corpora for ELT: Quick-and-dirty ways of developing corpora for language teaching. In J. Melia. & B. Lewandowska-Tomaszczyk (Eds.), PALC ’97 Proceedings, Practical Applications in Language Corpora (pp. 106-117). Lodz: Lodz University Press.

Building your own corpus – sorting body idioms

In my continuing and vain efforts to be associated with great bloggers this post could be said to be inspired by Matt Halsdorff/@MattHalsdorff’s post on native speaker use of idioms. 😉

The sort command in a concordance is one of the simplest yet most powerful features in identifying patterns. The following is a screenshot after a search on the word “face”:


Compare the above with the screenshot below which is sorted alphabetically by the word to the left of “face”:


The sort allows us to clearly see two groupings of about face and let’s face it.

The most common body idiom in my corpus is head (on) (over) to which occurs 165 times. Other body idioms in this corpus incude keep an eye on; on the other hand; fall back to; neck and neck; the bottom line; start (off) on the right foot. So here I have a nice list to explore with students.

Note that the sort command in AntConc can be a little confusing as the tick box is very close and to the left of the adjustable level boxes acting against the usual expectation that tick boxes are often to the right of the attribute that they refer to. The following screenshot shows in red what we expect compared to the blue that is the case:


Stay tuned for further diy corpus bulletins.

Building your own corpus – initial exploration

I’ll describe here what can be done initially with your corpus in AntConc. This post should be read after the first steps in Antconc post.

The keyword results, which shows the most significant words compared to a reference corpus, is a great way to check what words your class already knows. You can use your judgement as to how many and what words to use, the list can be recycled to check students understanding over time, and any number of ways could be used to exploit the list.

Next you can focus on certain words and explore further. Before making use of the Collocation function we need to make sure that in Tool Preferences>Category>Collocates the Selected Collocate measure is set to T-score. Then focusing on say the adjective responsive, we switch to the Collocation tab and enter the word and we get this screen:


We see that the most significant noun which collocates one word to the right of responsive is design. We can do this with other adjectives from the keyword list and can get the following list:

Responsive design
New features
Faster javascript
Latest version
Desktop browser

Depending on your purposes you can add to the list and use it in a matching exercise. Focusing further we can use the collocation function again to look at what goes with responsive design:


The only statistically significant word is of with a T-score of more than 2. However we see off, that’s curious, entering off responsive design into the concordance function we discover:


i.e. turn off is the appropriate collocation. The above shows how unexpected uses can arise, i.e. we can ask students to think of a sentence using X + of + responsive design and Y + off + responsive design and see if what they come up with makes sense.

Additionally although embrace/s/d/ing is not statistically significant it is an interesting collocation:


Presenting the above sentences we can try to see if the students can work out the meaning of embrace.

Stay tuned for related posts.

Building your own corpus – using general English lexis

I was reading a post on money idioms by Martin Sketchley/ @ELTExperiences and was reminded that a lot of general English lexis and topics which crop up in coursebooks can be a good way into any specialised corpus you build. Usually I work with a single text with my class and then use any analysis from my corpus to focus on specific lexis, i.e. multi-media lexis. What you can also do is look at general English words, in this post I will look at the words – pay, payments, money, cost, costs, cheap, cheaper.

Note you can use the asterisk * wildcard to search for all forms of words you are interested in. e.g. pay*, cost* and cheap*.

Pay, payments, money, cost, costs, cheap and cheaper return 17, 8, 16, 10,10, 11 and 7 hits respectively, the following is an example of the pay concordance lines:


There is an option in Tool Preferences>Category>Concordance to Hide search term in KWIC display which will take out your word and replace it with asterisks.e.g. for the payments concordance lines:


You would then give out the concordances in 7 groups along with the 7 missing words and ask your students to fill in the gaps. It’s a good idea to examine your concordance lines in case of co-text that may be too difficult or needs explaining.

If students have not seen concordance lines they should be exposed to them beforehand.

So for example take a line from each concordance group:

1. Yes, thats right, IE7 users visiting Kogan will pay more than those using a modern web browser. Or
2. H.264 support to decode video. That means royalty payments are covered by hardware makers, not Mozil
3. lla) that Google is willing to spend that kind of money just to keep Microsoft from starting a partn
4.numbers, the Google Maps API will soon be another cost to factor into the plan. And that may be enou
5. some developers were worried about the potential costs. The vast majority of maps hackers and casua
6.ep prices for all smart shoppers down. Sure its a cheap, attention-getting gimmick, but who hasnt wa
7.t if its just backups you want, clearly there are cheaper alternatives. The real appeal of Cloud Dri

and ask students to try to work out the context in terms of the topic text, what comes before fragment, what comes after. Note if you want to expand the range of the sentence extract adjust the Search Window Size, the default is set to 50 characters to the left and right of the search word.

After the gap filling exercise, you divide students into seven groups and ask each group to examine one set of concordance lines for any patterns that they notice.

There is plenty of scope to use general English in terms of grammatical forms as well which I’ll talk about in another post.

Thanks for reading.