Building your own corpus – BootCat custom URL feature

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator (now called TranslatorBank/CorpusMode)

– a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.

IceWeb  – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.


10 thoughts on “Building your own corpus – BootCat custom URL feature

  1. I once had bootcat on my computer for a short time but unfortunately never really used it effectively. This screencast seems to make this a lot easier. Also the list of other applications is quite helpful. Thank you for sharing this.

  2. Hi,

    Not sure that this is the best place to make this comment, but Mike Long’s recent plenary at KOTESOL sent me back to the Doughty (she’s the CALL expert in the duo) and Long 2003 article “OPTIMAL PSYCHOLINGUISTIC ENVIRONMENTS FOR DISTANCE FOREIGN LANGUAGE LEARNING” available here: Highly recommended. Here’s the relevant bit:

    “Another cautionary note is perhaps in order at this juncture, since the topic of using corpora for language teaching has been raised. Training language learners to use concordancing programs and corpora for the metalinguistic study of language samples is not at all what is being proposed here, although such programs certainly abound (see, e.g., LLT Special Issue Vol. 5, Num. 3). Rather, it is the course developer, not the learner, who should use these tools to build corpora that will have specific relevance to the pedagogic tasks that comprise the foreign language distance learning course (concordances and corpora tutorial). If a learner is to use concordancing tools, then this should be in the service of a generally non-language-based task, such as the work by Weeber in the Medline example cited above. Weeber’s tool allows a search of thousands of on-line publications in order to track the side effects of drugs and to connect them with diseases which might be treated by the same chemicals that cause difficulties in the original patients. Such an activity requires massive exposure to and processing of rich input and could be expected to promote inductive learning as well. .

    1. hi Geoff, nice thanks for that article looks very informative, had a quick scan. so it recommends indirect uses of corpora and reserves any direct uses to non-language tasks.

      i nod my head as when i have tried direct uses of a concordancer it has been hit and miss. but i am unable to imagine of non-language tasks!

      i am now experimenting with vocabulary profilers to help in a writing course early days maybe it will work maybe not!


      1. Hi Mura, I don’t think Cathy Doughty and Mike Long are ruling out the use by learners of concordancers to look at collocations, colligations, etc. I’ve spoken to Cathy about this and she thinks it’s kind of OK. Of course Mike particularly is keen to push the “rich input” bit, but anyway, what I think is very interesting is the recommendation that course developers build their own corpora.

  3. Hi Mura,

    Very clean and understandable video walk through. I can see how this would be really useful for building a corpus for ESP students who are working on a really specific topic. But it also seems like it would be great for more general English classes in which students are doing project based language work. My students are currently doing some research around developing nations and charitable giving (in order to make a fund raising plan). And if I had taken the time to gather a bunch of article which dealt with this topic and made a corpus, I could have probably cut out a lot of the guesswork when it came to vocabulary work. It’s impossible for me to simplify all the texts that students need to access and use in class, but if I had identified high frequency chunks and some of the more common and more opaque collocations, I could have perhaps made the texts a bit more accessible for the students.

    Anyway, it’s nice to have this tool on my radar.



    1. Hi Kev

      thanks for kind words on vid.

      the seeding function on BootCat is powerful once you tweak it and so could as you say be used as a tool in project based work. maybe for your next one, do share if you do 🙂


Penny for your thoughts

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.