Building your own corpus – BootCat custom URL feature

Posted on October 8, 2014June 4, 2020 by eflnotes

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator (now called TranslatorBank/CorpusMode)

– a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. ~~Some~~ ~~description here.~~

IceWeb – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.

10 thoughts on “Building your own corpus – BootCat custom URL feature”

Parise Peter says:

October 9, 2014 at 00:07

I once had bootcat on my computer for a short time but unfortunately never really used it effectively. This screencast seems to make this a lot easier. Also the list of other applications is quite helpful. Thank you for sharing this.

Reply
1. eflnotes says:
  
  October 9, 2014 at 00:25
  
  hi Peter
  yes the URL feature (which is relatively recent) really helps when you get stuck with seeding!
  ta
  mura
  
  Reply
geoffjordan says:

October 9, 2014 at 09:16

Hi,

Not sure that this is the best place to make this comment, but Mike Long’s recent plenary at KOTESOL sent me back to the Doughty (she’s the CALL expert in the duo) and Long 2003 article “OPTIMAL PSYCHOLINGUISTIC ENVIRONMENTS FOR DISTANCE FOREIGN LANGUAGE LEARNING” available here: http://llt.msu.edu/vol7num3/doughty/default.html. Highly recommended. Here’s the relevant bit:

“Another cautionary note is perhaps in order at this juncture, since the topic of using corpora for language teaching has been raised. Training language learners to use concordancing programs and corpora for the metalinguistic study of language samples is not at all what is being proposed here, although such programs certainly abound (see, e.g., LLT Special Issue Vol. 5, Num. 3). Rather, it is the course developer, not the learner, who should use these tools to build corpora that will have specific relevance to the pedagogic tasks that comprise the foreign language distance learning course (concordances and corpora tutorial). If a learner is to use concordancing tools, then this should be in the service of a generally non-language-based task, such as the work by Weeber in the Medline example cited above. Weeber’s tool allows a search of thousands of on-line publications in order to track the side effects of drugs and to connect them with diseases which might be treated by the same chemicals that cause difficulties in the original patients. Such an activity requires massive exposure to and processing of rich input and could be expected to promote inductive learning as well. .

Reply
1. eflnotes says:
  
  October 9, 2014 at 10:41
  
  hi Geoff, nice thanks for that article looks very informative, had a quick scan. so it recommends indirect uses of corpora and reserves any direct uses to non-language tasks.
  
  i nod my head as when i have tried direct uses of a concordancer it has been hit and miss. but i am unable to imagine of non-language tasks!
  
  i am now experimenting with vocabulary profilers to help in a writing course early days maybe it will work maybe not!
  
  ta
  mura
  
  Reply
  1. geoffjordan says:
    
    October 9, 2014 at 11:41
    
    Hi Mura, I don’t think Cathy Doughty and Mike Long are ruling out the use by learners of concordancers to look at collocations, colligations, etc. I’ve spoken to Cathy about this and she thinks it’s kind of OK. Of course Mike particularly is keen to push the “rich input” bit, but anyway, what I think is very interesting is the recommendation that course developers build their own corpora.
  2. eflnotes says:
    
    October 9, 2014 at 11:52
    
    yes i agree, the FLAX project is interesting in this respect e.g. developing open corpora resources for a Law MOOC http://alannahfitzgerald.org/2014/10/06/vici-competition/
    
    ta
    mura
kevchanwow says:

October 10, 2014 at 04:43

Hi Mura,

Very clean and understandable video walk through. I can see how this would be really useful for building a corpus for ESP students who are working on a really specific topic. But it also seems like it would be great for more general English classes in which students are doing project based language work. My students are currently doing some research around developing nations and charitable giving (in order to make a fund raising plan). And if I had taken the time to gather a bunch of article which dealt with this topic and made a corpus, I could have probably cut out a lot of the guesswork when it came to vocabulary work. It’s impossible for me to simplify all the texts that students need to access and use in class, but if I had identified high frequency chunks and some of the more common and more opaque collocations, I could have perhaps made the texts a bit more accessible for the students.

Anyway, it’s nice to have this tool on my radar.

Thanks,

Kevin

Reply
1. eflnotes says:
  
  October 10, 2014 at 07:39
  
  Hi Kev
  
  thanks for kind words on vid.
  
  the seeding function on BootCat is powerful once you tweak it and so could as you say be used as a tool in project based work. maybe for your next one, do share if you do 🙂
  
  ta
  mura
  
  Reply
Pingback: Building your own corpus – TextSTAT and AntConc | EFL Notes
Pingback: Building your own corpus – cleaning up your text file(s) | EFL Notes