Building your own corpus – BootCat custom URL feature

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator (now called TranslatorBank/CorpusMode)

– a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.

IceWeb  – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.

Building your own corpus – ICEweb

This post describes another program that you can use to crawl a website, the advantage of ICEweb is that it automatically strips html tags for you and it is faster than TextSTAT in retrieving webpages. A disadvantage is that ICEweb does not explicitly save your settings.

Prepare yourself for a deluge of screenshots, you could of course just skip to the help screen of the program and follow that :). Hat tip to Christopher Phipps/ @lousylinguist for alerting me to the program.

When the program starts you will see the following:


You need to name your corpus in terms of a region/category and country/sub-category since we are only interested in one website ( ignore region and country labels.

We’ll call the category webmonkey and when we press create category button we’ll see this dialogue box:


This will create the webmonkey folder in the ICEweb folder (note that as I have already created the webmonkey folder you can see it in the screenshot).

You will get a similar dialogue box for when you create the subcategory, in this case I have called that 2013 which should be located in the webmonkey folder:


Again the 2013 folder has already been created. Note here since you already have some folders and files in 2013 folder a dialogue box will appear asking whether you want to overwrite files you should press no, this is where the disadvantage of not being able to save a session appears.

Then press the add/edit URLs button:


a text file opens where you can enter the URLs of the files you want, you then save and exit this text file.

Press the start retrieve button, the program will ask you which folder you want, which will be the subcategory folder i.e. 2013:


and then once the files are downloaded you’ll get some messages in the right-hand pane like so:


Now you can have a look at one of the files by pressing the open result file button and choosing a file from the raw folder as in the following two screenshots:



As you can see compared to the files from TextSTAT (see cleaning up text file post) the text file above is much easier to handle in terms of easily cutting out text you are not interested in.

You can check out the rest of the building your own corpus series if you haven’t already, thanks for reading.