Building your own corpus – BootCat custom URL feature

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator  – a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.

IceWeb  – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.

Building your own corpus – binomials example

Note this entry may not mean much if you are new to AntConc, I will post more beginner type ones later, I wanted to take advantage of a well known ELT blogger’s post as a hook to this one!

This post shows how useful building your own corpus can be particularly for ESP purposes. Leo Selivan’s/@leoselivan latest describes how to teach binomials, be sure to check that out before proceeding. A great post but the examples he uses are from general English.

Using AntConc I can see what examples are relevant to my multi-media students.

The following screenshot show the results of the search /# and #/ in the Clusters/N-Grams tab of AntConc (# is the default wildcard for any one word):


Looking down to a frequency hit of 4 I get this list:

drag and drop
up and running
audio and video
latest and greatest
desktop and mobile
designers and developers
quickly and easily
sites and apps
tips and tricks
username and password
building and testing
download and install
tried and true
when and where

I also note that there are a number of ones using “new and …”:

new and different
new and evolving
new and exciting
new and improved
new and underused

There are also a number of others with a frequency hit of 3 or less e.g.:

backup and archiving
first and foremost
icons and logos
layout and design
panning and zooming
tips and tricks
faster and lighter
save and refresh
scissors and glue
leaps and bounds
looks and feels
new and shiny
pluses and minuses
pros and cons

Further I can easily use the concordance as example sentences e.g. for latest and greatest:


So this was a quick post to highlight how useful your own corpus can be. Stay tuned for related posts.

Building your own corpus – TextSTAT and AntConc

This post describes how to set up a workflow using two programs to build up a database of text from the internet. The two programs used are TextSTAT and AntConc. TextSTAT is used for its webcrawler to build your corpus [update1: an alternative program ICEweb, update 2: BootCat custom url] and AntConc is used to analyse the corpus. Note that I won’t be detailing any analysis in this post, that needs a series of other posts.


The first program TextSTAT is used to bring in your text from the internet. When you first run TextSTAT you will get the screen below:


First create a new corpus by going to Corpus/New corpus.

Next go to Corpus/Add file from web or press this icon:Image

to get the following dialogue box:


The above screenshot shows that I have entered the directory for posts from February 2011 on the site i.e. You need to have a good idea of how your site is structured in terms of how it archives its text.

I then select 25 as the number of pages to retrieve (you need to be aware of how many posts your particular website averages). I finally select domain: only subdirectory. Once you press search you should get a window like below:


Note that links such as and etc are indexes to the texts and so can be deleted by first selecting them:


And then right clicking:


and finally selecting remove files and ending up with the following screen:


Next switch to the Word Forms tab:


Press the Frequency List button and you will get a screen like this:


Once you save your recently made corpus, you need to change the file into a text file by simply renaming the .crp file into a .txt file. A good idea is to make a copy of the .crp file before renaming so that you can still use original file in TextSTAT.


The start screen of AntConc looks like this:


Go to File/Open File and you will get a screen like this:


You can see the .txt file you renamed from the file you built in TextSTAT in the top left of the first column. You may notice a difference in the wordlist that AntCONC builds compared to the wordlist TextSTAT builds:


This is because the renaming of the .crp file to .txt file is a dirty workaround which does not strip any headers that TextSTAT uses in its .crp file. The differences should be negligible for ESP purposes.

Once you have your txt file in AntConc it is just a matter of playing around to see how you can use it for your purposes. Use the AntConc website as a jump point to learn about the features. Stay tuned for a series of posts on how I use it for multimedia English classes.