This post describes how to set up a workflow using two programs to build up a database of text from the internet. The two programs used are TextSTAT and AntConc. TextSTAT is used for its webcrawler to build your corpus [update1: an alternative program ICEweb, update 2: BootCat custom url] and AntConc is used to analyse the corpus. Note that I won’t be detailing any analysis in this post, that needs a series of other posts.
TextSTAT
The first program TextSTAT is used to bring in your text from the internet. When you first run TextSTAT you will get the screen below:
First create a new corpus by going to Corpus/New corpus.
Next go to Corpus/Add file from web or press this icon:
to get the following dialogue box:
The above screenshot shows that I have entered the directory for posts from February 2011 on the webmonkey.com site i.e. http://www.webmonkey.com/2011/01/. You need to have a good idea of how your site is structured in terms of how it archives its text.
I then select 25 as the number of pages to retrieve (you need to be aware of how many posts your particular website averages). I finally select domain: only subdirectory. Once you press search you should get a window like below:
Note that links such as http://www.webmonkey.com/2011/01/ and http://www.webmonkey.com/2011/01/page/2/ etc are indexes to the texts and so can be deleted by first selecting them:
And then right clicking:
and finally selecting remove files and ending up with the following screen:
Next switch to the Word Forms tab:
Press the Frequency List button and you will get a screen like this:
Once you save your recently made corpus, you need to change the file into a text file by simply renaming the .crp file into a .txt file. A good idea is to make a copy of the .crp file before renaming so that you can still use original file in TextSTAT.
AntConc
The start screen of AntConc looks like this:
Go to File/Open File and you will get a screen like this:
You can see the .txt file you renamed from the file you built in TextSTAT in the top left of the first column. You may notice a difference in the wordlist that AntCONC builds compared to the wordlist TextSTAT builds:
This is because the renaming of the .crp file to .txt file is a dirty workaround which does not strip any headers that TextSTAT uses in its .crp file. The differences should be negligible for ESP purposes.
Once you have your txt file in AntConc it is just a matter of playing around to see how you can use it for your purposes. Use the AntConc website as a jump point to learn about the features. Stay tuned for a series of posts on how I use it for multimedia English classes.
I’ve had to use AntConc during my MA and it seemed like it had possibilities though I didn’t go into too much depth with it. They way you’ve shown it above simply shows frequency of words, but I’m assuming it’s more powerful than that with regards to associated words or chunks in frequency, yes?
hi Tyson,
yes that’s right, the collocation and cluster/ngram function are very useful as is the keyword function. will detail those when i get the chance. you need to run the freq list in TextSTAT in order to populate the .crp file with your text.
also with TextSTAT some websites won’t let you webcrawl them.
ta
mura
What about a DOC instead of a website?
docs are fine for sure then you don’t need TextSTAT and can just use AntCONC
Mura
You’re a genius!
L
hi leo
an undeserved compliment for sure!
thanks anyway 🙂
Wow… this is an awesome post… thanks for making me aware of such helpful tools…
hi paul, thanks for commenting. if you do use these tools or similar ones do let me know.
currently i am struggling with how to best integrate concordance output in class, a bit of a hit and miss affair at the moment!
ta
mura
This is great Mura – I’ve been thinking of making a corpus as part of my dissertation. I’ll definitely give it a go using this first.
Cheers
hi ben
that’s great to hear!
there’s a google+ group you may like to join to discuss corpora and the classroom https://plus.google.com/communities/101266284417587206243
ta
mura
I use Mike Smith’s WordSmith (his webpage is worth a visit ( http://www.lexically.net/wordsmith/index.html ), but the TextSTAT is new to me and very useful. Thanks. Looking forward to reading your text analysis posts.
hi there geoff
yes have heard good things about wordsmith e.g. features like concgrams, might have to invest in it sometime.
regarding textstat there’s another program iceweb which does the same thing but is faster and strips html tags for you, written about it here https://eflnotes.wordpress.com/2013/09/04/building-your-own-corpus-iceweb/
i look forward to any comments you may have about the use of antconc posts
ta
mura
Nice post!
I’m a computational linguist, and I think it’s a shame there aren’t more language-learning programs that take advantage of the processing capabilities we have available at the moment. It seems to me that every language-learning app is basically the same: a vocab manager, targeting new learners in particular, with little support for people who can already read quite well.
At the moment I’m working on an app to generate grammar-based Cloze tests. But, if anyone has any ideas, I’m all ears!
hi thanks for commenting
just checked your app cloze.it very nice 🙂
how does your system tell apart prepositional to from infinitive to?
you are very right about current generation of language learning programs.
i often have ideas related to such programs but then forget them promptly!
if i remember any will let you know!
ta
mura
At the moment I don’t have any logic to tell apart prepositional and infinitival “to”. The app uses a part-of-speech tagger that’s trained from a common scheme of 45 tags, from the Penn Treebank: http://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
A curiosity of this tag-set is that there’s a special tag just for “to”! I think the tagger should be able to make this distinction quite well, so I’m planning to modify the tag scheme to allow it to do so.
Nice post–this kind of addition to already-available documentation can be super-useful. Thanks!
hi thanks for the comment, readers should definitely check out zipfslaw blog https://zipfslaw.org/
ta
mura