//
you're reading...
Corpora, Tools

Building your own corpus – TextSTAT and AntConc

This post describes how to set up a workflow using two programs to build up a database of text from the internet. The two programs used are TextSTAT and AntConc. TextSTAT is used for its webcrawler to build your corpus and AntConc is used to analyse the corpus. Note that I won’t be detailing any analysis in this post, that needs a series of other posts.

TextSTAT

The first program TextSTAT is used to bring in your text from the internet. When you first run TextSTAT you will get the screen below:

Image

First create a new corpus by going to Corpus/New corpus.

Next go to Corpus/Add file from web or press this icon:Image

to get the following dialogue box:

Image

The above screenshot shows that I have entered the directory for posts from February 2011 on the webmonkey.com site i.e. http://www.webmonkey.com/2011/01/. You need to have a good idea of how your site is structured in terms of how it archives its text.

I then select 25 as the number of pages to retrieve (you need to be aware of how many posts your particular website averages). I finally select domain: only subdirectory. Once you press search you should get a window like below:

TextSTAT_result1

Note that links such as http://www.webmonkey.com/2011/01/ and http://www.webmonkey.com/2011/01/page/2/ etc are indexes to the texts and so can be deleted by first selecting them:

TextSTAT_select

And then right clicking:

TextSTAT_rightclick

and finally selecting remove files and ending up with the following screen:

TextSTAT_result2

Next switch to the Word Forms tab:

TextSTAT_wordforms

Press the Frequency List button and you will get a screen like this:

TextSTAT_freqlist

Once you save your recently made corpus, you need to change the file into a text file by simply renaming the .crp file into a .txt file. A good idea is to make a copy of the .crp file before renaming so that you can still use original file in TextSTAT.

AntConc

The start screen of AntConc looks like this:

AntCONC_start

Go to File/Open File and you will get a screen like this:

AntCONC_fileopened

You can see the .txt file you renamed from the file you built in TextSTAT in the top left of the first column. You may notice a difference in the wordlist that AntCONC builds compared to the wordlist TextSTAT builds:

AntCONC_result_wordlist

This is because the renaming of the .crp file to .txt file is a dirty workaround which does not strip any headers that TextSTAT uses in its .crp file. The differences should be negligible for ESP purposes.

Once you have your txt file in AntConc it is just a matter of playing around to see how you can use it for your purposes. Use the AntConc website as a jump point to learn about the features. Stay tuned for a series of posts on how I use it for multimedia classes.

About these ads

About eflnotes

ELT teacher based in Paris, France

Discussion

10 Responses to “Building your own corpus – TextSTAT and AntConc”

  1. I’ve had to use AntConc during my MA and it seemed like it had possibilities though I didn’t go into too much depth with it. They way you’ve shown it above simply shows frequency of words, but I’m assuming it’s more powerful than that with regards to associated words or chunks in frequency, yes?

    Posted by Tyson Seburn | March 6, 2013, 03:55
    • hi Tyson,

      yes that’s right, the collocation and cluster/ngram function are very useful as is the keyword function. will detail those when i get the chance. you need to run the freq list in TextSTAT in order to populate the .crp file with your text.
      also with TextSTAT some websites won’t let you webcrawl them.
      ta
      mura

      Posted by eflnotes | March 6, 2013, 08:14
  2. Mura
    You’re a genius!
    L

    Posted by Lexical Leo | March 6, 2013, 19:09
  3. Wow… this is an awesome post… thanks for making me aware of such helpful tools…

    Posted by Paul Read | April 15, 2013, 14:38
    • hi paul, thanks for commenting. if you do use these tools or similar ones do let me know.

      currently i am struggling with how to best integrate concordance output in class, a bit of a hit and miss affair at the moment!

      ta
      mura

      Posted by eflnotes | April 15, 2013, 16:00

Trackbacks/Pingbacks

  1. Pingback: Weekly round up in ELT 08/03/2013 - ELTSquared.co.uk - March 8, 2013

  2. Pingback: #ELTchat » Using corpora in the classroom – an #ELTchat summary (6/3/13) - April 29, 2013

Penny for your thoughts

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Connecting to %s

Categories

Archive

Follow

Get every new post delivered to your Inbox.

Join 267 other followers

%d bloggers like this: