This post describes another program that you can use to crawl a website, the advantage of ICEweb is that it automatically strips html tags for you and it is faster than TextSTAT in retrieving webpages. A disadvantage is that ICEweb does not explicitly save your settings.
Prepare yourself for a deluge of screenshots, you could of course just skip to the help screen of the program and follow that :). Hat tip to Christopher Phipps/
@lousylinguist for alerting me to the program.
When the program starts you will see the following:
You need to name your corpus in terms of a region/category and country/sub-category since we are only interested in one website (Webmonkey.com) ignore region and country labels.
We’ll call the category webmonkey and when we press create category button we’ll see this dialogue box:
This will create the webmonkey folder in the ICEweb folder (note that as I have already created the webmonkey folder you can see it in the screenshot).
You will get a similar dialogue box for when you create the subcategory, in this case I have called that 2013 which should be located in the webmonkey folder:
Again the 2013 folder has already been created. Note here since you already have some folders and files in 2013 folder a dialogue box will appear asking whether you want to overwrite files you should press no, this is where the disadvantage of not being able to save a session appears.
Then press the add/edit URLs button:
a text file opens where you can enter the URLs of the files you want, you then save and exit this text file.
Press the start retrieve button, the program will ask you which folder you want, which will be the subcategory folder i.e. 2013:
and then once the files are downloaded you’ll get some messages in the right-hand pane like so:
Now you can have a look at one of the files by pressing the open result file button and choosing a file from the raw folder as in the following two screenshots:
As you can see compared to the files from TextSTAT (see cleaning up text file post) the text file above is much easier to handle in terms of easily cutting out text you are not interested in.
You can check out the rest of the building your own corpus series if you haven’t already, thanks for reading.