I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.
I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).
The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.
Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.
BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.
CorpusCreator (now called TranslatorBank/CorpusMode)
– a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.
IceWeb – Windows OS only, have already described this here.
TextStat – have also already described this here.
Thanks for reading.