Corpus linguistics community news 4

Another installment of the Google+ Corpus Linguistics Community news. In addition I include links to some visual aids that were designed to answer questions people in the second round of #corpusmooc posed.

Bigger is not necessarily better – Here I talk about some minor aspects of the paper, and not the more interesting aspects that is, the paper compiled a list of approximately 14000 pairs of collocations that are worth teaching. The list is not available as of yet. The other main finding is that frequency is more important than dispersion or chronological data when identifying collocations, with human judgement remaining a key factor in deciding on useful collocations.

Google as a corpus with students – Some interesting recent developments on using Google as a corpus with a short list of relevant online reading.

How to develop effective concordance materials using online corpus – My interpretation of a slideshow by a Korean researcher on using data driven learning materials.

#corpusmooc Visual Aids:

Tokens, Types, Lemmas, Word families

Genre Register

Collocations and Colligations

Multi-word expressions

Do check out the other corpus lingustics community news if you haven’t already.

Thanks.

Building your own corpus – BootCat custom URL feature

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator  – a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.

IceWeb  – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.

Corpusmooc round 2

Corpusmooc is back in action.

New course content includes a focus in week 5 (Corpus linguistics in action – looking at social issues through corpora) on forensic linguistics. There is an additional reading by Tim Johns on data driven learning in Week 7 (Language learning and corpus linguistics). In week 4 (How do you build a corpus) there is a new video describing the use of TagAnt part of speech tagger.

Shame no new content on collecting corpora from the web.

There may be other new content I may have overlooked, will update this post if I find any.

New platform additions include a blog aggregator and a twitter visualiser. Unfortunately the forum system is still as clunky as clunky thing can be.

For any language teachers reading this do consider checking out the Google+ Corpus Linguistics community.

For example you may like to read blog posts by participants from round 1 of corpusmooc.