Corpus linguistics community news 7

If you follow events in the UK one can say, without much accusation of hyperbole, that these are indeed strange times.

So why not turn to the relative sanity of corpus linguistics community news 7.

First up is an example of searching BYU-COCA for use of a preposition of place.

Next a post on one way to explore some recent audio-video corpora.

couple of posts related to history of CL.

top tip when using BootCat.

My recommended link is to a mini or maybe it’s a micro CL course by Oxford Dictionaries.

A tool that uses TF-IDF scores to extract n-grams using as an example prime minister questions from ex-prime minister David Cameron and the still leader of the opposition Jeremy Corbyn.

Do check previous corpus linguistics community posts if you haven’t yet.

Thanks for reading and have a good summer/winter.

Building your own corpus – BootCat custom URL feature

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator  – a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.

IceWeb  – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.

Corpus linguistics community news 3

I have realised it’s been a while since I have reported any potentially useful posts I have done over at the G+ CL community. So here is bullletin number 3.

Some pointers when re-writing text for graded readers – this is my interpretation of a Japanese researcher’s slide presentation so I may be talking out me backside!

AntWordProfiler and specialised vocabulary profiling – this arose out of a question that a participant on the iTDi ELT Reading materials design course had.

Videogrep, a tool to make concordances of video – very neat tool, see my first comment to post to see examples using The Big Bang Theory.

Building your own corpus -TagAnt – continuation of my series showing how to use TagAnt POS tagging to mine your DIY corpus.

Semantic tagging Sugata Mitra – oh noes he’s back! Maybe of interest to upcoming round 2 #corpusmooc folk as I try to make sense of semantic tagging.

Seeding BootCat – notes on how best to seed BootCat when building your own corpus from the web.

Don’t forget CL community news 1 and 2 if you have not checked them already.