CORE blimey – genre language

A #corpusmooc participant in answering a discussion question on what they would like to use corpora for replied that they wanted a reference book that shows various common structures in various genres such as “letters of condolence, public service announcements, obituaries”.

The CORE (Corpus of Online Registers) corpus at BYU along with the virtual corpora feature allows a way to reach for this.

For example, the screenshot below shows the keywords of verbs & adjectives in the Reviews genre:

Before I briefly show how to make a virtual corpus do note that the standard interface allows you do to a lot of things with the various registers. The CORE interface shows you examples of this. For example the following shows the distribution of the present perfect across the genres:

Create virtual corpora

To create a virtual corpus first go to the CORE start page:

Then click on Texts/Virtual and get this screen:

Next press Create corpus to get this screen:

We want the Reviews Genre so choose it from the drop down box:

Then press Submit to get the following screen:

Here you can either accept these texts or say you want to build only a film review corpus manually look through links and filter for film reviews only. Give your corpus a name or add it to an already existing corpus. Here we give it the name “review”:

Then after submitting you will be taken to the following screen which shows you all your virtual corpora collection we can see the corpus we just created at number 5:

Now you can list keywords.

Do note that the virtual corpora feature is available in most of the BYU collection so if genre is not your thing maybe the other choices of corpora might be useful.

Thanks for reading and do let me know if anything appears unclear.

 

Advertisements

Corpus linguistics community news 8

First up is the news that there are more than 700 members. Nice.

Important date for your diaries is 25 September 2017 when another round of #corpusmooc is launching. This time new sections are promised and most notable new addition is a new version of LancsBox. Check out the following two cute vids being used to promote #corpusmooc 2017:

 

Also if you use Twitter you can follow the bot corpusmoocRT@corpusmoocFav.

Next up are some great plenary videos from this years Corpus Linguistics 2017 knees-up in Birmingham plus related notes from conference by John Williams.

Checking the distribution of the pair on the one hand/on the other hand in BYU-COCA sections.

A graphic trying to depict keywords as calculated in AntConc.

A possible way to find collocations suitable for various proficiency levels.

And finally for a bit o’ fun is this the longest term in ELT? And The Banbury Corpus Revisted by Michael Swan.

Thanks for reading and for those coming off a summer break much energy to you for the new teaching year.

 

Corpus linguistics community news 4

Another installment of the Google+ Corpus Linguistics Community news. In addition I include links to some visual aids that were designed to answer questions people in the second round of #corpusmooc posed.

Bigger is not necessarily better – Here I talk about some minor aspects of the paper, and not the more interesting aspects that is, the paper compiled a list of approximately 14000 pairs of collocations that are worth teaching. The list is not available as of yet. The other main finding is that frequency is more important than dispersion or chronological data when identifying collocations, with human judgement remaining a key factor in deciding on useful collocations.

Google as a corpus with students – Some interesting recent developments on using Google as a corpus with a short list of relevant online reading.

How to develop effective concordance materials using online corpus – My interpretation of a slideshow by a Korean researcher on using data driven learning materials.

#corpusmooc Visual Aids:

Tokens, Types, Lemmas, Word families

Genre Register

Collocations and Colligations

Multi-word expressions

Do check out the other corpus lingustics community news if you haven’t already.

Thanks.

Building your own corpus – BootCat custom URL feature

I noticed that #corpusmooc round 2 does have something on collecting corpora from the web, a video of using SketchEngine using its BootCat module that appears under unit 3.1.8 A brief introduction to SketchEngine.

I thought that there would be interest in a screencast of using the custom URL feature of BootCat as you need to pay to use SketchEngine (although they do offer a one month free trial and their analysis engine is pretty powerful).

The only tricky thing in setting up BootCat is getting a Windows Azure Marketplace account key as BootCat uses Bing search engine. Luckily the developers have a webpage explaining this.

Finally a tweet by Noor Adnan @n00rbaizura prompted me to think that listing tools available for getting text (non-programmatically) from the net might be useful. So I list four examples in order of usefulness and cross-platform availability. If you know of any others please do let me know.

BootCat – Most useful though when using seed feature need to tweak it, I talk about that here.

CorpusCreator  – a simpler interface than BootCat with feature to download pdf and convert to text. Only available for Windows OS. Some description here.

IceWeb  – Windows OS only, have already described this here.

TextStat – have also already described this here.

Thanks for reading.

Corpusmooc round 2

Corpusmooc is back in action.

New course content includes a focus in week 5 (Corpus linguistics in action – looking at social issues through corpora) on forensic linguistics. There is an additional reading by Tim Johns on data driven learning in Week 7 (Language learning and corpus linguistics). In week 4 (How do you build a corpus) there is a new video describing the use of TagAnt part of speech tagger.

Shame no new content on collecting corpora from the web.

There may be other new content I may have overlooked, will update this post if I find any.

New platform additions include a blog aggregator and a twitter visualiser. Unfortunately the forum system is still as clunky as clunky thing can be.

For any language teachers reading this do consider checking out the Google+ Corpus Linguistics community.

For example you may like to read blog posts by participants from round 1 of corpusmooc.