The Prime Machine – a new concordancer in town

One of the impulses behind The Prime Machine was to help students distinguish similar or synonymous words. Recently a student of mine  asked about the difference between “occasion” and “opportunity”. I used the compare function on the BYU COCA to help the student induce some meaning from the listed collocations. It kinda, sorta, helped.

The features offered by The Prime Machine promises much better help for this kind of question. For example in the screenshot below the (Neighbourhood) Label function shows the kind of semantic tags associated with the words “occasion” and “opportunity”. Having this info certainly helps reduce time figuring out the differences between the words.

Neighbourhood Labels for the comparison of occasion and opportunity

One of the other sweet new features brought to the concordancer table, is a card display system as seen in the first screenshot below. Another is information based on Michael Hoey’s lexical priming theory such as shown in the second screenshot below.

Card display for comparison of words occasion and opportunity
Paragraph position of the words occasion and opportunity

The developer of the new concordancer Stephen Jeaco kindly answered some questions.

1. Can you speak a little about your background?

Well, I’m British but I’ve lived in China for 18 years now.  My first degree was in English Literature and then I did my MA Applied Linguistics/TESOL and my PhD was under the supervision of Michael Hoey with the University of Liverpool.

I took up programming as a hobby in my teens.  If I hadn’t got the grades to read English at York, I would have gone on to study Computer Science somewhere.  In those days the main thing was to choose a degree programme that you felt you would enjoy.  Over the years, though, I’ve kept a technical interest and produced a program here or there for MA projects and things like that.

I’ve worked at XJTLU for 12 years now.  I was the founding director of the English Language Centre, and set up and ran that for 6 years.  After rotating out of role, I moved into what is now called the Department of English where I lecture in linguistics to our undergraduate English majors and to our MA TESOL students.

2. What needs is The Prime Machine setting out to fill?

I started working on The Prime Machine in 2010, at the beginning of my part-time PhD.  At that time, I was interested in corpus linguistics but I found it hard to pass that enthusiasm on to my colleagues and students.  We had some excellent software and some good web tools, but internet access to sites outside China wasn’t always very reliable, and getting started with using corpora for language learning usually meant having to learn quite a lot about what to look for, how to look for it, and also how to understand what the data on-screen could mean.

Having taught EAP for about 10 years at that time, I felt that my Chinese learners of English needed a way to help them see some of the patterns of English which can be found through exploring examples, and in particular I wanted to help them see differences between synonyms and become familiar with how collocation information could help them improve their writing.

I’d read some of Michael Hoey’s work while doing my MA, and in his role of Pro Vice Chancellor for Internationalization I met him at our university in China.  His theory of lexical priming provided both a rationale for how patterns familiar in corpus linguistics relate to acquisition and it also gave me some specific aspects to focus on in terms of thinking about what to encourage students to notice in corpus lines. 

The main aim of The Prime Machine was to provide an easy start to corpus linguistic analysis – or rather an easy start to using corpus tools to explore examples.  Central to the concept were two main ideas: (1) that students would need some additional help finding what to look for and knowing what to compare and (2) that new or enhanced ways of displaying corpus lines and summary data could help draw their attention do different patterns.  Personally, I really like the “Card” display, and while KWIC is always going to be effective for most things, when it comes to trying to work out where specific examples come from and what the wider context might be, I think the cards go a long way towards helping students in their first experiences of DDL.

Practically speaking, another thing I wanted to do was to start with a search screen where they could get very quick feedback on anything that couldn’t be found and whether other corpora on the system would have some results. 

3. What kind of feedback have you got from students and staff on the corpus tool?

I’ve had a lot of feedback and development suggestions from my students at my own institution.  Up until a few weeks ago, The Prime Machine was only assessable to our own staff and students.  The majority of users have been students studying linguistics modules, mostly those who are taking or have taken a module introducing corpus linguistics. However, for several years now I have also had students using it as a research tool for their Final Year Project – a year-long undergraduate dissertation project where typically each of us has 4 to 5 students for one-to-one supervision.  They’ve done a range of projects with it including trying to apply some of Michaela Mahlberg’s approaches to another author, exploring synonyms, exploring the naturalness of student paraphrases or exam questions.  People often think of Chinese students as being shy and wanting to avoid direct criticism of the teacher, but our students certainly develop the skills for expressing their thoughts and give me suggestions!

In my own linguistics module on corpus linguistics, I’ve found the new version of The Prime Machine to be a much easier way to get students started at looking at their own English writing or transcripts of their speech and getting them to consider whether evidence about different synonyms and expressions from corpora can help them improve their English production.  Personally, I use it as a stepping stone to introducing features of WordSmith Tools and other resources.

In terms of staff input, I’ve had a couple of more formal projects, getting feedback from colleagues on the ranking features and the Lines and Cards displays.  I’ve also had feedback by running sessions introducing the tool as part of a professional development day and a symposium.  Some of my colleagues have used it a bit with students, but I think while it required access from campus and before I had the website up, it was a bit too tricky even on site. 

On the other hand, I’ve given several conference papers introducing the software, and received some very useful comments and suggestions.

I need to balance my teaching workload, time spent working towards more concrete research outputs and family life, but if we can get over some of the connectivity issues and language teachers want to start using The Prime Machine with their students, I’m going to need as much feedback as possible.  I’d like to hope I could respond and build up or extend the tool, but at the same time there’s a need to try to keep things simple and suitable for beginners. 

 4. You have some extra materials for students at your institution, could you describe these?

There’s nothing really very special about these.  But having the two ways of accessing the server (offsite vs. on-site) means if corpus resources come with access restrictions or if a student wants to set up a larger DIY corpus for a research project I’m able to limit access to these.

Other than additional corpora, there are a few simple wordlists which I use in my own teaching and some additional options for some of the research tools.

5. What developments are in the pipeline for future versions of The Prime Machine?

One of the main reasons I wanted The Prime Machine to be publically available and available for free was so that others would be able to see some of the features I’ve written about or presented about at conferences in action.  In some ways, my focus has changed a bit towards smaller undergraduate projects for linguistics, but I still have interests and contacts in English language teaching.  Given some of the complications of connecting from Europe to a server in China, unless someone finds it really interesting and wants to set up a mirror server or work more collaboratively, I don’t think I can hope to have a system as widely popular and reliable as the big names in online concordancing tools.  But having interviews like this and getting the message out about the software through social media means that there is a lot more potential for suggestions and feature requests to help me develop in ways I’ve not thought of.

But left to my own perceptions and perhaps through interactions with my MA TESOL students, local high schools and our language centre, I’m interested in adding to the capabilities of the search screen to help students find collocations when the expression they have in mind is wildly different from anything stored in the corpus.  At the moment, it can do quite a good job of suggesting different word forms, giving some collocation suggestions and using other resources to suggest words with a similar meaning.  But sometimes students use words together in ways that (unless they want to use language very creatively) would stump most information retrieval systems.

Another aspect which I could develop would be the DIY text tools, which currently start to slow down quite rapidly when reading more than 80,000 words or so.  That would need a change of underlying data management, even without changing any of the features that the user sees.  I added those features in the last month or two before my current cohort of students were to start their projects, and again, feedback on those tools and some of the experimental features would be really useful.  On the other hand, I point my own students to tools like WordSmith Tools and AntConc when it comes to handling larger amounts of text!

The other thing, of course, is that I’m looking forward to getting hold of the BNC 2014 and adding another corpus or two.  Again, I can’t compete with the enormous corpora available elsewhere, but since most of the features I’m trying to help students notice differ across genre, register and style, I am quite keen on moderately sized corpora which have clearly defined sub-corpora or plenty of metadata.

One thing I would like to explore is porting The Prime Machine to Mac OS, and also possibly to mobile devices and tablets.  But as it stands, using The Prime Machine requires the kind of time commitment and concentration (and multiple searches and shuffling of results) that may not be so suitable for mobile phones.  I sometimes think it is more like the way we’d hunt for a specialist item on Taobao or Ebay when we’re not sure of a brand or even a product name, rather than the kind of Apps we tend to expect from our smart phones which provide instant ready-made answers.  Redesigning it for mobile use will need some thought.

Personally, I’m hoping to start one or two new projects, perhaps working with Chinese and English or looking more generally at Computer Assisted Language Teaching.  

Now that The Prime Machine is available, while of course it would be great if people use it and find it useful, more importantly beyond China I think I’d hope that it could inspire others to try creating new tools.  If someone says to the developer working on their new corpus web interface, “Do you think you could make a display that looks a bit like that?”, or “Can you pull in other data resources so those kinds of suggestions will pop up?”, I think they wouldn’t find it difficult, and we’d probably have more web tools which are a bit more user-friendly in terms of operation and more intuitive in terms of support for interpretation of the results. 

6. What other corpus tools do you recommend for teachers and students?

Well, I love seeing the enhancements and new features we get with new versions of popular corpus tools.  And at conferences, I’m always really impressed by some of the new things people are doing with web-based tools.   But one thing that I would say is that for the students I work with, I think knowing a bit more about the corpus is more useful than having something billions of words in size; being able to explore a good proportion of concordance lines for a mid-frequency item is great.  I think having a list of collocations or lines from millions of different sources to look at isn’t going to help language learners become familiar with the idea that concordance lines and corpus data can help them understand, explore and remember more about how to use words effectively. 

Nevertheless, I think those of us outside Europe should be quite jealous of the Europe-wide university access to Sketch Engine that’s just started for the next 5 years.  I also really like the way the BYU tool has developed.  I was thrilled to get hold of the MAT software for multidimensional analysis.  And I think I’ll always have my WordSmith Tools V4 on my home computer, and a link to our university network version of WordSmith Tools in my office and in the computer labs I use.

Thanks for reading. Do note if you comment here I need to forward them to Stephen (as he is behind the great firewall of China) and so there may be a delay in any feedback. Alternatively contact Stephen yourself from the main The Prime Machine website.

Also do note that the current available version of The Prime Machine may not work at the moment but wait a few days for a fix to be applied by Stephen and try again then.

Advertisements

CORE blimey – genre language

A #corpusmooc participant in answering a discussion question on what they would like to use corpora for replied that they wanted a reference book that shows various common structures in various genres such as “letters of condolence, public service announcements, obituaries”.

The CORE (Corpus of Online Registers) corpus at BYU along with the virtual corpora feature allows a way to reach for this.

For example, the screenshot below shows the keywords of verbs & adjectives in the Reviews genre:

Before I briefly show how to make a virtual corpus do note that the standard interface allows you do to a lot of things with the various registers. The CORE interface shows you examples of this. For example the following shows the distribution of the present perfect across the genres:

Create virtual corpora

To create a virtual corpus first go to the CORE start page:

Then click on Texts/Virtual and get this screen:

Next press Create corpus to get this screen:

We want the Reviews Genre so choose it from the drop down box:

Then press Submit to get the following screen:

Here you can either accept these texts or say you want to build only a film review corpus manually look through links and filter for film reviews only. Give your corpus a name or add it to an already existing corpus. Here we give it the name “review”:

Then after submitting you will be taken to the following screen which shows you all your virtual corpora collection we can see the corpus we just created at number 5:

Now you can list keywords.

Do note that the virtual corpora feature is available in most of the BYU collection so if genre is not your thing maybe the other choices of corpora might be useful.

Thanks for reading and do let me know if anything appears unclear.

 

Affix knowledge test and word part technique

CAT-WPLT

There is a new online test, the CAT-WPLT (computerized adaptive testing of Word Part Levels Test) to assess students word part knowledge, i.e. prefix, suffix and stems (though the test only uses affixes for receptive use). The (diagnostic) test is composed of three parts – form, meaning and use. The form part presents 1 real affix and 4 distractor affixes for the test user to choose. The meaning part presents 1 correct meaning and 3 distractor meanings and the use part presents 4 parts of speech to match one of these correctly to the affix.

Try out the test – CAT-WPLT.

The online test takes about 10-15mins to complete and results in a nice feedback screen showing how the test taker did on the form, meaning and use of the affixes. There are comparison advanced, intermediate and beginner profiles.

Figure from Mizumoto, Sasao, & Webb (2017) pg. 14

So say you have a profile of a student who shows weakness in form and meaning. What now? Mizumoto,  Sasao, & Webb (2017) suggest giving learners their pdf list of 118 affixes (assuming you don’t need to use the test again). So if your learner is at level 1 for recognizing the form of an affix, the affixes listed as level 2 can be focused on.

Another possibility is a memory technique called the word part technique.

Word part technique
Very simply it is using an already known word which contains the same word stem/root as the new word to be remembered.

More specifically the system Wei and Nation (2013) describe lists very frequent stems i.e. stems which appear in words in the most frequent 2000 words of the BNC. These are then used to learn stems appearing in the remaining 8000 mid-frequency words in the BNC wordlist. For example a high frequency word like visit has the root -vis- which appears in mid-frequency words such as visible, envisage, revise.

Once a form connection is seen between a known high frequency word and a mid-frequency word a meaning connection needs to be made i.e. explaining the form connection. So to explain the word visible we can say visible is something that you can see. Here the explanation uses the meaning of -vis- i.e. see.

(high freq. word) visit -> go to see someone
|
|
(stem)                  vis -> see
|
|
(mid-freq. word)  visible -> something that you can see

According to Wei & Nation (2013) the most difficult step is explaining the connection. Though I think the most difficult is the first step – seeing the connection i.e. the stem/root word. Wei & Nation (2013) encouragingly state that making the connection and explaining it can develop with practice.

 

Screen Shot 2017-09-07 at 2.03.44 PMScreen Shot 2017-09-07 at 2.04.08 PMScreen Shot 2017-09-07 at 2.04.21 PMScreen Shot 2017-09-07 at 2.04.31 PM

Click here to see top 25 word stems taken from Wei & Nation (2013)


They go on to recommend that once students have worked with this technique with the teacher they can go on to use it themselves as a strategy.

The technique’s efficacy is on par with the keyword technique and learners own methods or self-strategies (Wei, 2015). The word part technique has the added benefits that come with the nature of etymology and the history of words.

Thanks for reading.

References

Mizumoto, A., Sasao, Y., & Webb, S. A. (2017). Developing and evaluating a computerized adaptive testing version of the Word Part Levels Test. Language Testing, 0265532217725776.

Wei, Z., & Nation, P. (2013). The word part technique: A very useful vocabulary teaching technique. Modern English Teacher, 22, 12–16.

Wei, Z. (2015). Does teaching mnemonics for vocabulary learning make a difference? Putting the keyword method and the word part technique to the test. Language Teaching Research, 19(1), 43-69.

A FLAIR, VIEW of a couple of interesting language learning apps

FLAIR (Form-focused Linguistically Aware Information Retrieval) is a neat search engine that can get web texts filtered through 87 grammar items (e.g. to- infinitives, simple prepositions, copular verbs, auxiliary verbs).

The screenshot below shows results window after a search using the terms “grenfell fire”.

There are 4 areas I have marked A, B ,C and D which attracted my attention the most. There are other features which I will leave for you to explore.

A – Here you can filter the results of your search by CEFR levels. The numbers in faint grey show how many documents there are in this particular search total of 20.

B – Filter by Academic Word List, the first icon to right is to add your own wordlist.

C – The main filter of 87 grammar items. Note that some grammar items are more accurate than others.

D – You can upload you own text for FLAIR to analyze.

Another feature to highlight is that you can use the “site:” command to search within websites, nice. A paper on FLAIR 1 gives the following to try: https://www.gutenberg.org; http://www.timeforkids.com/news; http://www.bbc.co.uk/bitesize; https://newsela.com; http://onestopenglish.com.

The following screenshot shows an article filtered by C1-C2 level, Academic Word List and Phrasal Verbs:

VIEW (Visual Input Enhancement of the Web) is a related tool that learners of English, German, Spanish and Russian can use to highlight web texts for articles, determiners, prepositions, gerunds, noun countability and phrasal verbs (the full set currently available only for English). In addition users can do some activities, such as clicking, multiple-choice and practice (i.e fill in a blank), to identify grammar items. The developers call VIEW an intelligent automatic workbook.

VIEW comes as browser add-on for Firefox, Chrome and Opera as well as a web app. The following screenshot shows the add-on for Firefox menu:

VIEW draws on the ideas of input enhancement as the research rationale behind its approach. 2

References:

1. Chinkina, M., & Meurers, D. (2016). Linguistically aware information retrieval: providing input enrichment for second language learners. In Proceedings of the 11th Workshop on Innovative Use of NLP for Building Educational Applications, San Diego, CA. PDF available [http://anthology.aclweb.org/W/W16/W16-0521.pdf]
2. Meurers, D., Ziai, R., Amaral, L., Boyd, A., Dimitrov, A., Metcalf, V., & Ott, N. (2010, June). Enhancing authentic web pages for language learners. In Proceedings of the NAACL HLT 2010 Fifth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 10-18). Association for Computational Linguistics. PDF available [http://www.sfs.uni-tuebingen.de/~dm/papers/meurers-ziai-et-al-10.pdf]

Monco and “fresh” make & do collocations

Monco the web news monitor corpus (which means it is continuously updated) has a tremendous collocation feature. I first saw a reference to the collocation feature from a tweet by Diane Nicholls ‏@lexicoloco  but when I tried it the server was acting up. I was reminded to try again by a tweet from Dr. Michael Riccioli ‏@curvedway, whoa it is impressive.
For example let’s see what are the collocates of the famous make and do verbs.

For make here is screenshot of search settings for collocation (to get to collocation function look under tools menu from main Monco page). Note I am looking for nouns that come after the verb make. Also the double asterisk is a short cut to look for all forms of make (try it without the asterisks and see what you get).

make-settings

I get as results for the top 10 collocates (for all forms of make) the following:

Top 10 collocates-make
click on image for full results

Interesting collocations include make sense, make way, make debut. The results can show you at a glance the types of constructions involved:

make-examples

Or you can open another window for more details:

make-details

The top 10 collocates for do are:

Top 10 collocates-do
click on image for full results

Interesting collocates here are do thing, do anything, do something, do nothing makes a change from do shopping, cooking etc : )

Thanks for reading.

Check out an extended version of this post over at the HLT Magazine – Celebrity Collocations with Monco

Using BYU-Wikipedia corpus to answer genre related questions

A link was posted recently on Twitter to an IELTS site looking at writing processes and describing graphs.
The following caught my eye:

…natural processes are often described using the active voice, whereas man-made or manufacturing processes are usually described using the passive.
(http://iamielts.com/2016/02/descriptive-report-process-descriptions-and-proofreading/)

The claim seems to go back to 2011 online (http://ielts-simon.com/ielts-help-and-english-pr/2011/02/ielts-writing-task-1-describe-a-process-1.html).

This is an interesting claim. It has been shown that passives are more common in abstract, technical and formal writing (Biber, 1988 as cited by McEnery & Xiao, 2005). Here the claim is about specific written texts on natural processes and man-made processes.

Well we can simplify this by asking are there more passives used when writing about man-made processes than when writing about natural processes? Since if you use passive clauses then you don’t use active clauses and we can come to a conclusion by deduction.

BYU-Wikipedia corpus can be used to get approximations of natural process writing and man-made process writing. The keywords I used (for the title word) were ecology and manufacturing. Filtering out unwanted texts took longer than expected especially for the manufacturing corpus. In the end I had an ecology corpus of 77 articles and  153,621 words and a manufacturing corpus of 116 articles and 98,195 words.

The search term I used to look for passives was are|were [v?n*]. This gave me a total of 293 passives for ecology and 304 passives for manufacturing. According to the Lancaster LL calculator this showed a significant overuse of passives in manufacturing compared to ecology. According to the log ratio score this is about 2 times as common (if I understand this statistic correctly). Now this does not mean much as a lot of the texts in the wikipedia corpora won’t be specifically about processes but still it is interesting.

What is more interesting are the types of verbs used in passives in ecology and manufacturing. The top ten in each case:

Ecology:

 

ARE FOUND

ARE CONSIDERED

ARE KNOWN

ARE CALLED

ARE COMPOSED

ARE ADAPTED

ARE USED

ARE DOMINATED

ARE INFLUENCED

ARE DEFINED

Manufacturing:

ARE USED

ARE MADE

ARE KNOWN

ARE PRODUCED

ARE CREATED

WERE MADE

ARE DESIGNED

ARE CALLED

ARE PERFORMED

ARE PLACED

 

Thanks for reading.

References:

Biber, D. (1988) Variation Across Speech and Writing(Cambridge: Cambridge University Press).

McEnery, A. M. and Xiao, R. Z. (2005) Passive constructions in English and Chinese: A corpus-based contrastive study . Proceedings from the Corpus Linguistics Conference Series, 1 (1). ISSN 1747-9398 Retrieved from http://eprints.lancs.ac.uk/63/1/CL2005_(22)_%2D_passive_paper_%2D_McEnery_and_Xiao.pdf

Impassive Pullum on Passives

There’s a regular module I do at one school on writing about processes coming up soon. So a focus here is on use of passive clauses in such contexts. For years I was happily ignorant, induced by inaccurate instruction from books, about this grammar area. So it was a blessing to read and watch noted linguist Geoffrey Pullum pull apart such advice.
pullum-hunt

As an exercise for me to try to remember his counsel I knocked up three infographics, some work better than others. The information for these graphics come from Fear and Loathing of the English Passive (html); the 6 part video series Pullum on Passives  and On the myths that passives are wordy (pdf).

Types of Passives

types-of-passives-med

Real rules for Passives

real-rules-for-passives-med-new

Allegations against Passives

allegations-against-passives-med

Note that Pullum is not really impassive more impassioned but that makes the title of this post less groovy : )

Hope these are of use to you, thanks for reading.