Google’s Custom Search Engine with keywords

There have been some recent developments in helping people to build their own corpora. One of these is the wikipedia corpus builder from BYU-COCA developer Mark Davies. The advantage of this system is the handy keyword extraction tool (noun, verb, adjective, adverb, noun + noun, adjective + noun) as well as the usual functions from the BYU suite of corpus tools.

The disadvantage is that it is limited to the texts from wikipedia.

An alternative is to use Google’s Custom Search Engine (CSE) which allows you to add in the websites you are interested in and more recently (well recent for me) they have added a creation from keywords tool.

I have previously used the CSE in building a Mission to Mars search engine that contained urls with information related to the NASA mission to Mars. This however required some effort in first finding relevant urls.

Now with the keyword tool this is a much easier process.

I will describe how to do this using the example of setting up a Bridge Building corpus that I want to use with students. The task was to watch a video on a mega-bridge, take notes, then write a summary. I want to use the CSE to get students to look up “incorrect” language use in their summary and revise them using information from the CSE.

As mentioned I have done this previously with the Mission to Mars CSE but since it took some fiddling to build I had never really followed up on that. Now with the relative speed of keyword search this is something I want to try again.

Note that I will detail how to use the CSE in detail in another post, suffice to say now is that use of simple Google searching like double quotation marks and wildcard asterix goes a long way.

Note you do need a Google account.

When you create a new search engine you get this screen:


the Use the CSE creation from keywords tool is circled in red.

Clicking on that takes you to this screen which asks you to name your Search Engine:


Then the following screen is presented where you fill in your relevant keywords:

Here is the screen with keywords I entered for my Bridge Building CSE:


Then you press Expand keywords and you get a screen like this:


These are the words Google will use to retrieve relevant urls. Note that at this step it is advisable to check the keywords and if not relevant then use the add negative items feature (circled in red).

As you can see my initial keyword of center span came up with words related to html web programming and design and so I needed to eliminate such words in a cycle of adding to negative terms and seeing the resulting search words.

Then once you are happy with this press the Generate URL Patterns button shown circled in the next screenshot:

As a result of the previous command you get a list of urls that you can check for relevancy before adding them to your custom search engine:


Each CSE you build has a public url you can share with students here is mine for the Bridge Building CSE.

Thanks for reading and do feel free to fire any questions and comments below.

Corpus linguistics community news 5

It’s been a long time, I shouldn’t have left you
Without some corpus news to read through
(I know you got corpus soul)

First off, if you are a user of BYU suite of corpus tools do consider helping to correct their corpus of soap operas, should you get to 500 words your name will be in the acknowledgements. Nice.

Next up are some interviews with Alex Boulton on some issues in DDL, Ivor Timmis on his new corpus book for ELT and Andrew Caines on a spoken corpus project.

For those interested in XML tagging there is something on using UK 2015 election forewords to follow a tutorial using rhetorical tagging.

For those interested in multi-word tagging some descriptions, part 1 and part 2 of using one program called AMALGrAM 2.0.

Finally an fyi to check out the latest version of Ted Corpus Search Engine which now has translations and synced transcriptions.

Till next time.

Thanks for reading.


Labour and the Greens visit Paris (or a debating rhetoric lesson)

This post may be of interest to debating courses, and in particular those wanting to introduce a bit of rhetorical analysis.

First off I used a photo prompt to get students to come up with a debating point. I would normally use a video but finding decent videos takes time. Finding a photo is faster and leaves room for more interpretation, for this class I used a photo of people camping outside an Apple shop.

The students produced three topics – brand addiction, consumer society, power of advertising. They then formulated “This house believes/would” sentences such as This house believes that society suffers from brand addiction. The class then broke into groups, picked a formulation, they had 5 min preparation time and debated for some 10-15mins (timings will vary according to student level and emergent issues).

Rhetorical analysis

I showed the class the Veni, Vidi, Vici quote along with the English equivalent (I came, I saw, I conquered) pointing out that this was one kind of rhetorical technique that repeats some language structure (technically called a tricolon though I did not use this term with students). I also noted that this kind of pattern tends to come near the end of a sentence or clause.

I then showed them an extract from the Martin Luther King I have a dream speech and pointed out the repeated use of the same words – I have a dream, one day. Adding that this repeating device often comes at the beginning of a sentence or clause.

I also took the opportunity of highlighting an example of a repeating structure (every valley shall be exalted, and every hill and mountain shall be made low, the rough places will be made plain, and the crooked places will be made straight).

Next I mentioned the upcoming UK elections and that they were going to analyze two introductions from the election manifestos of the Labour party and the Green party. Letting them know I picked these two parties due to the text all fitting on one side of A4 paper :).

In groups of two I gave them three tasks:
1. count the words in each sentence and plot the number of words vs sentence number
2. identify repeating words
3. identify repeating structures

Once they had done this they were asked to find another pair with the same text and compare notes.

As I saw pairs finishing I went round and asked them questions about the shape of the graph and why the writer would want to use such a pattern, on the repeating lexical and structural patterns they identified and the possible effects of such usage on persuading a reader.

Finally pairs did the same with the second manifesto introduction.

For the next class they were told to rewrite their arguments from the earlier mini-debate using the information they had gained from their rhetorical analysis.

Thanks for reading and roll on Thursday May 7 2015.


American Rhetorics: Rhetorical figures in sound

The forest of rhetoric

UK 2015 manifesto forewords

UK 2015 manifestos venncloud

Wmatrix corpus analysis of UK General Election Manifestos

Quick cup of COCA – compound words

A new quick cup of coca post, whayhay. Thanks to Mike Harrison (@harrisonmike) on Twitter who was asking about finding compound adjectives.

Here we can use wildcard asterix, with part of speech.

So say we were looking for adjectives starting with well, we could use [well-*].[j*] to give the following top ten results –

(click on words to see full search results)

To find all compound adjectives we would simply replace the first part of the compound with another wildcard asterix like so:


which gives us the following top 10 results:

(click on words to see full search results)

Similarly if you were looking for noun, adverb or verb compounds simply add the appropriate POS tag i.e. [n*], [r*] and [v*] respectively.

Note do double-check result in concordance lines as sometimes the POS tagging is off.

As an interesting aside a search for compound adjectives historically in COHA gives us a very nice ascending curve. Wonder what the significance of that is?

Compound adjectives over time in COHA (click on graph)

Finally do check out the previous quick cup of coca posts if you want help with searching in COCA.

IATEFL 2015: Recent corpus tools for your students

Jane Templeton’s talk 1 illustrated corpus use by using the wordandphrase tool 2. (Lizzie Pinard has a write-up of the talk 3). I have described using this and other tools on this blog, and there is a nice round-up of corpus tools written by Steve Neufield 4 that looks at just the word, ozdic, word neighbors, netspeak, and stringnet.

This post reports on some more recent tools you may not be aware of (but posted sometime ago in G+ CL community so do check that if you want the skinny early on:\) – WriteAway, Linggle, Skell, Netcollo.

I list them in the order I think students will find easy to use and useful.

1. WriteAway – this tool auto-completes words to help highlight typical structures, so for example it gives two common patterns for Jane’s example of weakness as weakness of something and weakness in something. The first example in pattern one includes the collocation overcomes.

WriteAway screenshot for word weakness


2. Linggle – one could follow-up with a search on Linggle which is basically a souped up version of just the word and uses a 1 trillion word Web based corpus as opposed to the much smaller BNC that just the word uses

It is interesting that overcome weakness is not listed:

Linggle screenshot for verb + weakness
Linggle screenshot for verb + weakness (click image to see results)

but a search for overcome followed by a noun shows that it occurs less than 1% in web pages:

Linggle screenshot for overcome + noun (click image to see results)


3. SkeLL from Sketch Engine is neat for its word-sketch feature so a look at weakness brings up a nice set of collocations and colligations in one screen:

SkeLL wordsketch for weakness (click image to see results)


4. NetCollo corpus tool can compare BNC, a medical corpus and a law corpus, this is useful if you are looking at academic language in medicine and law. For example using the example of weakness we see that it is much more common in BNC:

NetCollo result for weakness (click image to see results)

and we can see that the collocation with overcome only appears once in the Medical corpus.

As ever do try these tools out yourself and then show not tell, as Jane says, your students as and when the need arises in class. By the way do check out the integrative rationale for corpus use by Anna Frankenberg-Garcia5.

Thanks for reading.


1. IATEFL 2015 video – Bringing corpus research into the language classroom

2. Word and tool

3. IATEFL 2015 Bringing corpus research into the language classroom – Jane Templeton

4. Teacher Development: Five ways to introduce concordances to your students

5. Integrating corpora with everyday language teaching

IATEFL 2015: Testing times all-round

I think Geoff Jordan1 has already outlined the general concerns about the lack of adequate examination of issues in testing at IATEFL 2015, mainly based  on Jeremy Harmer’s presentation2.

One key area that this presentation overlooked is that of high-stakes testing, although Harmer may well have felt that passing a music exam on the tuba was very important for him it does not really compare to test taking candidates whose scores can mean getting a place at university, gaining employment or passing immigration requirements.

The Pearson academic test which Harmer was promoting is used in such high-stakes situations. So glossing over this massive issue was telling. The example of brain surgery in the talk was not comparable as medical education is a long process, and the airline pilots case ignores that such tests cover more or less the whole of a pilot’s syllabus something language tests are rarely able to do.

The lack of critical discussion of automated scoring was another key area. One would be none the wiser from Harmer’s talk that automated scoring is the site of great debate and controversy. For example there’s the Human Readers movement3 which campaigns for the dropping of automated scoring for high-stakes testing.

Further some researchers in the field such as Xiaoming Xi4 claim that automated scoring systems are not ready for high-stakes decision making. She lists two main reasons – they can’t score for “coherence, logic or content like human raters” and the “vulnerability to cheating and test scheming”.

In addition Harmer claimed that “algorithms that are built into the software will grade and evaluate what you say more reliably and as accurately as any human being can.” This confuses the consistency of automated scoring with validity and accuracy, plus such scoring is dependent on human raters in any case.

Controversies such as tests being used to evaluate teachers in the US adds to paint a very contested picture of this area.

It is evident as Harmer says that testing is not going away and that teachers will need to engage with the issues so it was a missed opportunity to help teachers do this instead of simply pushing the onus onto the audience.

One could turn to the following internet sources to get engaged:

Language Testing Bytes – Podcasts on testing compiled by Glen Fulcher.

Thinking about tests and testing: a short primer in “assessment literacy” – A pdf of some useful basics on assessment.

Assessment is – ITDI blog with straight and useful talk from the chalkface, they have some other issues on this also worth checking.

Finally for a bit of fun and look away if you are sensitive to male body parts:

Thanks for reading (and watching).


1. IATEFL post mortem part 2

2. IATEFL 2015 video – An uncertain and approximate business? Why teachers should love testing

3. Human readers research findings

4. LTJ 27 3 Automated scoring transcript

Grassroots language technology: Glenys Hanson,

“Grassroots entrepreneurship” was listed as one of four characteristics that ELTJam says1 one can use to understand the current so called ed-tech movement and/or revolution. The others being money, disruption, polarisation/controversy.

Amongst the examples they gave of such entrepreneurship initiatives was Marie Goodwin2, a teacher who wanted a platform to help kids with reading3. The grassroots language technology series is trying to show that many teachers are doing similar, probably much smaller and mostly non-commercial, projects.

Our next person in the series is Glenys Hanson,  @GlenysHanson, who I first met on an online pronunciation course. Many thanks to Glenys for sharing her experiences.

1. Can you share a bit about your background?

Glenys: I’m from Wales but I’ve been living in France for nearly 50 years. I was an English as a foreign language teacher at the Centre de linguistique appliqué, Université de Franche-Comté, from 1977 to 2010. I started making sites and interactive exercises while I was there. Starting in 2001, I created what became English Online France, a site of resources for people learning English and their teachers. It still exists with most of the content I created but the presentation has changed.

For the university, I also ran about a dozen distance learning courses on the learning management system, Moodle. These were at “licence” and “masters” level and though they included interactive exercises much of the course work was different kinds of tasks. Half the students were in Africa so they were not blended courses.

I also made a large bilingual site for the association Une Education Pour Demain but their current site is not the one I created. My most recent site is Glenys Hanson’s Blog.

2. What motivated you to set up your online exercises site?

Glenys: I’d originally put my exercises on the English Online France site but by now many of them look very old-fashioned and I feel they need pedagogical updating too. As I’ve retired, I no longer have admin access to my former workplace site so decided to put the revamped versions on my own site: ESL EXOS.

The reason I decided to put learning exercises on line was that I couldn’t find any on the Internet to give my students the kind of practice I felt they needed outside of class. These days it is possible to find a few learning exercises but still not very many. There are, of course, thousands of tests, quizzes and games for English learners but hardly any exercises aiming to help students to discover for themselves how the language functions. Many (?most) teachers don’t even realise that there is a difference between testing and learning exercises.

3. What kind of time commitment is required to design the exercises?

Glenys: People who ask this question seldom go on to make on-line exercises. What great footballer started out by asking “What kind of time commitment is required to learn football?” Either you’re bitten by the bug and just love doing it or you don’t do it at all.

If you want a figure, I’ve read that it takes about 10h of development (pedagogical inspiration, technical realisation plus testing) to create a set of exercises that will take 1h for a student to do. Of course, that hour can be done by hundreds and even thousands of students over and over again.

In fact, learning to use an authoring program such as Hot Potatoes or TexToys is quite quick. It can take less than an hour for a newbie to make a simple MCQ or Cloze exercise. There are other authoring programs, usually Flash based, but they limit you to the question types the creators have determined on. “Out of the box” Hot Potatoes exercises look boring and old-fashioned, but the code can be “hacked” to produce an infinite variety of exercises types and graphic styles.

4. To what extent would you recommend other teachers to try to develop similar language tools?

Glenys: First teachers should determine whether or not they really need to. If they can find exercises that already exist on the Internet that suit the needs of their students, they can simply provide their students with lists of links. There’s no point in duplicating work that exists and is freely available. In the past, I created some listening and reading exercises but I haven’t revised them to put them on ESL EXOS because there’s a lot of good stuff already out there.

Another reason for teachers to create their own exercises is because they need to track and grade their students’ work. Systems like Moodle come with their own built in quiz tools. Moodle’s is very good but if the teacher ever decides to leave Moodle they can’t take the quizzes with them. Hot Potatoes creates web pages which can easily be moved around different types of site. They can also be integrated into Moodle in a way that allows students’ work to be tracked and graded. I’m not sure to what extent this is possible on other LMSs.

The third thing I would recommend is to start on this sort of stuff as young as possible: like learning to drive, it’s a doddle when you’re 15, it’s not when you’re 50. I had no choice – it just wasn’t around when I was young.

5. Do you recommend/know of other non-commercial language tech sites?

Glenys: Not sure what you mean by “language tech sites”. I know of a number of sites created by people who started out as language teachers and who have gone into the technical side of things in different ways.

  • Martin Holmes started out as an English teacher and went on to create Textoys and, with Stewart Arniel, Hot Potatoes. Martin and Stewart are no longer developing Hot Potatoes, but Stan Bogdanov, also an English teacher, is. On his site you can find his Hot Potatoes add-ons. He hosts those created by Michael Rottmeier and Agnès Simonet as well. Stan is also in the process of making versions of Hot Potaotes that will work on mobile devices. At the moment, some do and some don’t.
  • Michael Marzio’s Real English site of videoed street interviews accompanied by interactive exercises is free but funded by ads. A wonderful site!
  • Todd Beuckens’ ELLLO site of short videos of young people discussing a wide range of subjects.
  • Deborah Delin’s Strivney is a Moodle site for children learning English. As well as hundreds of Hot Potatoes exercises, she’s made some amazing Flash ones too. Log in to see, for example: Beginners English – A Rod.
  • Ángel Terán not a language teacher but his LyricsTraining site is a great tool for language learning. It’s a commercial company but free to use on line.
  • Max Bury creates software and has a lot of stimulating blog posts about learning.


1. IATEFL2015 video: An engaged tone: how ELT might handle the ‘EdTech revolution’

2. ELT Entrepreneur – Marie Goodwyn

3. Bright-Stream