Around the GloWbE in a few clicks – will suit you

I had the chance to try out the new Corpus of Global Web-based English (GloWbE) in my TOEIC class the other day. The corpus claims to have 1.9 billion words, compare that with COCA (Corpus of Contemporory American English) which has 450 million words.

The student question was about the sentence “That will suit you perfectly” (which arose from dialogues they had been inventing between a customer and a salesperson). And in fact the query was about word order of adverbs and after answering that (but see StringNet post) took excuse to consult GloWbE and COCA.

Comparing the search term /will suit you/ between the two corpora:

COCA (will suit you) – 17 hits

GloWbE (will suit you) – 322 hits

In GloWbE the most common collocate was “best”, “perfectly” ranked 5th most common – will suit you results.

As a bonus one of my students was most impressed with the tool and noted down the url link.

Of course the real power in the new corpus is to analyze country differences and for me web related lexis.

I decided to see results for search term /responsive design/ (a hot topic in web design) – on the face of it impressive numbers – 1 hit in COCA compared to 966 hits in GloWbE. However further inspection reveals duplicate text  e.g. with the collocate HTML5 – responsive design results.

The corpus indicates duplicates by a number in brackets and the authors promise that these will be cleaned up over time. So the future looks rosy for the usefulness of this massive new corpus.

6 thoughts on “Around the GloWbE in a few clicks – will suit you

  1. I’ll definitely check out GloWbe, especially now that I’m looking at lexis in my MSc. Just wondering how you actually used the corpus in class. Did you just check it privately in a spare moment and share the results, project it for the class to see, or some other more interactive involvement of the learners?

    Thanks

    Ben

    1. hi ben

      thanks for popping by and the like🙂

      i simply used the computer in class with the two students looking over my shoulder.

      it would have been great to have a projector but seeing as only one pair had enquired and rest of class were still working on their dialogues it wasn’t a great teaching opportunity loss.

      ta
      mura

    1. hi tyson,

      apparently they use google to get the web pages so they use their system to identify countries http://support.google.com/webmasters/bin/answer.py?hl=en&answer=62399

      if you read the ‘Creating the corpus’ from the drop down menu they say –
      ——
      For example, for a .com address (where no top-level domain is listed), it will try to use the IP address (which shows where the computer is physically located). But even if that fails, Google could still see that 95% of the visitors to the site come from Singapore, and that 95% of the links to that page are from Singapore (and remember that Google knows both of these things), and it would then guess that the site is probably from Singapore. It isn’t perfect, but it’s very, very good, as is shown in the results from the dialect-oriented searches.
      ——

Penny for your thoughts

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s