Corpora and Statistical Methods

Tutorial 3

1 Introduction

The aim of this tutorial is to (a) familiarise yourself with the SketchEngine, a web interface to a number of corpora; (b) apply some of the concepts about Zipfian distributions and morphological productivity to actual data.

1.1 Accessing the SketchEngine

The SketchEngine is online at When you go to this URL, you will be prompted for a username and password, which the lecturer should provide.

(An alternative is to create a trial account. This is free but lasts for only a month).

1.2 Tools you will need

Throughout this tutorial, you will be using the Corpus Query Language (CQL). This is a language which mixes regular expressions with a special syntax to make elaborate queries of corpus data. Examples include:

  • Finding all adjectives in the corpus which end in –ity
  • Finding all occurrences of the verb kill which are followed by a determiner, an adjective, and a noun in that order.

A tutorial about CQL and regular expressions can be found in a separate file, downloadable from the website.

1.2 Accessing a corpus

Once you log in, you will see a selection of different corpora for many different languages. These include some well-known ones, such as the BNC. There are also a number of web corpora (i.e. corpora built by scraping the web). We shall use one of these, namely ukWaC (UK Web as Corpus). (Please use “ukWaC”, not “ukWaC v1.0 old”, which is also marked on the menu).

Here is some useful info about this corpus:

  • Corpus size: ca. 1.5 billion words
  • Vocabulary size: ca. 11.2 million
  • No. of hapax legomena: 1,949,571 individual lemmas

Click on the ukWaC link. You will be taken to a search form as shown below.

Note:

  1. You can make simple word/phrase queries by typing them in the Query box;
  2. If you click Query Type on the left menu (bottom), a drop-down menu appears which among other things allows you to choose between a simple query, a lemma query (for all morphological forms of a specific lemma), or a CQL query (which allows for complex searches for word/lemma sequences with part of speech tags).

2 Creating CQL queries

For this part of the tutorial, you’ll practice using CQL. Be sure to select the CQL option from the drop down Query Type box, as shown above. You’ll need to refer to the tagset used in the corpus. This is the tagset originally developed for the Penn Treebank corpus, and a full listing is provided here:

Q1. Using the examples from the CQL tutorial, write CQL queries for the following (the first has been done for you):

  1. Nouns ending in the suffix –ness (e.g. goodness)

CQL = [word= “.+ness$” & tag= “NN”]

  1. Adjectives preceded by a determiner (e.g. the) and ending with the suffix –ous (e.g. the calamitous...)
  1. Adverbs ending in –ly (e.g. slowly) and followed by a verb (e.g. slowly ran)
  1. Adjectives starting with the negative prefix in- or im- and ending in the suffix –ous (e.g. impecunious)
  1. Complex adjectives involving the prefix non- (with the hyphen)
  1. Complex adjectives involving the prefix well- (with the hyphen)

Try out the queries in the CQL box. Do you get the desired results or does your query overgenerate?

Note: The search returns what is known as a KWIC (“key word in context”) concordance, i.e. a list of the patterns matched in the context in which they occur, as shown below. You can see the whole context of a specific match by clicking on the matched word or phrase itself, which is highlighted in red as shown below.

2.1 Creating a frequency list

Once you have a page of results, you can generate a frequency list, by clicking Node forms under Frequency in the left menu (circled in the diagram above), which gives you the types matched and their frequency.

Q2. Construct a frequency list for each of the last two queries above (Q1.e & Q1.f). Eyeball the data:

  1. What are the characteristics of the distribution?
  1. Judging by the list of types, do you think that these are productive morphological processes? Are there some cases where the complex adjectives seem to be non-compositional?

Note: It may be easier to save the frequency lists to your desktop and loading it into a spreadsheet program, like Excel or SPSS. You can do this by using the Save button (see below). This takes you to a form. You can leave all the fields as they are, but be sure to set a large value for the maximum number of lines to save (1 million should do it); otherwise you won’t save all the data.

2.2 Computing productivity measures

Q3. Count the number of individual hapax legomena for each of your two adjective queries from Q2 above. Based on raw counts, do they differ? What does this suggest to you regarding their productivity?

Q4. Based on your frequency lists, compute the realised, expanding and potential productivity coefficients for each of the two processes.

  1. Do they come out roughly equal on any of the measures or are there substantial differences? How do you interpret these results?
  1. For each of the two morphological processes, what do you observe about the three measures? Do you think they are roughly the same, or are they very different?
  1. (Slightly more challenging) For each of the two cases, compute a pairwise correlation between the three measures (i.e. a correlation between realised and expanding productivity, expanding and potential productivity, and realised and potential productivity). You can use a Pearson’s correlation (denoted r ) for this. What do you observe? How should a correlation be interpreted?
  • Note: if you’ve never computed a correlation, don’t worry, we’ll discuss it in class. It’s worth trying, however. You can find information about Pearson’s r here:

Q5. Re-compute the calculations for each process, but this time carry out your query on the BNC (you’ll need to return to the Home menu to select it). Note that the BNC is a much smaller corpus. Moreover, it contains texts up to the early 1990s (whereas ukWaC is much more recent).

  1. Are there more, or fewer types for the two morphological processes in the BNC, compared to ukWaC?
  1. Do the two morphological processes come out equally productive based on the BNC data? Why (not)?