Corpus linguistics and stylistics

Stylistics

“Stylistics is the study of literature using methods, theories and concepts from linguistics” (Leech and Short 2007:1), it is "[...] the study of the relationship between linguistic form and literary function [...]” (Leech and Short 2007:3)

Style

Style is an idiosyncratic way of performing a particular action. It can be non-linguistic or it can be linguistic.

“Style is a way in which language is used [...] style consists in choices made from the repertoire of the language [...] Stylistic choice is limited to those aspects of linguistic choice which concern alternative ways of rendering the same subject matter” (Leech and Short 2007:31)

“[...] the recognition and analysis of styles are squarely based on comparison.” (Enkvist 1973: 25-26)

“We match the text against another body of texts which we might label as norm, this norm being chosen because it is contextually relevant as a background for the text. […] Features whose densities are significantly different in the text and in the norm are style markers for the text in relation to the norm used. A change of norm may result in a different inventory of style markers.

The norm may be chosen from a wide field. One portion of a text may be matched against other portions or the whole of the same text. One text may be compared to other texts. Or the text may be set against an imaginary norm that only exists in a critic’s mind” (Enkvist 1973: 25-26)

Corpus linguistics + stylistics = corpus stylistics

  • The advent of corpora, and the availability of a range of corpus-based techniques, have opened up new avenues in the study of literature, and prose fiction in particular.
  • The ‘corpus turn’ (Leech and Short 2007:284) refers to the on-going trend in stylistics to use methods and tools from corpus-linguistics for the analysis of literary and other texts.
  • Usually referred to as corpus-stylistics

Intra-textual vs. inter-textual approaches to electronic text analysis (Adolphs 2006):

Intra-textual: analysis of an individual text (e.g. via concordances, collocations, etc.)

Inter-textual: comparison of an individual text with other related texts or with a larger corpus

Corpus approaches and genre style

  • Biber (1988) – multivariate statistical techniques

factor analysis; many different variables; variables = linguistic features (e.g. passive constructions)

  • e.g. Narrative versus non-narrative texts

important variables = past tense verbs, 3rd person pronouns, perfect aspect, present participle clauses; High scores = narrative; Low scores = non-narrative

Corpus approaches and authorial style

  • Studies attempting to ‘fingerprint’ authors: i.e. to identify linguistic items that distinguish the works by one author from those of others.
  • Burrows (1987) study of Jane Austen’s novels focusing on closed-class words, such as the, and, of, a andto.
  • Burrows found that these words can distinguish the works of different authors, different novels, and even the words spoken by different characters.
  • Hoover (2002) studied a series of corpora containing chunks from novels by different authors.
  • The distribution of the 300 most frequent words in the corpus correctly clusters 15 out of 17 novels.

Corpus approaches and text style

  • Stubbs’s (2005) study of Joseph Conrad’s Heart of Darkness.
  • Comparison between word frequencies in Heart of Darkness and in the ‘Imaginative Writing’ section of the BNC
  • Connection between key-words and Enkvist’s ‘style markers’
  • Stubbs noticed words expressing vagueness and uncertainty were more frequent in Heart of Darkness: ‘as though’, ‘seemed’, ‘kind of’, ‘sort of’.
  • Stubbs shows how the application of corpus methods can provide further justification for well-established interpretations, and new insights into the language and meaning potential of the text.

Corpus approaches and variation inside texts

  • Culpeper (2002) used WordSmith Tools to do a key-word analysis of the speech of the main characters in Romeo and Juliet
  • A file with the words spoken by each character was compared to a ‘reference corpus’ containing the words of all the other characters.
  • Findings are relevant to an understanding of how the characters are linguistically constructed (characterisation).
  • Culpeper, J. (2009) Revisited analysis, using WMatrix

Corpus tools

WordSmith Tools (Scott 2007) /
AntConc (Anthony 2010) /

WMatrix(Rayson 2003, 2008)

/ Multilingual Corpus Toolkit (MLCT)
(Piao) /

WMatrix–

web-based environment – processes texts in three ways:

1)Word level – simple word frequency lists.

2)Grammatical level –CLAWS [1](Constituent Likelihood Automatic Word-tagging System - see Garside 1987; Leech, Garside and Bryant 1994; Garside 1996; and Garside and Smith 1997) adds parts of speech (POS) tags (with 96-97% accuracy) to the words in the text.

3)Semantic level – The USAS (UCREL[2]Semantic Analysis System) tagger assigned tags to the words (with 92% accuracy) from a set of predefined semantic fields. Currently there are 21 major fields (see Table 1), which are subdivided into 232 category labels to allow for more fine-grained classification.

(More info about semantic categories at

Keyness – A concept made popular by Mike Scott and Wordsmith Tools (Scott 1996, 1999, 2007)

WordSmith, AntConc and WMatrix all have a ‘Keyword’ facility that compares the word-list from one text with the word-list from another text, or reference corpus, to produce a list of words that are statistically ‘over-used’ (or ‘under-used’, depending on the settings selected) in the first text.

WMatrixextends the notion of keyness as it can not only compare texts at the word level (key words), but also at the grammatical level (key POS), and the semantic level (key concepts).

WMatrix uses Log-likelihood [3]to calculate keyness.

Conclusions

  • The notion of ‘style’ is fundamentally based on comparison
  • Corpus-based methods are relevant to the analysis of style in fiction/literature, and can also play a role in explaining and refining interpretations of texts.
  • A variety of corpus methods have been applied to the analysis of genres, authors and texts.
  • Corpus tools can be used for the analysis of the ‘content’ or themes of texts, and for the analysis of some aspects of ‘style’.
  • However, manual analysis and interpretation of the output of the tool are needed.

[...] ‘corpus stylistics’ is not purely a quantitative study of literature. Rather, it is still a qualitative stylistic approach to the study of language of literature, combined with or supported by corpus-based quantitative methods and technology. Ho (2011:10)

References

Adolph, S. (2006) Introducing Electronic Text Analysis. London: Routledge.

Biber, D. (1988). Variation across speech and writing. Cambridge: Cambridge University Press.

Burrows, J. (1987) Computation into Criticism: A Study of Jane Austen's Novels, and an Experiment in Method, Oxfrod: Oxford University Press.

Culpeper, J. (2002) ‘Computers, language and characterisation: An Analysis of six characters in Romeo and Juliet’. In: U. Melander-Marttala, C. Ostman and MerjaKyto (eds.), Conversation in Life and in Literature: Papers from the ASLA Symposium, Association Suedoise de LinguistiqueAppliquee(ASLA), 15. Universitetstryckeriet: Uppsala, pp.11-30.

Enkvist, N. E. (1973) Linguistic Stylistics. The Hague: Mouton.

Ho, Y. (2011) Corpus Stylistics in Principles and Practice: A Stylistic Exploration of John Fowles’ The Magus. London: Continuum

Hoover, D. L. (2002) ‘Frequent word sequences and statistical analysis’. Literary and Linguistic Computing, 17, 2, 157-80.

Leech, G. N. and Short, M. (2007) Style in Fiction (2nded). London: Longman.

Stubbs, M. (2005) ‘Conrad in the computer: examples of quantitative stylistic methods’. Language and Literature, 14, 1, 5-24.

1

[1] See for further information.

[2] University Centre for Computer Corpus Research on Language

[3] log-likelihood info and wizard: