Intuitively, Most People Would Not Say Literature Was Quantifiable, but Data Mining Is

Lauren Marino

CPSC 445 HW1

Intuitively, most people would not say literature was quantifiable, but data mining is already being put to use in the analysis of poetry in languages as diverse as Chinese, Japanese, and English. In 1998, Mayumi Yamasaki, Masayuki Takeda, Tomoko Fukuda, and Ichiro Nanri used machine learning to discover characteristic patterns from collections of classical Japanese poems. They first put together a system of grammar for identifying patterns in the poems, then used data mining techniques to search out those patterns and compare the frequency of the occurrence of certain patterns across five different collections of classical Japanese poems. Though no definite conclusions were reached, they found that some patterns occurred frequently in some collections but not at all in others, while other patterns occurred in all five. These tendencies suggested directions for further research.

In August 2004, researchers Yong Yi, Zhong-shi He, Liang-yan Li, and Tian Yu presented their method of using machine learning to classify traditional Chinese poetry into Bold-and-Unrestrained or Graceful-and-Restrained styles. Because poetry style is most often determined subjectively and intuitively by a human reader, it is difficult to derive quantitative principles or format rules from human classifications. By using machine learning, the researchers hoped to find a quantitative model of traditional Chinese poetry identification. By using a naïve Bayesian method of classification, they were able to identify poetry style based on the occurrence of 1087 commonly used Chinese characters with approximately 90 percent accuracy.

In November 2004, participants at McMaster University, Open Sky Solutions, the University of Alberta, University of Georgia, University of Illinois, University of Maryland, University of Nebraska, and the University of Virginia came together to work on the Nora project, a software tool for “discovering, visualizing, and exploring significant patterns across large collections of full text humanities resources in existing digital libraries.” ( Through the cooperation of several digital libraries, the Nora project has gained access to approximately 10,000 literary texts, which roughly amounts to 5GB of data. That’s only a small fraction of the data stored in the world’s digital libraries.

One of the first experiments for the Nora project was the investigation of erotic language in the writings of poet Emily Dickinson. Initially, a user ranks how erotic a training set of documents on a scale of 1 to 5. The software then proceeds to attempt to classify the rest of the documents. Most interesting, though, is that by using a naïve Bayesian classification the software can tell the user which individual words it thought were potential indicators of the erotic. One researcher became particularly excited upon seeing the word “mine” rank high on that list of words.

“The minute I saw it, I had one of those ‘I knew that’ moments.Besides possessiveness, ‘mine’ connotes delving deep, plumbing, penetrating--all things we associate with the erotic at one point or another. And Emily Dickinson was, by her own accounting and metaphor, a diver who relished going for the pearls. So ‘mine’ should have been identified as a ‘likely hot’ word, but has not been, oddly enough, in the extensive literature on Dickinson’s desires.” (

So even in the Nora project’s earlier stages, it was able to reveal something that human scholarship had never before discovered. The Nora project eventually expanded beyond experiments on the works of Dickinson to encompass a variety of 19th century British and American literature, and even now the Nora project team is working to develop software that uses data mining to aid in the work of humanities scholars. In January 2007, the Nora project merged with the Wordhoard project at Northwestern University to create MONK (Metadata Offer New Knowledge), a digital environment which aims to help scholars identify and analyze patterns in the texts they study. By June 2008, they hope to install beta applications of MONK alongside several large digital text collections so that any scholar may use data mining to gain insight into literary texts.

References

[1] The MONK project website,

[2] Yi, Yong et al. “Studies on Traditional Chinese Poetry Style Identification”, August 2004,

[3] Yamasaki, Mayumi et al. “Discovering Characteristic Patterns from Collections of Classical Japanese Poems”, 1998

[4] The Nora project website,

[5] Kirschenbaum, Matthew “Poetry, Patterns, and Provocation: The nora Project” January 2006,