syllabus for cse 659

2001 fall semester

cse659 statistical natural language processing

instructor: Geunbae Lee, Eng 2-211, , 279-2254

language: English

1. Course objectives

This course introduces various recent statistical methods in natural language processing.
We will cover basic statistical tools for computational linguistics and their application to part-of-speech tagging,
statistical parsing, word sense disambiguation, machine translation, information retrieval and statistical discourse processing.
If time permits, we will briefly touch on some topics of statistical language models for speech recognition and text-to-speech systems.

2. Course prerequisites

cse561 linguistic fundamentals for natural language processing OR instructor approval

3. Grading

midterm 30%
final 30%
homeworks 30%
class participation 10%

4. Required texts or references

Text: Manning, C. D., & Schutze, H. (1999). Foundations of Statistical Natural Language Processing. MIT Press.

References :Brigitte Krenn and Christer Samuelsson. The Linguist's Guide to Statistics. Internet shareware,

5. 기타참고사항

Instruction language: English
Homeworks: hands-on experience on pos tagging and parsing for both Korean and English

6. Lecture note reading lists

1st week: statistical vs. structured NLP
2nd week: basic statistics and statistical model
3rd week: Linguistics Essentials
4th week: Corpus-Based NLP
5th week: Collocations
6th week: Statistical Inference: n-gram Models over Sparse Data
7th week: Word Sense Disambiguation
8th week: Lexical Acquisition (midterm)
9th week: Markov Models
10th week: Part-of-Speech Tagging
11th week: Probabilistic Context Free Grammars
12th week: Probabilistic Parsing
13th week: Statistical Alignment and Machine Translation
14th week: Clustering
15th week: Topics in Information Retrieval
16th week: Text Categorization (final)

(basic and applied statistics)

- Brigitte Krenn and Christer Samuelsson. The Linguist's Guide to Statistics. Internet shareware,

(general)

- Abney, S. Statistical methods and linguistics. In J. Klavans and P. Resnik. The Balancing act, MIT Press, 1996

- Church and Mercer, Introduction to the special issue on computational linguistics using large corpora, Computational Lingusitcs, 19, 1993

(POS tagging)

- Cutting et al. A practical part-of-speech tagger, In Proceedings of the 3rd conference on applied natural language processing, 1992.

- Church. A stochastic parts program and noun phrase parser for unrestricted text. in Proceedings of the 2nd conference on applied natural language processing, 1988

- Weischedel et. al. Coping with ambiguity and unknown words through probablistic models, Computational linguistics, 19(2), 1993

- J. Kupiec. Robust part-of-speech tagging using a hidden Markov model, computer speech and language, 6, 1992

- E. Brill. A simple rule-based part-of-speech tagger. Proceedings of the 3rd conference on applied NLP, 1992

- E. Roche and Y. Schabes. Deterministic part-of-speech tagging with finite state transducers. Computational linguistics 21, 1995.

- B. Merialdo. Tagging English text with a probabilistic model. Computational linguistics 20, 1994.

- Jeongwon Cha, Geunbae Lee, Jong-Hyeok Lee. Generalized unknown morpheme guessing for hybrid POS tagging of Korean. Proceedings of SIXTH WORKSHOP ON VERY LARGE CORPORA in Coling-Acl 98, Montreal, 1998.

(Statistical parsing)

- K. Lari and S. Young. The estimation of stochastic context-free grammar using the inside-outside

algorithm. Computer speech and language 4, 1990

- F. Pereira and Y. Schabes. Inside-outside reestimation frm partially bracketed corpora. ACL 30, 1992

- T. Briscoe and J. Carroll. Generalized probabilistic LR parsing of natural language (corpora) with unification-based grammars. Computational linguistics 19, 1993.

- E. Black et. al. Towards history-based grammars: using richer models for probablistic parsing. ACL 31, 1993.

- D. Margerman. Statistical decision-tree models for parsing, ACL 33, 1995.

- Brill. Automatic grammar induction and parsing free text: a transformation-based approach. ACL 31, 1993

(statistical disambiguation)

- D. Hindle and M. Rooth. Structural ambiguity and lexical relations. Computational linguistics 19, 1993

- K. Church and P. Hanks. Word association norms, mutual information and lexicography, ACL 28, 1990

- Alshawi and Carter. Training and scaling preference functions for disambiguation. computational lingustics 20(4), 1994.

(word classes and WSD)

- Gale et. al. Work on statistical methods for word-sense disambiguation, Proceedings from AAAI fall symposium: Probablistic approaches to natural language, 1992.

- Gale et al. Estimating upper and lower bounds on the performance of word-sense disambiguation programs. ACL 30, 1992.

- Yarowsky. Word sense disambiguation using statistical method of Roget's categories trained on large copora, Coling, 1992.

- Yarowsky. Unsupervised word-sense disambiguation rivaling supervised methods, ACL 33, 1995

- Pereira et. al. Distributional clustering of English words, ACL 31, 1993

- Dagan et al. Contextual word similarity and estimation from sparse data, ACL 31, 1993

- Dagan et al. Similarity-based estimation of word coocurrence probabilites, ACL 32, 1994.

(text aligment and machine translation)

- Kay and Roscheisen. Text-translation alignment. computational linguistics 19, 1993

- Gale and Church. A program for aligning sentences in bilingual corpora, Computational lingustics 19, 1993

- Brown et al. The mathematics of statistical machine translation: parameter estimation. computational linguistics, 1993

- Brown et al. A statiscal approach to machine translation. Computational linguistics 16, 1990

- Wu. Aligning a parallel English-Chinese corpus statistically with lexical criteria. ACL 32, 1994

- Church. Char_align: A program for aligning parall디 texts at the character level. ACL 31, 1993

- Sproat et al. A stochstic finite-state word segmentation algorithm for Chinese, ACL 32, 1994

(lexical knowledge acquisition)

- Manning. Automatic acquistion of a large subcategorization dictionary from corpora. ACL 31, 1993

- Smadja. Retrieving collocations from text: Xtract, Computational linguistics, 1993

(speech and others)

- Brown et al. Class-based n-gram models of natural language, computational linguistics, 18(4), 1992

- Chien et al. A best-first language processing model integrating the unification-based grammar and markov language model for speech recognition applications. IEEE trans. on speech and audio processing, 1(2), 1993

- Derouault and Meialdo. Natural language modeling for phoneme-to-text transcription, IEEE trans. PAMI, 8(6), 1986

- Seneff. Tina: a natural language system for spoken langugage applications, computational linguistics, 18(1), 1992

- Gupta et al. a language model for very large-vocabulary speech recognition, computer speech and language 6, 1992