Exploiting the positions and the occurrences of words to incorporate the inter-dependencies between for language modeling as a mean to exploit the long contexts more effectively to improve the modeling capability
Bold – Questions
Color – Answers
- What is the motivation of this theses? – Introduction
- What is the problem to solve?
- What is a LM? – To estimate the probability of a word sequence.
- How useful is a LM to different NLP applications? – NLP applications rely on the LMs to hypothesize a linguistically well-formed sentence/paragraph/document under a specific condition – for example speech recognition, machine translation, information retrieval, and handwriting recognition.
- What are the classes of LM?– N-gram, distant-bigram, skip-gram, trigger, bag-of-words, and their log-linear/linear interpolation– each of these LMs reduces the word sequence under different assumptions to keep the data scarcity problem manageable.
- What are the problems do these LMs address? – To alleviate the data scarcity problem by assuming the word sequence in the history as simpler events.
- Whatis the problem does this thesis address?– The inter-dependencies among the events are neglected when incorporating multiple events in the history for words prediction.
- How does this problem affect the performance of LMs?– The parameters in the combined model are not consistent to the statistics in the training data.
- How was this problem solved traditionally?– By using the bucketing scheme or maximum entropy model to combine events.
- How does this thesis address the problem?– By exploiting the events jointly and the LM is decomposed into multiple components in order to alleviate the data scarcity problem while maintain the inter-dependencies.
- How the proposed solution is compared the conventional approaches?– Conventional approaches exploit each event separately by using a model and then combine the component models by using linear/log-linear interpolation.
- What is the state-of-the-art approaches for LM?
- What is a LM? – To estimate the probability of a word sequence.
- How a LM is built?– By using the chain rule.
- Corpuses for LM? Toolkits?
- How are LM applied to different applications? –Speech recognition – machine translation – information retrieval – handwriting recognition.
- How a LM is evaluated?– Perplexity – Task specific evaluation, e.g. WER, BLEU, etc.
- How was this problem solved conventionally? – Background
- What are the problems do these LMs address? – To alleviate the data scarcity problem
- What are the classes of LM?– N-gram, distant-bigram, skip-gram, trigger, bag-of-words, and their log-linear/linear interpolation – each of these LMs reduces the word sequence under different assumptions to keep the data scarcity problem manageable – the history.
- What are the key concepts & relationships between these classes of LM? – These classes of LM exploit a simpler event derived from the word sequence in the history for words prediction – by dropping words from the word sequence (to form a shorter sub-sequence) or disregarding the arrangements of words.
- What are the weaknesses & strengths of these LMs?
- What problemstill exist in LM modelling?– The inter-dependencies among the events are neglected when incorporating multiple events in the history for words prediction.
- How does this problem affect the performance of LMs?– The parameters in the combined model are not consistent to the statistics in the training data.
- Our proposed solution for a problem in LM? –
- Whatis the problemwe want to solve?
- We again illustrate the problem in even more detail
- The inter-dependencies among the events are neglected when incorporating multiple events in the history for words prediction.
- How does this thesis address the problem?– By exploiting the events jointly and the LM is decomposed into multiple components in order to alleviate the data scarcity problem while maintain the inter-dependencies.
- Provide details of the proposed solution?–
- Elaborate the formulation of the proposed solution,
- Elaborate how the proposed solution can solve the problem. - Provide figures and graphs to support the description
- E.gdecoupling, generalization, smoothing –
- How to decompose the LM into multiple component models as according to the events? – Decoupling the history into TD and TO components.
- How to compute probabilities based on the component models?– Smoothing – Weighting
- How does TDTO relate to other works?
- There are different types of event can be exploited, which types of event are considered by this thesis? Why?
- How to show that the proposed solution can actually solve the problem?– Compare the proposed solution to the conventional solutions – Highlight other the benefits of the proposed method.
- Does the TDTO capture the long contexts more effective than higher order n-gram model?
- Does the TDTO capture the long contexts more effective than the bag-of-words model?
- Does the TDTO capture the long contexts more effective than the distant bigram model?
- Does the TDTO capture the long contexts more effective than the trigger model?
- What is the impact on other applications that require LM? By addressing the problem, can an (NLP) application be benefited?– Speech recognition – Document classification – Word prediction
- What are the drawback of the proposed solution? – The smoothing of the TD and the TO component models.
- An extension of Chapter 3’s solution
- Why there is a need to this extension?– Smoothing of the TD and the TO component models have been done naively.
- What are the existing solutions for smoothing? – f
- Go and describe the various solutions?
- The NN method to perform smoothing?
- What is the NN LM approaches?
- History, success, approaches,
- How does NN smooth
- Project the input into continuous space which interpolation of vector in neighboring apce for
- So what do can we do? – NN has been shown to be useful for smoothing the LM
- Why is NN suitable to smooth for TDTO
- How to implement NN-TDTO?–Show the architecture of the NN-TDTO.
- What is the architecture of the NN-TDTO?
- How to encode the inputs of the NN-TDTO?
- How to show the extension of the proposed solution is effective?– Show the perplexity results – Show the WER results.
- Does the NN-TDTO model have lower perplexity than the TDTO model?
- Does the NN-TDTO model have lower WER than the TDTO model?
- How is NN-TDTO compared to other NN-based LM?– Show the perplexity results as compared to the NN-based n-gram – Show the perplexity results as compared to the RNNLM.
- How is the perplexity of the NN-TDTO compared to the NNLM?
- How is the perplexity NN-TDTO compared to the RNNLM?
- What are other benefits of NN-TDTO? –
- How is the complexity of the NN-TDTO compared to the NNLM?
- How is the complexity NN-TDTO compared to the RNNLM?
- Conclusion
- What has this thesis achieved? –
- How effective is the proposed TDTO model?
- How is the perplexity?
- How is the WER?
- What are the directions to further this thesis? –
??? Where should these following question sbe?
- How was this problem solved traditionally & how does this thesis address the problem?– By using the bucketing scheme or maximum entropy model – By exploiting the events jointly and the LM is decomposed into multiple components in order to alleviate the data scarcity problem while maintain the inter-dependencies.