From 56282ca3de76b69ce37140fd1746fa6a90d6c720 Mon Sep 17 00:00:00 2001 From: levdoescode Date: Sat, 14 Jan 2023 13:04:49 -0500 Subject: [PATCH] Week 6 notes completed --- .../Week 6/Week 6 Notes.md | 141 ++++++++++++++++++ 1 file changed, 141 insertions(+) create mode 100644 CM3060 Natural Language Processing/Week 6/Week 6 Notes.md diff --git a/CM3060 Natural Language Processing/Week 6/Week 6 Notes.md b/CM3060 Natural Language Processing/Week 6/Week 6 Notes.md new file mode 100644 index 0000000..5f2ded5 --- /dev/null +++ b/CM3060 Natural Language Processing/Week 6/Week 6 Notes.md @@ -0,0 +1,141 @@ +# Language modeling +### Probabilistic language models +How likely is a given sequence of words? + +* Machine translation: +> String tea vs Powerful tea + +> strong engine vs power engine + +* Speech recognition +> I can recognize speech + +> I can wreck a nice beach + +* Summarization, spelling correction, etc. + +Computing probabilities +The probability of a sequence is the joint probability of all the individual words. + +P(W) = P(w1, w2, w3, w4, w5 ... Wn) + +The probability of an upcoming term given previous words (history) + +P(W5|w1, w2, w3, w4) + +The chain rule in probability theory: +The probability of a sequence is the multiplication of the probability of all words in the sequence. + +P(x1, x2, x3, ..., xn) = P(x1)P(x2|x1)P(x3|x1, x2)...P(xn|x1,...,xn-1) + +### Example +e.g. P(mat | the cat sat on the) = count(the cat sat on the mat)/count(the cat sat on the) = 8 / 10 = 0.8 + +But we **can't estimate all sequences** from **finite training data** + +e.g. P(bed | the cat sat on the) = count(the cat sat on the bed)/count(the cat sat on the) = 0 / 10 = 0.0 ? + +We see that from the training data, where no occurrence of 'the cat sat on the bed', we get a 0 probability when it should not be. + +### A simplifying assumption (Markov condition/assumption) + +P(mat | the cat sat on the) = P(mat | the) + +OR + +P(mat | the cat sat on the) = P(mat | on the) -> The previous 2 words + +We limit the context + +### Unigrams and bigrams +To make simplify the problem and get non-zero probability results, we calculate the probability of a sequence as the multiplication of all the probabilities of the individual words. + +* Simplest case: unigram frequencies + +P(w1 w2 ... wn) ≈ Π P(wi) + +The probability of the word sequence is the multiplication of all the individual probabilities. + +* Using bigrams to predict words + +P(wi | w1 w2 ... wi-1) ≈ P(wi | wi-1) + +In this case, we use a bigram, meaning we calculate the probability of a word in relation to the previous word. + +P(mat | the cat sat on the) ~= P(mat | the) + +This is an overestimation. + +We have learned: +1. We can calculate the probabilities by counting occurrences +2. Training corpora are finite, so we make simplifying assumptions +3. We can build ngram models using these assumptions + +### Estimating bigram probabilities +We can estimate second order (bigram) probabilities using a maximum likelihood estimator: + +$$ +P(w_i|w_i-1) = \frac{c(w_{i-1}, w_i)}{c(w_{i-1})} +$$ + +The probability of $w_i$ given $w_{i-1}$ is equal to the count of occurrences of the bigram $(w_{i-1}, w_i)$ divided by the count of the word $w_{i-1}$ + +![Bigram Probabilities](Files/image1.png) + +When the test data has new words, we can use smoothing. + +![Sparse Data Problem](Files/image2.png) + +Laplace smoothing - Just add one to all the counts + +![Laplace smoothing](Files/image3.png) + +However, it's not used very much because of its impression. + +We have other smoothing techniques called Backoff and interpolation + +![Backoff and interpolation](Files/image4.png) + +Interpolation uses lambdas which all add up to 1. They multiply the probability of the ngrams, usually trigram, bigram, unigrams. + +# Evalutating language models +Our LM should prefer likely sequences over unlikely ones +* We **train** our model using **training** data +* We then **test** it using some **unseen** data + +Two approaches to evaluation: +* Extrinsic - use in a real app (e.g. speech recognition), measure performance (e.g. accuracy) +* Intrinsic - use perplexity + +Perplexity is analogous to randomness or the branching factor, e.g. +> The cat sat on the _? +> I saw a _? + +### Perplexity +Perplexity is the inverse probability of the test set, normalized by the number of words. + +$$ +PP(W) = P(w_1w_2...w_N)^\dfrac{-1}{N} +$$ + +The -1 translates to the reciprocal and the N denominator to the Nth root + +$$ +PP(W) = \sqrt[N]{\frac{1}{P(w_1w_2...w_N)}} +$$ + +Then by the chain rule, the probability of the sequence is equal to the probability of the individual condition probabilities of the n-grams within the sequence. +$$ +PP(W) = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1...w_{i-1})}} +$$ + +High probability ~= low perplexity + +![Example](Files/image5.png) + +# Topic modeling +Topics are an abstract representation of the way the documents in a collection are grouped together. + +# Summary +![Summary1](Files/image6.png) +![Summary1](Files/image7.png)