Week 6 notes completed

This commit is contained in:
levdoescode
2023-01-14 13:04:49 -05:00
parent 88703fd872
commit 56282ca3de

View File

@ -0,0 +1,141 @@
# Language modeling
### Probabilistic language models
How likely is a given sequence of words?
* Machine translation:
> String tea vs Powerful tea
> strong engine vs power engine
* Speech recognition
> I can recognize speech
> I can wreck a nice beach
* Summarization, spelling correction, etc.
Computing probabilities
The probability of a sequence is the joint probability of all the individual words.
P(W) = P(w1, w2, w3, w4, w5 ... Wn)
The probability of an upcoming term given previous words (history)
P(W5|w1, w2, w3, w4)
The chain rule in probability theory:
The probability of a sequence is the multiplication of the probability of all words in the sequence.
P(x1, x2, x3, ..., xn) = P(x1)P(x2|x1)P(x3|x1, x2)...P(xn|x1,...,xn-1)
### Example
e.g. P(mat | the cat sat on the) = count(the cat sat on the mat)/count(the cat sat on the) = 8 / 10 = 0.8
But we **can't estimate all sequences** from **finite training data**
e.g. P(bed | the cat sat on the) = count(the cat sat on the bed)/count(the cat sat on the) = 0 / 10 = 0.0 ?
We see that from the training data, where no occurrence of 'the cat sat on the bed', we get a 0 probability when it should not be.
### A simplifying assumption (Markov condition/assumption)
P(mat | the cat sat on the) = P(mat | the)
OR
P(mat | the cat sat on the) = P(mat | on the) -> The previous 2 words
We limit the context
### Unigrams and bigrams
To make simplify the problem and get non-zero probability results, we calculate the probability of a sequence as the multiplication of all the probabilities of the individual words.
* Simplest case: unigram frequencies
P(w1 w2 ... wn) ≈ Π P(wi)
The probability of the word sequence is the multiplication of all the individual probabilities.
* Using bigrams to predict words
P(wi | w1 w2 ... wi-1) ≈ P(wi | wi-1)
In this case, we use a bigram, meaning we calculate the probability of a word in relation to the previous word.
P(mat | the cat sat on the) ~= P(mat | the)
This is an overestimation.
We have learned:
1. We can calculate the probabilities by counting occurrences
2. Training corpora are finite, so we make simplifying assumptions
3. We can build ngram models using these assumptions
### Estimating bigram probabilities
We can estimate second order (bigram) probabilities using a maximum likelihood estimator:
$$
P(w_i|w_i-1) = \frac{c(w_{i-1}, w_i)}{c(w_{i-1})}
$$
The probability of $w_i$ given $w_{i-1}$ is equal to the count of occurrences of the bigram $(w_{i-1}, w_i)$ divided by the count of the word $w_{i-1}$
![Bigram Probabilities](Files/image1.png)
When the test data has new words, we can use smoothing.
![Sparse Data Problem](Files/image2.png)
Laplace smoothing - Just add one to all the counts
![Laplace smoothing](Files/image3.png)
However, it's not used very much because of its impression.
We have other smoothing techniques called Backoff and interpolation
![Backoff and interpolation](Files/image4.png)
Interpolation uses lambdas which all add up to 1. They multiply the probability of the ngrams, usually trigram, bigram, unigrams.
# Evalutating language models
Our LM should prefer likely sequences over unlikely ones
* We **train** our model using **training** data
* We then **test** it using some **unseen** data
Two approaches to evaluation:
* Extrinsic - use in a real app (e.g. speech recognition), measure performance (e.g. accuracy)
* Intrinsic - use perplexity
Perplexity is analogous to randomness or the branching factor, e.g.
> The cat sat on the _?
> I saw a _?
### Perplexity
Perplexity is the inverse probability of the test set, normalized by the number of words.
$$
PP(W) = P(w_1w_2...w_N)^\dfrac{-1}{N}
$$
The -1 translates to the reciprocal and the N denominator to the Nth root
$$
PP(W) = \sqrt[N]{\frac{1}{P(w_1w_2...w_N)}}
$$
Then by the chain rule, the probability of the sequence is equal to the probability of the individual condition probabilities of the n-grams within the sequence.
$$
PP(W) = \sqrt[N]{\prod_{i=1}^{N}\frac{1}{P(w_i|w_1...w_{i-1})}}
$$
High probability ~= low perplexity
![Example](Files/image5.png)
# Topic modeling
Topics are an abstract representation of the way the documents in a collection are grouped together.
# Summary
![Summary1](Files/image6.png)
![Summary1](Files/image7.png)