diff --git a/CM3060 Natural Language Processing/Week 8/Files/distributed.png b/CM3060 Natural Language Processing/Week 8/Files/distributed.png new file mode 100644 index 0000000..d36f471 Binary files /dev/null and b/CM3060 Natural Language Processing/Week 8/Files/distributed.png differ diff --git a/CM3060 Natural Language Processing/Week 8/Files/embeddings.png b/CM3060 Natural Language Processing/Week 8/Files/embeddings.png new file mode 100644 index 0000000..f6c5eb4 Binary files /dev/null and b/CM3060 Natural Language Processing/Week 8/Files/embeddings.png differ diff --git a/CM3060 Natural Language Processing/Week 8/Files/topics.png b/CM3060 Natural Language Processing/Week 8/Files/topics.png new file mode 100644 index 0000000..56ee12e Binary files /dev/null and b/CM3060 Natural Language Processing/Week 8/Files/topics.png differ diff --git a/CM3060 Natural Language Processing/Week 8/Files/wordnet.png b/CM3060 Natural Language Processing/Week 8/Files/wordnet.png new file mode 100644 index 0000000..2386fb3 Binary files /dev/null and b/CM3060 Natural Language Processing/Week 8/Files/wordnet.png differ diff --git a/CM3060 Natural Language Processing/Week 8/Week 8 notes.md b/CM3060 Natural Language Processing/Week 8/Week 8 notes.md new file mode 100644 index 0000000..741eb75 --- /dev/null +++ b/CM3060 Natural Language Processing/Week 8/Week 8 notes.md @@ -0,0 +1,110 @@ +# Vector semantics and embeddings + +WordNet has the shortcoming of being manually maintained and requiring a lot of human effort to update it. + +Let's define words by the company they keep +* Words with **similar contexts** have **similar meanings**. Also referred to as the **distributional hypothesis**. + +* >If A and B have almost identical environments, we say that they are synonyms + +* E.g. doctor|surgeon (patient, hospital, treatment, etc.) + +## The distributional hypothesis +Distributional models are based on a co-occurrence matrix +* Term-document matrix +* Term-term matrix + +### Term document matrix + +| | As you like it | Twelfth night | Julius Caesar | Henry V | +|---|---|---|---|---| +|battle|1|0|7|13| +|good|114|80|62|89| +|fool|36|58|1|4| +|wit|20|15|2|3| + +Overall matrix is |V| by |D| + +**Similar documents** have **similar words**: +* Represented by the **column vectors** + +**Similar words** occur in **similar documents**: +* Represented by the **row vectors** + +### Term-term matrix + +| | computer | data | result | pie | sugar | +|---|---|---|---|---|---| +|cherry| 2 | 8 | 9 | 442 | 25 | +|strawberry | 0 | 0 | 1 | 60 | 19 | +|digital | 1670 | 1683 | 85 | 5 | 4 | +|information | 3325 | 3982 | 378 | 5 | 13 | + +Term-term matrices are **sparse** +* Term vectors are **long** |V| +* Most entries are **zero** + +Doesn't reflect underlying linguistic structure: +`food is bad` and `meal was aweful` + +## Word embeddings + +Let's represent words using **low-dimensional** vectors +* Capture the similarity between terms, e.g. `food|meal, bad|awful`, etc. +* **50-300** dimensions (rather than |V|) +* Most values are non-zero + +Benefits +* Classifiers need to learn far **fewer weights** +* Helps with **generalization**, avoids **overfitting** +* Captures **synonymy** + +## Word2vec +Word2vec software package +* **Static** embeddings (unlike BERT or ELMo) + +Key idea: +* **Predict** rather than count +* **Binary prediction task** "Is word x likely to co-occur with word y?" +* Keep **classifier weights** +* Running text is the training data + +Basic algorithm (skip-gram with negative sampling) +1. Treat neighboring context words as **positive** samples +2. Treat other random words in V as **negative** samples +3. Train a **logistic regression classifier** to distinguish these classes +4. Use **learned weights** as **embeddings** + +## Training the classifier +**Iterate** thru training data, e.g. +`The cat sat on the mat` + +Generate **positive** samples, e.g. +/- 2 words, 'sat' +(sat, cat), (sat, The), (sat, on), (sat, the) + +Generate **k negative** samples, e.g. +(sat, trumpet), (sat, nice), (sat, explode), (sat, if),... + +We want to **maximize** similarity of (w, cpos) pairs, **minimize** similarity of (w, cneg) pairs. + +Starting with **random** vectors, use **stochastic gradient descent** to: +* **maximize** dot product of word with **actual** context words +* **minimize** dot product of word with **negative** non-context words + +Outputs: **target** matrix W, **context** matrix C +Embedding for word i = Wi + Ci + +## Other static embeddings +fasttext: +* deals with **unknown** words and **sparsity** by using **subword models** +* E.g. characters ngrams +* Embedding for a given word is **sum** of all embeddings + +GloVe +* Uses **global** corpus statistics +* Combines **count-based** models with word2vec **linear** structures + +![](Files/topics.png) +![](Files/wordnet.png) +![](Files/distributed.png) +![](Files/embeddings.png)