111 lines
3.4 KiB
Markdown
111 lines
3.4 KiB
Markdown
# Vector semantics and embeddings
|
|
|
|
WordNet has the shortcoming of being manually maintained and requiring a lot of human effort to update it.
|
|
|
|
Let's define words by the company they keep
|
|
* Words with **similar contexts** have **similar meanings**. Also referred to as the **distributional hypothesis**.
|
|
|
|
* >If A and B have almost identical environments, we say that they are synonyms
|
|
|
|
* E.g. doctor|surgeon (patient, hospital, treatment, etc.)
|
|
|
|
## The distributional hypothesis
|
|
Distributional models are based on a co-occurrence matrix
|
|
* Term-document matrix
|
|
* Term-term matrix
|
|
|
|
### Term document matrix
|
|
|
|
| | As you like it | Twelfth night | Julius Caesar | Henry V |
|
|
|---|---|---|---|---|
|
|
|battle|1|0|7|13|
|
|
|good|114|80|62|89|
|
|
|fool|36|58|1|4|
|
|
|wit|20|15|2|3|
|
|
|
|
Overall matrix is |V| by |D|
|
|
|
|
**Similar documents** have **similar words**:
|
|
* Represented by the **column vectors**
|
|
|
|
**Similar words** occur in **similar documents**:
|
|
* Represented by the **row vectors**
|
|
|
|
### Term-term matrix
|
|
|
|
| | computer | data | result | pie | sugar |
|
|
|---|---|---|---|---|---|
|
|
|cherry| 2 | 8 | 9 | 442 | 25 |
|
|
|strawberry | 0 | 0 | 1 | 60 | 19 |
|
|
|digital | 1670 | 1683 | 85 | 5 | 4 |
|
|
|information | 3325 | 3982 | 378 | 5 | 13 |
|
|
|
|
Term-term matrices are **sparse**
|
|
* Term vectors are **long** |V|
|
|
* Most entries are **zero**
|
|
|
|
Doesn't reflect underlying linguistic structure:
|
|
`food is bad` and `meal was aweful`
|
|
|
|
## Word embeddings
|
|
|
|
Let's represent words using **low-dimensional** vectors
|
|
* Capture the similarity between terms, e.g. `food|meal, bad|awful`, etc.
|
|
* **50-300** dimensions (rather than |V|)
|
|
* Most values are non-zero
|
|
|
|
Benefits
|
|
* Classifiers need to learn far **fewer weights**
|
|
* Helps with **generalization**, avoids **overfitting**
|
|
* Captures **synonymy**
|
|
|
|
## Word2vec
|
|
Word2vec software package
|
|
* **Static** embeddings (unlike BERT or ELMo)
|
|
|
|
Key idea:
|
|
* **Predict** rather than count
|
|
* **Binary prediction task** "Is word x likely to co-occur with word y?"
|
|
* Keep **classifier weights**
|
|
* Running text is the training data
|
|
|
|
Basic algorithm (skip-gram with negative sampling)
|
|
1. Treat neighboring context words as **positive** samples
|
|
2. Treat other random words in V as **negative** samples
|
|
3. Train a **logistic regression classifier** to distinguish these classes
|
|
4. Use **learned weights** as **embeddings**
|
|
|
|
## Training the classifier
|
|
**Iterate** thru training data, e.g.
|
|
`The cat sat on the mat`
|
|
|
|
Generate **positive** samples, e.g. +/- 2 words, 'sat'
|
|
(sat, cat), (sat, The), (sat, on), (sat, the)
|
|
|
|
Generate **k negative** samples, e.g.
|
|
(sat, trumpet), (sat, nice), (sat, explode), (sat, if),...
|
|
|
|
We want to **maximize** similarity of (w, c<sub>pos</sub>) pairs, **minimize** similarity of (w, c<sub>neg</sub>) pairs.
|
|
|
|
Starting with **random** vectors, use **stochastic gradient descent** to:
|
|
* **maximize** dot product of word with **actual** context words
|
|
* **minimize** dot product of word with **negative** non-context words
|
|
|
|
Outputs: **target** matrix W, **context** matrix C
|
|
Embedding for word i = W<sub>i</sub> + C<sub>i</sub>
|
|
|
|
## Other static embeddings
|
|
fasttext:
|
|
* deals with **unknown** words and **sparsity** by using **subword models**
|
|
* E.g. characters ngrams
|
|
* Embedding for a given word is **sum** of all embeddings
|
|
|
|
GloVe
|
|
* Uses **global** corpus statistics
|
|
* Combines **count-based** models with word2vec **linear** structures
|
|
|
|

|
|

|
|

|
|

|