andre/UoL

Fork 0

Files

levdoescode 396c6f39a6 Week 8 notes complete

2023-03-18 22:31:25 -05:00

3.4 KiB

Raw Permalink Blame History

Vector semantics and embeddings

WordNet has the shortcoming of being manually maintained and requiring a lot of human effort to update it.

Let's define words by the company they keep

Words with similar contexts have similar meanings. Also referred to as the distributional hypothesis.
If A and B have almost identical environments, we say that they are synonyms
E.g. doctor|surgeon (patient, hospital, treatment, etc.)

The distributional hypothesis

Distributional models are based on a co-occurrence matrix

Term-document matrix
Term-term matrix

Term document matrix

	As you like it	Twelfth night	Julius Caesar	Henry V
battle	1	0	7	13
good	114	80	62	89
fool	36	58	1	4
wit	20	15	2	3

Overall matrix is |V| by |D|

Similar documents have similar words:

Represented by the column vectors

Similar words occur in similar documents:

Represented by the row vectors

Term-term matrix

	computer	data	result	pie	sugar
cherry	2	8	9	442	25
strawberry	0	0	1	60	19
digital	1670	1683	85	5	4
information	3325	3982	378	5	13

Term-term matrices are sparse

Term vectors are long |V|
Most entries are zero

Doesn't reflect underlying linguistic structure: food is bad and meal was aweful

Word embeddings

Let's represent words using low-dimensional vectors

Capture the similarity between terms, e.g. food|meal, bad|awful, etc.
50-300 dimensions (rather than |V|)
Most values are non-zero

Benefits

Classifiers need to learn far fewer weights
Helps with generalization, avoids overfitting
Captures synonymy

Word2vec

Word2vec software package

Static embeddings (unlike BERT or ELMo)

Key idea:

Predict rather than count
Binary prediction task "Is word x likely to co-occur with word y?"
Keep classifier weights
Running text is the training data

Basic algorithm (skip-gram with negative sampling)

Treat neighboring context words as positive samples
Treat other random words in V as negative samples
Train a logistic regression classifier to distinguish these classes
Use learned weights as embeddings

Training the classifier

Iterate thru training data, e.g.
The cat sat on the mat

Generate positive samples, e.g. +/- 2 words, 'sat'
(sat, cat), (sat, The), (sat, on), (sat, the)

Generate k negative samples, e.g.
(sat, trumpet), (sat, nice), (sat, explode), (sat, if),...

We want to maximize similarity of (w, c_pos) pairs, minimize similarity of (w, c_neg) pairs.

Starting with random vectors, use stochastic gradient descent to:

maximize dot product of word with actual context words
minimize dot product of word with negative non-context words

Outputs: target matrix W, context matrix C
Embedding for word i = W_i + C_i

Other static embeddings

fasttext:

deals with unknown words and sparsity by using subword models
E.g. characters ngrams
Embedding for a given word is sum of all embeddings

GloVe

Uses global corpus statistics
Combines count-based models with word2vec linear structures

3.4 KiB Raw Permalink Blame History