Files
vecna/docs/what_is_embeddings.md
Hein 4009a54e39 feat: 🎉 Vectors na Vectors, the begining
Translate 1536 <-> 768 , 3072 <-> 2048
2026-04-11 18:05:05 +02:00

356 lines
6.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
**Practical breakdown** of embeddings: how they work, how theyre trained, and concrete examples you can actually implement.
---
# 🧠 What embeddings actually are (intuitively + formally)
![Image](images/a.jpg)
![Image](images/b.jpg)
![Image](images/c.jpg)
![Image](images/d.jpg)
![Image](images/e.jpg)
![Image](images/f.jpg)
At the core:
* An embedding is a **function**
[
f(x) \rightarrow \mathbb{R}^d
]
* It maps raw data (text, image, etc.) into a **dense vector**.
Key property:
* **Semantic structure is encoded as geometry**
* Similar things → vectors close together
* Different things → far apart
This is the core idea behind vector search, RAG, clustering, etc. ([ibm.com][1])
---
# ⚙️ Why embeddings work (the underlying theory)
### 1. Distributional hypothesis
> “You shall know a word by the company it keeps”
* Words appearing in similar contexts → similar vectors
* This is the **foundation of Word2Vec, GloVe, BERT, etc.** ([ibm.com][2])
---
### 2. Geometry encodes meaning
Classic example:
```
king - man + woman ≈ queen
```
This works because relationships become **linear directions in vector space**.
---
### 3. Dense vs sparse representations
| Method | Problem |
| ---------- | ------------------ |
| One-hot | huge, no meaning |
| TF-IDF | frequency only |
| Embeddings | compact + semantic |
Embeddings are **low-dimensional but information-rich** representations. ([GeeksforGeeks][3])
---
# 🏗️ How embeddings are trained (core methods)
## 1. Prediction-based models (most important)
### Word2Vec (classic foundation)
![Image](images/a1.jpg)
![Image](images/b1.jpg)
![Image](images/c1.jpg)
![Image](images/d1.jpg)
![Image](images/e1.jpg)
![Image](images/f1.jpg)
Two main training strategies:
### (a) Skip-gram
Predict context from a word:
[
P(context \mid word)
]
### (b) CBOW
Predict word from context:
[
P(word \mid context)
]
Mechanism:
* Input = one-hot vector
* Hidden layer = embedding
* Train via gradient descent
👉 The embedding is literally the **weights of the hidden layer** ([Medium][4])
---
### Example (Skip-gram training loop)
```python
# pseudo-code
for word in corpus:
context = get_context_window(word)
loss = -log P(context | word)
update_weights()
```
---
## 2. Matrix factorization (GloVe)
Instead of prediction:
* Build co-occurrence matrix
* Factorize it into lower dimensions
Captures **global statistics**, not just local context. ([Medium][5])
---
## 3. Neural embedding layers (modern approach)
Used in:
* Transformers (BERT, GPT)
* Recommender systems
Mechanism:
* Embedding = **lookup table**
* Trained jointly with model
```python
embedding = torch.nn.Embedding(vocab_size, dim)
vector = embedding(token_id)
```
---
## 4. Contrastive learning (modern SOTA)
Used in:
* sentence embeddings
* CLIP (image-text)
* OpenAI embeddings
Core idea:
[
\text{similar pairs} \rightarrow \text{closer}
]
[
\text{different pairs} \rightarrow \text{farther}
]
Loss function:
[
\mathcal{L} = -\log \frac{e^{sim(x_i, x_j)}}{\sum_k e^{sim(x_i, x_k)}}
]
---
# 🔬 How modern embeddings (LLMs) differ
Older:
* static embeddings (Word2Vec)
Modern:
* **contextual embeddings**
Example:
* “bank” (river vs finance) → different vectors
This is why models like BERT/GPT outperform Word2Vec.
---
# 🧪 Practical training examples
## Example 1 — Train Word2Vec (Gensim)
```python
from gensim.models import Word2Vec
sentences = [["cat", "sat", "mat"], ["dog", "sat", "floor"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
vector = model.wv["cat"]
```
---
## Example 2 — Train embeddings in PyTorch
```python
import torch
import torch.nn as nn
embedding = nn.Embedding(10000, 128) # vocab, dim
input_ids = torch.tensor([1, 5, 23])
vectors = embedding(input_ids)
```
---
## Example 3 — Train contrastive embeddings
```python
# pseudo
anchor = model(text1)
positive = model(text2)
negative = model(text3)
loss = contrastive_loss(anchor, positive, negative)
```
---
## Example 4 — PCA reduction (your earlier question)
```python
from sklearn.decomposition import PCA
pca = PCA(n_components=256)
X_reduced = pca.fit_transform(X)
```
---
# 📊 Types of embeddings
| Type | Example |
| ---------- | --------------- |
| Word | Word2Vec, GloVe |
| Sentence | SBERT |
| Document | Doc2Vec |
| Image | CLIP |
| Graph | Node2Vec |
| Multimodal | CLIP, Gemini |
---
# 🧩 Key properties you should care about (engineering perspective)
### 1. Dimensionality
* Typical: 1281536
* Tradeoff: memory vs accuracy
---
### 2. Distance metric
* cosine similarity (most common)
* dot product
* Euclidean
---
### 3. Normalization
Critical for:
* search quality
* clustering
---
### 4. Training data distribution
Embeddings are only as good as:
* corpus size
* domain relevance
---
# ⚠️ Common pitfalls (important)
### ❌ Mixing embedding spaces
* embeddings from different models are **not compatible**
---
### ❌ Assuming linear compression is harmless
* PCA can distort semantic relationships
---
### ❌ Ignoring normalization
* cosine similarity breaks without it
---
### ❌ Using embeddings without evaluation
Always test:
* retrieval accuracy
* clustering quality
---
# 🧠 Mental model (most useful takeaway)
Think of embeddings as:
> A learned coordinate system where **meaning = position**
Training = learning that coordinate system so that:
* similar things cluster
* relationships become directions
---
[1]: https://www.ibm.com/think/topics/vector-embedding?utm_source=chatgpt.com "What is Vector Embedding? | IBM"
[2]: https://www.ibm.com/think/topics/word-embeddings?utm_source=chatgpt.com "What Are Word Embeddings? | IBM"
[3]: https://www.geeksforgeeks.org/nlp/word-embeddings-in-nlp/?utm_source=chatgpt.com "Word Embeddings in NLP"
[4]: https://medium.com/%40manansuri/a-dummys-guide-to-word2vec-456444f3c673?utm_source=chatgpt.com "A Dummy's Guide to Word2Vec - Medium"
[5]: https://medium.com/%40neri.vvo/word-embedding-a-powerful-tool-word2vec-glove-fasttext-dd6e2171d5?utm_source=chatgpt.com "Word Embedding Explained — Word2Vec GloVe, FastText"