feat: 🎉 Vectors na Vectors, the begining

Translate 1536 <-> 768 , 3072 <-> 2048
This commit is contained in:
2026-04-11 18:05:05 +02:00
parent d98ea7c222
commit 4009a54e39
58 changed files with 5324 additions and 2 deletions

355
docs/what_is_embeddings.md Normal file
View File

@@ -0,0 +1,355 @@
**Practical breakdown** of embeddings: how they work, how theyre trained, and concrete examples you can actually implement.
---
# 🧠 What embeddings actually are (intuitively + formally)
![Image](images/a.jpg)
![Image](images/b.jpg)
![Image](images/c.jpg)
![Image](images/d.jpg)
![Image](images/e.jpg)
![Image](images/f.jpg)
At the core:
* An embedding is a **function**
[
f(x) \rightarrow \mathbb{R}^d
]
* It maps raw data (text, image, etc.) into a **dense vector**.
Key property:
* **Semantic structure is encoded as geometry**
* Similar things → vectors close together
* Different things → far apart
This is the core idea behind vector search, RAG, clustering, etc. ([ibm.com][1])
---
# ⚙️ Why embeddings work (the underlying theory)
### 1. Distributional hypothesis
> “You shall know a word by the company it keeps”
* Words appearing in similar contexts → similar vectors
* This is the **foundation of Word2Vec, GloVe, BERT, etc.** ([ibm.com][2])
---
### 2. Geometry encodes meaning
Classic example:
```
king - man + woman ≈ queen
```
This works because relationships become **linear directions in vector space**.
---
### 3. Dense vs sparse representations
| Method | Problem |
| ---------- | ------------------ |
| One-hot | huge, no meaning |
| TF-IDF | frequency only |
| Embeddings | compact + semantic |
Embeddings are **low-dimensional but information-rich** representations. ([GeeksforGeeks][3])
---
# 🏗️ How embeddings are trained (core methods)
## 1. Prediction-based models (most important)
### Word2Vec (classic foundation)
![Image](images/a1.jpg)
![Image](images/b1.jpg)
![Image](images/c1.jpg)
![Image](images/d1.jpg)
![Image](images/e1.jpg)
![Image](images/f1.jpg)
Two main training strategies:
### (a) Skip-gram
Predict context from a word:
[
P(context \mid word)
]
### (b) CBOW
Predict word from context:
[
P(word \mid context)
]
Mechanism:
* Input = one-hot vector
* Hidden layer = embedding
* Train via gradient descent
👉 The embedding is literally the **weights of the hidden layer** ([Medium][4])
---
### Example (Skip-gram training loop)
```python
# pseudo-code
for word in corpus:
context = get_context_window(word)
loss = -log P(context | word)
update_weights()
```
---
## 2. Matrix factorization (GloVe)
Instead of prediction:
* Build co-occurrence matrix
* Factorize it into lower dimensions
Captures **global statistics**, not just local context. ([Medium][5])
---
## 3. Neural embedding layers (modern approach)
Used in:
* Transformers (BERT, GPT)
* Recommender systems
Mechanism:
* Embedding = **lookup table**
* Trained jointly with model
```python
embedding = torch.nn.Embedding(vocab_size, dim)
vector = embedding(token_id)
```
---
## 4. Contrastive learning (modern SOTA)
Used in:
* sentence embeddings
* CLIP (image-text)
* OpenAI embeddings
Core idea:
[
\text{similar pairs} \rightarrow \text{closer}
]
[
\text{different pairs} \rightarrow \text{farther}
]
Loss function:
[
\mathcal{L} = -\log \frac{e^{sim(x_i, x_j)}}{\sum_k e^{sim(x_i, x_k)}}
]
---
# 🔬 How modern embeddings (LLMs) differ
Older:
* static embeddings (Word2Vec)
Modern:
* **contextual embeddings**
Example:
* “bank” (river vs finance) → different vectors
This is why models like BERT/GPT outperform Word2Vec.
---
# 🧪 Practical training examples
## Example 1 — Train Word2Vec (Gensim)
```python
from gensim.models import Word2Vec
sentences = [["cat", "sat", "mat"], ["dog", "sat", "floor"]]
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
vector = model.wv["cat"]
```
---
## Example 2 — Train embeddings in PyTorch
```python
import torch
import torch.nn as nn
embedding = nn.Embedding(10000, 128) # vocab, dim
input_ids = torch.tensor([1, 5, 23])
vectors = embedding(input_ids)
```
---
## Example 3 — Train contrastive embeddings
```python
# pseudo
anchor = model(text1)
positive = model(text2)
negative = model(text3)
loss = contrastive_loss(anchor, positive, negative)
```
---
## Example 4 — PCA reduction (your earlier question)
```python
from sklearn.decomposition import PCA
pca = PCA(n_components=256)
X_reduced = pca.fit_transform(X)
```
---
# 📊 Types of embeddings
| Type | Example |
| ---------- | --------------- |
| Word | Word2Vec, GloVe |
| Sentence | SBERT |
| Document | Doc2Vec |
| Image | CLIP |
| Graph | Node2Vec |
| Multimodal | CLIP, Gemini |
---
# 🧩 Key properties you should care about (engineering perspective)
### 1. Dimensionality
* Typical: 1281536
* Tradeoff: memory vs accuracy
---
### 2. Distance metric
* cosine similarity (most common)
* dot product
* Euclidean
---
### 3. Normalization
Critical for:
* search quality
* clustering
---
### 4. Training data distribution
Embeddings are only as good as:
* corpus size
* domain relevance
---
# ⚠️ Common pitfalls (important)
### ❌ Mixing embedding spaces
* embeddings from different models are **not compatible**
---
### ❌ Assuming linear compression is harmless
* PCA can distort semantic relationships
---
### ❌ Ignoring normalization
* cosine similarity breaks without it
---
### ❌ Using embeddings without evaluation
Always test:
* retrieval accuracy
* clustering quality
---
# 🧠 Mental model (most useful takeaway)
Think of embeddings as:
> A learned coordinate system where **meaning = position**
Training = learning that coordinate system so that:
* similar things cluster
* relationships become directions
---
[1]: https://www.ibm.com/think/topics/vector-embedding?utm_source=chatgpt.com "What is Vector Embedding? | IBM"
[2]: https://www.ibm.com/think/topics/word-embeddings?utm_source=chatgpt.com "What Are Word Embeddings? | IBM"
[3]: https://www.geeksforgeeks.org/nlp/word-embeddings-in-nlp/?utm_source=chatgpt.com "Word Embeddings in NLP"
[4]: https://medium.com/%40manansuri/a-dummys-guide-to-word2vec-456444f3c673?utm_source=chatgpt.com "A Dummy's Guide to Word2Vec - Medium"
[5]: https://medium.com/%40neri.vvo/word-embedding-a-powerful-tool-word2vec-glove-fasttext-dd6e2171d5?utm_source=chatgpt.com "Word Embedding Explained — Word2Vec GloVe, FastText"