feat: 🎉 Vectors na Vectors, the begining

Translate 1536 <-> 768 , 3072 <-> 2048
2026-05-05 01:26:58 +00:00 · 2026-04-11 18:05:05 +02:00
parent d98ea7c222
commit 4009a54e39
58 changed files with 5324 additions and 2 deletions
--- a/docs/what_is_embeddings.md
+++ b/docs/what_is_embeddings.md
@@ -0,0 +1,355 @@
+**Practical breakdown** of embeddings: how they work, how they’re trained, and concrete examples you can actually implement.
+
+---
+
+# 🧠 What embeddings actually are (intuitively + formally)
+
+![Image](images/a.jpg)
+
+![Image](images/b.jpg)
+
+![Image](images/c.jpg)
+
+![Image](images/d.jpg)
+
+![Image](images/e.jpg)
+
+![Image](images/f.jpg)
+
+At the core:
+
+* An embedding is a **function**
+  [
+  f(x) \rightarrow \mathbb{R}^d
+  ]
+
+* It maps raw data (text, image, etc.) into a **dense vector**.
+
+Key property:
+
+* **Semantic structure is encoded as geometry**
+
+  * Similar things → vectors close together
+  * Different things → far apart
+
+This is the core idea behind vector search, RAG, clustering, etc. ([ibm.com][1])
+
+---
+
+# ⚙️ Why embeddings work (the underlying theory)
+
+### 1. Distributional hypothesis
+
+> “You shall know a word by the company it keeps”
+
+* Words appearing in similar contexts → similar vectors
+* This is the **foundation of Word2Vec, GloVe, BERT, etc.** ([ibm.com][2])
+
+---
+
+### 2. Geometry encodes meaning
+
+Classic example:
+
+```
+king - man + woman ≈ queen
+```
+
+This works because relationships become **linear directions in vector space**.
+
+---
+
+### 3. Dense vs sparse representations
+
+| Method     | Problem            |
+| ---------- | ------------------ |
+| One-hot    | huge, no meaning   |
+| TF-IDF     | frequency only     |
+| Embeddings | compact + semantic |
+
+Embeddings are **low-dimensional but information-rich** representations. ([GeeksforGeeks][3])
+
+---
+
+# 🏗️ How embeddings are trained (core methods)
+
+## 1. Prediction-based models (most important)
+
+### Word2Vec (classic foundation)
+
+![Image](images/a1.jpg)
+
+![Image](images/b1.jpg)
+
+![Image](images/c1.jpg)
+
+![Image](images/d1.jpg)
+
+![Image](images/e1.jpg)
+
+![Image](images/f1.jpg)
+
+Two main training strategies:
+
+### (a) Skip-gram
+
+Predict context from a word:
+
+[
+P(context \mid word)
+]
+
+### (b) CBOW
+
+Predict word from context:
+
+[
+P(word \mid context)
+]
+
+Mechanism:
+
+* Input = one-hot vector
+* Hidden layer = embedding
+* Train via gradient descent
+
+👉 The embedding is literally the **weights of the hidden layer** ([Medium][4])
+
+---
+
+### Example (Skip-gram training loop)
+
+```python
+# pseudo-code
+for word in corpus:
+    context = get_context_window(word)
+    loss = -log P(context | word)
+    update_weights()
+```
+
+---
+
+## 2. Matrix factorization (GloVe)
+
+Instead of prediction:
+
+* Build co-occurrence matrix
+* Factorize it into lower dimensions
+
+Captures **global statistics**, not just local context. ([Medium][5])
+
+---
+
+## 3. Neural embedding layers (modern approach)
+
+Used in:
+
+* Transformers (BERT, GPT)
+* Recommender systems
+
+Mechanism:
+
+* Embedding = **lookup table**
+* Trained jointly with model
+
+```python
+embedding = torch.nn.Embedding(vocab_size, dim)
+vector = embedding(token_id)
+```
+
+---
+
+## 4. Contrastive learning (modern SOTA)
+
+Used in:
+
+* sentence embeddings
+* CLIP (image-text)
+* OpenAI embeddings
+
+Core idea:
+
+[
+\text{similar pairs} \rightarrow \text{closer}
+]
+[
+\text{different pairs} \rightarrow \text{farther}
+]
+
+Loss function:
+
+[
+\mathcal{L} = -\log \frac{e^{sim(x_i, x_j)}}{\sum_k e^{sim(x_i, x_k)}}
+]
+
+---
+
+# 🔬 How modern embeddings (LLMs) differ
+
+Older:
+
+* static embeddings (Word2Vec)
+
+Modern:
+
+* **contextual embeddings**
+
+Example:
+
+* “bank” (river vs finance) → different vectors
+
+This is why models like BERT/GPT outperform Word2Vec.
+
+---
+
+# 🧪 Practical training examples
+
+## Example 1 — Train Word2Vec (Gensim)
+
+```python
+from gensim.models import Word2Vec
+
+sentences = [["cat", "sat", "mat"], ["dog", "sat", "floor"]]
+
+model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
+
+vector = model.wv["cat"]
+```
+
+---
+
+## Example 2 — Train embeddings in PyTorch
+
+```python
+import torch
+import torch.nn as nn
+
+embedding = nn.Embedding(10000, 128)  # vocab, dim
+
+input_ids = torch.tensor([1, 5, 23])
+vectors = embedding(input_ids)
+```
+
+---
+
+## Example 3 — Train contrastive embeddings
+
+```python
+# pseudo
+anchor = model(text1)
+positive = model(text2)
+negative = model(text3)
+
+loss = contrastive_loss(anchor, positive, negative)
+```
+
+---
+
+## Example 4 — PCA reduction (your earlier question)
+
+```python
+from sklearn.decomposition import PCA
+
+pca = PCA(n_components=256)
+X_reduced = pca.fit_transform(X)
+```
+
+---
+
+# 📊 Types of embeddings
+
+| Type       | Example         |
+| ---------- | --------------- |
+| Word       | Word2Vec, GloVe |
+| Sentence   | SBERT           |
+| Document   | Doc2Vec         |
+| Image      | CLIP            |
+| Graph      | Node2Vec        |
+| Multimodal | CLIP, Gemini    |
+
+---
+
+# 🧩 Key properties you should care about (engineering perspective)
+
+### 1. Dimensionality
+
+* Typical: 128–1536
+* Tradeoff: memory vs accuracy
+
+---
+
+### 2. Distance metric
+
+* cosine similarity (most common)
+* dot product
+* Euclidean
+
+---
+
+### 3. Normalization
+
+Critical for:
+
+* search quality
+* clustering
+
+---
+
+### 4. Training data distribution
+
+Embeddings are only as good as:
+
+* corpus size
+* domain relevance
+
+---
+
+# ⚠️ Common pitfalls (important)
+
+### ❌ Mixing embedding spaces
+
+* embeddings from different models are **not compatible**
+
+---
+
+### ❌ Assuming linear compression is harmless
+
+* PCA can distort semantic relationships
+
+---
+
+### ❌ Ignoring normalization
+
+* cosine similarity breaks without it
+
+---
+
+### ❌ Using embeddings without evaluation
+
+Always test:
+
+* retrieval accuracy
+* clustering quality
+
+---
+
+# 🧠 Mental model (most useful takeaway)
+
+Think of embeddings as:
+
+> A learned coordinate system where **meaning = position**
+
+Training = learning that coordinate system so that:
+
+* similar things cluster
+* relationships become directions
+
+---
+
+
+[1]: https://www.ibm.com/think/topics/vector-embedding?utm_source=chatgpt.com "What is Vector Embedding? | IBM"
+[2]: https://www.ibm.com/think/topics/word-embeddings?utm_source=chatgpt.com "What Are Word Embeddings? | IBM"
+[3]: https://www.geeksforgeeks.org/nlp/word-embeddings-in-nlp/?utm_source=chatgpt.com "Word Embeddings in NLP"
+[4]: https://medium.com/%40manansuri/a-dummys-guide-to-word2vec-456444f3c673?utm_source=chatgpt.com "A Dummy's Guide to Word2Vec - Medium"
+[5]: https://medium.com/%40neri.vvo/word-embedding-a-powerful-tool-word2vec-glove-fasttext-dd6e2171d5?utm_source=chatgpt.com "Word Embedding Explained — Word2Vec GloVe, FastText"
+