feat: 🎉 Vectors na Vectors, the begining
Translate 1536 <-> 768 , 3072 <-> 2048
BIN
docs/images/a.jpg
Normal file
|
After Width: | Height: | Size: 9.5 KiB |
BIN
docs/images/a1.jpg
Normal file
|
After Width: | Height: | Size: 21 KiB |
BIN
docs/images/b.jpg
Normal file
|
After Width: | Height: | Size: 69 KiB |
BIN
docs/images/b1.jpg
Normal file
|
After Width: | Height: | Size: 36 KiB |
BIN
docs/images/c.jpg
Normal file
|
After Width: | Height: | Size: 153 KiB |
BIN
docs/images/c1.jpg
Normal file
|
After Width: | Height: | Size: 302 KiB |
BIN
docs/images/d.jpg
Normal file
|
After Width: | Height: | Size: 176 KiB |
BIN
docs/images/d1.jpeg
Normal file
|
After Width: | Height: | Size: 54 KiB |
BIN
docs/images/e.jpg
Normal file
|
After Width: | Height: | Size: 17 KiB |
BIN
docs/images/e1.jpg
Normal file
|
After Width: | Height: | Size: 26 KiB |
BIN
docs/images/f.jpg
Normal file
|
After Width: | Height: | Size: 88 KiB |
BIN
docs/images/f1.jpg
Normal file
|
After Width: | Height: | Size: 19 KiB |
355
docs/what_is_embeddings.md
Normal file
@@ -0,0 +1,355 @@
|
||||
**Practical breakdown** of embeddings: how they work, how they’re trained, and concrete examples you can actually implement.
|
||||
|
||||
---
|
||||
|
||||
# 🧠 What embeddings actually are (intuitively + formally)
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
At the core:
|
||||
|
||||
* An embedding is a **function**
|
||||
[
|
||||
f(x) \rightarrow \mathbb{R}^d
|
||||
]
|
||||
|
||||
* It maps raw data (text, image, etc.) into a **dense vector**.
|
||||
|
||||
Key property:
|
||||
|
||||
* **Semantic structure is encoded as geometry**
|
||||
|
||||
* Similar things → vectors close together
|
||||
* Different things → far apart
|
||||
|
||||
This is the core idea behind vector search, RAG, clustering, etc. ([ibm.com][1])
|
||||
|
||||
---
|
||||
|
||||
# ⚙️ Why embeddings work (the underlying theory)
|
||||
|
||||
### 1. Distributional hypothesis
|
||||
|
||||
> “You shall know a word by the company it keeps”
|
||||
|
||||
* Words appearing in similar contexts → similar vectors
|
||||
* This is the **foundation of Word2Vec, GloVe, BERT, etc.** ([ibm.com][2])
|
||||
|
||||
---
|
||||
|
||||
### 2. Geometry encodes meaning
|
||||
|
||||
Classic example:
|
||||
|
||||
```
|
||||
king - man + woman ≈ queen
|
||||
```
|
||||
|
||||
This works because relationships become **linear directions in vector space**.
|
||||
|
||||
---
|
||||
|
||||
### 3. Dense vs sparse representations
|
||||
|
||||
| Method | Problem |
|
||||
| ---------- | ------------------ |
|
||||
| One-hot | huge, no meaning |
|
||||
| TF-IDF | frequency only |
|
||||
| Embeddings | compact + semantic |
|
||||
|
||||
Embeddings are **low-dimensional but information-rich** representations. ([GeeksforGeeks][3])
|
||||
|
||||
---
|
||||
|
||||
# 🏗️ How embeddings are trained (core methods)
|
||||
|
||||
## 1. Prediction-based models (most important)
|
||||
|
||||
### Word2Vec (classic foundation)
|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||

|
||||
|
||||
Two main training strategies:
|
||||
|
||||
### (a) Skip-gram
|
||||
|
||||
Predict context from a word:
|
||||
|
||||
[
|
||||
P(context \mid word)
|
||||
]
|
||||
|
||||
### (b) CBOW
|
||||
|
||||
Predict word from context:
|
||||
|
||||
[
|
||||
P(word \mid context)
|
||||
]
|
||||
|
||||
Mechanism:
|
||||
|
||||
* Input = one-hot vector
|
||||
* Hidden layer = embedding
|
||||
* Train via gradient descent
|
||||
|
||||
👉 The embedding is literally the **weights of the hidden layer** ([Medium][4])
|
||||
|
||||
---
|
||||
|
||||
### Example (Skip-gram training loop)
|
||||
|
||||
```python
|
||||
# pseudo-code
|
||||
for word in corpus:
|
||||
context = get_context_window(word)
|
||||
loss = -log P(context | word)
|
||||
update_weights()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Matrix factorization (GloVe)
|
||||
|
||||
Instead of prediction:
|
||||
|
||||
* Build co-occurrence matrix
|
||||
* Factorize it into lower dimensions
|
||||
|
||||
Captures **global statistics**, not just local context. ([Medium][5])
|
||||
|
||||
---
|
||||
|
||||
## 3. Neural embedding layers (modern approach)
|
||||
|
||||
Used in:
|
||||
|
||||
* Transformers (BERT, GPT)
|
||||
* Recommender systems
|
||||
|
||||
Mechanism:
|
||||
|
||||
* Embedding = **lookup table**
|
||||
* Trained jointly with model
|
||||
|
||||
```python
|
||||
embedding = torch.nn.Embedding(vocab_size, dim)
|
||||
vector = embedding(token_id)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Contrastive learning (modern SOTA)
|
||||
|
||||
Used in:
|
||||
|
||||
* sentence embeddings
|
||||
* CLIP (image-text)
|
||||
* OpenAI embeddings
|
||||
|
||||
Core idea:
|
||||
|
||||
[
|
||||
\text{similar pairs} \rightarrow \text{closer}
|
||||
]
|
||||
[
|
||||
\text{different pairs} \rightarrow \text{farther}
|
||||
]
|
||||
|
||||
Loss function:
|
||||
|
||||
[
|
||||
\mathcal{L} = -\log \frac{e^{sim(x_i, x_j)}}{\sum_k e^{sim(x_i, x_k)}}
|
||||
]
|
||||
|
||||
---
|
||||
|
||||
# 🔬 How modern embeddings (LLMs) differ
|
||||
|
||||
Older:
|
||||
|
||||
* static embeddings (Word2Vec)
|
||||
|
||||
Modern:
|
||||
|
||||
* **contextual embeddings**
|
||||
|
||||
Example:
|
||||
|
||||
* “bank” (river vs finance) → different vectors
|
||||
|
||||
This is why models like BERT/GPT outperform Word2Vec.
|
||||
|
||||
---
|
||||
|
||||
# 🧪 Practical training examples
|
||||
|
||||
## Example 1 — Train Word2Vec (Gensim)
|
||||
|
||||
```python
|
||||
from gensim.models import Word2Vec
|
||||
|
||||
sentences = [["cat", "sat", "mat"], ["dog", "sat", "floor"]]
|
||||
|
||||
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)
|
||||
|
||||
vector = model.wv["cat"]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example 2 — Train embeddings in PyTorch
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
|
||||
embedding = nn.Embedding(10000, 128) # vocab, dim
|
||||
|
||||
input_ids = torch.tensor([1, 5, 23])
|
||||
vectors = embedding(input_ids)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example 3 — Train contrastive embeddings
|
||||
|
||||
```python
|
||||
# pseudo
|
||||
anchor = model(text1)
|
||||
positive = model(text2)
|
||||
negative = model(text3)
|
||||
|
||||
loss = contrastive_loss(anchor, positive, negative)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Example 4 — PCA reduction (your earlier question)
|
||||
|
||||
```python
|
||||
from sklearn.decomposition import PCA
|
||||
|
||||
pca = PCA(n_components=256)
|
||||
X_reduced = pca.fit_transform(X)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
# 📊 Types of embeddings
|
||||
|
||||
| Type | Example |
|
||||
| ---------- | --------------- |
|
||||
| Word | Word2Vec, GloVe |
|
||||
| Sentence | SBERT |
|
||||
| Document | Doc2Vec |
|
||||
| Image | CLIP |
|
||||
| Graph | Node2Vec |
|
||||
| Multimodal | CLIP, Gemini |
|
||||
|
||||
---
|
||||
|
||||
# 🧩 Key properties you should care about (engineering perspective)
|
||||
|
||||
### 1. Dimensionality
|
||||
|
||||
* Typical: 128–1536
|
||||
* Tradeoff: memory vs accuracy
|
||||
|
||||
---
|
||||
|
||||
### 2. Distance metric
|
||||
|
||||
* cosine similarity (most common)
|
||||
* dot product
|
||||
* Euclidean
|
||||
|
||||
---
|
||||
|
||||
### 3. Normalization
|
||||
|
||||
Critical for:
|
||||
|
||||
* search quality
|
||||
* clustering
|
||||
|
||||
---
|
||||
|
||||
### 4. Training data distribution
|
||||
|
||||
Embeddings are only as good as:
|
||||
|
||||
* corpus size
|
||||
* domain relevance
|
||||
|
||||
---
|
||||
|
||||
# ⚠️ Common pitfalls (important)
|
||||
|
||||
### ❌ Mixing embedding spaces
|
||||
|
||||
* embeddings from different models are **not compatible**
|
||||
|
||||
---
|
||||
|
||||
### ❌ Assuming linear compression is harmless
|
||||
|
||||
* PCA can distort semantic relationships
|
||||
|
||||
---
|
||||
|
||||
### ❌ Ignoring normalization
|
||||
|
||||
* cosine similarity breaks without it
|
||||
|
||||
---
|
||||
|
||||
### ❌ Using embeddings without evaluation
|
||||
|
||||
Always test:
|
||||
|
||||
* retrieval accuracy
|
||||
* clustering quality
|
||||
|
||||
---
|
||||
|
||||
# 🧠 Mental model (most useful takeaway)
|
||||
|
||||
Think of embeddings as:
|
||||
|
||||
> A learned coordinate system where **meaning = position**
|
||||
|
||||
Training = learning that coordinate system so that:
|
||||
|
||||
* similar things cluster
|
||||
* relationships become directions
|
||||
|
||||
---
|
||||
|
||||
|
||||
[1]: https://www.ibm.com/think/topics/vector-embedding?utm_source=chatgpt.com "What is Vector Embedding? | IBM"
|
||||
[2]: https://www.ibm.com/think/topics/word-embeddings?utm_source=chatgpt.com "What Are Word Embeddings? | IBM"
|
||||
[3]: https://www.geeksforgeeks.org/nlp/word-embeddings-in-nlp/?utm_source=chatgpt.com "Word Embeddings in NLP"
|
||||
[4]: https://medium.com/%40manansuri/a-dummys-guide-to-word2vec-456444f3c673?utm_source=chatgpt.com "A Dummy's Guide to Word2Vec - Medium"
|
||||
[5]: https://medium.com/%40neri.vvo/word-embedding-a-powerful-tool-word2vec-glove-fasttext-dd6e2171d5?utm_source=chatgpt.com "Word Embedding Explained — Word2Vec GloVe, FastText"
|
||||
|
||||