Files
vecna/docs/what_is_embeddings.md
Hein 4009a54e39 feat: 🎉 Vectors na Vectors, the begining
Translate 1536 <-> 768 , 3072 <-> 2048
2026-04-11 18:05:05 +02:00

6.2 KiB
Raw Blame History

Practical breakdown of embeddings: how they work, how theyre trained, and concrete examples you can actually implement.


🧠 What embeddings actually are (intuitively + formally)

Image

Image

Image

Image

Image

Image

At the core:

  • An embedding is a function [ f(x) \rightarrow \mathbb{R}^d ]

  • It maps raw data (text, image, etc.) into a dense vector.

Key property:

  • Semantic structure is encoded as geometry

    • Similar things → vectors close together
    • Different things → far apart

This is the core idea behind vector search, RAG, clustering, etc. (ibm.com)


⚙️ Why embeddings work (the underlying theory)

1. Distributional hypothesis

“You shall know a word by the company it keeps”

  • Words appearing in similar contexts → similar vectors
  • This is the foundation of Word2Vec, GloVe, BERT, etc. (ibm.com)

2. Geometry encodes meaning

Classic example:

king - man + woman ≈ queen

This works because relationships become linear directions in vector space.


3. Dense vs sparse representations

Method Problem
One-hot huge, no meaning
TF-IDF frequency only
Embeddings compact + semantic

Embeddings are low-dimensional but information-rich representations. (GeeksforGeeks)


🏗️ How embeddings are trained (core methods)

1. Prediction-based models (most important)

Word2Vec (classic foundation)

Image

Image

Image

Image

Image

Image

Two main training strategies:

(a) Skip-gram

Predict context from a word:

[ P(context \mid word) ]

(b) CBOW

Predict word from context:

[ P(word \mid context) ]

Mechanism:

  • Input = one-hot vector
  • Hidden layer = embedding
  • Train via gradient descent

👉 The embedding is literally the weights of the hidden layer (Medium)


Example (Skip-gram training loop)

# pseudo-code
for word in corpus:
    context = get_context_window(word)
    loss = -log P(context | word)
    update_weights()

2. Matrix factorization (GloVe)

Instead of prediction:

  • Build co-occurrence matrix
  • Factorize it into lower dimensions

Captures global statistics, not just local context. (Medium)


3. Neural embedding layers (modern approach)

Used in:

  • Transformers (BERT, GPT)
  • Recommender systems

Mechanism:

  • Embedding = lookup table
  • Trained jointly with model
embedding = torch.nn.Embedding(vocab_size, dim)
vector = embedding(token_id)

4. Contrastive learning (modern SOTA)

Used in:

  • sentence embeddings
  • CLIP (image-text)
  • OpenAI embeddings

Core idea:

[ \text{similar pairs} \rightarrow \text{closer} ] [ \text{different pairs} \rightarrow \text{farther} ]

Loss function:

[ \mathcal{L} = -\log \frac{e^{sim(x_i, x_j)}}{\sum_k e^{sim(x_i, x_k)}} ]


🔬 How modern embeddings (LLMs) differ

Older:

  • static embeddings (Word2Vec)

Modern:

  • contextual embeddings

Example:

  • “bank” (river vs finance) → different vectors

This is why models like BERT/GPT outperform Word2Vec.


🧪 Practical training examples

Example 1 — Train Word2Vec (Gensim)

from gensim.models import Word2Vec

sentences = [["cat", "sat", "mat"], ["dog", "sat", "floor"]]

model = Word2Vec(sentences, vector_size=100, window=5, min_count=1)

vector = model.wv["cat"]

Example 2 — Train embeddings in PyTorch

import torch
import torch.nn as nn

embedding = nn.Embedding(10000, 128)  # vocab, dim

input_ids = torch.tensor([1, 5, 23])
vectors = embedding(input_ids)

Example 3 — Train contrastive embeddings

# pseudo
anchor = model(text1)
positive = model(text2)
negative = model(text3)

loss = contrastive_loss(anchor, positive, negative)

Example 4 — PCA reduction (your earlier question)

from sklearn.decomposition import PCA

pca = PCA(n_components=256)
X_reduced = pca.fit_transform(X)

📊 Types of embeddings

Type Example
Word Word2Vec, GloVe
Sentence SBERT
Document Doc2Vec
Image CLIP
Graph Node2Vec
Multimodal CLIP, Gemini

🧩 Key properties you should care about (engineering perspective)

1. Dimensionality

  • Typical: 1281536
  • Tradeoff: memory vs accuracy

2. Distance metric

  • cosine similarity (most common)
  • dot product
  • Euclidean

3. Normalization

Critical for:

  • search quality
  • clustering

4. Training data distribution

Embeddings are only as good as:

  • corpus size
  • domain relevance

⚠️ Common pitfalls (important)

Mixing embedding spaces

  • embeddings from different models are not compatible

Assuming linear compression is harmless

  • PCA can distort semantic relationships

Ignoring normalization

  • cosine similarity breaks without it

Using embeddings without evaluation

Always test:

  • retrieval accuracy
  • clustering quality

🧠 Mental model (most useful takeaway)

Think of embeddings as:

A learned coordinate system where meaning = position

Training = learning that coordinate system so that:

  • similar things cluster
  • relationships become directions