
Embeddings
How Do Machines Know Things Are Similar?
One of the most practical capabilities modern AI models develop during training is something called an embedding space. Embeddings allow machines to reason about similarity: which words, documents, images, or ideas are “close” to each other in meaning.
In this article, we’ll explain what embeddings are, how they form, and most importantly how they’re used in practice.
Forming Embedding Spaces
This section is a bit more technical. If you’re mainly interested in applications, feel free to skip ahead to the “Using Embeddings” section below.
At a high level, embedding spaces are learned compressions of higher-dimensional data. The goal of this compression is not to preserve everything, but to preserve what matters for a specific task.
A Simple Intuition: Compressing Space
Imagine the three-dimensional space we experience every day. Any point can be described by coordinates (x, y, z). There are many ways to compress this 3D space into just one dimension:
- Drop the y and z coordinates and keep only x
- Compute the distance from the origin at (0,0,0): sqrt(x^2 + y^2 + z^2)
Both approaches give us a single number, but they preserve very different information. In both cases, we lose a lot of detail about where the object actually is.
Why Learning the Compression Matters
The key idea behind machine learning embeddings is that models learn how to compress data in a way that best supports their goal.
Consider an extreme example.
Suppose our task is to build a model that takes a point in 3D space and outputs 10× its distance from the origin. In this case, the second mapping
sqrt(x^2 + y^2 + z^2)
is perfect. We lose no information that matters for the task. The rest of the model simply multiplies the value by 10.
By contrast, dropping the y and z coordinates would lose crucial information. The model could no longer compute the correct distance unless y^2 + z^2 happened to be the same for every input. This illustrates an important point:
The “best” compression depends entirely on the task.
Learning an Optimal Compression for Language
Now let’s apply this idea to language models.
A classic language model’s task is to predict the next word y given the words it has already seen x. For example:
["My", "name", "is"] → ?
To do this, the model needs a numerical representation of words.
A Bad Compression
Imagine representing each word with a single number:
"My" → 100
"name" → 41
"is" → 5
This representation already loses a lot of meaning. Why is “My” twenty times “is”? What does that even represent? The model now has to somehow interpret meaning from a single dimension that clearly isn’t rich enough.
Starting With Full Information: One-Hot Encoding
To preserve all information initially, we can give each word its own dimension. For three words, that looks like this:
"My" → [1, 0, 0]
"name" → [0, 1, 0]
"is" → [0, 0, 1]
This is called a one-hot encoding. It contains 100% of the information, we can always map back to the original word, but it’s extremely inefficient for computation and doesn’t express similarity at all.
Learning the Embedding Space
The next step is where embeddings come in.
Instead of working directly with one-hot vectors, the model learns a function that maps them into a lower-dimensional embedding space with d dimensions per word. The value of d is chosen so that the model retains enough information to accurately predict the next word.
We won’t go into the training mechanics here, but the key result is this:
The embedding space preserves the information that matters for predicting future words.
Once words are embedded, the rest of the model operates on these dense vectors instead of the original one-hot representations.
Why Embeddings Feel Intuitive
Because the embedding space is optimized for language, its dimensions often align with concepts that feel intuitive to humans. Some dimensions may loosely capture ideas like:
- “animal-like”
- “technical”
- “emotional”
- “formal”
Words such as lion, elephant, and animal end up close together because they can often replace each other in similar sentence contexts and the model will still predict similar outputs (next words).
If we change the number of dimensions d, the model may discover a different, but still optimal, representation by splitting or merging these abstract features.
Similar Meaning → Small Distance
Because the model is trained to preserve what matters for prediction, we can draw an important conclusion:
Words that are interchangeable in context end up close together in embedding space.
This is why embeddings are so powerful. Distance in the embedding space becomes a proxy for semantic similarity.
Using Embeddings in Practice
So now we have a learned representation of meaning in a d-dimensional space. What can we do with it?
First, it’s important to note that embeddings aren’t limited to text. They can be learned for:
- Images
- Audio and video
- User behavior and preferences
- Code
- Even combinations of different data types
- ...
In some systems, embeddings from different sources are learned jointly, making it possible to compare, for example, text with images or users with content.
Similarity Search
The most common use of embeddings is similarity search.
If two items are close together in embedding space, they usually have similar meaning. This allows us to:
- Search documents using natural language queries
- Find related articles or products
- Match users with content they’re likely to enjoy
For example, a paragraph about animals will generally be closer to the word “lion” than to “car” in embedding space.
Where You See This Today
Many modern AI systems rely on embeddings, often invisibly:
- Website chatbots use embedding search to retrieve relevant information
- Recommendation systems suggest content based on embedding similarity
- Knowledge assistants search internal documents using semantic meaning
Even if you don’t write code, you’re likely interacting with embedding-powered systems every day, often without realizing it.

