The embedding models convert textual information into numerical representations, allowing semantic searches across vast text collections. The necessity for embeddings starts from the complexity of words, that extends beyond sequences of characters.
Representing words as character sequences in a neural network would consume significant computational overhead in deciphering their meaningful permutations and combinations. Also, we need to know that words occurring in similar context bear similar meaning or maybe a similar embedding value. Thus, vector embeddings serve to address this challenge efficiently.
In short, embeddings are representations of data, such as words in a sentence, converted into numerical lists known as vectors.
These vectors capture relationships and sequences of numbers that represent connections between the data points. Acting as multidimensional maps, they enable the measurement of similarity. Consider the example of the words "tea" and "coffee": owing to their frequent occurrence in similar contexts, they possess similar vector representations. If a vector comprises 10 unique numbers, it implies the representation of 10 distinct features across 10 dimensions.
Additionally, vectors can also encapsulate information from images and audio files, enabling processes like Google's similarity search for images.
There are many practical ways for generating and storing embeddings. However, it's essential to understand the process of converting data into numerical representations, storing these representations in databases, and retrieving relevant information from them.
While a basic embedding can be created by one-hot encoding a set of numbers and inputting them into a neural network, this approach undermines the main advantage of embeddings, as it doesn't allow for calculating semantic distances or differences between two embeddings. For instance, when comparing "cat" to "car," using one-hot encoding (or embedding in this case) the similarity score might be higher than when comparing "cat" to "dog." However, in reality, this may not accurately reflect the similarity between the terms.
Therefore, generating these embeddings involves complex computations, leveraging prior knowledge, and sophisticated algorithms, unlike the deterministic process of splitting text.
Thankfully, embedding models have been trained extensively on a lot of data, automatically creating n-dimensional embeddings based on the context length of the model, and simplifying the task for us.
After the creation of an embedding vector, it can be stored within a vector database or vector store.
These stores offer versatile functionalities, including search capabilities where results are organized based on relevance to a query, clustering to group text strings by similarity, and classification to assign text strings to the most fitting labels.