Google has introduced the new Gemini Embedding 2 model — the first of its kind native multimodal system for creating embeddings. What does this mean? Now, a single query can include multiple data types, such as an image and text, which the model processes simultaneously, identifying semantic connections between different formats. Support for over one hundred languages is promised, making it versatile for global tasks.
What can it do? Here are the main features:
– The text input can contain up to 8,192 tokens;
– Up to six images can be uploaded in PNG or JPEG format;
– Videos up to two minutes long (supported formats are MP4 and MOV);
– PDF documents up to six pages can be uploaded;
– Audio is recognized directly, without the need for pre-transcription.
The default size of vector representations is 3,072, but with the Matryoshka RL technology, it can be reduced to 1,536 or even 768. This allows for memory savings and faster searches, though with a slight loss of accuracy.
The same concept of nested representations used in previous Google models has now been adapted for multimodal data. This approach helps better understand the relationships between different types of information.
This innovation is now available via the Gemini API and the Vertex AI platform in public preview mode. Out of the box, the model works well with popular tools like LangChain, LlamaIndex, Haystack, Weaviate, Qdrant, and ChromaDB.
There are example notebooks on GitHub demonstrating how to use Gemini and Vertex, as well as demos of multimodal semantic search for practical testing.
This marks a significant step forward in artificial intelligence development and multimedia data processing.
Created with n8n:
https://cutt.ly/n8n
Created with syllaby:
https://cutt.ly/syllaby
