Google has introduced a new version of its Gemini Embedding — now with multimodal embeddings!
This new model can natively process videos up to 2 minutes long, handle multiple PDF pages, and also pay attention to audio with text. It can be used both in the free tier and via a paid API. The embeddings are structured like a nesting doll: each individual embedding piece is generated independently, although less precise.
Unfortunately, Google’s service prices have risen again. Text processing now costs about $0.20 per million tokens, while the price for multimodal data has increased significantly — for example, video processing now costs $12 per million tokens (approximately 15,000 frames). Google is actively leveraging the fact that there are few competitors in this segment — other major companies have yet to implement such extensive updates. For instance, OpenAI last updated its embeddings in January 2024, while simultaneously improving GPT-3.5 Turbo and GPT-4 Turbo.
All of this is relevant due to the lack of widespread alternatives on the market.
Created with n8n:
https://cutt.ly/n8n
Created with syllaby:
https://cutt.ly/syllaby
