@GoogleDevelopers

Check out all the AI videos at Google I/O 2024 โ†’ https://goo.gle/io24-ai-yt

@jprak123asd

I wanted to extend my heartfelt thanks for the excellent session on how Retrieval-Augmented Generation (RAG) can be used to train Large Language Models (LLMs) to build expert systems in the retail, software, automotive, and other sectors.

Your explanation was incredibly clear and insightful, making a complex topic easily understandable. I truly felt like Dr. Watson listening to Sherlock Holmes unravel the mysteries of the universe, marveling at the clarity and depth of the information presented.

Your efforts in breaking down the concepts and applications of RAG in such a straightforward manner have left me feeling both enlightened and excited about the potential this technology holds for our industry.

Thank you once again for your time and for sharing your expertise. I look forward to exploring and implementing these innovative solutions in our own projects

@sarvariabhinav

WHERE IS THE SAMPLE CODE??????? This is very frustrating to showcase but not share code.

@charlesbabbage6786

Could'nt find the exact notebook used here.

@thyagarajesh184

Impressive technology. Look forward to using it for my project.

@dumbol8126

will there be an opensource version of this, or atleast a paper

@nestorbao2108

Why do you use multimodal embedding model if you summarize images and ground them into text?

@AGunzOw

How do you handle terabytes of  enterprise data, just do embedding groups? Should you generate sub questions first? How do you handle large amount of users?

@mariaescobar8003

When I use RAG, Am I sharing my data with the model/company? or is it private with an extracost?

@GAURAVKUMARSVNIT

Thanks for such an informative presentation.

@wolpumba4099

AI Summary

Abstract:

This session, presented by Jeff (Developer Advocate, Google Cloud) and Shela (Engineer, Google), introduces multimodal retrieval-augmented generation (RAG). It addresses the limitations of standard large language models (LLMs) which often lack specific, private, or up-to-date knowledge, leading to generic or incorrect answers, illustrated with a car dashboard warning light example. The presentation explains the core components of RAG: vector embeddings (including multimodal embeddings for text, images, etc.), vector search for efficient retrieval from external knowledge bases, and using LLMs like Gemini to augment responses with retrieved context. Two common multimodal RAG architectures are discussed. A detailed live demo showcases building a RAG pipeline using Gemini 1.5 Pro, Vertex AI Vector Store, and LangChain to query a fictional car's PDF owner's manual using both text and image inputs, demonstrating how RAG grounds LLM answers in specific, first-party data to provide accurate, context-aware responses. The session concludes with potential applications across various industries and resources for getting started.

Multimodal RAG: Grounding LLM Responses in Your Data

*   00:00:10 Introduction: Jeff and Shela from Google introduce the topic of multimodal retrieval-augmented generation (RAG).
*   00:00:31 The Problem with Standard LLMs: LLMs often lack specific, real-time, or private context (e.g., your car's specific manual or history), leading to generic or incorrect answers when faced with specialized queries like identifying dashboard warning lights. This is the "knowledge gap."
*   00:01:51 RAG as a Solution: RAG enhances LLMs by retrieving relevant information on demand from external, private data sources (like manuals, codebases, documents) and augmenting the LLM's knowledge base to generate tailored, grounded responses. It differs from fine-tuning.
*   00:04:21 Core RAG Components: RAG consists of three main parts:
    *   Vector Embeddings: Numerical representations of data (text, images, etc.) capturing semantic meaning.
    *   Vector Search: Efficiently retrieving relevant embeddings (and thus data) from a database based on query similarity.
    *   Augment & Generate: Using an LLM (like Gemini) to synthesize the retrieved information with its base knowledge for a final response.
*   00:05:01 Embeddings Explained: Embeddings act as a universal translator, turning unstructured data (text, images, audio, video, code) into numerical vectors that machines can understand, capturing context and relationships. Multimodal embeddings allow different data types to exist in the same semantic space.
*   00:08:00 Vector Search Explained: Vector search finds relevant information by comparing embedding similarity, going beyond simple keyword matching. Common approaches include brute force (exact but slow) and Approximate Nearest Neighbor (ANN) search (fast, scalable, uses indexing).
*   00:10:32 Augmentation & Generation with LLMs: After retrieving relevant data chunks via vector search, the LLM (e.g., Gemini) takes the user's query and the retrieved context to generate a synthesized, coherent, and accurate response grounded in the provided facts.
*   00:11:47 RAG Architecture Overview: The process involves turning a user query into an embedding, searching a vector database of private data embeddings, retrieving the top matches (original data chunks/images), and feeding both the query and retrieved context to the LLM for generation.
*   00:13:12 Multimodal RAG Architectures: Two common patterns:
    1.  Summarize all multimodal data into text, then embed and retrieve based on text summaries (risks some info loss).
    2.  Use true multimodal embeddings for all data types, allowing direct cross-modal retrieval (potentially higher accuracy, requires capable embedding models). The demo uses Approach 1.
*   00:15:21 Live Demo Introduction: Demonstrating RAG using Gemini 1.5 Pro and Vertex AI to query a fictional 2024 "Symbol Starlight" car manual PDF, answering questions an unassisted LLM couldn't.
*   00:16:41 Demo: Pre-processing: The pipeline starts by taking the PDF, splitting it into text chunks, extracting images and tables.
*   00:17:16 Demo: Data Preparation & Summarization: Using a Colab notebook, the code extracts text, images, and tables. Gemini 1.5 Pro is then used to generate concise text summaries of text chunks/tables and detailed descriptions of the images.
*   00:21:50 Demo: Vector Store Setup: Using Vertex AI Vector Store to house the embeddings, including setting up the index and deployment endpoint.
*   00:22:51 Demo: Embedding & Loading: Using LangChain and the Gemini embedding model, text chunks and image summaries are converted into embeddings and streamed into the Vertex AI Vector Store. A document store keeps track of the original data linked by ID.
*   00:24:29 Demo: Q&A Pipeline (RAG Chain): Constructing the RAG chain using LangChain: it takes a query, retrieves relevant documents/images via the vector store, formats a prompt (instructing Gemini to act as an automotive expert), and sends the query + retrieved context to Gemini 1.5 Pro for a grounded answer.
*   00:27:03 Demo: Text Query Example: Asking "How many miles until oil change?" triggers RAG, retrieving manual sections. Gemini synthesizes the answer: 5,000 miles or 6 months, grounded in the manual.
*   00:27:46 Demo: Image Query Example: Uploading an image of a dashboard light (TPMS) and asking "What does this light mean?". RAG retrieves relevant text and the matching image from the manual. Gemini identifies it as the low tire pressure warning.
*   00:29:48 Demo: Follow-up Query: Asking "What should the tire pressure be?" RAG retrieves the specs, and Gemini provides the correct PSI values (35 front / 38 rear) based on the manual.
*   00:31:02 Summary & Benefits Recap: RAG significantly enhances LLM quality and relevance by connecting them to external, specific knowledge sources.
*   00:31:32 Broader Applications: RAG isn't limited to cars; examples include tech (code migration using diagrams, docs, chat logs), retail (image-based product search), and media (mood-based movie recommendations).
*   00:33:14 Resources: Links provided for getting started with Gemini API, Vertex AI, and relevant GitHub code sample repositories.

I used gemini-2.5-pro-exp-03-25| input-price: 1.25 output-price: 10.0 max-context-length: 128_000 on rocketrecap dot com to summarize the transcript.
Cost (if I didn't use the free tier): $0.07
Input tokens: 24554
Output tokens: 3830

@hasszhao

where is this  notebook in the cookbook repo?

@tanishagrawal9091

Great Presentation!!๐Ÿ‘

@mohamedkarim-p7j

Thank for sharing๐Ÿ‘

@evanrfraser

Fantastic. Thank You!

@homeandr1

Hello Jeff, could be that there is a mistake 24:00 in a for loop instead of  โ€œfor i, s in enumerate(texts + table_summaries + image_summaries)โ€ should be โ€œfor i, s in enumerate(text_summaries + table_summaries + image_summaries)โ€

@TL735

Nice, but why don't you develop a simple drag-and-drop RAG? e.g. I add a drive folder link and Google generates a RAG chat based on its content.

@dynamix9916

Thank you ๐Ÿ™

@anandthakkar6109

where can we get the link of the demo notebook ?

@RiccardoCarlessoGoogle

Is there a link to the python notebook? I'd love to play with it!