利用GPT4-V及Langchain实现多模态RAG
liuian 2025-06-03 23:29 16 浏览
多模态RAG将是2024年AI应用架构发展的一个重要趋势,在前面的一篇文章里提到llama-index在这方面的尝试《利用GPT4-V及llama-index构建多模态RAG应用》,本文[1]中将以另一主流框架langchain为例介绍多模态RAG的实现。
大体流程:
1)使用多模态embedding(如 CLIP)处理图像和文本
2)对于图像和文本均使用向量检索
3)将原始图像和文本块传递给多模态 LLM(GPT4-V)进行答案合成
具体实现:
- 安装依赖。
! pip install pdf2image
! pip install pytesseract
! apt install poppler-utils
! apt install tesseract-ocr
#
! pip install -U langchain openai chromadb langchain-experimental # (newest versions required for multi-modal)
#
# lock to 0.10.19 due to a persistent bug in more recent versions
! pip install "unstructured[all-docs]==0.10.19" pillow pydantic lxml pillow matplotlib tiktoken open_clip_torch torch
2.下载数据(测试文档点阅读原文查看)。
import os
import shutil
#os.mkdir("Data")
! wget "https://www.getty.edu/publications/resources/virtuallibrary/0892360224.pdf"
3.提取图像并保存在所需路径中
path = "/content/Data/"
#
file_name = os.listdir(path)
4.使用 Unstructured 中的 partition_pdf 方法提取文本和图像。
# Extract images, tables, and chunk text
from unstructured.partition.pdf import partition_pdf
raw_pdf_elements = partition_pdf(
filename=path + file_name[0],
extract_images_in_pdf=True,
infer_table_structure=True,
chunking_strategy="by_title",
max_characters=4000,
new_after_n_chars=3800,
combine_text_under_n_chars=2000,
image_output_dir_path=path,
5.按类型对文本元素进行分类
tables = []
texts = []
for element in raw_pdf_elements:
if "unstructured.documents.elements.Table" in str(type(element)):
tables.append(str(element))
elif "unstructured.documents.elements.CompositeElement" in str(type(element)):
texts.append(str(element))
#
print(len(tables)
print(len(texts))
#### Response
2
194
6.图像存储在文件路径
from PIL import Image
Image.open("/content/data/figure-26-1.jpg")
7.对文档进行多模态embedding入库(图片及文字)。
在这里,使用了 OpenClip 多模态embedding。为了获得更好的性能,使用了更大的模型(在
langchain_experimental.open_clip.py 中设置)。
model_name = "ViT-g-14" checkpoint = "laion2b_s34b_b88k"
import os
import uuid
import chromadb
import numpy as np
from langchain.vectorstores import Chroma
from langchain_experimental.open_clip import OpenCLIPEmbeddings
from PIL import Image as _PILImage
# Create chroma
vectorstore = Chroma(
collection_name="mm_rag_clip_photos", embedding_function=OpenCLIPEmbeddings()
)
# Get image URIs with .jpg extension only
image_uris = sorted(
[
os.path.join(path, image_name)
for image_name in os.listdir(path)
if image_name.endswith(".jpg")
]
)
# Add images
vectorstore.add_images(uris=image_uris)
# Add documents
vectorstore.add_texts(texts=texts)
# Make retriever
retriever = vectorstore.as_retriever()
8.检索增强生成
上面的vectorstore.add_images 方法将以 base64 编码字符串的形式存储/检索图像,然后将这些信息传递给 GPT-4V。
import base64
import io
from io import BytesIO
import numpy as np
from PIL import Image
def resize_base64_image(base64_string, size=(128, 128)):
"""
Resize an image encoded as a Base64 string.
Args:
base64_string (str): Base64 string of the original image.
size (tuple): Desired size of the image as (width, height).
Returns:
str: Base64 string of the resized image.
"""
# Decode the Base64 string
img_data = base64.b64decode(base64_string)
img = Image.open(io.BytesIO(img_data))
# Resize the image
resized_img = img.resize(size, Image.LANCZOS)
# Save the resized image to a bytes buffer
buffered = io.BytesIO()
resized_img.save(buffered, format=img.format)
# Encode the resized image to Base64
return base64.b64encode(buffered.getvalue()).decode("utf-8")
def is_base64(s):
"""Check if a string is Base64 encoded"""
try:
return base64.b64encode(base64.b64decode(s)) == s.encode()
except Exception:
return False
def split_image_text_types(docs):
"""Split numpy array images and texts"""
images = []
text = []
for doc in docs:
doc = doc.page_content # Extract Document contents
if is_base64(doc):
# Resize image to avoid OAI server error
images.append(
resize_base64_image(doc, size=(250, 250))
) # base64 encoded str
else:
text.append(doc)
return {"images": images, "texts": text}
使用 RunnableParallel 对输入进行格式化,同时为 ChatPromptTemplates 添加图像支持。
from operator import itemgetter
from langchain.chat_models import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda, RunnablePassthrough,RunnableParallel
def prompt_func(data_dict):
# Joining the context texts into a single string
formatted_texts = "\n".join(data_dict["context"]["texts"])
messages = []
# Adding image(s) to the messages if present
if data_dict["context"]["images"]:
image_message = {
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{data_dict['context']['images'][0]}"
},
}
messages.append(image_message)
# Adding the text message for analysis
text_message = {
"type": "text",
"text": (
"As an expert art critic and historian, your task is to analyze and interpret images, "
"considering their historical and cultural significance. Alongside the images, you will be "
"provided with related text to offer context. Both will be retrieved from a vectorstore based "
"on user-input keywords. Please use your extensive knowledge and analytical skills to provide a "
"comprehensive summary that includes:\n"
"- A detailed description of the visual elements in the image.\n"
"- The historical and cultural context of the image.\n"
"- An interpretation of the image's symbolism and meaning.\n"
"- Connections between the image and the related text.\n\n"
f"User-provided keywords: {data_dict['question']}\n\n"
"Text and / or tables:\n"
f"{formatted_texts}"
),
}
messages.append(text_message)
return [HumanMessage(content=messages)]
利用LCEL 构造RAG chain
from google.colab import userdata
openai_api_key = userdata.get('OPENAI_API_KEY')
model = ChatOpenAI(temperature=0,
openai_api_key=openai_api_key,
model="gpt-4-vision-preview",
max_tokens=1024)
# RAG pipeline
chain = (
{
"context": retriever | RunnableLambda(split_image_text_types),
"question": RunnablePassthrough(),
}
| RunnableParallel({"response":prompt_func| model| StrOutputParser(),
"context": itemgetter("context"),})
)
测试验证:
q1:
response = chain.invoke("hunting on the lagoon")
#
print(response['response'])
print(response['context'])
############# RESPONSE ###############
The image depicts a serene scene of a lagoon with several groups of people engaged in bird hunting. The visual elements include calm waters, boats with hunters wearing red and white clothing, and birds both in flight and used as decoys. The hunters appear to be using long poles, possibly to navigate through the shallow waters or to assist in the hunting process. In the background, there are simple straw huts, suggesting temporary shelters for the hunters. The sky is painted with soft clouds, and the overall color palette is muted, with the reds of the hunters' clothing standing out against the blues and greens of the landscape.
The historical and cultural context of this image is rooted in the Italian Renaissance, specifically in Venice during the late 15th to early 16th century. Vittore Carpaccio, the artist, was known for his genre paintings, which depicted scenes from everyday life with great detail and realism. This painting, "Hunting on the Lagoon," is a testament to Carpaccio'
s keen observation of his environment and the activities of his contemporaries. The inclusion of diverse figures, such as some black individuals, reflects the cosmopolitan nature of Venetian society at the time.
Interpreting the symbolism and meaning of the image, one might consider the lagoon as a symbol of Venice itself—a city intertwined with water, where the boundary between land and sea is often blurred. The act of hunting could represent the human endeavor to harness and interact with nature, a common theme during the Renaissance as people sought to understand and depict the natural world with increasing accuracy. The presence of decoys suggests themes of illusion and reality, which were also explored in Renaissance art.
The connection between the image and the related text is clear. The text provides valuable insights into the painting's background, such as its use as a window cover, which adds a layer of functionality and interactivity to the artwork. The trompe l'oeil on the back with the illusionistic cornice and the real hinge further emphasizes the artist's interest in creating a sense of depth and reality. The mention of the lily blossom at the bottom indicates that the painting may have been altered from its original form, which could have included more symbolic elements or been part of a larger composition.
The text also notes that Carpaccio was famous as a landscape painter, which aligns with the detailed and atmospheric depiction of the lagoon setting. The discovery of the painting only a few years ago suggests that there is still much to learn about Carpaccio'
s work and the nuances of this particular piece. The lack of complete understanding of the subject matter invites further research and interpretation, allowing viewers to ponder the daily life and environment of Renaissance Venice.
{'images': [''],
'texts': ["VITTORE CARPACCIO Venetian, 1455/56-1525/26 Hunting on the Lagoon oil on panel, 75.9x63.7cm 6 Carpaccio is considered to be the first great genre painter of the Italian Renaissance, and it is ob- vious that he was a careful observer of his surroundings. The subject of this unusual painting is not yet completely understood, but it apparently depicts groups of Venetians, including some blacks, hunting for birds on the Venetian lagoon. Some birds standing upright in the boats must be decoys. In the background are huts built of straw, which the hunters must have used as temporary lodging. The back of the painting shows an illusionistic cornice with some letters and memoranda—still legible—fastened to the wall. The presence of a real hinge on the back indicates the painting was used as a door to a cupboard or more probably a window cover. It is therefore possible that one had the illusion of looking into the lagoon when the window was shuttered. The presence of a lily blossom at the bottom implies that the painting has been cut down; originally it may have shown the lily in a vase or it may have been cut from a still larger painting in which our fragment was only the background. Reperse: Trompe l'Oeil ",
'3\n\nThis painting was discovered only a few years ago. Unfortunately very little is known about its',
'18\n\npersonality and artistic interests, but he was most famous as a landscape painter.']}
print(response['context']['images'])
####### RESPONSE ##################
['
辅助函数,用于显示检索到的图像,作为生成响应的源上下文的一部分。
from IPython.display import HTML, display
def plt_img_base64(img_base64):
# Create an HTML img tag with the base64 string as the source
image_html = f'<img src="data:image/jpeg;base64,{img_base64}" />'
# Display the image by rendering the HTML
display(HTML(image_html))
显示与检索文本相关的图片
plt_img_base64(response['context']['images'][0])
q2:
response = chain.invoke("Woman with children")
print(response['response'])
print(response['context'])
########### RESPONSE ######################
The image in question appears to be a portrait of a woman with children, painted in oil on canvas and measuring 94.4x114.2 cm. The woman is likely the central figure in the painting, and the children are probably depicted around her, possibly playing with various instruments as suggested by the text. The woman's age is given as 21, and the painting is dated 1632, which places it in the early 17th century.
The historical and cultural context of this image is significant. The early 17th century was a time of great change and upheaval in Europe, with the Thirty Years' War raging and the rise of absolutist monarchies. In the art world, this was the era of the Baroque, characterized by dramatic, emotional, and often theatrical compositions. The fact that the woman is identified by her age suggests that this is a portrait of a specific individual, possibly a member of the nobility or upper class, as such portraits were often commissioned to commemorate important life events or to display wealth and status.
The symbolism and meaning of the image could be interpreted in several ways. The presence of children suggests themes of motherhood, family, and domesticity. The fact that they are playing instruments could symbolize harmony, creativity, and the importance of music and the arts in the family's life. The woman's age, 21, could also be significant, as it is often considered the age of adulthood and independence.
The related text mentions that the painting was discovered only a few years ago and that very little is known about it. This adds an element of mystery to the image and suggests that there may be more to uncover about its history and significance. The text also mentions a French artist, born in 1702 and died in 1766, which could indicate that the painting is French in origin, although the date of the painting does not align with the artist's lifetime. The mention of Marc de Villiers, born in 1671 and the subject of a painting dated 1747, suggests that the image may be part of a larger collection of portraits of notable individuals from this period.
Overall, this image of a woman with children is a rich and complex work that offers insights into the cultural and historical context of the early 17th century. Its symbolism and meaning are open to interpretation, and the connections between the image and the related text suggest that there is still much to learn about this painting and its place in art history.
{'images': [],
'texts': ['31\n\nThis portrait is dated 1632 and gives the age of the sitter, 21. To our eyes she would appear to be',
'3\n\nThis painting was discovered only a few years ago. Unfortunately very little is known about its',
'oil on canvas, 94.4x114.2 cm\n\n4l\n\nat which they want to play their various instruments.',
'French, 1702-1766\n\n46\n\nThe sitter, Marc de Villiers, was born in 1671, and since this painting is signed and dated in 1747,']}
注意:该查询没有相关图像,因此图像召回为空列表。
q3:
response = chain.invoke("Moses and the Messengers from Canaan")
print(response['response'])
print(response['context'])
########### RESPONSE #############
The image you've provided appears to be a classical painting depicting a group of figures in a pastoral landscape. Unfortunately, the image does not directly correspond to the provided keywords "Moses and the Messengers from Canaan," nor does it seem to relate to the text snippets you've included. However, I will do my best to analyze the image based on its visual elements and provide a general interpretation that might align with the themes of historical and cultural significance.
Visual Elements:
- The painting shows a group of people gathered in a natural setting, which seems to be a forest clearing or the edge of a wooded area.
- The figures are dressed in what appears to be classical or ancient attire, suggesting a historical or mythological scene.
- The color palette is composed of earthy tones, with a contrast between the light and shadow that gives depth to the scene.
- The composition is balanced, with trees framing the scene on the left and the background opening up to a brighter, possibly sunlit area.
Historical and Cultural Context:
- The painting style and attire of the figures suggest it could be from the Renaissance or Baroque period, which were times of great interest in classical antiquity and biblical themes.
- The reference to "Arcadian shepherds discovering a tomb" and "Poussin" in the text indicates a connection to Nicolas Poussin, a French painter of the Baroque era known for his classical landscapes and historical scenes.
Interpretation and Symbolism:
- Without a direct connection to the story of Moses and the messengers from Canaan, it's challenging to provide a precise interpretation. However, the painting could be depicting a scene of discovery or revelation, common themes in Poussin's work.
- The pastoral setting might symbolize an idyllic, peaceful world, often associated with the concept of Arcadia in classical literature and art.
- The gathering of figures could represent a moment of communal storytelling or the sharing of important news, which could loosely tie into the idea of messengers or a significant event.
Connections to Related Text:
- The text mentions the theme of "Arcadian shepherds discovering a tomb," which is a motif Poussin famously depicted in his painting "Et in Arcadia ego." While the image does not show a tomb, the pastoral setting and classical attire could suggest a similar thematic exploration.
- The reference to Flemish art and the interaction with Italian Renaissance artists might imply a fusion of Northern European and Italian artistic styles, which could be reflected in the painting's technique and composition.
In conclusion, while the image does not directly depict the story of Moses and the messengers from Canaan, it does evoke the classical and pastoral themes prevalent in the work of artists like Poussin during the Baroque period. The painting may represent a general scene of classical antiquity or a mythological event, characterized by a serene landscape and a gathering of figures engaged in a significant moment. The historical and cultural significance of such a painting would lie in its representation of the values and aesthetics of the time, as well as its potential to blend different artistic traditions.
{'images': ['',
''],
'texts': ['16\n\nThe theme of Arcadian shepherds discovering a tomb originated in painting with Poussin in the',
'Flemish, 1488-1541\n\n20\n\nWhen Italian artists of the Renaissance came into contact with paintings from the north, they']}
显示检索到的图像
for images in response['context']['images']:
plt_img_base64(images)
以上,利用多模态 LLM 和 Langchain 以及unstructured,成功地从非结构化数据中实现了 RAG。不仅利用了文档中嵌入的图像信息,还利用了文本信息。
参考原文:
[1] Plaban Nayak:Multimodal RAG using Langchain Expression Language And GPT4-Vision
相关推荐
- 总结下SpringData JPA 的常用语法
-
SpringDataJPA常用有两种写法,一个是用Jpa自带方法进行CRUD,适合简单查询场景、例如查询全部数据、根据某个字段查询,根据某字段排序等等。另一种是使用注解方式,@Query、@Modi...
- 解决JPA在多线程中事务无法生效的问题
-
在使用SpringBoot2.x和JPA的过程中,如果在多线程环境下发现查询方法(如@Query或findAll)以及事务(如@Transactional)无法生效,通常是由于S...
- PostgreSQL系列(一):数据类型和基本类型转换
-
自从厂子里出来后,数据库的主力就从Oracle变成MySQL了。有一说一哈,贵确实是有贵的道理,不是开源能比的。后面的工作里面基本上就是主MySQL,辅MongoDB、ES等NoSQL。最近想写一点跟...
- 基于MCP实现text2sql
-
目的:基于MCP实现text2sql能力参考:https://blog.csdn.net/hacker_Lees/article/details/146426392服务端#选用开源的MySQLMCP...
- ORACLE 错误代码及解决办法
-
ORA-00001:违反唯一约束条件(.)错误说明:当在唯一索引所对应的列上键入重复值时,会触发此异常。ORA-00017:请求会话以设置跟踪事件ORA-00018:超出最大会话数ORA-00...
- 从 SQLite 到 DuckDB:查询快 5 倍,存储减少 80%
-
作者丨Trace译者丨明知山策划丨李冬梅Trace从一开始就使用SQLite将所有数据存储在用户设备上。这是一个非常不错的选择——SQLite高度可靠,并且多种编程语言都提供了广泛支持...
- 010:通过 MCP PostgreSQL 安全访问数据
-
项目简介提供对PostgreSQL数据库的只读访问功能。该服务器允许大型语言模型(LLMs)检查数据库的模式结构,并执行只读查询操作。核心功能提供对PostgreSQL数据库的只读访问允许L...
- 发现了一个好用且免费的SQL数据库工具(DBeaver)
-
缘起最近Ai不是大火么,想着自己也弄一些开源的框架来捣腾一下。手上用着Mac,但Mac都没有显卡的,对于学习Ai训练模型不方便,所以最近新购入了一台4090的拯救者,打算用来好好学习一下Ai(呸,以上...
- 微软发布.NET 10首个预览版:JIT编译器再进化、跨平台开发更流畅
-
IT之家2月26日消息,微软.NET团队昨日(2月25日)发布博文,宣布推出.NET10首个预览版更新,重点改进.NETRuntime、SDK、libraries、C#、AS...
- 数据库管理工具Navicat Premium最新版发布啦
-
管理多个数据库要么需要使用多个客户端应用程序,要么找到一个可以容纳你使用的所有数据库的应用程序。其中一个工具是NavicatPremium。它不仅支持大多数主要的数据库管理系统(DBMS),而且它...
- 50+AI新品齐发,微软Build放大招:拥抱Agent胜算几何?
-
北京时间5月20日凌晨,如果你打开微软Build2025开发者大会的直播,最先吸引你的可能不是一场原本属于AI和开发者的技术盛会,而是开场不久后的尴尬一幕:一边是几位微软员工在台下大...
- 揭秘:一条SQL语句的执行过程是怎么样的?
-
数据库系统能够接受SQL语句,并返回数据查询的结果,或者对数据库中的数据进行修改,可以说几乎每个程序员都使用过它。而MySQL又是目前使用最广泛的数据库。所以,解析一下MySQL编译并执行...
- 各家sql工具,都闹过哪些乐子?
-
相信这些sql工具,大家都不陌生吧,它们在业内绝对算得上第一梯队的产品了,但是你知道,他们都闹过什么乐子吗?首先登场的是Navicat,这款强大的数据库管理工具,曾经让一位程序员朋友“火”了一把。Na...
- 详解PG数据库管理工具--pgadmin工具、安装部署及相关功能
-
概述今天主要介绍一下PG数据库管理工具--pgadmin,一起来看看吧~一、介绍pgAdmin4是一款为PostgreSQL设计的可靠和全面的数据库设计和管理软件,它允许连接到特定的数据库,创建表和...
- Enpass for Mac(跨平台密码管理软件)
-
还在寻找密码管理软件吗?密码管理软件有很多,但是综合素质相当优秀且完全免费的密码管理软件却并不常见,EnpassMac版是一款免费跨平台密码管理软件,可以通过这款软件高效安全的保护密码文件,而且可以...
- 一周热门
-
-
Python实现人事自动打卡,再也不会被批评
-
【验证码逆向专栏】vaptcha 手势验证码逆向分析
-
Psutil + Flask + Pyecharts + Bootstrap 开发动态可视化系统监控
-
一个解决支持HTML/CSS/JS网页转PDF(高质量)的终极解决方案
-
再见Swagger UI 国人开源了一款超好用的 API 文档生成框架,真香
-
网页转成pdf文件的经验分享 网页转成pdf文件的经验分享怎么弄
-
C++ std::vector 简介
-
飞牛OS入门安装遇到问题,如何解决?
-
系统C盘清理:微信PC端文件清理,扩大C盘可用空间步骤
-
10款高性能NAS丨双十一必看,轻松搞定虚拟机、Docker、软路由
-
- 最近发表
- 标签列表
-
- python判断字典是否为空 (50)
- crontab每周一执行 (48)
- aes和des区别 (43)
- bash脚本和shell脚本的区别 (35)
- canvas库 (33)
- dataframe筛选满足条件的行 (35)
- gitlab日志 (33)
- lua xpcall (36)
- blob转json (33)
- python判断是否在列表中 (34)
- python html转pdf (36)
- 安装指定版本npm (37)
- idea搜索jar包内容 (33)
- css鼠标悬停出现隐藏的文字 (34)
- linux nacos启动命令 (33)
- gitlab 日志 (36)
- adb pull (37)
- python判断元素在不在列表里 (34)
- python 字典删除元素 (34)
- vscode切换git分支 (35)
- python bytes转16进制 (35)
- grep前后几行 (34)
- hashmap转list (35)
- c++ 字符串查找 (35)
- mysql刷新权限 (34)