使用python提取关键字形式的图像 [英] extract keywords form images using python
问题描述
仍在学习python.我目前正在研究将从代码中提取元数据(用户自定义关键字)的python代码. 我已经尝试过Pillow AND exif,但这不包括用户制作的标签或关键字. 通过applist,我成功地提取了包括关键字在内的图元文件,但是当我尝试使用ElementTree提取图元文件以提取感兴趣的部分时,我仅获得空数据.
still learning python. I am currently working on a python code that will extracts metadata (usermade keywords) from images. I already tried Pillow AND exif but this excludes the user made tags or keywords. With applist, i successfully managed to extract the metafile including my keywords but when I tried to purse it with ElementTree to extract the parts of interest, I obtain only empty data.
我的xml文件如下所示(经过一些操作):
My xml file look like this (after some manipulation):
<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="XMP Core 4.4.0">
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about=""
xmlns:dc="http://purl.org/dc/elements/1.1/">
<dc:description>
<rdf:Seq>
<rdf:li xml:lang="x-default">South Carolina, Olivyana, Kumasi</rdf:li>
</rdf:Seq>
</dc:description>
<dc:subject>
<rdf:Bag>
<rdf:li>Kumasi</rdf:li>
<rdf:li>Summer 2016</rdf:li>
<rdf:li>Charlestone</rdf:li>
<rdf:li>SC</rdf:li>
<rdf:li>Beach</rdf:li>
<rdf:li>Olivjana</rdf:li>
</rdf:Bag>
</dc:subject>
<dc:title>
<rdf:Seq>
<rdf:li xml:lang="x-default">P1050365</rdf:li>
</rdf:Seq>
</dc:title>
</rdf:Description>
<rdf:Description rdf:about=""
xmlns:aux="http://ns.adobe.com/exif/1.0/aux/">
<aux:SerialNumber>F360908190331</aux:SerialNumber>
</rdf:Description>
</rdf:RDF>
</x:xmpmeta>
我的代码如下:
import xml.etree.ElementTree as ET
from PIL import Image, ExifTags
with Image.open("myfile.jpg") as im:
for segment, content in im.applist:
marker, body = content.split(b'\x00', 1)
if segment == 'APP1' and marker == b'http://ns.adobe.com/xap/1.0/':
data = body.decode('"utf-8"')
print (data)
目前无法将其传递给解析器,因为有空行返回错误:
at this point it was't possible to pass this to the parser as there is an empty line returning an error:
tree = ET.parse(data)
ValueError: embedded null byte
因此,将其删除后,我将数据保存在xml文件(上面的xml数据)中,并传递给了解析器,但没有获得任何数据:
so after removing it i saved the data in a xml file (the xml data above) and passed to the parser but obtaining no data:
tree = ET.parse('mytags.xml')
tags = tree.findall('xmpmeta/RDF/Description/subject/Bags')
print (type(tags))
print (len(tags))
<class 'list'>
0
有趣的是,我使用了xml文件形式的标记(即'x:xmpmeta':),但收到以下错误消息:
Interestingly, it I used the tags in the form of the xml file (i.e. 'x:xmpmeta':), I receive the following error:
SyntaxError: prefix 'x' not found in prefix map
感谢您的帮助.
Fabio
推荐答案
仅在XML解析上无法解决PIL元数据的问题,这是您遇到的三个问题:
Focusing only on your XML parsing not PIL metadata work, three issues are your problem:
- 使用
findall
时,需要定义名称空间前缀,可以使用 namespaces arg进行定义.然后,您的xpath必须包含前缀. - 使用
findall
时,请勿包括根,因为这是起点,但从子级开始向下. - 没有 Bags 本地名称,带有复数形式,只有 Bag ,其长度为1.如果您想要它的子级,请更深一层.
- You need to define the namespace prefixes when using
findall
which you can do with the namespaces arg. And then your xpath must include the prefixes. - When using
findall
do not include the root as that is the starting point but from its child downward. - There is no Bags local name with plural but only Bag and its length would be one. If you want its children, go one level deeper.
考虑调整后的脚本:
import xml.etree.ElementTree as ET
tree = ET.parse('mytags.xml')
nmspdict = {'x':'adobe:ns:meta/',
'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
'dc': 'http://purl.org/dc/elements/1.1/'}
tags = tree.findall('rdf:RDF/rdf:Description/dc:subject/rdf:Bag/rdf:li',
namespaces = nmspdict)
print (type(tags))
print (len(tags))
# <class 'list'>
# 6
for i in tags:
print(i.text)
# Kumasi
# Summer 2016
# Charlestone
# SC
# Beach
# Olivjana
这篇关于使用python提取关键字形式的图像的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!