如何在python中检索Office文件的作者? [英] How to retrieve the author of an office file in python?
问题描述
标题解释了这个问题,有一些doc和docs文件,我想检索他们的作者信息,以便我可以重组我的文件.
Title explains the problem, there are doc and docs files that which I want to retrieive their author information so that I can restructure my files.
os.stat
仅返回大小和日期时间,与实文件有关的信息.
open(filename, 'rb').read(200)
返回许多我无法解析的字符.
os.stat
returns only size and datetime, real-file related information.
open(filename, 'rb').read(200)
returns many characters that I could not parse.
有一个名为xlrd
的模块,用于读取xlsx
文件.但是,这仍然不允许我读取doc
或docx
文件.我知道在non-msoffice
程序上不容易读取新的Office文件,因此,如果不可能的话,从旧的Office文件中收集信息就足够了.
There is a module called xlrd
for reading xlsx
files. Yet, this still doesn't let me read doc
or docx
files. I am aware of new office files are not easily read on non-msoffice
programs, so if that's impossible, gathering info from old office files would suffice.
推荐答案
由于docx
文件只是压缩的XML,因此您只需解压缩docx文件并从XML文件中提取作者信息即可.不太清楚它的存储位置,只是环顾四周就使我怀疑它已存储为dc:creator
在docProps/core.xml
中.
Since docx
files are just zipped XML you could just unzip the docx file and presumably pull the author information out of an XML file. Not quite sure where it'd be stored, just looking around at it briefly leads me to suspect it's stored as dc:creator
in docProps/core.xml
.
以下是打开docx文件并检索创建者的方法:
Here's how you can open the docx file and retrieve the creator:
import zipfile, lxml.etree
# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text
这篇关于如何在python中检索Office文件的作者?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!