如何在python中检索Office文件的作者? [英] How to retrieve the author of an office file in python?

查看:102
本文介绍了如何在python中检索Office文件的作者?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

标题解释了这个问题,有一些doc和docs文件,我想检索他们的作者信息,以便我可以重组我的文件.

Title explains the problem, there are doc and docs files that which I want to retrieive their author information so that I can restructure my files.

os.stat仅返回大小和日期时间,与实文件有关的信息.
open(filename, 'rb').read(200)返回许多我无法解析的字符.

os.stat returns only size and datetime, real-file related information.
open(filename, 'rb').read(200) returns many characters that I could not parse.

有一个名为xlrd的模块,用于读取xlsx文件.但是,这仍然不允许我读取docdocx文件.我知道在non-msoffice程序上不容易读取新的Office文件,因此,如果不可能的话,从旧的Office文件中收集信息就足够了.

There is a module called xlrd for reading xlsx files. Yet, this still doesn't let me read doc or docx files. I am aware of new office files are not easily read on non-msoffice programs, so if that's impossible, gathering info from old office files would suffice.

推荐答案

由于docx文件只是压缩的XML,因此您只需解压缩docx文件并从XML文件中提取作者信息即可.不太清楚它的存储位置,只是环顾四周就使我怀疑它已存储为dc:creatordocProps/core.xml中.

Since docx files are just zipped XML you could just unzip the docx file and presumably pull the author information out of an XML file. Not quite sure where it'd be stored, just looking around at it briefly leads me to suspect it's stored as dc:creator in docProps/core.xml.

以下是打开docx文件并检索创建者的方法:

Here's how you can open the docx file and retrieve the creator:

import zipfile, lxml.etree

# open zipfile
zf = zipfile.ZipFile('my_doc.docx')
# use lxml to parse the xml file we are interested in
doc = lxml.etree.fromstring(zf.read('docProps/core.xml'))
# retrieve creator
ns={'dc': 'http://purl.org/dc/elements/1.1/'}
creator = doc.xpath('//dc:creator', namespaces=ns)[0].text

这篇关于如何在python中检索Office文件的作者?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆