提取DOCX注释 [英] Extract DOCX Comments

查看:132
本文介绍了提取DOCX注释的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我是老师.我想要列出对我分配的论文发表评论的所有学​​生的名单,以及他们说的话. Drive API的内容对我来说太具有挑战性,但是我认为我可以将它们下载为zip并解析XML.

I'm a teacher. I want a list of all the students who commented on the essay I assigned, and what they said. The Drive API stuff was too challenging for me, but I figured I could download them as a zip and parse the XML.

注释用w:comment标签标记,注释文本用w:t标记.这应该很容易,但是XML(etree)正在使我丧命.

The comments are tagged in w:comment tags, with w:t for the comment text and . It should be easy, but XML (etree) is killing me.

通过教程(和官方Python文档):

via the tutorial (and official Python docs):

z = zipfile.ZipFile('test.docx')
x = z.read('word/comments.xml')
tree = etree.XML(x)

然后我这样做:

children = tree.getiterator()
for c in children:
    print(c.attrib)

结果:

{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}author': 'Joe Shmoe', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}id': '1', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}date': '2017-11-17T16:58:27Z'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidR': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidDel': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidP': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRDefault': '00000000', '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}rsidRPr': '00000000'}
{}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}
{'{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val': '0'}

在此之后,我完全被困住了.我试过element.get()element.findall()都没有运气.即使我复制/粘贴值('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'),也会得到None作为回报.

And after this I am totally stuck. I've tried element.get() and element.findall() with no luck. Even when I copy/paste the value ('{http://schemas.openxmlformats.org/wordprocessingml/2006/main}val'), I get None in return.

任何人都可以帮忙吗?

推荐答案

考虑到OOXML是一种非常复杂的格式,您已经走得很远了.

You got remarkably far considering that OOXML is such a complex format.

以下是一些示例Python代码,展示了如何通过XPath访问DOCX文件的注释:

Here's some sample Python code showing how to access the comments of a DOCX file via XPath:

from lxml import etree
import zipfile

ooXMLns = {'w':'http://schemas.openxmlformats.org/wordprocessingml/2006/main'}

def get_comments(docxFileName):
  docxZip = zipfile.ZipFile(docxFileName)
  commentsXML = docxZip.read('word/comments.xml')
  et = etree.XML(commentsXML)
  comments = et.xpath('//w:comment',namespaces=ooXMLns)
  for c in comments:
    # attributes:
    print(c.xpath('@w:author',namespaces=ooXMLns))
    print(c.xpath('@w:date',namespaces=ooXMLns))
    # string value of the comment:
    print(c.xpath('string(.)',namespaces=ooXMLns))

这篇关于提取DOCX注释的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆