文本提取中的分词系统,Lxml Xpath [英] Word Breaks in text extraction , Lxml Xpath

查看:108
本文介绍了文本提取中的分词系统,Lxml Xpath的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想提取带有删除线的单词,即使用<w:delText>标记.我使用过一个表达式,它成功地将其提取出来,只是有些单词看起来很破损.例如,单词"They"出现为'T''hey'.下面给出的是问题仍然存在的xml示例:

I want to extract words with strikethroughs i.e with the <w:delText> tag. I have used an expression and it extracts it successfully except that some words appear broken . For example the word "They" appears as 'T' and 'hey' . Given below is an xml sample where the problem persists:

<w:delText
    xml:space="preserve">.
    </w:delText></w:r><w:r
    w:rsidR="0020338C"
    w:rsidDel="00147CFE"><w:rPr><w:rFonts
    w:ascii="Times
    New
    Roman"
    w:hAnsi="Times
    New
    Roman"/><w:sz
    w:val="24"/></w:rPr><w:delText>T</w:delText></w:r><w:r
    w:rsidR="00DF6A7D"
    w:rsidDel="00147CFE"><w:rPr><w:rFonts
    w:ascii="Times
    New
    Roman"
    w:hAnsi="Times
    New
    Roman"/><w:sz
    w:val="24"/></w:rPr><w:delText>hey</w:delText></w:r></w:del><w:ins
    w:id="5"
    w:author="Author"
    w:date="2014-08-13T10:08:00Z"><w:r
    w:rsidR="00147CFE"><w:rPr><w:rFonts
    w:ascii="Times
    New
    Roman"
    w:hAnsi="Times
    New
    Roman"/><w:sz
    w:val="24"/></w:rPr><w:t
    xml:space="preserve">
    that
    helps
    them</w:t></w:r></w:ins>

我使用了以下代码:

find =  etree.XPath("//w:p//.//*[local-name() = 'delText']//text()" ,namespaces={'w':"http://schemas.openxmlformats.org/wordprocessingml/2006/main"})
list_of_deleted_words = (find(lxml_tree))  

我怎么可能解决这个问题?

How could i possibly fix this??

修改:

我意识到问题仅在于其中包含大写字母的单词,诸如"She","He"之类的单词也会被拆分.

I realized the problem is only with words that have capital letters in them , words like "She" , "He" also get split.

推荐答案

这是单词.它们"应该算作一个单词,而不是两个(我的代码当前正在执行).

It is the words.." They" should be counted as one word rather than two (that my code is doing currently).

之所以出现此问题,是因为将一段文本随意地放入了几个所谓的行"中.在OOXML中,文本按w:p元素(段落)进行组织,如下所示(简化结构):

The problem arises because stretches of text are arbitrarily put into several so-called "runs". In OOXML, text is organized in w:p elements (paragraphs) like this (simplified structures):

<w:p>
  <w:r>
    <w:t>Simpli</w:t>
  </w:r>
  <w:r>
    <w:t>fied structures</w:t>
  </w:r>
</w:p>

如您所见,实际文本位于w:t元素内部,而这些元素又位于w:r元素或运行"内部.不幸的是,分开运行的这种划分是偶然的,以至于只能是任意的.据我所知,没有人知道如何开始新的运行.

As you can see, the actual text is inside w:telements that are in turn inside a w:r element, or "run". Unfortunately, this division in separate runs is so haphazard that it can be nothing but arbitrary. To my knowledge, nobody knows how the choice for starting a new run is made.

现在,转到您的问题,w:delText也在内部运行.而且,从零开始的碎裂似乎纯粹是武断的.

Now, turning to your question, w:delText is inside runs, too. And there, too, the fragmenation into runs appears to be purely abitrary.

使用您当前的方法,无法知道特定w:delText的文本内容是否是一个完整的单词.为此,您必须考虑整个运行顺序,包括正常文本的运行和包含已删除文本的运行.

With your current method, there is no way of knowing if the text content of a particular w:delText ever was a whole word or not. For that, you'd have to take into account the whole sequence of runs, both the ones that contain normal text and the ones containing deleted text.

这可能会起作用,因为已删除的文本仍处于删除位置.显示的OpenXML 2003稍有不同,但是没关系:

Chances are that this would work, because deleted text is still in a run in the position where it was deleted. Showing OpenXML 2003, slightly different, but it does not matter:

<w:r>
  <w:t>Normal Text before deletion </w:t>
</w:r>
<aml:annotation aml:id="0"
               w:type="Word.Deletion"
               aml:author="Mathias Müller"
               aml:createdate="2014-09-26T22:25:00Z">
  <aml:content>
     <w:r wsp:rsidDel="00F647B7">
        <w:delText>T</w:delText>
     </w:r>
  </aml:content>
</aml:annotation>
<aml:annotation aml:id="1"
               w:type="Word.Deletion"
               aml:author="Mathias Müller"
               aml:createdate="2014-09-26T22:24:00Z">
  <aml:content>
     <w:r wsp:rsidDel="00F647B7">
        <w:delText>hey </w:delText>
     </w:r>
  </aml:content>
</aml:annotation>
<w:r>
  <w:t>Normal Text after deletion </w:t>
</w:r>

换种方式,

  • 如果连续有两个(或多个)已删除的运行",但其中两个都没有空格,那么您就知道它们只是一个单词的一部分.

关于单词边界,

  • 如果已删除的运行之前是正常运行,并且它们之间有空格(在正常运行的末尾或已删除的运行的开头),则您知道已删除的运行以一个新词开头
  • 如果删除的运行前面没有任何空格的正常运行,那么您应该得出结论:仅删除了一部分单词,并且此删除的运行不是完整的单词
  • 对于已删除的运行,上述所有反之亦然,之后立即进行正常运行,它们之间有或没有空格.

我们都知道,当然,依靠空格来区分单词是一种粗略的方法,但是在这种情况下可能就足够了.

We all know, of course, that relying on whitespace to tell words apart is a crude method, but it might be sufficient in this case.

这篇关于文本提取中的分词系统,Lxml Xpath的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆