仅从格式为.tex的arXiv文章中提取正文 [英] Extract only body text from arXiv articles formatted as .tex

查看:212
本文介绍了仅从格式为.tex的arXiv文章中提取正文的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我的数据集由arXiv天体物理学文章(.tex文件)组成,我只需要从文章正文中提取文本,而无需从文章的其他任何部分中提取文本(例如表格,图形,摘要,标题,脚注,确认,引用等).

My dataset is composed of arXiv astrophysics articles as .tex files, and I need to extract only text from the article body, not from any other part of the article (e.g. tables, figures, abstract, title, footnotes, acknowledgements, citations, etc.).

我一直在尝试使用Python3和 tex2py ,但我正在努力获得完整的语料库,因为文件的标签和&文本在标签之间被分解.

I've been trying with Python3 and tex2py, but I'm struggling with getting a clean corpus, because the files differ in labeling & the text is broken up between labels.

我附上了SSCCE,几个样本Latex文件及其pdf文件以及已解析的语料库.语料库显示了我的挣扎:部分和小节没有按顺序提取,文本在某些标签处中断,并且包括一些表格和图形.

I have attached a SSCCE, a couple sample Latex files and their pdfs, and the parsed corpus. The corpus shows my struggles: Sections and subsections are not extracted in order, text breaks at some labels, and some tables and figures are included.

代码:

import os
from tex2py import tex2py

corpus = open('corpus2.tex', 'a')

def parseFiles():
    """
    Parses downloaded document .tex files for word content.
    We are only interested in the article body, defined by /section tags.
    """

    for file in os.listdir("latex"):
        if file.endswith('.tex'):
            print('\nChecking ' + file + '...')
            with open("latex/" + file) as f:
                try:
                    toc = tex2py(f) # toc = tree of contents
                    # If file is a document, defined as having \begin{document}
                    if toc.source.document:
                        # Iterate over each section in document
                        for section in toc:
                            # Parse the section
                            getText(section)
                    else:
                        print(file + ' is not a document. Discarded.')
                except (EOFError, TypeError, UnicodeDecodeError): 
                    print('Error: ' + file + ' was not correctly formatted. Discarded.')



def getText(section):
    """
    Extracts text from given "section" node and any nested "subsection" nodes. 

    Parameters
    ----------
    section : list
        A "section" node in a .tex document 
    """

    # For each element within the section 
    for x in section:
        if hasattr(x.source, 'name'):
            # If it is a subsection or subsubsection, parse it
            if x.source.name == 'subsection' or x.source.name == 'subsubsection':
                corpus.write('\nSUBSECTION!!!!!!!!!!!!!\n')
                getText(x)
            # Avoid parsing past these sections
            elif x.source.name == 'acknowledgements' or x.source.name == 'appendix':
                return
        # If element is text, add it to corpus
        elif isinstance(x.source, str):
            # If element is inline math, worry about it later
            if x.source.startswith('$') and x.source.endswith('$'):
                continue
            corpus.write(str(x))
        # If element is 'RArg' labelled, e.g. \em for italic, add it to corpus
        elif type(x.source).__name__ is 'RArg':
            corpus.write(str(x.source))


if __name__ == '__main__':
    """Runs if script called on command line"""
    parseFiles()

链接到其余部分:

  • Sample .tex file 1 and its pdf
  • Sample .tex file 2 and its pdf
  • Resulting corpus

我知道一个相关的问题(以编程方式将乳胶代码转换/解析为纯文本),但似乎没有一个最终的答案.

I'm aware of a related question (Programatically converting/parsing latex code to plain text), but there seems not to be a conclusive answer.

推荐答案

要获取文档中的所有文本,在这里tree.descendants会更加友好.这将按顺序输出所有文本.

To grab all text from a document, tree.descendants will be a lot more friendly here. This will output all text in order.

def getText(section):
    for token in section.descendants:
        if isinstance(token, str):
            corpus.write(str(x))

为了捕捉极端情况,我写了一个更加充实的版本.这包括检查您在那里列出的所有条件.

To capture the edge cases, I wrote a slightly more fleshed-out version. This includes checks for all the conditions you've listed up there.

from TexSoup import RArg

def getText(section):
    for x in section.descendants:
        if isinstance(x, str):
            if x.startswith('$') and x.endswith('$'):
                continue
            corpus.write(str(x))
        elif isinstance(x, RArg):
            corpus.write(str(x))
        elif hasattr(x, 'source') and hasattr(x.source, 'name') and x.source.name in ('acknowledgements', 'appendix'):
            return

这篇关于仅从格式为.tex的arXiv文章中提取正文的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆