从 pdf 中提取参考文献 - Python [英] Extract References from pdf - Python

查看：101 发布时间：2021/9/6 19:32:15 python text-extraction

本文介绍了从 pdf 中提取参考文献 - Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在我的python项目中，我需要从pdf研究论文中提取REFERENCES.我正在使用 PyPDF2 读取 pdf 并像这样从中提取文本.

In my python project, I need to extract REFERENCES from pdf research papers. I'm using PyPDF2 to read pdf and extract text from it like this.

import PyPDF2

pdfFileObj = open('fileName.pdf','rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
pageCount = pdfReader.numPages
count = 0
text = ''

while count < pageCount:
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()

现在这个 text 可以是任何格式，我无法从中识别任何标题.我不能使用 find('References') 因为纸也可以在其他任何地方包含这个词.有些论文在标题前包含数字，例如 6 REFERENCES，因此我可以为此添加正则表达式

Now this text can be in any format and I'm unable to identify any heading from it. I can not use find('References') because paper can also contain this word anywhere else. Some papers contain Number before heading like 6 REFERENCES, so I can add regex for this

但我在开始之前被困在没有任何数字值的论文中.

but I'm stuck with papers without any Numeric value before heading.

这是我目前正在研究的 pdf 非投影依赖解析器

Here is the pdf I'm currently working on A non-projective dependency parser

就是这样，我得到了它的参考

and this is how, I'm getting its references

References Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358. Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague. Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. Copenhagen. David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525. Hans Jiirgen Heringer. 1993. Dependency syntax - basic ideas and the classical model. In Joachim Jacobs, Arnim von Stechow, Wolfgang Sternefeld, and Thee Venneman, editors, Syntax - An In- ternational Handbook of Contemporary Research, volume 1, chapter 12, pages 298-316. Walter de Gruyter, Berlin - New York. Richard Hudson. 1991. English Word Grammar. Basil Blackwell, Cambridge, MA. Arvi Hurskainen. 1996. Disambiguation of morpho- logical analysis in Bantu languages. In The 16th International Conference on Computational Lin- guistics, pages 568-573. Copenhagen. Time J~rvinen. 1994. Annotating 200 million words: the Bank of English project. In The 15th International Conference on Computational Lin- guistics Proceedings, pages 565-568. Kyoto. Fred Karlsson, Atro Voutilainen, Juha Heikkil~, and Arto Anttila, editors. 1995. Constraint Gram- mar: a language-independent system for parsing unrestricted text, volume 4 of Natural Language Processing. Mouton de Gruyter, Berlin and N.Y. Fred Karlsson. 1990. Constraint grammar as a framework for parsing running text. In Hans Karl- gren, editor, Papers presented to the 13th Interna- tional Conference on Computational Linguistics, volume 3, pages 168-173, Helsinki, Finland. Michael McCord. 1990. Slot grammar: A system for simpler construction of practical natural language grammars. In lq, Studer, editor, Natural Language and Logic: International Scientific Symposium, Lecture Notes in Computer Science, pages 118- 145. Springer, Berlin. Igor A. Mel'~uk. 1987. Dependency Syntax: Theory and Practice. State University of New York Press, Albany. Christer Samuelsson, Pasi Tapanainen, and Atro Voutilainen. 1996. Inducing constraint gram- mars. In Laurent Miclet and Colin de la Higuera, editors, Grammatical Inference: Learning Syntax from Sentences, volume 1147 of Lecture Notes in Artificial Intelligence, pages 146-155, Springer. Daniel Sleator and Davy Temperley. 1991. Parsing English with a link grammar. Technical Report CMU-CS-91-196, Carnegie Mellon University. Pasi Tapanainen and Time J/irvinen. 1994. Syn- tactic analysis of natural language using linguis- tic rules and corpus-based patterns. In The 15th International Conference on Computational Lin- guistics Proceedings, pages 629-634. Kyoto. Pasi Tapanainen. 1996. The Constraint Grammar Parser CG-2. Number 27 in Publications of the Department of General Linguistics, University of Helsinki. Lucien TesniSre. 1959. l~ldments de syntaxe stvuc- turale, l~ditions Klincksieck, Paris. Atro Voutilainen. 1995. Morphological disambigua- tion. In Karlsson et al., chapter 6, pages 165-284. 71

如何将这些参考字符串解析为 pdf 中提到的多个参考?任何形式的帮助将不胜感激.

How can I parse these Reference string into multiple references as mentioned in pdf? Any kind of help will be appreciated.

推荐答案

PDF 非常复杂，我不是专家，但我拿了 extractText() 查看它是如何工作的(并使用打印)>', operator,operands) 我可以看到它在 PDF 中找到了什么值.

PDF is very complex and I'm not specialist but I took source code of extractText() to see how it works and using print('>>>', operator, operands) I could see what values it found in PDF.

在本文档中，它使用 Tm" 将位置移动到新行，因此更改了 extractText() 中的原始代码，我使用了 Tm" 添加 \n 并且我在行中得到文本

In this document it uses "Tm" to move position to new line so changed original code in extractText() and I used "Tm" to add \n and I got text in lines

Arto Anttila. 1995. How to recognise subjects in 
English. In Karlsson et al., chapt. 9, pp. 315-358. 
Dekang Lin. 1996. Evaluation of Principar with the 
Susanne corpus. In John Carroll, editor, Work- 
shop on Robust Parsing, pages 54-69, Prague. 
Jason M. Eisner. 1996. Three new probabilistic 
models for dependency parsing: An exploration. 
In The 16th International Conference on Compu- 
tational Linguistics, pages 340-345. Copenhagen. 
David G. Hays. 1964. Dependency theory: A 
formalism and some observations. Language, 
40(4):511-525.

或者在行之间使用 ---

---
Arto Anttila. 1995. How to recognise subjects in 
---
English. In Karlsson et al., chapt. 9, pp. 315-358. 
---
Dekang Lin. 1996. Evaluation of Principar with the 
---
Susanne corpus. In John Carroll, editor, Work- 
---
shop on Robust Parsing, pages 54-69, Prague. 
---
Jason M. Eisner. 1996. Three new probabilistic 
---
models for dependency parsing: An exploration. 
---
In The 16th International Conference on Compu- 
---
tational Linguistics, pages 340-345. Copenhagen. 
---
David G. Hays. 1964. Dependency theory: A 
---
formalism and some observations. Language, 
---
40(4):511-525.

但它仍然不是那么有用，但现在我用来得到这个结果的代码

But it still not so usefull but now code which I used to get this result

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText(self):  
    # code from original `extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645
    
    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)
    
    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"

        # new code to add `\n` when text moves to new line
        elif operator == b_("Tm"):
            text += '\n'
            
    return text
    
# --- main ---

pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    text += myExtractText(page)  # modified function

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
    
# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

挖掘后似乎 Tm 也有值，并且有新的位置 x, y，我用来计算文本行之间的距离，然后添加 \n 当距离大于某个值时.我测试了不同的值，从值 17 我得到了预期的结果

After digging it seems Tm has also values and there is new postion x, y which I used to calculate distance between lines of text and I add \n when distance is bigger then some value. I tested different values and from value 17 I got expected result

---
Arto Anttila. 1995. How to recognise subjects in English. In Karlsson et al., chapt. 9, pp. 315-358. 
---
Dekang Lin. 1996. Evaluation of Principar with the Susanne corpus. In John Carroll, editor, Work- shop on Robust Parsing, pages 54-69, Prague. 
---
Jason M. Eisner. 1996. Three new probabilistic models for dependency parsing: An exploration. In The 16th International Conference on Compu- tational Linguistics, pages 340-345. Copenhagen. 
---
David G. Hays. 1964. Dependency theory: A formalism and some observations. Language, 40(4):511-525. 
---

这里是代码

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText2(self):
    # original code from `page.extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645

    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)
    
    prev_x = 0
    prev_y = 0
    
    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"
            
        elif operator == b_("Tm"):
            x = operands[-2]
            y = operands[-1]

            diff_x = prev_x - x
            diff_y = prev_y - y

            #print('>>>', diff_x, diff_y - y)
            #text += f'| {diff_x}, {diff_y - y} |'
            
            if diff_y > 17 or diff_y < 0:  # (bigger margin) or (move to top in next column)
                text += '\n'
                #text += '\n' # to add empty line between elements
                
            prev_x = x
            prev_y = y
            
    return text
        
# --- main ---
        
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    text += myExtractText(page)  # modified function

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
    
# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

它适用于此 PDF，但其他文件可能具有不同的结构或 references 之间的不同距离，并且它们可能需要其他更改.

It works for this PDF but other files may have different structure or different distance between references and they may need other changes.

更通用的版本 - 它有第二个参数

Little more universal version - it gets second argument

如果你没有第二个参数运行

If you run without second argument

 text += myExtractText(page)

然后它像原始的 extractText() 一样工作，并且您可以在一个字符串中获得所有内容.

then it works like original extractText() and you get all in one string.

如果第二个参数是 True

 text += myExtractText(page, True)

然后它在每个 Tm 之后添加新行 - 就像我的第一个版本一样.

then it adds new line after every Tm - like in my first version.

如果第二个参数是整数 - 即.17

If second argument is integer number - ie. 17

 text += myExtractText(page, 17)

然后当距离大于 17 时添加新行 - 就像我的第二个版本一样.

then it adds new line when distance is bigger then 17 - like in my second version.

import PyPDF2
from PyPDF2.pdf import *  # to import function used in origimal `extractText`

# --- functions ---

def myExtractText(self, distance=None):
    # original code from `page.extractText()`
    # https://github.com/mstamy2/PyPDF2/blob/d7b8d3e0f471530267827511cdffaa2ab48bc1ad/PyPDF2/pdf.py#L2645

    text = u_("")

    content = self["/Contents"].getObject()

    if not isinstance(content, ContentStream):
        content = ContentStream(content, self.pdf)
    
    prev_x = 0
    prev_y = 0
    
    for operands, operator in content.operations:
        # used only for test to see values in variables
        #print('>>>', operator, operands)

        if operator == b_("Tj"):
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += _text
        elif operator == b_("T*"):
            text += "\n"
        elif operator == b_("'"):
            text += "\n"
            _text = operands[0]
            if isinstance(_text, TextStringObject):
                text += operands[0]
        elif operator == b_('"'):
            _text = operands[2]
            if isinstance(_text, TextStringObject):
                text += "\n"
                text += _text
        elif operator == b_("TJ"):
            for i in operands[0]:
                if isinstance(i, TextStringObject):
                    text += i
            text += "\n"
            
        if operator == b_("Tm"):
        
            if distance is True: 
                text += '\n'
                
            elif isinstance(distance, int):
                x = operands[-2]
                y = operands[-1]

                diff_x = prev_x - x
                diff_y = prev_y - y

                #print('>>>', diff_x, diff_y - y)
                #text += f'| {diff_x}, {diff_y - y} |'
                
                if diff_y > distance or diff_y < 0:  # (bigger margin) or (move to top in next column)
                    text += '\n'
                    #text += '\n' # to add empty line between elements
                    
                prev_x = x
                prev_y = y
            
    return text
        
# --- main ---
        
pdfFileObj = open('A97-1011.pdf', 'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)

text = ''

for page in pdfReader.pages:
    #text += page.extractText()  # original function
    
    #text += myExtractText(page)        # modified function (works like original version)
    #text += myExtractText(page, True)  # modified function (add `\n` after every `Tm`)
    text += myExtractText(page, 17)  # modified function (add `\n` only if distance is bigger then `17`)   

# get only text after word `References`
pos = text.lower().find('references')
text = text[pos+len('references '):]
    
# print all at once
print(text)

# print line by line
for line in text.split('\n'):
    print(line)
    print('---')

顺便说一句:它不仅对 References 有用，而且对文本的其余部分也很有用 - 似乎它分割了段落.

BTW: It can be useful not only for References but also for rest of text - It seems it split paragraphs.

PDF 开头的结果

---
A non-projective dependency parser 
---
Pasi Tapanainen and Timo J~irvinen University of Helsinki, Department of General Linguistics Research Unit for Multilingual Language Technology P.O. Box 4, FIN-00014 University of Helsinki, Finland {Pas i. Tapanainen, Timo. Jarvinen}@l ing. Hel s inki. f i 
---
Abstract 
---
We describe a practical parser for unre- stricted dependencies. The parser creates links between words and names the links according to their syntactic functions. We first describe the older Constraint Gram- mar parser where many of the ideas come from. Then we proceed to describe the cen- tral ideas of our new parser. Finally, the parser is evaluated. 
---
1 Introduction 
---
We are concerned with surface-syntactic parsing of running text. Our main goal is to describe syntac- tic analyses of sentences using dependency links that show the he~t-modifier relations between words. In addition, these links have labels that refer to the syntactic function of the modifying word. A simpli- fied example is in Figure 1, where the link between I and see denotes that I is the modifier of see and its syntactic function is that of subject. Similarly, a modifies bird, and it is a determiner. 
---
see bi i ~ d'~b~ bird 
---
Figure 1: Dependencies for sentence I see a bird. 
---
First, in this paper, we explain some central con- cepts of the Constraint Grammar framework from which many of the ideas are derived. Then, we give some linguistic background to the notations we are using, with a brief comparison to other current de- pendency formalisms and systems. New formalism is described briefly, and it is utilised in a small toy grammar to illustrate how the formalism works. Fi- nally, the real parsing system, with a grammar of some 2 500 rules, is evaluated. 
---
64 
---
The parser corresponds to over three man-years of work, which does not include the lexical analyser and the morphological disambiguator, both parts of the existing English Constraint Grammar parser (Karls- son et al., 1995). The parsers can be tested via WWW t . 
---
2 Background 
---
Our work is partly based on the work done with the Constraint Grammar framework that was orig- inally proposed by Fred Karlsson (1990). A de- tMled description of the English Constraint Gram- mar (ENGCG) is in Karlsson et al. (1995). The basic rule types of the Constraint Grammar (Tapanainen, 1996) 2 are REMOVE and SELECT for discarding and se- lecting an alternative reading of a word. Rules also have contextual tests that describe the condition ac- cording to which they may be applied. For example, the rule 
---

这篇关于从 pdf 中提取参考文献 - Python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从 pdf 中提取参考文献 - Python [英] Extract References from pdf - Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从 pdf 中提取参考文献 - Python [英] Extract References from pdf - Python

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭