用于识别文本引用的正则表达式 [英] Regular expression for recognizing in-text citations

查看：32 发布时间：2021/7/6 20:06:49 regex

本文介绍了用于识别文本引用的正则表达式的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在尝试创建一个正则表达式来捕获文本引用.

以下是文本引用的几个例句:

<块引用>

...以及(Nivre et al., 2007)中报告的结果不具有代表性...
...两个系统使用马尔可夫链方法(Sagae 和 Tsujii 2007).
Nivre (2007) 表明......
... 用于附加和标记依赖项(Chen 等人，2007 年；Dredze 等人，2007 年).

目前，我拥有的正则表达式是

\(\D*\d\d\d\d\)

哪个匹配示例 1-3，但不匹配示例 4.我如何修改它以捕获示例 4?

谢谢！

解决方案

基于 Tex 的回答，我写了一个名为 Overcite 的非常简单的 Python 脚本，用于为朋友执行此操作(学期末，懒惰引用你知道是怎么回事).它是开源的，并在 Bitbucket 上获得许可.

它涵盖了比 Tex 可能有用的更多案例(请参阅测试文件)，包括与号和带有页码的参考.整个脚本基本上是:

author = "(?:[A-Z][A-Za-z'`-]+)"etal = "(?:et al.?)"附加 = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"year_num = "(?:19|20)[0-9][0-9]"page_num = "(?:, p.? [0-9]+)?"# 总是可选的year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"正则表达式 = "(" + 作者 + 附加 +"*" + 年份 + ")"匹配 = re.findall(正则表达式，文本)

I'm trying to create a regular expression to capture in-text citations.

Here's a few example sentences of in-text citations:

... and the reported results in (Nivre et al., 2007) were not representative ...

... two systems used a Markov chain approach (Sagae and Tsujii 2007).

Nivre (2007) showed that ...

... for attaching and labeling dependencies (Chen et al., 2007; Dredze et al., 2007).

Currently, the regular expression I have is

\(\D*\d\d\d\d\)

Which matches examples 1-3, but not example 4. How can I modify this to capture example 4?

Thanks!

解决方案

Building on Tex's answer, I've written a very simple Python script called Overcite to do this for a friend (end of semester, lazy referencing you know how it is). It's open source and MIT licensed on Bitbucket.

It covers a few more cases than Tex's which might be helpful (see the test file), including ampersands and references with page numbers. The whole script is basically:

author = "(?:[A-Z][A-Za-z'`-]+)"
etal = "(?:et al.?)"
additional = "(?:,? (?:(?:and |& )?" + author + "|" + etal + "))"
year_num = "(?:19|20)[0-9][0-9]"
page_num = "(?:, p.? [0-9]+)?"  # Always optional
year = "(?:, *"+year_num+page_num+"| *\("+year_num+page_num+"\))"
regex = "(" + author + additional+"*" + year + ")"

matches = re.findall(regex, text)

这篇关于用于识别文本引用的正则表达式的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用于识别文本引用的正则表达式 [英] Regular expression for recognizing in-text citations

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

用于识别文本引用的正则表达式 [英] Regular expression for recognizing in-text citations

问题描述

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

登录关闭