如何从文本(PHP)中提取引用? [英] How to extract citations from a text (PHP)?

查看:113
本文介绍了如何从文本(PHP)中提取引用?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

你好!

我想从文本中提取所有引文.此外,应提取被引人的名字. DayLife做得很好.

I would like to extract all citations from a text. Additionally, the name of the cited person should be extracted. DayLife does this very well.

示例:

他们认为这是'游戏结束',"一位高级政府官员说.

"They think it’s ‘game over,’ " one senior administration official said.

他们认为这是游戏结束" ,应提取引用的人一名高级行政官员.

The phrase They think it's 'game over' and the cited person one senior administration official should be extracted.

您认为有可能吗?如果您检查是否提到了被引述的人,则只能区分引文和引号中的单词.

Do you think that's possible? You can only distinguish between citations and words in quotes if you check whether there's a cited person mentioned.

示例:

我认为这很严重,而且正在恶化,"马伦海军上将周日在CNN的国情咨文"计划中表示.

"I think it is serious and it is deteriorating," Admiral Mullen said Sunday on CNN’s "State of the Union" program.

国际电联段落不是引号.但是,您如何检测到这一点? a)您检查是否提到了被引人. b)您计算假定报价中的空格.如果少于3个空格将不会被引用,对不对?我宁愿选择b),因为并非总是有被引证的人.

The passage State of the Union is not a quotation. But how do you detect this? a) You check if there's a cited person mentioned. b) You count the blank spaces in the supposed quotation. If there are less than 3 blank spaces it won't be a quotation, right? I would prefer b) since there's not always a cited person named.

如何开始?

我首先将所有类型的引号替换为一个引号,以便以后只需要检查一个引号.

I would first replace all types of quotes by a single type so that you'll have to check for only one quote mark later.

<?php
$text = '';
$quote_marks = array('"', '"', '„', '»', '«');
$text = str_replace($quote_marks, '"', $text);
?>

然后,我将提取引号之间的所有词组,这些词组包含3个以上的空格:

Then I would extract all phrases between quotation marks which contain more than 3 blank spaces:

<?php
function extract_quotations($text) {
   $result = preg_match_all('/"([^"]+)"/', $text, $found_quotations);
   if ($result == TRUE) {
      return $found_quotations;
      // check for count of blank spaces
   }
   return array();
}
?>

您如何改善呢?

希望您能帮助我.提前非常感谢您!

I hope you can help me. Thank you very much in advance!

推荐答案

正如ceejayoz指出的那样,这不适合单个函数.您在问题中所描述的内容(检测句子中用引号转义的部分的语法功能-即我认为它很严重且正在恶化"与国情咨文")最好通过图书馆解决可以将自然语言分解为标记.我不知道PHP中有任何这样的库,但是您可以看一下将在python中使用的项目的项目大小:

As ceejayoz already pointed out, this won't fit into a single function. What you're describing in your question (detecting grammatical function of a quote-escaped part of a sentence - i.e. "I think it is serious and it is deteriorating," vs "State of the Union") would be best solved with a library that can break down natural language into tokens. I am not aware of any such library in PHP, but you can have a look at the project size of something you would use in python: http://www.nltk.org/

我认为您能做的最好的事情就是定义一组语法规则,并手动进行验证.那这样的事情呢?

I think the best you can do is define a set of syntax rules that you verify manually. What about something like this:

abstract class QuotationExtractor {

    protected static $instances;

    public static function getAllPossibleQuotations($string) {
        $possibleQuotations = array();
        foreach (self::$instances as $instance) {
            $possibleQuotations = array_merge(
                $possibleQuotations,
                $instance->extractQuotations($string)
            );
        }
        return $possibleQuotations;
    }

    public function __construct() {
        self::$instances[] = $this;
    }

    public abstract function extractQuotations($string);

}

class RegexExtractor extends QuotationExtractor {

    protected $rules;

    public function extractQuotations($string) {
        $quotes = array();
        foreach ($this->rules as $rule) {
            preg_match_all($rule[0], $string, $matches, PREG_SET_ORDER);
            foreach ($matches as $match) {
                $quotes[] = array(
                    'quote' => trim($match[$rule[1]]),
                    'cited' => trim($match[$rule[2]])
                );
            }
        }
        return $quotes;
    }

    public function addRule($regex, $quoteIndex, $authorIndex) {
        $this->rules[] = array($regex, $quoteIndex, $authorIndex);
    }

}

$regexExtractor = new RegexExtractor();
$regexExtractor->addRule('/"(.*?)[,.]?\h*"\h*said\h*(.*?)\./', 1, 2);
$regexExtractor->addRule('/"(.*?)\h*"(.*)said/', 1, 2);
$regexExtractor->addRule('/\.\h*(.*)(once)?\h*said[\-]*"(.*?)"/', 3, 1);

class AnotherExtractor extends Quot...

如果您具有如上所述的结构,则可以在所有文本中运行相同的文本,并列出可能的报价以选择正确的报价.我已经使用该线程作为输入来运行代码进行测试,结果是:

If you have a structure like the above you can run the same text through any/all of them and list the possible quotations to select the correct ones. I've run the code with this thread as input for testing and the result was:

array(4) {
  [0]=>
  array(2) {
    ["quote"]=>
    string(15) "Not necessarily"
    ["cited"]=>
    string(8) "ceejayoz"
  }
  [1]=>
  array(2) {
    ["quote"]=>
    string(28) "They think it's `game over,'"
    ["cited"]=>
    string(34) "one senior administration official"
  }
  [2]=>
  array(2) {
    ["quote"]=>
    string(46) "I think it is serious and it is deteriorating,"
    ["cited"]=>
    string(14) "Admiral Mullen"
  }
  [3]=>
  array(2) {
    ["quote"]=>
    string(16) "Not necessarily,"
    ["cited"]=>
    string(0) ""
  }
}

这篇关于如何从文本(PHP)中提取引用?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆