文本摘要评估 - BLEU 与 ROUGE [英] Text Summarization Evaluation - BLEU vs ROUGE

查看:126
本文介绍了文本摘要评估 - BLEU 与 ROUGE的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

根据两个不同的摘要系统(sys1 和 sys2)的结果和相同的参考摘要,我使用 BLEU 和 ROUGE 对它们进行了评估.问题是:sys1 的所有 ROUGE 分数都高于 sys2(ROUGE-1、ROUGE-2、ROUGE-3、ROUGE-4、ROUGE-L、ROUGE-SU4,...)但 sys1 的 BLEU 分数较低比 sys2 的 BLEU 分数(相当多).

With the results of two different summary systems (sys1 and sys2) and the same reference summaries, I evaluated them with both BLEU and ROUGE. The problem is: All ROUGE scores of sys1 was higher than sys2 (ROUGE-1, ROUGE-2, ROUGE-3, ROUGE-4, ROUGE-L, ROUGE-SU4, ...) but the BLEU score of sys1 was less than the BLEU score of sys2 (quite much).

所以我的问题是:ROUGE 和 BLEU 都基于 n-gram 来衡量系统摘要和人类摘要之间的相似性.那么为什么会出现这样的评价结果​​差异呢?ROUGE 与 BLEU 的主要区别是什么来解释这个问题?

So my question is: Both ROUGE and BLEU are based on n-gram to measure the similar between the summaries of systems and the summaries of human. So why there are differences in results of evaluation like that? And what's the main different of ROUGE vs BLEU to explain this issue?

推荐答案

一般:

Bleu 测量精度:机器生成的摘要中的单词(和/或 n-gram)出现在人工参考摘要中的程度.

Bleu measures precision: how much the words (and/or n-grams) in the machine generated summaries appeared in the human reference summaries.

Rouge 测量召回率:人工参考摘要中的单词(和/或 n-gram)出现在机器生成的摘要中的程度.

Rouge measures recall: how much the words (and/or n-grams) in the human reference summaries appeared in the machine generated summaries.

自然 - 这些结果是互补的,就像精度与召回率的情况一样.如果您在人工参考中出现很多来自系统结果的单词,您的 Bleu 就会很高,如果您在系统结果中出现很多来自人工参考的单词,您的 Rouge 就会很高.

Naturally - these results are complementing, as is often the case in precision vs recall. If you have many words from the system results appearing in the human references you will have high Bleu, and if you have many words from the human references appearing in the system results you will have high Rouge.

在您的情况下,sys1 的 Rouge 似乎比 sys2 更高,因为 sys1 中的结果始终比 sys2 的结果中出现更多来自人工参考的单词.但是,由于您的 Bleu 分数显示 sys1 的召回率低于 sys2,因此这表明与 sys2 相关的人工参考中出现的来自您的 sys1 结果的单词并不多.

In your case it would appear that sys1 has a higher Rouge than sys2 since the results in sys1 consistently had more words from the human references appear in them than the results from sys2. However, since your Bleu score showed that sys1 has lower recall than sys2, this would suggest that not so many words from your sys1 results appeared in the human references, in respect to sys2.

例如,如果您的 sys1 输出的结果包含来自参考文献的单词(提升 Rouge),但也包含许多参考文献未包含的单词(降低 Bleu),则可能会发生这种情况.看起来,sys2 给出的结果中,输出的大多数单词确实出现在人工参考中(向上蓝色),但在其结果中也遗漏了许多出现在人工参考中的单词.

This could happen for example if your sys1 is outputting results which contain words from the references (upping the Rouge), but also many words which the references didn't include (lowering the Bleu). sys2, as it seems, is giving results for which most words outputted do appear in the human references (upping the Blue), but also missing many words from its results which do appear in the human references.

顺便说一句,有一种叫做简洁惩罚的东西,它非常重要并且已经添加到标准 Bleu 实现中.它惩罚比参考文献的一般长度的系统结果(阅读更多关于它此处).这补充了 n-gram 度量行为,该行为实际上惩罚比参考结果更长的时间,因为系统结果越长,分母增长越长.

BTW, there's something called brevity penalty, which is quite important and has already been added to standard Bleu implementations. It penalizes system results which are shorter than the general length of a reference (read more about it here). This complements the n-gram metric behavior which in effect penalizes longer than reference results, since the denominator grows the longer the system result is.

你也可以为 Rouge 实现类似的东西,但这次惩罚系统结果比一般参考长度,否则会使他们人为地获得更高的 Rouge 分数(因为更长的时间结果,您在参考文献中遇到某个单词的机会就越大).在 Rouge 中,我们除以人类参考的长度,因此我们需要对较长的系统结果进行额外的惩罚,这可能会人为地提高他们的 Rouge 分数.

You could also implement something similar for Rouge, but this time penalizing system results which are longer than the general reference length, which would otherwise enable them to obtain artificially higher Rouge scores (since the longer the result, the higher the chance you would hit some word appearing in the references). In Rouge we divide by the length of the human references, so we would need an additional penalty for longer system results which could artificially raise their Rouge score.

最后,您可以使用 F1 度量使指标协同工作:F1 = 2 * (Bleu * Rouge)/(Bleu + Rouge)

Finally, you could use the F1 measure to make the metrics work together: F1 = 2 * (Bleu * Rouge) / (Bleu + Rouge)

这篇关于文本摘要评估 - BLEU 与 ROUGE的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆