Flesch-Kincaid可读性:改进PHP功能 [英] Flesch-Kincaid Readability: Improve PHP function

查看:101
本文介绍了Flesch-Kincaid可读性:改进PHP功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我编写了以下PHP代码,以将Flesch-Kincaid可读性得分实现为一个功能:

I wrote this PHP code to implement the Flesch-Kincaid Readability Score as a function:

function readability($text) {
    $total_sentences = 1; // one full stop = two sentences => start with 1
    $punctuation_marks = array('.', '?', '!', ':');
    foreach ($punctuation_marks as $punctuation_mark) {
        $total_sentences += substr_count($text, $punctuation_mark);
    }
    $total_words = str_word_count($text);
    $total_syllable = 3; // assuming this value since I don't know how to count them
    $score = 206.835-(1.015*$total_words/$total_sentences)-(84.6*$total_syllables/$total_words);
    return $score;
}

您对如何改进代码有建议吗?这是正确的吗?能行吗?

Do you have suggestions how to improve the code? Is it correct? Will it work?

希望您能帮助我.预先感谢!

I hope you can help me. Thanks in advance!

推荐答案

就启发式而言,代码看起来还不错.这里有一些要考虑的要点使一台机器需要计算的项目相当困难:

The code looks fine as far as a heuristic goes. Here are some points to consider that make the items you need to calculate considerably difficult for a machine:

  1. 什么是句子?

  1. What is a sentence?

说真的,什么是句子?我们有句号,但它们也可以用于博士学位,例如Y.M.C.A.和其他非句子定论的目的.当考虑到感叹号,问号和省略号时,您实际上会因假设句号可以解决问题而对自己造成伤害.我之前已经看过这个问题,如果您确实想在真实文本中增加句子的可靠性,则需要解析文本.这可能是计算密集型,耗时的,并且很难找到可用的免费资源.最后,您仍然必须担心特定解析器实现的错误率.但是,只有完整的分析才能告诉您什么是句子,什么只是句点的其他用途.此外,如果您使用的是野外"文本(例如HTML),那么您还必须担心句子的结尾不是标点符号而是标签结尾.例如,许多网站没有在h1和h2标签上添加标点符号,但是它们显然是不同的句子或短语.

Seriously, what is a sentence? We have periods, but they can also be used for Ph.D., e.g., i.e., Y.M.C.A., and other non-sentence-final purposes. When you consider exclamation points, question marks, and ellipses, you're really doing yourself a disservice by assuming a period will do the trick. I've looked at this problem before, and if you really want a more reliable count of sentences in real text, you'll need to parse the text. This can be computationally intensive, time-consuming, and hard to find free resources for. In the end, you still have to worry about the error rate of the particular parser implementation. However, only full parsing will tell you what's a sentence and what's just a period's other many uses. Furthermore, if you're using text 'in the wild' -- such as, say, HTML -- you're going to also have to worry about sentences ending not with punctuation but with tag endings. For instance, many sites don't add punctuation to h1 and h2 tags, but they're clearly different sentences or phrases.

音节不是我们应该近似的东西

Syllables aren't something we should be approximating

这是这种可读性启发式方法的主要标志,也是使其最难实现的标志.对作品中的音节计数进行计算分析时,需要假设假定的读者所讲的语言与培训您的音节计数发生器所用的语言是同一方言.声音如何围绕音节下降实际上是使重音变调的主要部分.如果您不相信我,请尝试访问牙买加.这意味着即使一个人手动进行计算,它仍然是方言特定的分数.

This is a major hallmark of this readability heuristic, and it's one that makes it the most difficult to implement. Computational analysis of syllable count in a work requires the assumption that the assumed reader speaks in the same dialect as whatever your syllable count generator is being trained on. How sounds fall around a syllable is actual a major part of what makes accents accents. If you don't believe me, try visiting Jamaica sometime. What this means it that even if a human were to do the calculations for this by hand, it would still be a dialect-specific score.

什么是字?

不要丝毫打扰精神科医师,但是您会发现,空格分隔的单词和被概念化为说话者的单词完全不同.这将使可计算的可读性分数的概念有些疑问.

Not to wax psycholingusitic in the slightest, but you will find that space-separated words and what are conceptualized as words to a speaker are quite different. This will make the concept of a computable readability score somewhat questionable.

因此,最后,我可以回答您的行之有效"的问题.如果您希望阅读一段文字并在其他度量标准中显示此可读性评分,以提供某种可以想象的附加值,那么有眼光的用户将不会提出所有这些问题.如果您要尝试做一些科学的事情,甚至是做一些教学上的事情(因为这个分数和类似的分数最终是要达到目的的),我都不会打扰.实际上,如果您要使用它为用户提供有关他们生成的内容的任何建议,我将非常犹豫.

So in the end, I can answer your question of 'will it work'. If you're looking to take a piece of text and display this readability score among other metrics to offer some kind of conceivable added value, the discerning user will not bring up all of these questions. If you are trying to do something scientific, or even something pedagogical (as this score and those like it were ultimately intended), I wouldn't really bother. In fact, if you're going to use this to make any kind of suggestions to a user about content that they have generated, I would be extremely hesitant.

一种更好的方法来衡量文本的阅读难度可能与低频单词与高频单词的比率以及

A better way to measure reading difficulty of a text would more likely be something having to do with the ratio of low-frequency words to high-frequency words along with the number of hapax legomena in the text. But I wouldn't pursue actually coming up with a heuristic like this, because it would be very difficult to empirically test anything like it.

这篇关于Flesch-Kincaid可读性:改进PHP功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆