创建一个有效的单词计数器，包括中文/日语和其他重音语言 [英] Creating an effective word counter including Chinese/Japanese and other accented languages

查看：51 发布时间：2021/9/3 18:49:35 php symbols word-count non-ascii-characters

本文介绍了创建一个有效的单词计数器，包括中文/日语和其他重音语言的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

在尝试弄清楚如何获得有效的字符串计数器后，我知道 PHP 具有的现有函数 str_word_count 但不幸的是它没有做我需要它做的事情，因为我需要计算包含英文、中文的单词数、日语和其他重音字符.

After trying to figure how to have an effective word counter of a string, I know about the existing function that PHP has str_word_count but unfortunately it doesn't do what I need it to do because I will need to count the number of words that includes English, Chinese, Japanese and other accented characters.

然而，str_word_count 无法计算单词的数量，除非您在第三个参数中添加字符，但这疯狂，这可能意味着我必须添加每个字符中文、日文、重音字符(等)语言，但这不是我需要的.

However str_word_count fails to count the number of words unless you add the characters in the third argument but this is insane, it could mean I have to add every single character in the Chinese, Japanese, accented characters (etc) language but this is not what I need.

测试:

str_word_count('The best tool'); // int(3)
str_word_count('最適なツール'); // int(0)
str_word_count('最適なツール', 0, '最ル'); // int(5)

无论如何，我在网上找到了这个功能，它可以完成这项工作，但遗憾的是它没有计数:

Anyway, I found this function online, it could do the job, but sadly it fails to count:

function word_count($str)
{
    if($str === '')
    {
        return 0;
    }

    return preg_match_all("/\p{L}[\p{L}\p{Mn}\p{Pd}'\x{2019}]*/u", $str);
}

测试:

word_count('The best tool') // int(3)
word_count('最適なツール'); // int(1)

// With spaces
word_count('最 適 な ツ ー ル'); // int(5)

基本上，我正在寻找一个良好的 UTF-8 支持的单词计数器，它可以计算每个典型单词/重音/语言符号中的单词 - 是否有可能的解决方案?

Basically I'm looking for a good UTF-8 supported word counter that can count words from every typical word/accented/language symbols - is there a possible solution to this?

推荐答案

你可以看看 mbstring 扩展以处理 UTF-8 字符串.

You can take a look at the mbstring extension to work with UTF-8 strings.

mb_split() 使用正则表达式模式拆分 mb 字符串.

mb_split() split a mb string using a regex pattern.

<?php 
printf("Counting words in: %s\n", $argv[1]);
mb_regex_encoding('UTF-8');
mb_internal_encoding("UTF-8");
$r = mb_split(' ', $argv[1]); 
print_r($r); 
printf("Word count: %d\n", count($r));

$ php mb.php "foo bar"
Counting words in: foo bar
Array
(
    [0] => foo
    [1] => bar
)
Word count: 2


$ php mb.php "最適な ツール"
Counting words in: 最適な ツール
Array
(
    [0] => 最適な 
    [1] => ツール
)
Word count: 2

~~注意:我必须在字符之间添加 2 个空格才能获得正确的计数~~已修复，通过设置mb_regex_encoding() &mb_internal_encoding() 到 UTF-8

~~Note: I had to add 2 spaces between characters to get a correct count~~ Fixed by setting mb_regex_encoding() & mb_internal_encoding() to UTF-8

然而，在中文中不存在词"的概念(在某些情况下在日语中也可能存在)，因此您可能永远不会以这种方式获得相关结果......)

您可能需要使用字典编写算法来确定哪些字符组是单词"

You may need to write an algorithm using a dictionnary to determine which groups of characters is a "word"

这篇关于创建一个有效的单词计数器，包括中文/日语和其他重音语言的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

创建一个有效的单词计数器，包括中文/日语和其他重音语言 [英] Creating an effective word counter including Chinese/Japanese and other accented languages

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录关闭

创建一个有效的单词计数器，包括中文/日语和其他重音语言 [英] Creating an effective word counter including Chinese/Japanese and other accented languages

问题描述

推荐答案

相关文章

PHP最新文章

热门教程

热门工具

登录 关闭

登录关闭