在不知道substr的情况下计算单词列表中子字符串的唯一外观? [英] Count unique appearance of substring in a list of words without knowing the substr?

查看:98
本文介绍了在不知道substr的情况下计算单词列表中子字符串的唯一外观?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

* 我尝试计算单词列表中子字符串的唯一出现*
因此,请检查单词列表并检测是否有单词中存在基于最小字符的子字符串发生多次并计数。我不知道任何子字符串。

*I try to count the unique appearances of a substring inside a list of words * So check the list of words and detect if in any words there are substrings based on min characters that occur multiple times and count them. I don't know any substrings.

这是一个可行的解决方案,其中您知道子字符串,但是如果您不知道该怎么办?
有一个基于单词的最小字符数。

This is a working solution where you know the substring but what if you do not know ? Theres a Minimum Character count where words are based on.

将找到所有单词,其中 Book是单词的子字符串。具有以下php函数。

Will find all the words where "Book" is a substring of the word. With below php function.

想要的结果显示:

book count (5)
stor count (2)


推荐答案

给出一串长度为100的字符串

Given a string of length 100

book bookstore bookworm booking book cooking boring bookingservice.... ok
0123456789...                                                     ... 100

您的算法可能是:

研究不同起点和子串长度的子串。
取所有从0开始且长度为1-100的子字符串,因此:0-1、0-2、0-3,...,看看这些子字符串中的任何一个是否在整体上累加一次串。
通过从递增位置开始,搜索从1开始的所有子字符串(即1-2、1-3、1-4等)来遍历字符串,直到达到99-100。

Investigate substrings from different starting points and substring lengths. You take all substrings starting from 0 with a length from 1-100, so: 0-1, 0-2, 0-3,... and see if any of those substrings accurs more than once in the overall string. Progress through the string by starting at increasing positions, searching all substrings starting from 1, i.e. 1-2, 1-3, 1-4,... and so on until you reach 99-100.

保留所有子字符串及其出现次数的表,然后可以对其进行排序。

Keep a table of all substrings and their number of occurances and you can sort them.

您可以通过指定a最小和最大长度,这会极大地减少您的搜索数量和命中精度。此外,一旦找到子字符串,请将其保存在搜索到的子字符串数组中。如果再次遇到子字符串,请跳过它。 (即,您已经计算过的本书的匹配数,当您点击下一个 book 子字符串时,就不应该再计算一次)。此外,您将不必搜索长度超过总字符串一半的字符串。

You can optimize by specifying a minimum and maximum length, which reduces your number of searches and hit accuracy quite dramatically. Additionally, once you find a substring save them in a array of searched substrings. If you encounter the substring again, skip it. (i.e. hits for book that you already counted you should not count again when you hit the next booksubstring). Furthermore you will never have to search strings that are longer than half of the total string.

对于示例字符串,您可以对字符串的唯一性进行附加测试。
您将拥有

For the example string you might run additional test for the uniquness of a string. You'd have

o              x ..
oo             x  7
bo             x  7
ok             x  6 
book           x  5
booking        x  2
bookingservice x  1

忽略字符串短于3(且长于总字符串的一半)的话,您将得到

with disregarding stings shorter than 3 (and longer than half of total textstring), you'd get

book           x  5
booking        x  2
bookingservice x  1

[edit]显然,这会遍历所有字符串,而不仅仅是自然单词。

[edit] This would obviously look through all of the string, not just natural words.

[edit]通常我不喜欢为OP编写代码,但是在这种情况下,我对自己有点兴趣:


[edit] Normally I don't like writing code for OPs, but in this case I got a bit interested myself:

$string = "book bookshelf booking foobar bar booking ";
$string .= "selfservice bookingservice cooking";

function search($string, $min = 4, $max = 16, $threshhold = 2) {
    echo "<pre><br/>";
    echo "searching <em>'$string'</em> for string occurances ";
    echo "of length $min - $max: <br/>";

    $hits = array();
    $foundStrings = array();

    // no string longer than half of the total string will be found twice
    if ($max > strlen($string) / 2) {
        $max = strlen($string);
    }

    // examin substrings:
    // start from 0, 1, 2...
    for ($start = 0; $start < $max; $start++) {

        // and string length 1, 2, 3, ... $max
        for ($length = $min; $length < strlen($string); $length++) {

            // get the substring in question, 
            // but search for natural words (trim)
            $substring = trim(substr($string, $start, $length));

            // if substring was not counted yet, 
            // add the found count to the hits
            if (!in_array($substring, $foundStrings)) {
                preg_match_all("/$substring/i", $string, $matches);
                $hits[$substring] = count($matches[0]);
            }
        }
    }

    // sort the hits array desc by number of hits
    arsort($hits);

    // remove substring hits with hits less that threshhold
    foreach ($hits as $substring => $count) {
        if ($count < $threshhold) {
            unset($hits[$substring]);
        }
    }

    print_r($hits);
}

search($string);

?>

注释和变量名应使代码能够自我解释。在您的情况下,$ string将用于读取文件。此示例将输出:

The comments and variable names should make the code explain itself. $string would come for a read file in your case. This exmaple would output:

searching 'book bookshelf booking foobar bar booking selfservice 
bookingservice cooking' for string occurances of length 4 - 16: 
Array
(
    [ook] => 6
    [book] => 5
    [boo] => 5
    [bookin] => 3
    [booking] => 3
    [booki] => 3
    [elf] => 2
)

让我知道您如何实现它:)

Let me know how you implement it :)

这篇关于在不知道substr的情况下计算单词列表中子字符串的唯一外观?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆