PHP-相互比较多维子数组,并在相似性阈值上合并 [英] PHP - Compare multidimensional sub-arrays to each other and merge on similarity threshold

查看:62
本文介绍了PHP-相互比较多维子数组,并在相似性阈值上合并的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

简介-该问题已于2018年5月27日更新:

我有1个 PHP 多维数组,包含6个子数组,每个子数组包含20个子子数组,每个子数组又包含2个子子数组,其中一个是字符串(标题),另一个是未指定数量的关键字(keywords).

我正在将120个子子数组中的每个子数组与其余5个子数组中包含的100个其他子子数组进行比较.以便将子数组 1 中的子子数组 1 子数组进行比较 1 并包含在子数组 2 中的子数组 20 strong>并包括子数组 6 ,依此类推.

如果两个子子数组中的 关键字 足够,并且 headers 也是如此,两者都使用Levenshtein距离,子子数组将被合并.


示例脚本

我已经编写了一个脚本来执行此操作,但是使用两个单独的数组来演示我的目标:

<?php
// Variable deciding maximum Levenshtein distance between two words. Can be changed to lower / increase threshhold for whether two keywords are deemed identical.
$lev_point_value = 3;

// Variable deciding minimum amount of identical (passed the $lev_point_value variable) keywords needed to merge arrays. Can be changed to lower / increase threshhold for how many keywords two arrays must have in common to be merged.
$merge_tag_value = 4;

// Variable deciding minimum Levenshtein distance between two headers needed to merge arrays. Can be changed to lower / increase threshhold for whether two titles are deemed identical.
$merge_head_value = 22;

// Array1 - A story about a monkey, includes at header and keywords.
$array1 = array (
        "header" => "This is a story about a monkey.",
        'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing"
    ));

// Array1 - Another, but slightly different story about a monkey, includes at header and keywords.
$array2 = array (
        "header" => "This is another, but different story, about a monkey.",
        'keywords' => array( "Monkey", "Big", "Trees", "Bonobo", "Fun", "Dance", "Cow", "Coconuts"
    ));

// Function comparing keywords between two arrays. Uses levenshtein distance lesser than $lev_point_value. Each pass increases $merged_tag, which is then returned.
function sim_tag_index($array1, $array2, $lev_point_value) {
    $merged_tag = 0;
    foreach ($array1['keywords'] as $item1){
        foreach ($array2["keywords"] as $item2){
            if (levenshtein($item1, $item2) <= $lev_point_value) {
            $merged_tag++;
            };
         }
    };
    return $merged_tag;
}

// Function comparing headers between two arrays using levenshtein distance, which is then returned as $merged_head.
function sim_header_index($array1, $array2) {
    $merged_head = (levenshtein($array1['header'], $array2['header']));
    return $merged_head;
}

// Function running sim_tag_index against $merge_tag_value, if it passes, then running sim_tag_index against $merge_head_value, if this passes aswell, merge arrays.
function merge_on_sim($array1, $array2, $merge_tag_value, $merge_head_value, $lev_point_value) {
    $group = array();
    if (sim_tag_index($array1, $array2, $lev_point_value) >= $merge_tag_value) {
        if (sim_header_index($array1, $array2) >= $merge_head_value) {
            $group = (array_unique(array_merge($array1["keywords"],$array2["keywords"])));
        }
    }
    return $group;
}

// Printing function merge_on_sim.
print_r (merge_on_sim($array1, $array2, $merge_tag_value, $merge_head_value, $lev_point_value));
?>


问题:

如何扩展或重写脚本以遍历多个子子数组,将它们与其他子数组中找到的所有其他子子数组进行比较,然后合并被视为足够相同吗?


多维数组结构

$array = array (
    // Sub-array 1
    array (
        // Story 'Monkey 1' - Has identical sub-sub-arrays 'Monkey 2' and 'Monkey 3' and will be merged with them.
        array (
            "header" => "This is a story about a monkey.",
            'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing")
        ),
        // Story 'Cat 1' - Has identical sub-sub-array 'Cat 2' and will be merged with it.
        array (
            "header" => "Here's a catarific story about a cat",
            'keywords' => array( "meauw", "raaaw", "kitty", "growup", "Fun", "claws", "fish", "salmon")
        )
    ),
    // Sub-array 2
    array ( 
        // Story 'Monkey 2' - Has identical sub-sub-arrays 'Monkey 1' and 'Monkey 3' and will be merged with them.
        array (
            "header" => "This is another, but different story, about a monkey.",
            'keywords' => array( "Monkey", "Big", "Trees", "Bonobo", "Fun", "Dance", "Cow", "Coconuts")
        ),
        // Story 'Cat 2' - Has identical sub-sub-array 'Cat 1' and will be merged with it.
        array (
            "header" => "Here's a different story about a cat",
            'keywords' => array( "meauwe", "ball", "cat", "kitten", "claws", "sleep", "fish", "purr")
        )
    ),
    // Sub-array 3
    array ( 
        // Story 'Monkey 3' - Has identical sub-sub-arrays 'Monkey 1' and 'Monkey 2' and will be merged with them.
        array (
            "header" => "This is a third story about a monkey.",
            'keywords' => array( "Jungle", "tree", "monkey", "Bonobo", "Fun", "Dance", "climbing", "Coconut", "pretty")
        ),
        // Story 'Fireman 1' - Has no identical sub-sub-arrays and will not be merged.
        array (
            "header" => "This is a story about a fireman",
            'keywords' => array( "fire", "explosion", "burning", "rescue", "happy", "help", "water", "car")
        )
    )
);


想要的多维数组

$array = array (
    // Story 'Monkey 1', 'Monkey 2' and 'Monkey 3' merged.
    array (
        "header" => array( "This is a story about a monkey.", "This is another, but different story, about a monkey.", "This is a third story about a monkey."),
        'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing", "Fun", "Dance", "Cow", "Coconuts", "Jungle", "tree", "pretty")
    ),
    // Story 'Cat 1' and 'Cat 2' merged.
    array (
        "header" => array( "Here's a catarific story about a cat", "Here's a different story about a cat"),
        'keywords' => array( "meauw", "raaaw", "kitty", "growup", "Fun", "claws", "fish", "salmon", "ball", "cat", "kitten", "sleep", "fish", "purr")
    )
);

解决方案

我去试试吧!

我使用preg_grep查找与其他子数组相同的项目.然后,我使用count来查看有多少个匹配的关键字.
这就是门槛所在.目前,我将其设置为2,这意味着两个匹配的关键字是一个匹配项.

//flatten array to make it simpler
$new =[];
foreach($array as $subarr){
    $new = array_merge($new, $subarr);
}

$threshold = 2;
$merged=[];
foreach($new as $key => $story){
    // create regex pattern to find similar items
    $words = "/" . implode("|", $story["keywords"]) . "/i";
    foreach($new as $key2 => $story2){
        // only loop new items and items that has not been merged already
        if($key != $key2 && $key2 > $key && !in_array($key2, $merged)){
            // If the count of words from preg_grep is above threshold it's mergable
            if(count(preg_grep($words, $story2["keywords"])) > $threshold){
                // debug
                //echo $key . " " . $key2 . "\n";
                //echo $story["header"] . " = " . $story2["header"] ."\n\n";

                // if the item does not exist create it first to remove notices
                if(!isset($res[$key])) $res[$key] = ["header" => [], "keywords" =>[]];

                // add headers
                $res[$key]["header"][] = $story["header"];
                $res[$key]["header"][] = $story2["header"];

                // only keep unique 
                $res[$key]["header"] = array_unique($res[$key]["header"]);

                // add keywords and remove duplicates
                $res[$key]["keywords"] = array_merge($res[$key]["keywords"], $story["keywords"], $story2["keywords"]);
                $res[$key]["keywords"] = array_unique($res[$key]["keywords"]);

                // add key2 to merged so that we don't merge this again.
                $merged[] = $key2;
            }
        }
    }
}

var_dump($res);

https://3v4l.org/6cKRq

输出就是您所要的想要的".

Introduction - This question has been updated the 27th May 2018:

I have 1 PHP multidimensional-array, containing 6 sub-arrays, each containing 20 sub-sub-arrays, which in turn, each contain 2 sub-sub-arrays, one being a string (header), the other being an unspecified number of keywords (keywords).

I am looking to compare each of the 120 sub-sub-arrays to the 100 other sub-sub-arrays contained in the remainint 5 sub-arrays. So that sub-sub-array1 in sub-array1 is compared to sub-array1 to and including sub-array20 in sub-array2 to and including sub-array6, and so forth.

If enough keywords in two sub-sub-arrays are deemed identical and headers are as well, both using Levenshtein distance, the sub-sub-arrays will be merged.


Example script

I have written a script doing exactly this, but for two separate arrays to demonstrate my goal:

<?php
// Variable deciding maximum Levenshtein distance between two words. Can be changed to lower / increase threshhold for whether two keywords are deemed identical.
$lev_point_value = 3;

// Variable deciding minimum amount of identical (passed the $lev_point_value variable) keywords needed to merge arrays. Can be changed to lower / increase threshhold for how many keywords two arrays must have in common to be merged.
$merge_tag_value = 4;

// Variable deciding minimum Levenshtein distance between two headers needed to merge arrays. Can be changed to lower / increase threshhold for whether two titles are deemed identical.
$merge_head_value = 22;

// Array1 - A story about a monkey, includes at header and keywords.
$array1 = array (
        "header" => "This is a story about a monkey.",
        'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing"
    ));

// Array1 - Another, but slightly different story about a monkey, includes at header and keywords.
$array2 = array (
        "header" => "This is another, but different story, about a monkey.",
        'keywords' => array( "Monkey", "Big", "Trees", "Bonobo", "Fun", "Dance", "Cow", "Coconuts"
    ));

// Function comparing keywords between two arrays. Uses levenshtein distance lesser than $lev_point_value. Each pass increases $merged_tag, which is then returned.
function sim_tag_index($array1, $array2, $lev_point_value) {
    $merged_tag = 0;
    foreach ($array1['keywords'] as $item1){
        foreach ($array2["keywords"] as $item2){
            if (levenshtein($item1, $item2) <= $lev_point_value) {
            $merged_tag++;
            };
         }
    };
    return $merged_tag;
}

// Function comparing headers between two arrays using levenshtein distance, which is then returned as $merged_head.
function sim_header_index($array1, $array2) {
    $merged_head = (levenshtein($array1['header'], $array2['header']));
    return $merged_head;
}

// Function running sim_tag_index against $merge_tag_value, if it passes, then running sim_tag_index against $merge_head_value, if this passes aswell, merge arrays.
function merge_on_sim($array1, $array2, $merge_tag_value, $merge_head_value, $lev_point_value) {
    $group = array();
    if (sim_tag_index($array1, $array2, $lev_point_value) >= $merge_tag_value) {
        if (sim_header_index($array1, $array2) >= $merge_head_value) {
            $group = (array_unique(array_merge($array1["keywords"],$array2["keywords"])));
        }
    }
    return $group;
}

// Printing function merge_on_sim.
print_r (merge_on_sim($array1, $array2, $merge_tag_value, $merge_head_value, $lev_point_value));
?>


Question:

How can I expand or rewrite my script to go through multiple sub-sub-arrays, comparing them to all other sub-sub-arrays, found in other sub-arrays, and then merge sub-sub-arrays that are deemed identical enough?


Multidimensional Array Structure

$array = array (
    // Sub-array 1
    array (
        // Story 'Monkey 1' - Has identical sub-sub-arrays 'Monkey 2' and 'Monkey 3' and will be merged with them.
        array (
            "header" => "This is a story about a monkey.",
            'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing")
        ),
        // Story 'Cat 1' - Has identical sub-sub-array 'Cat 2' and will be merged with it.
        array (
            "header" => "Here's a catarific story about a cat",
            'keywords' => array( "meauw", "raaaw", "kitty", "growup", "Fun", "claws", "fish", "salmon")
        )
    ),
    // Sub-array 2
    array ( 
        // Story 'Monkey 2' - Has identical sub-sub-arrays 'Monkey 1' and 'Monkey 3' and will be merged with them.
        array (
            "header" => "This is another, but different story, about a monkey.",
            'keywords' => array( "Monkey", "Big", "Trees", "Bonobo", "Fun", "Dance", "Cow", "Coconuts")
        ),
        // Story 'Cat 2' - Has identical sub-sub-array 'Cat 1' and will be merged with it.
        array (
            "header" => "Here's a different story about a cat",
            'keywords' => array( "meauwe", "ball", "cat", "kitten", "claws", "sleep", "fish", "purr")
        )
    ),
    // Sub-array 3
    array ( 
        // Story 'Monkey 3' - Has identical sub-sub-arrays 'Monkey 1' and 'Monkey 2' and will be merged with them.
        array (
            "header" => "This is a third story about a monkey.",
            'keywords' => array( "Jungle", "tree", "monkey", "Bonobo", "Fun", "Dance", "climbing", "Coconut", "pretty")
        ),
        // Story 'Fireman 1' - Has no identical sub-sub-arrays and will not be merged.
        array (
            "header" => "This is a story about a fireman",
            'keywords' => array( "fire", "explosion", "burning", "rescue", "happy", "help", "water", "car")
        )
    )
);


Wanted Multidimensional Array

$array = array (
    // Story 'Monkey 1', 'Monkey 2' and 'Monkey 3' merged.
    array (
        "header" => array( "This is a story about a monkey.", "This is another, but different story, about a monkey.", "This is a third story about a monkey."),
        'keywords' => array( "Trees", "Monkey", "Flying", "Drink", "Vacation", "Coconut", "Big", "Bonobo", "Climbing", "Fun", "Dance", "Cow", "Coconuts", "Jungle", "tree", "pretty")
    ),
    // Story 'Cat 1' and 'Cat 2' merged.
    array (
        "header" => array( "Here's a catarific story about a cat", "Here's a different story about a cat"),
        'keywords' => array( "meauw", "raaaw", "kitty", "growup", "Fun", "claws", "fish", "salmon", "ball", "cat", "kitten", "sleep", "fish", "purr")
    )
);

解决方案

I'll give it a go!

I use preg_grep to find items that are the same in the other subarrays. Then I use count to see how many matching keywords there is.
And that is where the threshold is. Currently I set it to 2, that means two matching keywords is a match.

//flatten array to make it simpler
$new =[];
foreach($array as $subarr){
    $new = array_merge($new, $subarr);
}

$threshold = 2;
$merged=[];
foreach($new as $key => $story){
    // create regex pattern to find similar items
    $words = "/" . implode("|", $story["keywords"]) . "/i";
    foreach($new as $key2 => $story2){
        // only loop new items and items that has not been merged already
        if($key != $key2 && $key2 > $key && !in_array($key2, $merged)){
            // If the count of words from preg_grep is above threshold it's mergable
            if(count(preg_grep($words, $story2["keywords"])) > $threshold){
                // debug
                //echo $key . " " . $key2 . "\n";
                //echo $story["header"] . " = " . $story2["header"] ."\n\n";

                // if the item does not exist create it first to remove notices
                if(!isset($res[$key])) $res[$key] = ["header" => [], "keywords" =>[]];

                // add headers
                $res[$key]["header"][] = $story["header"];
                $res[$key]["header"][] = $story2["header"];

                // only keep unique 
                $res[$key]["header"] = array_unique($res[$key]["header"]);

                // add keywords and remove duplicates
                $res[$key]["keywords"] = array_merge($res[$key]["keywords"], $story["keywords"], $story2["keywords"]);
                $res[$key]["keywords"] = array_unique($res[$key]["keywords"]);

                // add key2 to merged so that we don't merge this again.
                $merged[] = $key2;
            }
        }
    }
}

var_dump($res);

https://3v4l.org/6cKRq

Output is as your "wanted" in question.

这篇关于PHP-相互比较多维子数组,并在相似性阈值上合并的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆