通过underscore.js中的另一个数组过滤json数据 [英] Filter a json data by another array in underscore.js

查看:71
本文介绍了通过underscore.js中的另一个数组过滤json数据的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个搜索字段,我想使用underscore.js添加一些复杂的功能.

有时,用户会搜索整个句子".就像三星银河A20s ultra"一样.我想使用搜索字符串中的任何单词来过滤JSON数据,并按包含更多单词的结果进行排序.

样本数据:

var phones = [
{name: "Samsung A10s", id: 845},
{name: "Samsung galaxy", id: 839},
{name: "Nokia 7", id: 814},
{name: "Samsung S20s ultra", id: 514},
{name: "Apple iphone ultra", id: 159},
{name: "LG S20", id: 854}];

在下划线中执行此操作的最佳方法是什么?

解决方案

在此答案中,我将构建一个带有两个参数的函数searchByRelevance:

  1. 具有nameid属性的手机的JSON数组,并且
  2. 搜索字符串,

并返回一个新的JSON数组,只有name与搜索字符串具有至少一个共同词的手机才能排序,以使最常见词的手机排在首位.

让我们首先确定所有子任务以及如何使用Underscore实现它们.完成此操作后,我们可以将它们组合到searchByRelevance函数中.最后,我还将花一些时间谈论如何确定最佳".

子任务

将字符串分割成单词

您不需要下划线.字符串具有内置split方法:

"Samsung galaxy A20s ultra".split(' ')
// [ 'Samsung', 'galaxy', 'A20s', 'ultra' ]

但是,如果您有一个完整的字符串数组,并且想要将它们全部拆分,那么就得到了一个数组数组,您可以使用 _.invoke :

_.invoke([
    'Samsung A10s',
    'Samsung galaxy',
    'Nokia 7',
    'Samsung S20s ultra',
    'Apple iphone ultra',
    'LG S20'
], 'split', ' ')
// [ [ 'Samsung', 'A10s' ],
//   [ 'Samsung', 'galaxy' ],
//   [ 'Nokia', '7' ],
//   [ 'Samsung', 'S20s', 'ultra' ],
//   [ 'Apple', 'iphone', 'ultra' ],
//   [ 'LG', 'S20' ] ]

找到两个数组共同的词

如果您有两个单词数组,

var words1 = [ 'Samsung', 'galaxy', 'A20s', 'ultra' ],
    words2 = [ 'Apple', 'iphone', 'ultra' ];

然后,您可以使用 _.intersection 来获得一个仅包含它们共同词的新数组:

_.intersection(words1, words2) // [ 'ultra' ]

计算数组中的单词数

这再次是您不需要Underscore的东西:

[ 'Samsung', 'A10s' ].length // 2

但是,如果您有多个单词数组,则可以使用 _.map :

_.map([
    [ 'Samsung', 'A10s' ],
    [ 'Samsung', 'galaxy' ],
    [ 'Nokia', '7' ],
    [ 'Samsung', 'S20s', 'ultra' ],
    [ 'Apple', 'iphone', 'ultra' ],
    [ 'LG', 'S20' ]
], 'length')
// [ 2, 2, 2, 3, 3, 2 ]

按某种标准对数组进行排序

_.sortBy 做到这一点.例如,按ID的phones数据:

_.sortBy(phones, 'id')
// [ { name: 'Apple iphone ultra', id: 159 },
//   { name: 'Samsung S20s ultra', id: 514 },
//   { name: 'Nokia 7', id: 814 },
//   { name: 'Samsung galaxy', id: 839 },
//   { name: 'Samsung A10s', id: 845 },
//   { name: 'LG S20', id: 854 } ]

要对降序而不是升序进行排序,可以先对升序进行排序,然后使用 _.last ):

_.sortBy(phones, function(phone) { return _.last(phone.name); })
// [ { name: 'LG S20', id: 854 },
//   { name: 'Nokia 7', id: 814 },
//   { name: 'Samsung S20s ultra', id: 514 },
//   { name: 'Apple iphone ultra', id: 159 },
//   { name: 'Samsung A10s', id: 845 },
//   { name: 'Samsung galaxy', id: 839 } ]

按照某种准则对数组的元素进行分组

除了直接排序以外,我们还可以首先仅按条件对项目进行分组.这是使用 _.groupBy _.first :

_.groupBy(phones, function(phone) { return _.first(phone.name); })
// { S: [ { name: 'Samsung A10s', id: 845 },
//        { name: 'Samsung galaxy', id: 839 },
//        { name: 'Samsung S20s ultra', id: 514 } ],
//   N: [ { name: 'Nokia 7', id: 814 } ],
//   A: [ { name: 'Apple iphone ultra', id: 159 } ],
//   L: [ { name: 'LG S20', id: 854 } ] }

我们已经看到,可以传递键进行排序或分组依据,也可以传递返回某些内容用作标准的函数.我们可以在这里使用第三个选项来代替上面的功能:

_.groupBy(phones, ['name', 0])
// { S: [ { name: 'Samsung A10s', id: 845 },
//        { name: 'Samsung galaxy', id: 839 },
//        { name: 'Samsung S20s ultra', id: 514 } ],
//   N: [ { name: 'Nokia 7', id: 814 } ],
//   A: [ { name: 'Apple iphone ultra', id: 159 } ],
//   L: [ { name: 'LG S20', id: 854 } ] }

获取对象的键

这是 _.keys 的用途:

_.keys({name: "Samsung A10s", id: 845}) // [ 'name', 'id' ]

您也可以使用标准 Object.keys . _.keys在不能使用Object.keys的旧环境中工作.否则,它们是可以互换的.

将一系列事物转化为其他事物

我们之前已经看到过使用_.map来获取多个单词数组的长度.通常,它需要一个数组或对象以及要对该数组或对象的每个元素进行的操作,然后它将返回一个带有结果的数组:

_.map(phones, 'id')
// [ 845, 839, 814, 514, 159, 854 ]
_.map(phones, ['name', 0])
// [ 'S', 'S', 'N', 'S', 'A', 'L' ]
_.map(phones, function(phone) { return _.last(phone.name); })
// [ 's', 'y', '7', 'a', 'a', '0' ]

请注意与_.sortBy_.groupBy的相似性.这是Underscore中的一种通用模式:您有一些东西的集合,并且想要对每个元素做一些事情,以便得出某种结果.您要对每个元素执行的操作称为"iteratee". Underscore具有确保您可以在与iteratee一起使用的所有功能中使用相同的iteratee速记的功能: _.iteratee .

有时,您可能希望对集合的每个元素进行某些操作,并以不同于_.map_.sortBy和其他Underscore函数已经执行的方式的方式组合结果.在这种情况下,您可以使用 _.reduce ,这是它们中最通用的功能.例如,以下是我们通过混合第一个电话名称的第一个字母,第二个电话名称的第二个字母等等来创建电话名称混合的方法:

_.reduce(phones, function(memo, phone, index) {
    return memo + phone.name[index];
}, '')
// 'Sakse0'

传递给_.reduce的功能将为每个电话调用. memo是我们到目前为止构建的结果.该功能的结果将用作我们处理的下一部手机的新memo.通过这种方式,我们可以一次构建一个电话串.在这种情况下,_.reduce的最后一个参数''设置了memo的初始值,因此我们要从头开始.

将多个数组连接为一个数组

为此,我们有 _.flatten :

_.flatten([
    [ 'Samsung', 'A10s' ],
    [ 'Samsung', 'galaxy' ],
    [ 'Nokia', '7' ],
    [ 'Samsung', 'S20s', 'ultra' ],
    [ 'Apple', 'iphone', 'ultra' ],
    [ 'LG', 'S20' ]
])
// [ 'Samsung', 'A10s', 'Samsung', 'galaxy', 'Nokia', '7',
//   'Samsung', 'S20s', 'ultra', 'Apple', 'iphone', 'ultra',
//   'LG', 'S20' ]

将它们放在一起

我们有一系列电话和一个搜索字符串,我们希望以某种方式将这些电话中的每一个与搜索字符串进行比较,最后我们希望将其结果进行组合,以便按相关性获得这些电话.让我们从中间部分开始.

是否其中每个电话"都按门铃?我们正在创建一个iteratee!我们希望它以电话作为参数,并希望它返回其name与搜索字符串相同的单词数.该函数将执行以下操作:

function relevance(phone) {
    return _.intersection(phone.name.split(' '), searchTerms).length;
}

这假定在relevance函数外部定义了searchTerms变量.它必须是在搜索字符串中包含单词的数组.我们待会儿处理;让我们先解决如何合并结果.

虽然有很多可能的方法,但我认为以下内容非常优雅.我先按相关性对电话进行分组,

_.groupBy(phones, relevance)

但是我想省略与搜索字符串有零个单词的电话组:

var groups = _.omit(_.groupBy(phones, relevance), '0');

请注意,我省略了 string '0',而不是 number 0,因为_.groupBy的结果是一个对象,并且对象的键始终是字符串.

现在,我们需要按匹配单词的数量对其余的groups进行排序.通过使用groups

的键,我们知道每个组的匹配单词数

_.keys(groups)

,我们可以先对这些升序进行排序,但是必须注意将它们转换回数字,以便我们将2排在10之前(数字比较),而不是将'10'排在'2'之前(按字典顺序)比较):

_.sortBy(_.keys(groups), Number)

然后我们可以反转此顺序以得出小组的最终顺序.

var tiers = _.sortBy(_.keys(groups), Number).reverse();

现在,我们只需要将此排序的键数组转换为具有实际电话组的数组即可.为此,我们可以使用_.map _.propertyOf :

_.map(tiers, _.propertyOf(groups))

最后,为了使搜索结果具有相关性,我们只需要将其展平为一个大数组即可.

_.flatten(_.map(tiers, _.propertyOf(groups)))

让我们将所有内容包装到我们的searchByRelevance函数中.请记住,我们仍然需要在relevance iteratee之外定义searchTerms:

function searchByRelevance(phones, searchString) {
    var searchTerms = searchString.split(' ');
    function relevance(phone) {
        return _.intersection(phone.name.split(' '), searchTerms).length;
    }
    var groups = _.omit(_.groupBy(phones, relevance), '0');
    var tiers = _.sortBy(_.keys(groups), Number).reverse();
    return _.flatten(_.map(tiers, _.propertyOf(groups)));
}

现在进行测试!

searchByRelevance(phones, 'Samsung galaxy A20s ultra')
// [ { name: 'Samsung galaxy', id: 839 },
//   { name: 'Samsung S20s ultra', id: 514 },
//   { name: 'Samsung A10s', id: 845 },
//   { name: 'Apple iphone ultra', id: 159 } ]

什么是最佳"?

如果您衡量好",则按照代码行数,那么代码越少越好.我们只用了八行代码就实现了上面的searchByRelevance,这看起来还不错.

但是,它有点密集.如果我们使用 chaining 可读性会有所提高. >:

function searchByRelevance(phones, searchString) {
    var searchTerms = searchString.split(' ');
    function relevance(phone) {
        return _.intersection(phone.name.split(' '), searchTerms).length;
    }
    var groups = _.chain(phones)
        .groupBy(relevance)
        .omit('0');
    return groups.keys()
        .sortBy(Number)
        .reverse()
        .map(_.propertyOf(groups.value()))
        .flatten()
        .value();
}

善"的另一个维度是:是性能. searchByRelevance会更快吗?为了大致了解这一点,我们通常会执行最小和最频繁的操作,并计算在给定大小的输入下执行该操作的频率.

searchByRelevance中我们要做的主要事情是比较单词.这不是最小的操作,因为比较单词是由比较字母组成的,但是由于英文单词往往比较短,我们现在可以假装比较两个单词是我们最小且执行最多的操作.这使计算更加容易.

对于每部电话,我们将其名称中的每个单词与搜索字符串中的每个单词进行比较.如果我们有100个电话,并且平均电话名称有3个单词,而搜索字符串有5个单词,那么我们将进行100 * 3 * 5 = 1500个单词比较.

计算机速度很快,所以1500算不上什么.通常,如果执行最小步骤的次数保持在100000(100k)以下,除非最小步骤非常昂贵,否则您甚至根本不会注意到延迟.

但是,输入更大的单词时,单词比较的数量将爆炸性地增长.如果我们有20000(20k)电话,平均名称中的5个单词和10个单词的搜索字符串,那么我们已经进行了100万个单词的比较.这可能意味着在结果显示出来之前凝视您的屏幕几秒钟.

我们可以写一个searchByRelevance的变体,它可以在眨眼间搜索20k带有长名称的电话吗?是的,实际上我们也可以做到一百万甚至更多!我不会逐行介绍细节,但是通过使用适当的查找结构,我们可以提高速度:

// lookup table by word in the name
function createIndex(phones) {
    return _.reduce(phones, function(lookup, phone) {
        _.each(phone.name.split(' '), function(word) {
            var matchingPhones = (lookup[word] || []);
            matchingPhones.push(phone.id);
            lookup[word] = matchingPhones;
        });
        return lookup;
    }, {});
}

// search using lookup tables
function searchByRelevance(phonesById, idsByWord, searchString) {
    var groups = _.chain(searchString.split(' '))
        .map(_.propertyOf(idsByWord))
        .compact()
        .flatten()
        .countBy()
        .pairs()
        .groupBy('1');
    return groups.keys()
        .sortBy(Number)
        .reverse()
        .map(_.propertyOf(groups.value()))
        .flatten(true) // only one level of flattening
        .map('0')
        .map(_.propertyOf(phonesById))
        .value();
}

要使用此功能,我们只需创建一次查找表,然后将其重新用于每次搜索.仅当电话的JSON数据更改时,我们才需要重新创建查找表.

var phonesById = _.indexBy(phones);
var idsByWord = createIndex(phones);

searchByRelevance(phonesById, idsByWord, 'Samsung galaxy A20s ultra')
// [ { name: 'Samsung galaxy', id: 839 },
//   { name: 'Samsung S20s ultra', id: 514 },
//   { name: 'Samsung A10s', id: 845 },
//   { name: 'Apple iphone ultra', id: 159 } ]
searchByRelevance(phonesById, idsByWord, 'Apple')
// [ { name: 'Apple iphone ultra', id: 159 } ]

要了解这有多快,让我们再次计算最小的操作.在createIndex中,最小的最频繁的操作是存储单词和电话ID之间的关联.我们针对每个电话,名称中的每个单词执行一次此操作.在searchByRelevance中,最小的最频繁的操作是在countBy步骤中增加给定电话的相关性.对于搜索字符串中的每个单词,以及与该单词匹配的每个手机,我们都会执行一次.

如果我们做出一些合理的假设,我们可以估计出给定搜索字符串中匹配手机的数量.电话名称中最常见的单词可能是品牌,例如三星",三星",和"Apple".由于至少有十个品牌,我们可以假设与给定搜索词匹配的手机数量通常不到手机总数的10%.因此,执行一次搜索所需的时间是搜索字符串中的单词数乘以电话数乘以10%(即除以10).

因此,如果我们有100个电话,名称中平均包含3个单词,那么索引将需要100 * 3 = 300次,将关联存储在idsByWord查找表中.使用搜索字符串中的5个单词执行搜索仅需要5 * 100 * 10%= 50的相关性增量.这已经比不使用查找表所需的1500个单词比较要快得多,尽管在这种情况下,计算机背后的人员不会注意到它们之间的区别.

通过更大的输入量,带有查找表的方法的速度优势进一步增强:

 ┌───────────────────┬───────┬────────┬───────┐
│ Problem size      │ Small │ Medium │ Large │
├───────────────────┼───────┼────────┼───────┤
│ phones            │   100 │    20k │    1M │
│ words per name    │     3 │      5 │     8 │
│ search terms      │     5 │     10 │    15 │
├───────────────────┼───────┼────────┼───────┤
│ w/o lookup tables │       │        │       │
│ word comparisons  │  1500 │     1M │  120M │
├───────────────────┼───────┼────────┼───────┤
│ w/ lookup tables  │       │        │       │
│ associations      │   300 │   100k │    8M │
│ increments        │    50 │    20k │  1.5M │
└───────────────────┴───────┴────────┴───────┘
 

事实上,这仍然低估了速度优势,因为与给定搜索词匹配的手机所占百分比可能会随着手机数量的增加而下降.

查找表可以使搜索更快.但这是更好吗?如前所述,对于较小的问题,速度差异将不明显.查找表的缺点是,这需要更多的代码,这使它更难理解,并且需要花费更多的精力进行维护.它还需要一个与关联数一样大的查找表,这意味着我们将比以前使用更多的内存.

总之,什么是最佳"?总是取决于不同限制之间的权衡,例如代码大小,速度和内存使用情况.由您决定要如何权衡这些约束.

I have a search field and I want to add some complex functionality using underscore.js.

Sometimes users search for a whole "sentence" like "Samsung galaxy A20s ultra". I want to filter JSON data using any of the words in the search string and sort by results that contain more of the words.

Sample data:

var phones = [
{name: "Samsung A10s", id: 845},
{name: "Samsung galaxy", id: 839},
{name: "Nokia 7", id: 814},
{name: "Samsung S20s ultra", id: 514},
{name: "Apple iphone ultra", id: 159},
{name: "LG S20", id: 854}];

What is the best way to do it in underscore?

解决方案

In this answer, I'll be building a function searchByRelevance that takes two arguments:

  1. a JSON array of phones with name and id properties, and
  2. a search string,

and which returns a new JSON array, with only the phones of which the name has at least one word in common with the search string, sorted such that the phones with the most common words come first.

Let's first identify all the subtasks and how you could implement them with Underscore. Once we've done that, we can compose them into the searchByRelevance function. In the end, I'll also spend some words on how we might determine what is "best".

Subtasks

Split a string into words

You don't need Underscore for this. Strings have a builtin split method:

"Samsung galaxy A20s ultra".split(' ')
// [ 'Samsung', 'galaxy', 'A20s', 'ultra' ]

However, if you have a whole array of strings and you want to split them all, so you get an array of arrays, you can do so using _.invoke:

_.invoke([
    'Samsung A10s',
    'Samsung galaxy',
    'Nokia 7',
    'Samsung S20s ultra',
    'Apple iphone ultra',
    'LG S20'
], 'split', ' ')
// [ [ 'Samsung', 'A10s' ],
//   [ 'Samsung', 'galaxy' ],
//   [ 'Nokia', '7' ],
//   [ 'Samsung', 'S20s', 'ultra' ],
//   [ 'Apple', 'iphone', 'ultra' ],
//   [ 'LG', 'S20' ] ]

Find the words that two arrays have in common

If you have two arrays of words,

var words1 = [ 'Samsung', 'galaxy', 'A20s', 'ultra' ],
    words2 = [ 'Apple', 'iphone', 'ultra' ];

then you can get a new array with just the words they have in common using _.intersection:

_.intersection(words1, words2) // [ 'ultra' ]

Count the number of words in an array

This is again something you don't need Underscore for:

[ 'Samsung', 'A10s' ].length // 2

But if you have multiple arrays of words, you can get the word counts for all of them using _.map:

_.map([
    [ 'Samsung', 'A10s' ],
    [ 'Samsung', 'galaxy' ],
    [ 'Nokia', '7' ],
    [ 'Samsung', 'S20s', 'ultra' ],
    [ 'Apple', 'iphone', 'ultra' ],
    [ 'LG', 'S20' ]
], 'length')
// [ 2, 2, 2, 3, 3, 2 ]

Sort an array by some criterion

_.sortBy does this. For example, the phones data by id:

_.sortBy(phones, 'id')
// [ { name: 'Apple iphone ultra', id: 159 },
//   { name: 'Samsung S20s ultra', id: 514 },
//   { name: 'Nokia 7', id: 814 },
//   { name: 'Samsung galaxy', id: 839 },
//   { name: 'Samsung A10s', id: 845 },
//   { name: 'LG S20', id: 854 } ]

To sort descending instead of ascending, you can first sort ascending and then reverse the result using the builtin reverse method:

_.sortBy(phones, 'id').reverse()
// [ { name: 'LG S20', id: 854 },
//   { name: 'Samsung A10s', id: 845 },
//   { name: 'Samsung galaxy', id: 839 },
//   { name: 'Nokia 7', id: 814 },
//   { name: 'Samsung S20s ultra', id: 514 },
//   { name: 'Apple iphone ultra', id: 159 } ]

You can also pass a criterion function. The function receives the current item and it can do anything, as long as it returns a string or number to use as the rank of the current item. For example, this sorts the phones by the last letter of the name (using _.last):

_.sortBy(phones, function(phone) { return _.last(phone.name); })
// [ { name: 'LG S20', id: 854 },
//   { name: 'Nokia 7', id: 814 },
//   { name: 'Samsung S20s ultra', id: 514 },
//   { name: 'Apple iphone ultra', id: 159 },
//   { name: 'Samsung A10s', id: 845 },
//   { name: 'Samsung galaxy', id: 839 } ]

Group the elements of an array by some criterion

Instead of sorting directly, we might also first only group the items by a criterion. Here's grouping the phones by the first letter of the name, using _.groupBy and _.first:

_.groupBy(phones, function(phone) { return _.first(phone.name); })
// { S: [ { name: 'Samsung A10s', id: 845 },
//        { name: 'Samsung galaxy', id: 839 },
//        { name: 'Samsung S20s ultra', id: 514 } ],
//   N: [ { name: 'Nokia 7', id: 814 } ],
//   A: [ { name: 'Apple iphone ultra', id: 159 } ],
//   L: [ { name: 'LG S20', id: 854 } ] }

We have seen that we can pass keys to sort or group by, or a function that returns something to use as a criterion. There is a third option which we can use here instead of the function above:

_.groupBy(phones, ['name', 0])
// { S: [ { name: 'Samsung A10s', id: 845 },
//        { name: 'Samsung galaxy', id: 839 },
//        { name: 'Samsung S20s ultra', id: 514 } ],
//   N: [ { name: 'Nokia 7', id: 814 } ],
//   A: [ { name: 'Apple iphone ultra', id: 159 } ],
//   L: [ { name: 'LG S20', id: 854 } ] }

Getting the keys of an object

This is what _.keys is for:

_.keys({name: "Samsung A10s", id: 845}) // [ 'name', 'id' ]

You can also do this with the standard Object.keys. _.keys works in old environments where Object.keys doesn't. Otherwise, they are interchangeable.

Turn an array of things into other things

We have previously seen the use of _.map to get the lengths of multiple arrays of words. In general, it takes an array or object and something that you want to be done with each element of that array or object, and it will return an array with the results:

_.map(phones, 'id')
// [ 845, 839, 814, 514, 159, 854 ]
_.map(phones, ['name', 0])
// [ 'S', 'S', 'N', 'S', 'A', 'L' ]
_.map(phones, function(phone) { return _.last(phone.name); })
// [ 's', 'y', '7', 'a', 'a', '0' ]

Note the similarity with _.sortBy and _.groupBy. This is a general pattern in Underscore: you have a collection of something and you want to do something with each element, in order to arrive at some sort of result. The thing you want to do with each element is called the "iteratee". Underscore has a function that ensures you can use the same iteratee shorthands in all functions that work with an iteratee: _.iteratee.

Sometimes you may want to do something with each element of a collection and combine the results in a way that is different from what _.map, _.sortBy and the other Underscore functions already do. In this case, you can use _.reduce, the most general function of them all. For example, here's how we can create a mixture of the names of the phones, by taking the first letter of the name of the first phone, the second letter of the name of the second phone, and so forth:

_.reduce(phones, function(memo, phone, index) {
    return memo + phone.name[index];
}, '')
// 'Sakse0'

The function that we pass to _.reduce is invoked for each phone. memo is the result that we've built so far. The result of the function is used as the new memo for the next phone that we process. In this way, we build our string one phone at a time. The last argument to _.reduce, '' in this case, sets the initial value of memo so we have something to start with.

Concatenate multiple arrays into a single one

For this we have _.flatten:

_.flatten([
    [ 'Samsung', 'A10s' ],
    [ 'Samsung', 'galaxy' ],
    [ 'Nokia', '7' ],
    [ 'Samsung', 'S20s', 'ultra' ],
    [ 'Apple', 'iphone', 'ultra' ],
    [ 'LG', 'S20' ]
])
// [ 'Samsung', 'A10s', 'Samsung', 'galaxy', 'Nokia', '7',
//   'Samsung', 'S20s', 'ultra', 'Apple', 'iphone', 'ultra',
//   'LG', 'S20' ]

Putting it all together

We have an array of phones and a search string, we want to somehow compare each of those phones to the search string, and finally we want to combine the results of that so we get the phones by relevance. Let's start with the middle part.

Does "each of those phones" ring a bell? We are creating an iteratee! We want it to take a phone as its argument, and we want it to return the number of words that its name has in common with the search string. This function will do that:

function relevance(phone) {
    return _.intersection(phone.name.split(' '), searchTerms).length;
}

This assumes that there is a searchTerms variable defined outside of the relevance function. It has to be an array with the words in the search string. We'll deal with this in a moment; let's address how to combine our results first.

While there are many ways possible, I think the following is quite elegant. I start with grouping the phones by relevance,

_.groupBy(phones, relevance)

but I want to omit the group of phones that have zero words in common with the search string:

var groups = _.omit(_.groupBy(phones, relevance), '0');

Note that I'm omitting the string key '0', not the number key 0, because the result of _.groupBy is an object, and the keys of an object are always strings.

Now we need to order the remaining groups by the number of matching words. We know the number of matching words for each group by taking the keys of our groups,

_.keys(groups)

and we can sort these ascending first, but we must take care to cast them back to numbers, so that we will sort 2 before 10 (numerical comparison) instead of '10' before '2' (lexicographical comparison):

_.sortBy(_.keys(groups), Number)

then we can reverse this in order to arrive at the final order of our groups.

var tiers = _.sortBy(_.keys(groups), Number).reverse();

Now we just need to transform this sorted array of keys into an array with the actual groups of phones. To do this, we can use _.map and _.propertyOf:

_.map(tiers, _.propertyOf(groups))

Finally, we only need to flatten this into one big array, in order to have our search results by relevance.

_.flatten(_.map(tiers, _.propertyOf(groups)))

Let's wrap all of this up into our searchByRelevance function. Remember that we still needed to define searchTerms outside of our relevance iteratee:

function searchByRelevance(phones, searchString) {
    var searchTerms = searchString.split(' ');
    function relevance(phone) {
        return _.intersection(phone.name.split(' '), searchTerms).length;
    }
    var groups = _.omit(_.groupBy(phones, relevance), '0');
    var tiers = _.sortBy(_.keys(groups), Number).reverse();
    return _.flatten(_.map(tiers, _.propertyOf(groups)));
}

Now put it to the test!

searchByRelevance(phones, 'Samsung galaxy A20s ultra')
// [ { name: 'Samsung galaxy', id: 839 },
//   { name: 'Samsung S20s ultra', id: 514 },
//   { name: 'Samsung A10s', id: 845 },
//   { name: 'Apple iphone ultra', id: 159 } ]

What is "best"?

If you measure "goodness" by the number of lines of code, then less code is generally better. We implemented searchByRelevance above in just eight lines of code, so that seems pretty good.

It is, however, a bit dense. The number of lines increases, but the readability improves a bit, if we use chaining:

function searchByRelevance(phones, searchString) {
    var searchTerms = searchString.split(' ');
    function relevance(phone) {
        return _.intersection(phone.name.split(' '), searchTerms).length;
    }
    var groups = _.chain(phones)
        .groupBy(relevance)
        .omit('0');
    return groups.keys()
        .sortBy(Number)
        .reverse()
        .map(_.propertyOf(groups.value()))
        .flatten()
        .value();
}

Yet another dimension of "goodness" is performance. Could searchByRelevance be faster? To get a sense of this, we usually take the smallest and most frequent operation, and we calculate how often we'll be executing that operation for a given size of input.

The main thing we'll be doing a lot in searchByRelevance, is comparing words. This is not the smallest operation, because comparing words consists of comparing letters, but because words in English tend to be short, we can pretend for now that comparing two words is our smallest and most executed operation. This makes the calculations a bit easier.

For each phone, we will be comparing each word in its name with each word in our search string. If we have 100 phones, and the average phone name has 3 words, and the search string has 5 words, then we will be making 100 * 3 * 5 = 1500 word comparisons.

Computers are fast, so 1500 is nothing. Generally, if the number of times you execute your smallest step remains under 100000 (100k), you probably won't even notice a delay unless that smallest step is very expensive.

However, the number of word comparisons will grow quite explosively with larger inputs. If we have 20000 (20k) phones, 5 words in the average name and a search string of 10 words, we are already making a million word comparisons. That could mean staring at your screen for a few seconds before the results come in.

Can we write a variant of searchByRelevance that can search 20k phones with long names in an eyeblink? Yes, and in fact we can probably also do a million or more! I won't go into the details line by line, but we can get much better speed by using appropriate lookup structures:

// lookup table by word in the name
function createIndex(phones) {
    return _.reduce(phones, function(lookup, phone) {
        _.each(phone.name.split(' '), function(word) {
            var matchingPhones = (lookup[word] || []);
            matchingPhones.push(phone.id);
            lookup[word] = matchingPhones;
        });
        return lookup;
    }, {});
}

// search using lookup tables
function searchByRelevance(phonesById, idsByWord, searchString) {
    var groups = _.chain(searchString.split(' '))
        .map(_.propertyOf(idsByWord))
        .compact()
        .flatten()
        .countBy()
        .pairs()
        .groupBy('1');
    return groups.keys()
        .sortBy(Number)
        .reverse()
        .map(_.propertyOf(groups.value()))
        .flatten(true) // only one level of flattening
        .map('0')
        .map(_.propertyOf(phonesById))
        .value();
}

To use this, we create the lookup tables once, then reuse them for each search. We need to recreate the lookup tables only if the JSON data of phones change.

var phonesById = _.indexBy(phones);
var idsByWord = createIndex(phones);

searchByRelevance(phonesById, idsByWord, 'Samsung galaxy A20s ultra')
// [ { name: 'Samsung galaxy', id: 839 },
//   { name: 'Samsung S20s ultra', id: 514 },
//   { name: 'Samsung A10s', id: 845 },
//   { name: 'Apple iphone ultra', id: 159 } ]
searchByRelevance(phonesById, idsByWord, 'Apple')
// [ { name: 'Apple iphone ultra', id: 159 } ]

To appreciate how much faster this is, let's count the smallest operations again. In createIndex, the smallest most frequent operation is storing an association between a word and the id of a phone. We do this once for each phone, for each word in its name. In searchByRelevance, the smallest most frequent operation is incrementing the relevance of a given phone in the countBy step. We do this once for each word in the search string, for each phone that matches that word.

We can estimate the number of matching phones for a given search string if we make some reasonable assumptions. The most frequent words in the phone names are probably the brands, such as "Samsung" and "Apple". Since there are at least ten brands, we can assume that the number of phones that match a given search term is generally less than 10% of the total number of phones. So the time it takes to execute one search is the number of words in the search string, times the number of phones, times 10% (i.e., divided by 10).

So if we have 100 phones with on average 3 words in the name, then indexing takes 100 * 3 = 300 times storing an association in the idsByWord lookup table. Performing a search with 5 words in the search string takes only 5 * 100 * 10% = 50 relevance increments. This is already much faster than the 1500 word comparisons we needed without lookup tables, although the human behind the computer will not notice the difference in this case.

The speed advantage of the approach with the lookup table further increases with larger inputs:

┌───────────────────┬───────┬────────┬───────┐
│ Problem size      │ Small │ Medium │ Large │
├───────────────────┼───────┼────────┼───────┤
│ phones            │   100 │    20k │    1M │
│ words per name    │     3 │      5 │     8 │
│ search terms      │     5 │     10 │    15 │
├───────────────────┼───────┼────────┼───────┤
│ w/o lookup tables │       │        │       │
│ word comparisons  │  1500 │     1M │  120M │
├───────────────────┼───────┼────────┼───────┤
│ w/ lookup tables  │       │        │       │
│ associations      │   300 │   100k │    8M │
│ increments        │    50 │    20k │  1.5M │
└───────────────────┴───────┴────────┴───────┘

This is, in fact, still underestimating the speed advantage, since the percentage of phones that match a given search term is likely to drop as the number of phones increases.

Lookup tables make searching much faster. But is it better? As I said before, for small problem sizes, the speed difference will not be noticable. A disadvantage of the lookup tables is that this requires more code, which makes it a bit harder understand, as well as taking more effort to maintain. It also requires a lookup table as large as the number of associations, which means we will be using much more additional memory than before.

To conclude, what is "best" always depends on a tradeoff between different constraints, such as code size, speed and memory usage. It is up to you to decide how you want to weigh these constraints relative to each other.

这篇关于通过underscore.js中的另一个数组过滤json数据的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆