更有效的“删除重复”功能 [英] A more efficient 'remove duplicates' function

查看:53
本文介绍了更有效的“删除重复”功能的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我管理的Google表格列表有时超过10,000行。对于行数最多约5,000的工作表,下面提到的删除重复项功能可以正常工作。但是对于5,000以上的任何东西,我都会收到超出最长执行时间的错误。我将非常感谢有关如何使代码更高效的一些说明,即使对于10k +行的工作表也能顺利运行。

I manage Google Sheet lists that sometimes exceed 10,000 rows. For sheets with rows up to around 5,000, the remove duplicates function noted below works finely. But for anything above 5,000, I receive the 'Exceeded maximum execution time' error. I would be grateful for some instruction on how to make the code more efficient such that it could run smoothly even for sheets with 10k+ rows.

function removeDuplicates() {
  var sheet = SpreadsheetApp.getActiveSheet();
  var data = sheet.getDataRange().getValues();
  var newData = new Array();
  for(i in data){
    var row = data[i];
    var duplicate = false;
    for(j in newData){
      if(row.join() == newData[j].join()){
        duplicate = true;
      }
    }
    if(!duplicate){
      newData.push(row);
    }
  }
  sheet.clearContents();
  sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
}


推荐答案

有几件事这使你的代码变慢。让我们看看你的两个 for 循环:

There are a couple of things that are making your code slow. Let's look at your two for loops:

for (i in data) {
  var row = data[i];
  var duplicate = false;

  for (j in newData){
    if (row.join() == newData[j].join()) {
      duplicate = true;
    }
  }

  if (!duplicate) {
    newData.push(row);
  }
}

从表面上看,你正在做正确的事情:对于原始数据中的每一行,检查新数据是否已经有匹配的行。如果没有,请将行添加到新数据中。然而,在这个过程中,你正在做很多额外的工作。

On the face of it, you're doing the right things: For every row in the original data, check if the new data already has a matching row. If it doesn't, add the row to the new data. In the process, however, you're doing a lot of extra work.

例如,考虑一下这个事实,即在任何给定的时间,<$ c中的一行$ c> data 在 newData 中的匹配行不会超过一行。但是在你的内部 for 循环中,找到一个匹配后,它仍然继续检查 newData 。对此的解决方案是在 duplicate = true; 之后添加 break; 以停止迭代。

Consider, for example, the fact that at any given time, a row in data will have no more than one matching row in newData. But in your inner for loop, after you find that one match, it still continues checking the rest of the rows in newData. The solution to this would be to add a break; after duplicate = true; to stop iterating.

考虑到任何给定的 j newData [j] .join()<的值/ code>将始终相同。假设您在 data 中有100行,并且没有重复项(最坏的情况)。当你的函数完成时,你将计算 newData [0] .join() 99次, newData [1] .join() 98次......总而言之,您将完成近5,000次计算以获得相同的99个值。对此的解决方案是 memoization ,您可以存储计算结果以避免进行稍后再次进行相同的计算。

Consider also that for any given j, the value of newData[j].join() will always be the same. Suppose you have 100 rows in data, and no duplicates (the worst case). By the time your function finishes, you'll have calculated newData[0].join() 99 times, newData[1].join() 98 times... all in all you'll have done almost 5,000 calculations to get the same 99 values. A solution to this is memoization, whereby you store the result of a calculation in order to avoid doing the same calculation again later.

即使您进行了这两项更改,您的代码也是时间复杂度仍然是 0 的(名词的²)。如果你有100行数据,在最坏的情况下,内循环将运行4,950次。对于10,000行,这个数字约为5000万。

Even if you make those two changes, though, your code's time complexity is still O(n²). If you have 100 rows of data, in the worst case the inner loop will run 4,950 times. For 10,000 rows that number is around 50 million.

但是,我们可以这样做 O n )相反,如果我们摆脱内部循环并重新形成外部循环,那么:

However, we can do this is O(n) time instead, if we get rid of the inner loop and reformulate the outer loop like so:

var seen = {};

for (var i in data) {
  var row = data[i];
  var key = row.join();

  if (key in seen) {
    continue;
  }
  seen[key] = true;
  newData.push(row);
}

这里,不是检查 newData的每一行对于在每次迭代中匹配 row 的行,我们将目前为止看到的每一行存储为对象中的一个键看到。然后在每次迭代中我们只需检查看到是否有一个匹配的密钥,我们几乎可以做的操作恒定时间,或 O 1 )。 1

Here, instead of checking every row of newData for a row matching row in every iteration, we store every row we've seen so far as a key in the object seen. Then in each iteration we just have to check if seen has a key matching row, an operation we can do in nearly constant time, or O(1).1

作为一个完整的功能,这是它的样子:

As a complete function, here's what it looks like:

function removeDuplicates_() {
  const startTime = new Date();
  const sheet = SpreadsheetApp.getActiveSheet();
  const data = sheet.getDataRange().getValues();
  const numRows = data.length;
  const newData = [];
  const seen = {};

  for (var i = 0, row, key; i < numRows && (row = data[i]); i++) {
    key = JSON.stringify(row);
    if (key in seen) {
      continue;
    }
    seen[key] = true;
    newData.push(row);
  }

  sheet.clearContents();
  sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);

  // Show summary
  const secs = (new Date() - startTime) / 1000;
  SpreadsheetApp.getActiveSpreadsheet().toast(
    Utilities.formatString('Processed %d rows in %.2f seconds (%.1f rows/sec); %d deleted',
                           numRows, secs, numRows / secs, numRows - newData.length),
    'Remove duplicates', -1);
}

function onOpen() {
  SpreadsheetApp.getActive().addMenu('Scripts', [
    { name: 'Remove duplicates', functionName: 'removeDuplicates_' }
  ]);
}

你会看到而不是使用行。 join()此代码使用 JSON.stringify(row),因为 row.join()很脆弱( ['a,b','c']。join()== ['a','b,c']。join() , 例如)。 JSON.stringify 不是免费的,但对我们来说这是一个很好的妥协。

You'll see that instead of using row.join() this code uses JSON.stringify(row), because row.join() is fragile (['a,b', 'c'].join() == ['a', 'b,c'].join(), for example). JSON.stringify isn't free, but it's a good compromise for our purposes.

在我的测试中这个过程一个简单的电子表格,有超过8秒的50,000行和2列,或者每秒约6,000行。

In my tests this processes a simple spreadsheet with 50,000 rows and 2 columns in a little over 8 seconds, or around 6,000 rows per second.

这篇关于更有效的“删除重复”功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆