更有效的“删除重复”功能 [英] A more efficient 'remove duplicates' function
问题描述
我管理的Google表格列表有时超过10,000行。对于行数最多约5,000的工作表,下面提到的删除重复项功能可以正常工作。但是对于5,000以上的任何东西,我都会收到超出最长执行时间的错误。我将非常感谢有关如何使代码更高效的一些说明,即使对于10k +行的工作表也能顺利运行。
I manage Google Sheet lists that sometimes exceed 10,000 rows. For sheets with rows up to around 5,000, the remove duplicates function noted below works finely. But for anything above 5,000, I receive the 'Exceeded maximum execution time' error. I would be grateful for some instruction on how to make the code more efficient such that it could run smoothly even for sheets with 10k+ rows.
function removeDuplicates() {
var sheet = SpreadsheetApp.getActiveSheet();
var data = sheet.getDataRange().getValues();
var newData = new Array();
for(i in data){
var row = data[i];
var duplicate = false;
for(j in newData){
if(row.join() == newData[j].join()){
duplicate = true;
}
}
if(!duplicate){
newData.push(row);
}
}
sheet.clearContents();
sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
}
推荐答案
有几件事这使你的代码变慢。让我们看看你的两个 for
循环:
There are a couple of things that are making your code slow. Let's look at your two for
loops:
for (i in data) {
var row = data[i];
var duplicate = false;
for (j in newData){
if (row.join() == newData[j].join()) {
duplicate = true;
}
}
if (!duplicate) {
newData.push(row);
}
}
从表面上看,你正在做正确的事情:对于原始数据中的每一行,检查新数据是否已经有匹配的行。如果没有,请将行添加到新数据中。然而,在这个过程中,你正在做很多额外的工作。
On the face of it, you're doing the right things: For every row in the original data, check if the new data already has a matching row. If it doesn't, add the row to the new data. In the process, however, you're doing a lot of extra work.
例如,考虑一下这个事实,即在任何给定的时间,<$ c中的一行$ c> data 在 newData
中的匹配行不会超过一行。但是在你的内部 for
循环中,找到一个匹配后,它仍然继续检查 newData
。对此的解决方案是在 duplicate = true;
之后添加 break;
以停止迭代。
Consider, for example, the fact that at any given time, a row in data
will have no more than one matching row in newData
. But in your inner for
loop, after you find that one match, it still continues checking the rest of the rows in newData
. The solution to this would be to add a break;
after duplicate = true;
to stop iterating.
考虑到任何给定的 j
, newData [j] .join()<的值/ code>将始终相同。假设您在
data
中有100行,并且没有重复项(最坏的情况)。当你的函数完成时,你将计算 newData [0] .join()
99次, newData [1] .join()
98次......总而言之,您将完成近5,000次计算以获得相同的99个值。对此的解决方案是 memoization ,您可以存储计算结果以避免进行稍后再次进行相同的计算。
Consider also that for any given j
, the value of newData[j].join()
will always be the same. Suppose you have 100 rows in data
, and no duplicates (the worst case). By the time your function finishes, you'll have calculated newData[0].join()
99 times, newData[1].join()
98 times... all in all you'll have done almost 5,000 calculations to get the same 99 values. A solution to this is memoization, whereby you store the result of a calculation in order to avoid doing the same calculation again later.
即使您进行了这两项更改,您的代码也是时间复杂度仍然是 0 的(名词的²)。如果你有100行数据,在最坏的情况下,内循环将运行4,950次。对于10,000行,这个数字约为5000万。
Even if you make those two changes, though, your code's time complexity is still O(n²). If you have 100 rows of data, in the worst case the inner loop will run 4,950 times. For 10,000 rows that number is around 50 million.
但是,我们可以这样做 O ( n )相反,如果我们摆脱内部循环并重新形成外部循环,那么:
However, we can do this is O(n) time instead, if we get rid of the inner loop and reformulate the outer loop like so:
var seen = {};
for (var i in data) {
var row = data[i];
var key = row.join();
if (key in seen) {
continue;
}
seen[key] = true;
newData.push(row);
}
这里,不是检查 newData的每一行
对于在每次迭代中匹配 row
的行,我们将目前为止看到的每一行存储为对象中的一个键看到
。然后在每次迭代中我们只需检查看到
是否有一个匹配行
的密钥,我们几乎可以做的操作恒定时间,或 O ( 1 )。 1
Here, instead of checking every row of newData
for a row matching row
in every iteration, we store every row we've seen so far as a key in the object seen
. Then in each iteration we just have to check if seen
has a key matching row
, an operation we can do in nearly constant time, or O(1).1
作为一个完整的功能,这是它的样子:
As a complete function, here's what it looks like:
function removeDuplicates_() {
const startTime = new Date();
const sheet = SpreadsheetApp.getActiveSheet();
const data = sheet.getDataRange().getValues();
const numRows = data.length;
const newData = [];
const seen = {};
for (var i = 0, row, key; i < numRows && (row = data[i]); i++) {
key = JSON.stringify(row);
if (key in seen) {
continue;
}
seen[key] = true;
newData.push(row);
}
sheet.clearContents();
sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
// Show summary
const secs = (new Date() - startTime) / 1000;
SpreadsheetApp.getActiveSpreadsheet().toast(
Utilities.formatString('Processed %d rows in %.2f seconds (%.1f rows/sec); %d deleted',
numRows, secs, numRows / secs, numRows - newData.length),
'Remove duplicates', -1);
}
function onOpen() {
SpreadsheetApp.getActive().addMenu('Scripts', [
{ name: 'Remove duplicates', functionName: 'removeDuplicates_' }
]);
}
你会看到而不是使用行。 join()
此代码使用 JSON.stringify(row)
,因为 row.join()
很脆弱( ['a,b','c']。join()== ['a','b,c']。join()
, 例如)。 JSON.stringify
不是免费的,但对我们来说这是一个很好的妥协。
You'll see that instead of using row.join()
this code uses JSON.stringify(row)
, because row.join()
is fragile (['a,b', 'c'].join() == ['a', 'b,c'].join()
, for example). JSON.stringify
isn't free, but it's a good compromise for our purposes.
在我的测试中这个过程一个简单的电子表格,有超过8秒的50,000行和2列,或者每秒约6,000行。
In my tests this processes a simple spreadsheet with 50,000 rows and 2 columns in a little over 8 seconds, or around 6,000 rows per second.
这篇关于更有效的“删除重复”功能的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!