删除大型数据集的重复项,包括真实重复项(整个行重复)和基于一列的重复项 [英] Delete Duplicates for Large Dataset, Both True Duplicates (Whole row is duplicate) and Duplicate Based on One Column

查看:120
本文介绍了删除大型数据集的重复项,包括真实重复项(整个行重复)和基于一列的重复项的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有一个相当大的数据集.最小行数在8K范围内.我需要在两个条件下删除重复项.第一个就是我所说的真实重复".根据定义,这意味着整行是重复的.这是我适用于这种情况的脚本.

I have a rather large dataset. Minimum rows are in the 8K range. I need to delete duplicates on two conditions. The first would be what I call a "True Duplicate." By definition this means that the entire row is a duplicate. Here is the script I have that works for that scenario.

function removeDuplicates(sheet) {
  var data = sheet.getDataRange().getValues();
  var newData = [];
  var trueDuplicateCount = 0;

  for (var i in data) {
    var row = data[i];
    var duplicate = false;
    for (var j in newData) {
      if (row.join() == newData[j].join()) {//Look for duplicates across all rows. True Duplicate
        duplicate = true;
        trueDuplicateCount = trueDuplicateCount + 1;
      }
    }
    if (!duplicate) {
      newData.push(row);
    }
  }
  sheet.clearContents();
  sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
  return trueDuplicateCount;
}

另一个条件是基于一列信息的重复项.删除真实重复项"后,我需要根据列删除重复项.我想把最早的日期放在另一列.

The other condition would be a duplicate based on one column's info. After having removed "True Duplicates" I need to delete duplicates based on a column. I would like to keep the line with the earliest date on another column.

这是我尝试过的方法,但不适用于这种情况.

function removeDuplicates(sheet) {
  var data = sheet.getDataRange().getValues();
  var newData = [];
  var trueDuplicateCount = 0;
  var diffDateDuplicateCount = 0;

  for (var i in data) {
    var row = data[i];
    var duplicate = false;
    for (var j in newData) {
      if (row.join() == newData[j].join()) {//Look for duplicates across all rows. True Duplicate
        duplicate = true;
        trueDuplicateCount = trueDuplicateCount + 1;
      }
      if(row[1] == newData[j][1] && row[0] > newData[j][0]){
        duplicate = true;
        diffDateDuplicateCount = diffDateDuplicateCount + 1
      }
    }
    if (!duplicate) {
      newData.push(row);
    }
  }
  sheet.clearContents();
  sheet.getRange(1, 1, newData.length, newData[0].length).setValues(newData);
  return [trueDuplicateCount, diffDateDuplicateCount];
}

以下是数据集的示例

删除真实重复项"后

删除重复日期较晚的副本

上面是脚本运行后剩下的内容,然后函数还将返回一个数组,其中删除了每种重复类型的计数.

The above is what would be left after the script ran and then the function would also return an array with the count for each type of duplicate deleted.

当前脚本适用于True Duplicates部分,但是我担心速度以及在更大的数据集上是否可能超时.对于8K行,我已经看到了将近3分钟的运行时间.话虽如此,这是我的要求.

The current script works for the True Duplicates portion, but I'm concerned about speed and possibly timing out on a larger dataset. With 8K rows I'm already seeing almost 3 minute run time. With that said, here are my asks.

条件

  • 速度,速度,速度.有没有更有效的方法来解决这个问题? 这是我最大的担忧.
  • 必须删除日期较晚的重复项,并保留最早的日期.
  • 必须为删除的每种重复类型返回一个计数.

希望这可以消除一些混乱.我已经展示了每个步骤想要做的事情. (Acct Number组成)

Hoping this clears up some confusion. I've shown what I want done with each step. (Acct Number is made up)

关于所选解决方案的评论

我选择了执行速度最快的解决方案.虽然Tanaike和Master都在工作,但我之所以选择Master's是因为我预计将来会有很多产品线.每毫秒都很重要.

I went with the solution that executed the fastest. While both Tanaike's and Master's worked, I went with Master's because I anticipate a lot of lines in the future. Every millisecond counts.

我只想感谢那些特别回答Tanaike付出了很多努力的人.希望这个问题成为删除重复项的圣杯,因为您的非v8解决方案对于那些没有v8的用户仍然非常有用.

I just want to thank those who answered especially Tanaike who put in a lot of work. Hopefully this question becomes the holy grail for deleting duplicates because your non-v8 solution is still great for those without v8.

推荐答案

您可以使用内置的removeDuplicates方法,该方法将就地删除重复项.然后使用哈希对象删除日期重复项:

You can use the inbuilt removeDuplicates method, which will remove duplicates in place. Use hash object to remove date duplicates afterwards:

function remDups(sheet) {
  let sh = sheet || SpreadsheetApp.getActive().getSheetByName('Sheet1');
  let rg = sh.getRange(2, 1, sh.getLastRow() - 1, 2);
  let initDataSz = rg.getNumRows();
  let newRg = rg.removeDuplicates();
  let newDataSz = newRg.getNumRows();
  //console.info({ initDataSz, newDataSz });
  let trueDups = initDataSz - newDataSz;
  let values = newRg.getValues();
  //newRg.copyTo(sh.getRange('C1'));
  newRg.clearContent();

  let out = Object.entries(
    values.reduce((obj, [date, color]) => {
      let oldDate = (obj[color] = obj[color] || Infinity);
      if (oldDate - date > 0) {
        obj[color] = date;
      }
      return obj;
    }, {})
  ).map(e => e.reverse());
  let falseDups = newDataSz - out.length;
  sh.getRange(2, 1, out.length, out[0].length).setValues(out);
  return [`${trueDups}`, `${falseDups}`];
}

性能:

    V8引擎上的15000行
  • 〜2.6秒
  • Performance:

    • ~2.6 seconds for 15000 rows on V8 engine
    • 这篇关于删除大型数据集的重复项,包括真实重复项(整个行重复)和基于一列的重复项的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆