删除重复项,保留最后一个条目-优化 [英] Removing duplicates, keeping last entry -- optimization

查看:82
本文介绍了删除重复项,保留最后一个条目-优化的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理一个宏,该宏将通过电子表格并根据在两列(列Q和D)中分别提供的两个条件来删除重复的条目(行).

I'm working on a macro that will go through a spreadsheet and remove duplicate entries (rows) based on two criteria that are provided separately in two columns (columns Q and D).

这就是我所拥有的.我在一个小的数据集上对其进行了测试,结果为 slow .

Here is what I have. I tested it out on a small dataset and it's slow.

Sub RemoveDupesKeepLast()
dim i As Integer
dim criteria1, criteria2 As String

Application.ScreenUpdating = False
Application.Calculation = xlCalculationManual
Application.EnableEvents = False

'start at bottom of sheet, go up
For i = ActiveSheet.UsedRange.Rows.Count to 2 Step -1

    'if there is no entry, go to next row
    If Cells(i, "Q").Value = "" Then
        GoTo gogo:
    End If

    'set criteria that we will filter for
    criteria1 = Cells(i, "D").Value
    criteria2 = Cells(i, "Q").Value

    'filter for criteria2, then criteria1 to get duplicates
    ActiveSheet.Range("A":"CI").AutoFilter field:=17, Criteria1:=criteria2, Operator:=xlFilterValues
    ActiveSheet.Range("A":"CI").AutoFilter field:=4, Criteria1:=criteria1, Operator:=xlFilterValues

    'if there are duplicates, keep deleting rows until only bottom-most entry is left behind
    Do While Range("Q2", Cells(Rows.Count, "Q").End(xlUp)).Cells.SpecialCells(xlCellTypeVisible).Count > 1
        ActiveSheet.AutoFilter.Range.Offset(1).SpecialCells(xlCellTypeVisible).Cells(1,17).EntireRow.Delete
    Loop

    'reset autofilter
    If ActiveSheet.FilterMode Then
        Cells.AutoFilter
    End If

gogo:
Next i

Application.EnableEvents = True
Application.Calculation = xlCalculationAutomatic
Application.ScreenUpdating = True

End Sub

有没有其他方法可以解决此问题以加快速度?现在,我基本上要检查每一行,直到到达顶部.工作表实际上是从30,000行到最大的任何地方.在我看来,应该有一种更快,更清洁的方式来实现我想要的目标,但我似乎想不到.

Is there a different way I can approach this problem to speed things up? As it is right now, I'm basically checking each row until I get to the top. The sheets are actually anywhere from 30,000 rows to the max. Seems to me that there should be a faster, cleaner way of achieving what I'm trying to do but I can't seem to think of one.

推荐答案

40.3秒内完成100,00行×87列.

100,00 rows × 87 columns in 40.3 seconds.

如果您的数据集以3万行开始并且只有更大,那么您应该在可能的情况下寻求内存中处理¹.我已经适应了此解决方案中使用的方法以更严格地遵循您的要求.

If your data set(s) start at 30K rows and only get bigger you should be looking to in-memory processing whenever possible¹. I've adapted the methods used in this solution to more closely follow your requirements.

以下大量将所有值加载到变量数组中,并构建一个结果中的Scripting.Dictionary 对象.使用了将密钥添加到字典的覆盖"方法,以便仅保留最后一个.

The following bulk loads all values into a variant array and builds a Scripting.Dictionary object from the results. The 'overwrite' method of adding keys to the dictionary is used so that only the last one is kept.

执行整理后,这些值将返回到重新定义的变量数组中,并还原到工作表中. enlargee .

When the collation has been performed, the values are returned to a re-dimensioned variant array and restored to the worksheet en masse.

模块1(代码)

Module1 (Code)

Option Explicit

Sub removeDupesKeepLast()
    Dim d As Long, dDQs As Object, ky As Variant
    Dim r As Long, c As Long, vVALs As Variant, vTMP As Variant

    'appTGGL bTGGL:=False   'uncomment this when you have finished debugging

    Set dDQs = CreateObject("Scripting.Dictionary")
    dDQs.comparemode = vbTextCompare

    'step 1 - bulk load the values
    With Worksheets("Sheet1")   'you should know what worksheet you are on
        With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1
            With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row
                vVALs = .Value  'use .Value2 if you do not have dates in unformatted cells
            End With
        End With
    End With

    'step 2 - build the dictionary
    ReDim vTMP(UBound(vVALs, 2) - 1)
    For r = LBound(vVALs, 1) To UBound(vVALs, 1)
        For c = LBound(vVALs, 2) To UBound(vVALs, 2)
            vTMP(c - 1) = vVALs(r, c)
        Next c
        dDQs.Item(vVALs(r, 4) & ChrW(8203) & vVALs(r, 17)) = vTMP
    Next r

    'step 3 - put the de-duplicated values back into the array
    r = 0
    ReDim vVALs(1 To dDQs.Count, LBound(vVALs, 2) To UBound(vVALs, 2))
    For Each ky In dDQs
        r = r + 1
        vTMP = dDQs.Item(ky)
        For c = LBound(vTMP) To UBound(vTMP)
            vVALs(r, c + 1) = vTMP(c)
        Next c
    Next ky

    'step 4 - clear the destination; put the de-duplicated values back into the worksheet and reset .UsedRange
    With Worksheets("Sheet1")   'you should know what worksheet you are on
        With .Cells(1, 1).CurrentRegion 'block of data radiating out from A1
            With .Resize(.Rows.Count - 1, .Columns.Count).Offset(1, 0) 'step off the header row
                .ClearContents  'retain formatting if it is there
                .Cells(1, 1).Resize(UBound(vVALs, 1), UBound(vVALs, 2)) = vVALs
            End With
        End With
        .UsedRange   'assert the UsedRange property (refreshes it)
    End With

    dDQs.RemoveAll: Set dDQs = Nothing

    appTGGL
End Sub

Public Sub appTGGL(Optional bTGGL As Boolean = True)
    With Application
        .ScreenUpdating = bTGGL
        .EnableEvents = bTGGL
        .DisplayAlerts = bTGGL
        .AutoRecover.Enabled = bTGGL   'no interruptions with an auto-save
        .Calculation = IIf(bTGGL, xlCalculationAutomatic, xlCalculationManual)
        .CutCopyMode = False
        .StatusBar = vbNullString
    End With
    Debug.Print Timer
End Sub

我的示例工作簿记录了10万行×87列,其中重复项约占24%,并在约40秒内处理了所有重复项(保留了最后的条目).上面写回Sheet1;我的测试运行回写到Sheet2,以保留原始数据.如果选择写回不同的工作表,请确保从A1开始有一些值,以便

My sample workbook took 100K rows × 87 column with ~24% duplicates and processed all duplicates (keeping the last entries) in ~40 seconds. The above writes back to Sheet1; my tests were run writing back to Sheet2 in order to retain the original data. If you choose to write back to a different worksheet, make sure that there are some values starting at A1 in order that the Range.CurrentRegion property can be properly identified. The test machine was an older laptop running 32-bit Excel 2010; your own results will likely vary.

¹请参见突出显示的副本和过滤器按颜色替代获取有关在Excel中处理大型数据集的提示.

¹ See Highlight Duplicates and Filter by color alternative for tip[s on dealing with large data sets in Excel.

这篇关于删除重复项,保留最后一个条目-优化的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆