如何从列表中删除几乎重复的整数? [英] How do I remove almost-duplicate integers from list?

查看:74
本文介绍了如何从列表中删除几乎重复的整数?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在用Python解析一些PDF.这些PDF在视觉上被组织成行和列. pdftohtml脚本将这些PDF转换为XML格式,其中充满了没有任何层次结构的宽松<text>标记.然后,我的代码需要将这些<text>标记排序回行.

I'm parsing some PDFs in Python. These PDFs are visually organized into rows and columns. The pdftohtml script converts these PDFs to an XML format, full of loose <text> tags which don't have any hierarchy. My code then needs to sort these <text> tags back into rows.

由于每个<text>标记都具有"top"或"left"坐标之类的属性,因此我编写了将具有相同"top"坐标的<text>项目附加到列表的代码.该列表实际上是一行.

Since each <text> tag has attributes like "top" or "left" coordinates, I wrote code to append <text> items with the same "top" coordinate to a list. This list is effectively one row.

我的代码首先遍历页面,找到所有唯一的顶部"值,并将它们附加到顶部列表中.然后遍历此顶部列表.对于每个唯一的最高值,它将搜索具有该最高"值的所有项目,并将它们添加到行列表中.

My code first iterates over the page, finds all unique "top" values, and appends them to a tops list. Then it iterates over this tops list. For each unique top value, it searches for all items that have that "top" value and adds them to a row list.

for side in page:
    tops = list( set( [ d['top'] for d in side ] ) )
    tops.sort()
    for top in tops:
        row = []
        for blob in side:
            if int(blob['top']) == int(top):
                row.append(blob)
        rows.append(row)

此代码非常适合我正在解析的大多数PDF.但是在某些情况下,同一行上的项目的最高值 略有不同,相差一到两个.

This code works great for the majority of the PDFs I'm parsing. But there are cases where items which are on the same row have slightly different top values, off by one or two.

我正在尝试修改我的代码,以使其变得更加模糊.

I'm trying to adapt my code to become a bit fuzzier.

底部的比较似乎很容易修复.像这样:

The comparison at the bottom seems easy enough to fix. Something like this:

        for blob in side:
            rangeLower = int(top) - 2
            rangeUpper = int(top) + 2
            thisTop = int(blob['top'])
            if rangeLower <= thisTop <= rangeUpper :
                row.append(blob)

但是我首先创建的唯一的最高价值列表是一个问题.我使用的代码是

But the list of unique top values that I create first is a problem. The code I use is

    tops = list( set( [ d['top'] for d in side ] ) )

在这些极端情况下,我最终得到一个像这样的列表:

In these edge cases, I end up with a list like:

[925, 946, 966, 995, 996, 1015, 1035]

如何修改代码以避免在列表中包含"995"和"996"?当整数在1或2之内时,我想确保仅得到一个值.

How could I adapt that code to avoid having "995" and "996" in the list? I want to ensure I end up with just one value when integers are within 1 or 2 of each other.

推荐答案

  • 对列表进行排序,以使关闭值彼此相邻
  • 使用reduce根据先前的值过滤值
    • Sort the list to put the close values next to one another
    • Use reduce to filter the value depending on the previous value
    • 代码:

      >>> tops = [925, 946, 966, 995, 996, 1015, 1035]
      >>> threshold = 2
      >>> reduce(lambda x, y: x + [y] if len(x) == 0 or y > x[-1] + threshold else x, sorted(tops), [])
      [925, 946, 966, 995, 1015, 1035]
      

      具有多个连续值:

      >>> tops = range(10)
      >>> reduce(lambda x, y: x + [y] if len(x) == 0 or y > x[-1] + threshold else x, sorted(tops), [])
      [0, 3, 6, 9]
      

      编辑

      减少阅读可能有点麻烦,因此这是一种更简单的方法:

      Edit

      Reduce can be a little cumbersome to read, so here is a more straightforward approach:

      res = []
      for item in sorted(tops):
          if len(res) == 0 or item > res[-1] + threshold:
              res.append(item)
      

      这篇关于如何从列表中删除几乎重复的整数?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆