根据值之间的距离将字符串分成几列 [英] Split string into columns based on distance between values

查看:69
本文介绍了根据值之间的距离将字符串分成几列的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在处理从PDF导出的非结构化文本数据.原始数据来自PDF中的表格,该表格已转换为文本格式,因此剩下的只是它的一般结构.我正在查看的特定部分曾经是一张桌子.

I am working with unstructured text data exported from a PDF. The original data comes from a table in the PDF that was converted to text format, so all that remains is the general structure of it. A particular section I'm looking at used to be a table.

例如,这是一些示例输入

So for example, here is some sample input

  A        B     C     D         E
 1        2                     3
 4              6     7    

第一行指示标题,随后几行是值.

The first line indicates the headers, and the following lines are the values.

幸运的是,间距得以保留(某种程度上):每列之间始终至少有两个空格.但是,实际的空间数量会有所不同,具体取决于解析器根据表的结构决定如何处理它.

Fortunately, the spacing is preserved (somewhat): there will always be at least two spaces between each column. However, the actual number of spaces would vary depending on how the parser decided to handle it based on how the table was structured.

我想将这些行解析为以下数组.我将首先解析标题以获取列,然后在解析其余行时将其用作模板.

I want to parse these lines into the following arrays. I would first parse the header to get the columns, and then use that as the template I need while parsing the rest of the lines.

{"A", "B", "C", "D", "E"}
{"1", "2",  "",  "", "3"}
{"4",  "", "6", "7",  ""}

仅给出此信息,是否可以准确地做到这一点?

Is it possible to accurately do this, given only this information?

推荐答案

我想您可以在String中获取标头(A,B,...)的索引,并将其与该值的索引进行比较每行以获取最接近的...我很快尝试并得到了这个结果:

I guess that you could get the index of the header (A, B, ...) in the String and compare it to the index of the value in each lines to get the closest ... I tried quickly and got this result :

public static void main(String[] args) {
    String headerColumn = "  A        B     C     D         E";
    String firstLine = " 1        2                     3";
    String secondLine = " 4              6     7    ";

    Map<Integer, String> indexHeaderMap = new HashMap<Integer, String>();
    // Get header indexes
    for (int i = 0; i < headerColumn.length(); i++) {
        String currChar = String.valueOf(headerColumn.charAt(i));
        if (!currChar.equals(" ")) {
            indexHeaderMap.put(i, currChar);
        }
    }

    // Parse first line
    parseLine(firstLine, indexHeaderMap);
    // Parse second line
    parseLine(secondLine, indexHeaderMap);
}

和功能:

private static void parseLine(String pLine, Map<Integer, String> pHeaderMap) {
    for (int i = 0; i < pLine.length(); i++) {
        String currChar = String.valueOf(pLine.charAt(i));
        if (!currChar.equals(" ")) {
            int valueColumnIndex = getNearestColumnIndex(i, pHeaderMap);
            System.out.println("Value " + currChar + " is on column " + pHeaderMap.get(valueColumnIndex));
        }
    }
}

private static int getNearestColumnIndex(int pIndex,
        Map<Integer, String> pHeaderMap) {
    int minDiff = 500;
    int nearestColumnIndex = -1;
    for(Map.Entry<Integer, String> mapEntry : pHeaderMap.entrySet()) {
        int diff = Math.abs(mapEntry.getKey() - pIndex);
        if (diff < minDiff) {
            minDiff = diff;
            nearestColumnIndex = mapEntry.getKey();
        }
    }

    return nearestColumnIndex;
}

这是输出:

Value 1 is on column A
Value 2 is on column B
Value 3 is on column E
Value 4 is on column A
Value 6 is on column C
Value 7 is on column D

我希望这对获得您期望的结果足够有帮助!

I hope this is helpful enough to get the result you expect !

这篇关于根据值之间的距离将字符串分成几列的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆