处理“",“-" CSV与Univocity [英] Handling "", "-" CSV with Univocity

查看:86
本文介绍了处理“",“-" CSV与Univocity的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

知道我如何获得适当的台词吗?一些行被粘住了,我不知道如何停止它或为什么.

Any idea how I can get proper lines? some lines are getting glued, and I can't figure out how to stop it or why.

  col. 0: Date
  col. 1: Col2
  col. 2: Col3
  col. 3: Col4
  col. 4: Col5
  col. 5: Col6
  col. 6: Col7
  col. 7: Col7
  col. 8: Col8

  col. 0: 2017-05-23
  col. 1: String
  col. 2: lo rem ipsum
  col. 3: dolor sit amet
  col. 4: mcdonalds.com/online.html
  col. 5: null
  col. 6: "","-""-""2017-05-23"
  col. 7: String
  col. 8: lo rem ipsum
  col. 9: dolor sit amet
  col. 10: burgerking.com
  col. 11: https://burgerking.com/
  col. 12: 20
  col. 13: 2
  col. 14: fake

  col. 0: 2017-05-23
  col. 1: String
  col. 2: lo rem ipsum
  col. 3: dolor sit amet
  col. 4: wendys.com
  col. 5: null
  col. 6: "","-""-""2017-05-23"
  col. 7: String
  col. 8: lo rem ipsum
  col. 9: dolor sit amet
  col. 10: buggagump.com
  col. 11: null
  col. 12: "","-""-""2017-05-23"
  col. 13: String
  col. 14: cheese
  col. 15: ad eum
  col. 16: mcdonalds.com/online.html
  col. 17: null
  col. 18: "","-""-""2017-05-23"
  col. 19: String
  col. 20: burger
  col. 21: ludus dissentiet
  col. 22: www.mcdonalds.com
  col. 23: https://www.mcdonalds.com/
  col. 24: 25
  col. 25: 3
  col. 26: fake

  col. 0: 2017-05-23
  col. 1: String
  col. 2: wine
  col. 3: id erat utamur
  col. 4: bubbagump.com
  col. 5: https://buggagump.com/
  col. 6: 25
  col. 7: 3
  col. 8: fake
  done

示例CSV(复制/粘贴时\ r \ n可能已损坏).在此处可用: https://www .dropbox.com/s/86klza4qok4ty2s/格式错误的%20csv%20r%20n%20small.csv?dl = 0

A sample CSV (the \r\n may have gotten corrupted when copy/pasting). Available here: https://www.dropbox.com/s/86klza4qok4ty2s/malformed%20csv%20r%20n%20small.csv?dl=0

"Date","Col2","Col3","Col4","Col5","Col6","Col7","Col7","Col8"
"2017-05-23","String","lo rem ipsum","dolor sit amet","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","burgerking.com","https://burgerking.com/","20","2","fake"
"2017-05-23","String","lo rem ipsum","dolor sit amet","wendys.com","","-","-","-"
"2017-05-23","String","lo rem ipsum","dolor sit amet","buggagump.com","","-","-","-"
"2017-05-23","String","cheese","ad eum","mcdonalds.com/online.html","","-","-","-"
"2017-05-23","String","burger","ludus dissentiet","www.mcdonalds.com","https://www.mcdonalds.com/","25","3","fake"
"2017-05-23","String","wine","id erat utamur","bubbagump.com","https://buggagump.com/","25","3","fake"

建筑设置:

  CsvParserSettings settings = new CsvParserSettings();

  settings.setDelimiterDetectionEnabled(true);
  settings.setQuoteDetectionEnabled(true);

  settings.setLineSeparatorDetectionEnabled(false); // all the same using `true`
  settings.getFormat().setLineSeparator("\r\n");

  CsvParser parser = new CsvParser(settings);

  List<String[]> rows;

  rows = parser.parseAll(getReader("testFiles/" + "malformed csv small.csv"));

  for (String[] row : rows)
  {
    System.out.println("");
    int i = 0;

    for (String element : row)
    {
      System.out.println("col. " + i++ + ": " + element);
    }
  }

  System.out.println("done");

推荐答案

在测试自动检测过程时,建议您使用以下命令打印出检测到的格式:

As you are testing the auto-detection process, I suggest you to print out the detected format with:

CsvFormat format = parser.getDetectedFormat();
System.out.println(format);

这将打印出来:

CsvFormat:
    Comment character=#
    Field delimiter=,
    Line separator (normalized)=\n
    Line separator sequence=\r\n
    Quote character="
    Quote escape character=-
    Quote escape escape character=null

如您所见,解析器未正确检测到引号转义.尽管格式检测过程通常非常好,但不能保证它总是会正确处理,尤其是对于小型测试样本.在您的示例中,我看不到为什么它将-用作转义符,所以我打开了这个 ="nofollow noreferrer">问题进行调查,看看是什么使它能够检测到该问题.

As you can see, the parser is not detecting the quote escape correctly. While the format detection process is typically very good, it is not guaranteed that it will always get it right, specially with small test samples. In your sample I can't see why it would pick up the - as the escape character, so I opened this issue to investigate and see what is making it detect that one.

作为一种变通办法,您现在可以做的是,如果您知道一个事实,即您的任何输入文件都不会使用-作为引号转义,那就是检测格式,测试从输入中提取的格式,然后解析内容,如下所示:

What you can do right now as a workaround, if you know for a fact that none of your input files will never have - as the quote escape, is to detect the format, test what it picked up from the input, and then parse the contents, like this:

public List<String[]> parse(File input, CsvFormat format) {
    CsvParserSettings settings = new CsvParserSettings();
    if (format == null) { //no format specified? Let's detect what we are dealing with
        settings.detectFormatAutomatically();

        CsvParser parser = new CsvParser(settings);
        parser.beginParsing(input); //just call begin parsing to kick of the auto-detection process
        format = parser.getDetectedFormat(); //capture the format
        parser.stopParsing(); //stop the parser - no need to read anything yet.

        System.out.println(format);

        if (format.getQuoteEscape() == '-') { //got something weird detected? Let's amend it.
            format.setQuoteEscape('"');
        }

        return parse(input, format); //now parse with the intended format
    } else {
        settings.setFormat(format); //this parses with the format adjusted earlier.
        CsvParser parser = new CsvParser(settings);
        return parser.parseAll(input);
    }

}

现在只需调用parse方法:

List<String[]> rows = parse(new File("/Users/jbax/Downloads/malformed csv r n small.csv"), null);

您将正确提取数据.希望这会有所帮助!

And you will have your data properly extracted. Hope this helps!

这篇关于处理“",“-" CSV与Univocity的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆