CSVParser将LF处理为CRLF [英] CSVParser processes LF as CRLF

查看:485
本文介绍了CSVParser将LF处理为CRLF的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试如下解析CSV文件

I am trying to parse a CSV file as below

String NEW_LINE_SEPARATOR = "\r\n"; CSVFormat csvFileFormat = CSVFormat.DEFAULT.withRecordSeparator(NEW_LINE_SEPARATOR); FileReader fr = new FileReader("201404051539.csv"); CSVParser csvParser = csvFileFormat.withHeader().parse(fr); List<CSVRecord> recordsList = csvParser.getRecords();

String NEW_LINE_SEPARATOR = "\r\n"; CSVFormat csvFileFormat = CSVFormat.DEFAULT.withRecordSeparator(NEW_LINE_SEPARATOR); FileReader fr = new FileReader("201404051539.csv"); CSVParser csvParser = csvFileFormat.withHeader().parse(fr); List<CSVRecord> recordsList = csvParser.getRecords();

现在,文件中的普通行以CRLF字符结尾,但是对于少数几行,中间还有其他LF字符出现. 即

Now the file got normal lines ending with CRLF characters however for few lines there is additional LF character appearing in middle. i.e.

    a,b,c,dCRLF --line1
    e,fLF,g,h,iCRLF --line2

由于这个原因,解析操作创建了三个记录,而实际上它们只有两个.

Due to this, the parse operation creates three records whereas actually they are only two.

有没有办法让LF字符出现在第二行的中间而不被视为换行符,并且仅在解析时才获得两条记录?

Is there a way I can get the LF character appearing in middle of second line not treated as line break and get two records only upon parsing?

谢谢

推荐答案

我认为 uniVocity-parsers 是您将发现的唯一一个可以按预期使用行尾的解析器.

I think uniVocity-parsers is the only parser you will find that will work with line endings as you expect.

使用univocity解析器的等效代码为:

The equivalent code using univocity-parsers will be:

    CsvParserSettings settings = new CsvParserSettings(); //many options here, check the tutorial
    settings.getFormat().setLineSeparator("\r\n");
    settings.getFormat().setNormalizedNewline('\u0001'); //uses a special character to represent a new record instead of \n.
    settings.setNormalizeLineEndingsWithinQuotes(false); //does not replace \r\n by the normalized new line when reading quoted values.
    settings.setHeaderExtractionEnabled(true); //extract headers from file
    settings.trimValues(false); //does not remove whitespaces around values 
    CsvParser parser = new CsvParser(settings);

    List<Record> recordsList = parser.parseAllRecords(new File("201404051539.csv"));

如果将行分隔符定义为\ r \ n,则这是唯一标识新记录的字符序列(用引号引起来).所有值都可以具有\ r或\ n而不用引号引起来,因为这不是行分隔符序列.

If you define a line separator to be \r\n then this is the ONLY sequence of characters that should identify a new record (when outside quotes). All values can have either \r or \n without being enclosed in quotes because that's NOT the line separator sequence.

解析输入样本时提供的信息:

When parsing the input sample you gave:

String input = "a,b,c,d\r\ne,f\n,g,h,i\r\n";
parser.parseAll(new StringReader(input));

结果将是:

LINE1 = [a, b, c, d]
LINE2 = [e, f
, g, h, i]

披露:我是这个图书馆的作者.它是开源且免费的(Apache 2.0许可证)

Disclosure: I'm the author of this library. It's open-source and free (Apache 2.0 license)

这篇关于CSVParser将LF处理为CRLF的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆