如何在Scala或Java中读取混合编码的文本文件? [英] How to read a text file with mixed encodings in Scala or Java?

查看:324
本文介绍了如何在Scala或Java中读取混合编码的文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想解析一个CSV文件,理想情况下使用weka.core.converters.CSVLoader。
但是,我拥有的文件不是有效的UTF-8文件。
它主要是一个UTF-8文件,但一些字段值是不同的编码,
所以没有编码,其中整个文件是有效的
但我需要解析它无论如何。
除了使用像Weka这样的java库,我主要在Scala工作。
我甚至不能读取文件usin scala.io.Source:
例如

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the field values are in different encodings, so there is no encoding in which the whole file is valid, but I need to parse it anyway. Apart from using java libraries like Weka, I am mainly working in Scala. I am not even able to read the file usin scala.io.Source: For example

Source.
  fromFile(filename)("UTF-8").
  foreach(print);

throws:

    java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:337)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:153)
at java.io.BufferedReader.read(BufferedReader.java:174)
at scala.io.BufferedSource$$anonfun$iter$1$$anonfun$apply$mcI$sp$1.apply$mcI$sp(BufferedSource.scala:38)
at scala.io.Codec.wrap(Codec.scala:64)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.collection.Iterator$$anon$14.next(Iterator.scala:150)
at scala.collection.Iterator$$anon$25.hasNext(Iterator.scala:562)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
at scala.io.Source.hasNext(Source.scala:238)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.io.Source.foreach(Source.scala:181)

我很高兴抛出所有无效字符或替换为一些虚拟。
我将有很多这样的文本以各种方式处理
,可能需要将数据传递到各种第三方库。
一个理想的解决方案是某种全局设置,
会导致所有低级java库忽略文本中的无效字节
,以便我可以调用第三方库

I am perfectly happy to throw all the invalid characters away or replace them with some dummy. I am going to have lots of text like this to process in various ways and may need to pass the data to various third party libraries. An ideal solution would be some kind of global setting that would cause all the low level java libraries to ignore invalid bytes in text, so that that I can call third party libraries on this data without modification.

解决方案:

import java.nio.charset.CodingErrorAction
import scala.io.Codec

implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)

val src = Source.
  fromFile(filename).
  foreach(print)

感谢+ Esailija指点我的方向。
这导致我如何检测非法的UTF-8字节序列在java输入流中替换它们?
它提供核心java解决方案。在Scala我可以通过使编解码器隐式的默认行为。我想我可以通过将它的隐式编解码器定义放在包对象中,使其成为整个包的默认行为。

Thanks to +Esailija for pointing me in the right direction. This lead me to How to detect illegal UTF-8 byte sequences to replace them in java inputstream? which provides the core java solution. In Scala I can make this the default behaviour by making the codec implicit. I think I can make it the default behaviour for the entire package by putting it the implicit codec definition in the package object.

推荐答案

这是我设法用java做的:

This is how I managed to do it with java:

    FileInputStream input;
    String result = null;
    try {
        input = new FileInputStream(new File("invalid.txt"));
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.IGNORE);
        InputStreamReader reader = new InputStreamReader(input, decoder);
        BufferedReader bufferedReader = new BufferedReader( reader );
        StringBuilder sb = new StringBuilder();
        String line = bufferedReader.readLine();
        while( line != null ) {
            sb.append( line );
            line = bufferedReader.readLine();
        }
        bufferedReader.close();
        result = sb.toString();

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch( IOException e ) {
        e.printStackTrace();
    }

    System.out.println(result);

无效文件用字节创建:

0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94

这是hellöwörld在UTF- 。

使用 .REPLACE ,您会看到使用的标准Unicode替换字符:

With .REPLACE you see the standard unicode replacement character being used:

//"h�ellö� wö�rld�"

使用 .IGNORE ,您会看到无效的字节被忽略:

With .IGNORE, you see the invalid bytes ignored:

//"hellö wörld"

未指定 .onMalformedInput ,您会得到

java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(Unknown Source)
    at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
    at sun.nio.cs.StreamDecoder.read(Unknown Source)
    at java.io.InputStreamReader.read(Unknown Source)
    at java.io.BufferedReader.fill(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)

这篇关于如何在Scala或Java中读取混合编码的文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆