如何在 Scala 或 Java 中读取混合编码的文本文件? [英] How to read a text file with mixed encodings in Scala or Java?

查看:36
本文介绍了如何在 Scala 或 Java 中读取混合编码的文本文件?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析 CSV 文件,最好使用 weka.core.converters.CSVLoader.但是,我拥有的文件不是有效的 UTF-8 文件.它主要是一个 UTF-8 文件,但一些字段值采用不同的编码,所以没有整个文件有效的编码,但无论如何我都需要解析它.除了使用像 Weka 这样的 java 库,我主要在 Scala 工作.我什至无法使用 scala.io.Source 读取文件:例如

I am trying to parse a CSV file, ideally using weka.core.converters.CSVLoader. However the file I have is not a valid UTF-8 file. It is mostly a UTF-8 file but some of the field values are in different encodings, so there is no encoding in which the whole file is valid, but I need to parse it anyway. Apart from using java libraries like Weka, I am mainly working in Scala. I am not even able to read the file usin scala.io.Source: For example

Source.
  fromFile(filename)("UTF-8").
  foreach(print);

抛出:

    java.nio.charset.MalformedInputException: Input length = 1
at java.nio.charset.CoderResult.throwException(CoderResult.java:277)
at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:337)
at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:176)
at java.io.InputStreamReader.read(InputStreamReader.java:184)
at java.io.BufferedReader.fill(BufferedReader.java:153)
at java.io.BufferedReader.read(BufferedReader.java:174)
at scala.io.BufferedSource$$anonfun$iter$1$$anonfun$apply$mcI$sp$1.apply$mcI$sp(BufferedSource.scala:38)
at scala.io.Codec.wrap(Codec.scala:64)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.io.BufferedSource$$anonfun$iter$1.apply(BufferedSource.scala:38)
at scala.collection.Iterator$$anon$14.next(Iterator.scala:150)
at scala.collection.Iterator$$anon$25.hasNext(Iterator.scala:562)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
at scala.io.Source.hasNext(Source.scala:238)
at scala.collection.Iterator$class.foreach(Iterator.scala:772)
at scala.io.Source.foreach(Source.scala:181)

我非常乐意将所有无效字符扔掉或用一些虚拟字符替换它们.我将有大量这样的文本以各种方式处理并且可能需要将数据传递给各种第三方库.一个理想的解决方案是某种全局设置导致所有低级 java 库忽略文本中的无效字节,这样我就可以在不修改的情况下调用第三方库来处理这些数据.

I am perfectly happy to throw all the invalid characters away or replace them with some dummy. I am going to have lots of text like this to process in various ways and may need to pass the data to various third party libraries. An ideal solution would be some kind of global setting that would cause all the low level java libraries to ignore invalid bytes in text, so that that I can call third party libraries on this data without modification.

解决方案:

import java.nio.charset.CodingErrorAction
import scala.io.Codec

implicit val codec = Codec("UTF-8")
codec.onMalformedInput(CodingErrorAction.REPLACE)
codec.onUnmappableCharacter(CodingErrorAction.REPLACE)

val src = Source.
  fromFile(filename).
  foreach(print)

感谢 +Esailija 为我指明了正确的方向.这让我java输入流中如何检测非法的UTF-8字节序列来替换它们?它提供了核心java解决方案.在 Scala 中,我可以通过使编解码器隐式来使其成为默认行为.我想我可以通过将隐式编解码器定义放在包对象中来使其成为整个包的默认行为.

Thanks to +Esailija for pointing me in the right direction. This lead me to How to detect illegal UTF-8 byte sequences to replace them in java inputstream? which provides the core java solution. In Scala I can make this the default behaviour by making the codec implicit. I think I can make it the default behaviour for the entire package by putting it the implicit codec definition in the package object.

推荐答案

这是我用 java 做到的:

This is how I managed to do it with java:

    FileInputStream input;
    String result = null;
    try {
        input = new FileInputStream(new File("invalid.txt"));
        CharsetDecoder decoder = Charset.forName("UTF-8").newDecoder();
        decoder.onMalformedInput(CodingErrorAction.IGNORE);
        InputStreamReader reader = new InputStreamReader(input, decoder);
        BufferedReader bufferedReader = new BufferedReader( reader );
        StringBuilder sb = new StringBuilder();
        String line = bufferedReader.readLine();
        while( line != null ) {
            sb.append( line );
            line = bufferedReader.readLine();
        }
        bufferedReader.close();
        result = sb.toString();

    } catch (FileNotFoundException e) {
        e.printStackTrace();
    } catch( IOException e ) {
        e.printStackTrace();
    }

    System.out.println(result);

无效文件是用字节创建的:

The invalid file is created with bytes:

0x68, 0x80, 0x65, 0x6C, 0x6C, 0xC3, 0xB6, 0xFE, 0x20, 0x77, 0xC3, 0xB6, 0x9C, 0x72, 0x6C, 0x64, 0x94

这是 UTF-8 中的 hello wörld,其中混合了 4 个无效字节.

Which is hellö wörld in UTF-8 with 4 invalid bytes mixed in.

使用 .REPLACE 您会看到正在使用的标准 unicode 替换字符:

With .REPLACE you see the standard unicode replacement character being used:

//"h�ellö� wö�rld�"

使用 .IGNORE,您会看到忽略的无效字节:

With .IGNORE, you see the invalid bytes ignored:

//"hellö wörld"

不指定.onMalformedInput,你得到

java.nio.charset.MalformedInputException: Input length = 1
    at java.nio.charset.CoderResult.throwException(Unknown Source)
    at sun.nio.cs.StreamDecoder.implRead(Unknown Source)
    at sun.nio.cs.StreamDecoder.read(Unknown Source)
    at java.io.InputStreamReader.read(Unknown Source)
    at java.io.BufferedReader.fill(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)
    at java.io.BufferedReader.readLine(Unknown Source)

这篇关于如何在 Scala 或 Java 中读取混合编码的文本文件?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆