Spark:以UTF-8编码导入文本文件 [英] Spark: importing text file in UTF-8 encoding

查看:93
本文介绍了Spark:以UTF-8编码导入文本文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试处理一个包含很多特殊字符的文件,例如德语变音符(ä,ü,o)等:

I am trying to process a file which contains a lot of special characters such as German umlauts(ä,ü,o) etc. as follows :

sc.hadoopConfiguration.set("textinputformat.record.delimiter","\ r \ n \ r \ n") sc.textFile("/file/path/samele_file.txt)

但是在阅读内容时,这些特殊字符不会被识别.

But upon reading the contents, these special characters are not recognized.

我认为默认编码不是UTF-8或类似格式.我想知道是否有办法在此textFile方法上设置编码,例如:

I think the default encoding is not in UTF-8 or similar formats. I would like to know if there is a way to set encoding on this textFile method such as:

sc.textFile("/file/path/samele_file.txt",mode="utf-8")`

推荐答案

否,如果您以UTF-8模式读取非UTF-8格式的文件,则非ASCII字符将无法正确解码.请将文件转换为UTF-8编码,然后阅读.你可以参考读取不同格式的文件

No, if you read a non UTF-8 format file in UTF-8 mode, non-ascii characters will not be decoded properly. Please convert file to UTF-8 encoding and then read. You can refer to Reading file in different formats

这篇关于Spark:以UTF-8编码导入文本文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆