Java代码错误地读取了UTF-8文本 [英] Java code reads UTF-8 text incorrectly
问题描述
读取代码中的UTF-8字符时遇到问题(在Eclipse上运行).
I'm having a problem reading UTF-8 characters in my code (running on Eclipse).
我有一个文件text
,其中有几行,例如:
I have a file text
which has a few lines in it, for example:
אך 1234
注意:该词前面有一个\t
,该词应该出现在左侧,数字在右侧...我不知道如何在此处将其取反,对不起.
NOTE: There is a \t
before the word, and the word should appear on the left, the number on the right... I don't know how to reverse them here, sorry.
即希伯来语单词,然后是数字.
That is, a Hebrew word and then a number.
我需要以某种方式将单词与数字分开.我试过了:
I need to separate the word from the number somehow. I tried this:
BufferedReader br = new BufferedReader(new FileReader(text));
String content;
while ((content = br.readLine()) != null)
{
String delims = "[ ]+";
String[] tokens = content.split(delims);
}
问题是由于某种原因,代码读取content
(文件的第一行)的方式如下:
The problem is that for some reason, the code reads content
(the first line in the file) as follows:
אך\t1234
...表示该空间不在正确的位置.
...meaning that the space isn't in its correct place.
我想我可以使用\t
标记文本,但是我不确定是否应该这样做,因为文件未正确读取...
I suppose I could tokenize the text using the \t
, but I'm not sure I should do it, as the file isn't being read correctly...
有人知道为什么会这样吗?
Does anyone have any idea why this happens?
非常感谢:-)
推荐答案
我认为当实际上有一个选项卡时,您正在匹配一个空格?
I think you are matching a space when there actually is a tab there?
您可以尝试以下方法吗?
Can you try this:
BufferedReader br = new BufferedReader(new FileReader(text));
String content;
while ((content = br.readLine()) != null)
{
String delims = "\\s";
String[] tokens = content.split(delims);
}
这篇关于Java代码错误地读取了UTF-8文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!