如何在读取多种语言的数据时避免垃圾/垃圾字符? [英] How to avoid Junk/garbage characters while reading data from multiple languages?

查看:213
本文介绍了如何在读取多种语言的数据时避免垃圾/垃圾字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析来自超过10种不同语言的RSS新闻源。



所有的解析都是在java中完成的,数据存储在MySQL中,



当我读取数据时,我经常遇到垃圾字符。



我尝试:


  1. 我有配置我的MySQL 来存储utf-8数据。

  2. 在连接数据库时,我设置了字符集结果为utf-8

当我手动运行jar文件插入数据时,字符看起来很好。但是当我为同一个jar文件设置一个cronjob时,我开始面对这个问题了。



在英语中, href =http://stackoverflow.com/questions/2477452/%C3%A2%E2%82%AC%E2%84%A2-showing-on-page-instead-of>这以及其他本土语言,字符看起来完全是垃圾,我不能甚至辨别一个单一的字符。



有什么我失踪吗?



垃圾字符示例



古吉拉特语 રàààલàªμààààààà«àª¸à¾àª«àª°àà«àª¸à¾àª®àªàà¨ààšà૫રીથશેતà



Malyalam :à'¨àμ‡à'ààààªà '¾à'³à'¿à'²àμ‡à'àààààààààà'àμà''••μ<<<<•••••••••ààààààà '±à'šàμà'šàμ



英语:银行局局有义务扩大到金融部门PSU

解决方案

古吉拉特开始感谢,正确吗? Malyalam开始,正确吗?英语应包括 Bureau's



这是

的典型案例


  • 客户端中的字节在utf8中正确编码。 ( Bureau 编码在utf8的Ascii / latin1子集中;但'不是ascii省略号。 li>
  • 您与 SET NAMES latin1 (或 set_charset('latin1') ...)。 (应该是 utf8 。)

  • 表中的列声明为 CHARACTER SET latin1 。 (或者可能是继承自表/数据库。)(应该是 utf8 。)



修复数据是一个两步ALTER。

  ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...)...; 
ALTER TABLE TBL MODIFY COLUMN col VARCHAR(...)... CHARACTER SET utf8 ...;

其中长度足够大,其他... $ c> NOT NULL ,etc)已经在列上。



不幸的是,如果你有很多列可以使用,需要很多ALTER。对于<$ c $中的单个表,您可以(应该) MODIFY 所有必需的列 VARBINARY c> ALTERs



修复代码是建立utf8作为连接;这取决于在PHP中使用的api。 ALTERs 将更改列定义。



编辑
$ b

您有 CHARACTER SET 错误的 VARCHAR 。因此,你看到Mojibake像રેલ。大多数转换技术试图保留રેલ,但这不是你需要的。相反,采取步骤 VARBINARY 保留位,而忽略表示latin1编码字符的位的旧定义。第二步再次保留位,但现在声称它们代表utf8字符。


I am parsing rss news feeds from over 10 different languages.

All the parsing is being done in java and data is stored in MySQL before my API's written in php are responding to the clients.

I constantly come across garbage characters when I read the data.

What have I tried :

  1. I have configured my MySQL to store utf-8 data. My db,table and even the column have UTF8 as their default charset.
  2. While connecting my db,I set the character set results as utf-8

When I run the jar file manually to insert the data,the character's appear fine. But when I set a cronjob for the same jar file,I start facing the problem all over again.

In English,I particularly face problems like this and in other vernacular languages,the character appear to be totally garbish and I cant even recongnize a single character.

Is there anything that I am missing?

Sample garbage characters :

Gujarati :"રેલવે મà«àª¸àª¾àª«àª°à«€àª®àª¾àª‚ સામાન ચોરી થશે તો મળશે વળતર!"

Malyalam : "നേപàµà´ªà´¾à´³à´¿à´²àµ‡à´•àµà´•àµà´³àµà´³ കോളàµâ€ നിരകàµà´•àµ à´•àµà´±à´šàµà´šàµ"

English : Bank Board Bureau’s ambit to widen to financial sector PSUs

解决方案

The Gujarati starts રેલવે, correct? And the Malyalam starts നേപ, correct? And the English should have included Bureau’s.

This is the classic case of

  • The bytes you have in the client are correctly encoded in utf8. (Bureau is encoded in the Ascii/latin1 subset of utf8; but is not the ascii apostrophe.)
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
  • The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)

The fix for the data is a "2-step ALTER".

ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;

where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.

Unfortunately, if you have a lot of columns to work with, it will take a lot of ALTERs. You can (should) MODIFY all the necessary columns to VARBINARY for a single table in a pair of ALTERs.

The fix for the code is to establish utf8 as the connection; this depends on the api used in PHP. The ALTERs will change the column definition.

Edit

You have VARCHAR with the wrong CHARACTER SET. Hence, you see Mojibake like રેલ. Most conversion techniques try to preserve રેલ, but that is not what you need. Instead, taking a step to VARBINARY preserves the bits while ignoring the old definition of the bits representing latin1-encoded characters. The second step again preserves the bits, but now claiming they represent utf8 characters.

这篇关于如何在读取多种语言的数据时避免垃圾/垃圾字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆