从多种语言读取数据时如何避免垃圾字符? [英] How to avoid Junk/garbage characters while reading data from multiple languages?

查看:22
本文介绍了从多种语言读取数据时如何避免垃圾字符?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在解析来自 10 多种不同语言的 RSS 新闻提要.

I am parsing rss news feeds from over 10 different languages.

在我用 php 编写的 API 响应客户端之前,所有解析都在 java 中完成,数据存储在 MySQL 中.

All the parsing is being done in java and data is stored in MySQL before my API's written in php are responding to the clients.

我在读取数据时经常遇到垃圾字符.

I constantly come across garbage characters when I read the data.

我尝试了什么:

  1. 我已经配置了我的 MySQL 存储 utf-8 数据.我的数据库、表甚至列都使用 UTF8 作为默认字符集.
  2. 在连接我的数据库时,我设置了字符集结果作为 utf-8
  1. I have configured my MySQL to store utf-8 data. My db,table and even the column have UTF8 as their default charset.
  2. While connecting my db,I set the character set results as utf-8

当我手动运行 jar 文件以插入数据时,字符显示正常.但是当我为同一个 jar 文件设置 cronjob 时,我又开始面临这个问题.

在英语中,我特别面临诸如 this 和在其他白话语言中,这个字符看起来完全是垃圾,我什至无法识别一个字符.

In English,I particularly face problems like this and in other vernacular languages,the character appear to be totally garbish and I cant even recongnize a single character.

有什么我遗漏的吗?

垃圾字符示例:

古吉拉特语 :"રેલàªà«‡ મà«àª¸àª¾àª«àª°à«€àª®àª¾àª‚સાª®àª¾àª¨à«€ થશે તો મળશે વળતર!"

Gujarati :"રેલવે મà«àª¸àª¾àª«àª°à«€àª®àª¾àª‚ સામાન ચોરી થશે તો મળશે વળતર!"

马来语 : "നേപàµà´ªà´¾à´³à´¿à´²àµ‡à´•àµà´•àµà´³àµà´³ കോളàµâ€à´¨à´¿à´°à´•àµà´•àµà´•àµà´±à´šàµà´šàµ"

Malyalam : "നേപàµà´ªà´¾à´³à´¿à´²àµ‡à´•àµà´•àµà´³àµà´³ കോളàµâ€ നിരകàµà´•àµ à´•àµà´±à´šàµà´šàµ"

英语:银行董事会将范围扩大到金融部门的初级抽样单位

English : Bank Board Bureau’s ambit to widen to financial sector PSUs

推荐答案

古吉拉特语开始રેલવે,对吗?马拉雅拉姆语开始于നേപ,对吗?而且英文应该包括Bureau's.

The Gujarati starts રેલવે, correct? And the Malyalam starts നേപ, correct? And the English should have included Bureau’s.

这是经典案例

  • 您在客户端中的字节已正确编码为 utf8.(Bureau 在 utf8 的 Ascii/latin1 子集中编码;但 不是 ascii 撇号.)
  • 您可能默认使用 SET NAMES latin1(或 set_charset('latin1') 或 ...)连接.(应该是 utf8.)
  • 表中的列声明为CHARACTER SET latin1.(或者它可能是从表/数据库继承的.)(它应该是 utf8.)
  • The bytes you have in the client are correctly encoded in utf8. (Bureau is encoded in the Ascii/latin1 subset of utf8; but is not the ascii apostrophe.)
  • You connected with SET NAMES latin1 (or set_charset('latin1') or ...), probably by default. (It should have been utf8.)
  • The column in the table was declared CHARACTER SET latin1. (Or possibly it was inherited from the table/database.) (It should have been utf8.)

数据修复是两步更改".

ALTER TABLE Tbl MODIFY COLUMN col VARBINARY(...) ...;
ALTER TABLE Tbl MODIFY COLUMN col VARCHAR(...) ... CHARACTER SET utf8 ...;

其中长度足够大,另一个..."已经在列中了(NOT NULL 等).

where the lengths are big enough and the other "..." have whatever else (NOT NULL, etc) was already on the column.

不幸的是,如果您要处理很多列,则需要进行大量 ALTER.您可以(应该)MODIFY 将所有必要的列VARBINARY 用于一对ALTERs 中的单个表.

Unfortunately, if you have a lot of columns to work with, it will take a lot of ALTERs. You can (should) MODIFY all the necessary columns to VARBINARY for a single table in a pair of ALTERs.

修复代码是建立utf8作为连接;这取决于 PHP 中使用的 api.ALTERs 将更改列定义.

The fix for the code is to establish utf8 as the connection; this depends on the api used in PHP. The ALTERs will change the column definition.

编辑

您的 VARCHAR 带有错误的 CHARACTER SET.因此,您看到的 Mojibake 就像 રેલ.大多数转换技术都试图保留 રેલ,但这不是您所需要的.相反,采取一步到 VARBINARY 保留位,同时忽略表示 latin1 编码字符的位的旧定义.第二步再次保留这些位,但现在声称它们代表 utf8 字符.

You have VARCHAR with the wrong CHARACTER SET. Hence, you see Mojibake like રેલ. Most conversion techniques try to preserve રેલ, but that is not what you need. Instead, taking a step to VARBINARY preserves the bits while ignoring the old definition of the bits representing latin1-encoded characters. The second step again preserves the bits, but now claiming they represent utf8 characters.

这篇关于从多种语言读取数据时如何避免垃圾字符?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆