波斯语中的QString [英] QString in Persian

查看:146
本文介绍了波斯语中的QString的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我给出了一个Qt项目,该项目需要支持波斯语。从服务器发送数据并使用第一行,我得到一个QByteArray并使用第二行将其转换为QString:

  QByteArray readData = socket-> readAll(); 
QString DataAsString = QTextCodec :: codecForUtfText(readData)-> toUnicode(readData);

发送数据为英文时,一切都很好,但是使用波斯语时,而不是

 سلام

我得到

  سÙ\u0084اÙ\u0085 

我提到了此过程,因此人们不会建议使用.tr的多语言应用程序的制作方法。都是关于文本和解码的,而不是那些翻译方法。我的操作系统是Windows 8.1(对于您需要了解的情况)。



当服务器发送سلام


$时,我得到此十六进制值b $ b

  0008d8b3d984d8a7d985 

通过服务器由于我不知道的原因,在开头发送了两个额外的字节。因此,我使用以下命令将其删除:

  DataAsString.remove(0,2); 

在将其转换为QString之后,它的十六进制值在请求时会有所增加。

解决方案

我很好奇等待答复,自己玩弄一点:



我复制了文本سلام(英语: Hello)并将其粘贴到Nodepad ++(在我的情况下使用UTF-8编码)。然后我切换到以十六进制查看并得到:



  $ qmake-qt5 testQPersian。 pro 

$ make

$ ./testQ波斯



同样,Latin-1的输出看起来与OP以及Notepad ++公开的内容相似



输出为UTF-8提供了预期的文本(按预期,因为我提供了正确的UTF-8编码作为输入)。



可能是,ASCII / Latin-1输出的变化有点令人困惑。 –存在多种字符字节编码,它们在下半部分(0 ... 127)共享ASCII,但在上半部分(128 ... 255)具有不同的字节含义。 (请查看



所以,似乎代替了



d8 b3 d9 84 d8 a7 d9 85



他得到了



00 08 d8 b3 d9 84 d8 a7 d9 85



可能的解释:



服务器首先发送16位长度 00 08 –解释为 Big-Endian 16位整数: 8 ,然后<用UTF-8编码的strong> 8 个字节(看起来就像我上面播放的字节)。
(AFAIK,如果发件人和接收者本来就有不同的字节序,则使用Big-Endian二进制网络协议以防止字节序问题并不稀奇。)此处: htons(3)-Linux手册页


在i386上,主机字节顺序是最低有效字节在先,而在Internet上使用的网络字节顺序是最高有效字节在前。







OP声称已使用此协议数据输出– writeUTF


将两个字节的长度信息写入输出流,然后是每个字符的修改后的UTF-8表示形式在字符串s中。如果s为null,则抛出NullPointerException。字符串s中的每个字符都将转换为一个,两个或三个字节的组,具体取决于字符的值。


因此,解码看起来可能像这样:

  QByteArray readData( \x00\x08\xd8\xb3 dxd9\x84\xd8\xa7\xd9\x85,10); 
// QByteArray readData = socket-> readAll();
无符号长度
=((uint8_t)readData [0]<< 8)+(uint8_t)readData [1];
QString text = QString :: fromUtf8(dataRead.data()+ 2,长度);




  1. 前两个字节是从<$ c $中提取的c> readData 并组合为 length (解码big-endian 16位整数)。


  2. dataRead 的其余部分将转换为 QString ,提供先前提取的长度。因此,将跳过 readData 的前两个长度字节。



I have given a Qt Project which needs to support Persian language.T he data is sent from a server and using the first line, I get a QByteArray and convert it to QString using the second line:

    QByteArray readData = socket->readAll();
    QString DataAsString = QTextCodec::codecForUtfText(readData)->toUnicode(readData);

When the data is sent is English, everything is fine, but when it is Persian, instead of

سلام

I get

سÙ\u0084اÙ\u0085

I mentioned the process so people wouldn't suggest methods to make a multi language app that uses .tr. It's all about text and decoding not those translation methods. My OS is Windows 8.1 (for the case you need to know it).

I get this hex Value when the server sends سلام

0008d8b3d984d8a7d985

By the way the server sends two extra bytes at the beginning for a reason I don't know. So I cut it off using:

DataAsString.remove(0,2);

after it's been converted to QString so the hex value has some extra at the begging.

解决方案

I was far to curious to wait for reply and toyed a bit on my own:

I copied the text سلام (in English: "Hello") and pasted it into Nodepad++ (which used UTF-8 encoding in my case). Then I switched to View as Hex and got:

The ASCII dump on right side looks a bit similar to what OP got unexpectedly. This let me believe that the bytes in readData are encoded in UTF-8. Hence, I took the exposed hex-numbers and made a little sample code:

testQPersian.cc:

#include <QtWidgets>

int main(int argc, char **argv)
{
  QByteArray readData = "\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85";
  QString textLatin1 = QString::fromLatin1(readData);
  QString textUtf8 = QString::fromUtf8(readData);
  QApplication app(argc, argv);
  QWidget qWin;
  QGridLayout qGrid;
  qGrid.addWidget(new QLabel("Latin-1:"), 0, 0);
  qGrid.addWidget(new QLabel(textLatin1), 0, 1);
  qGrid.addWidget(new QLabel("UTF-8:"), 1, 0);
  qGrid.addWidget(new QLabel(textUtf8), 1, 1);
  qWin.setLayout(&qGrid);
  qWin.show();
  return app.exec();
}

testQPersian.pro:

SOURCES = testQPersian.cc

QT += widgets

Compiled and tested in cygwin on Windows 10:

$ qmake-qt5 testQPersian.pro

$ make

$ ./testQPersian

Again, the output as Latin-1 looks a bit similar to what OP got as well as what Notepad++ exposed.

The output as UTF-8 provides the expected text (as expected because I provided a proper UTF-8 encoding as input).

May be, it's a bit confusing that the ASCII/Latin-1 output vary. – There exists multiple character byte encodings which share the ASCII in the lower half (0 ... 127) but have different meanings of bytes in the upper half (128 ... 255). (Have a look at ISO/IEC 8859 to see what I mean. These have been introduced as localizations before Unicode became popular as the final solution of the localization problem.)

The Persian characters have surely all Unicode codepoints beyond 127. (Unicode shares the ASCII for the first 128 codepoints as well.) Such codepoints are encoded in UTF-8 as sequences of multiple bytes where each byte has the MSB (the most significant bit – Bit 7) set. Hence, if these bytes are (accidentally) interpreted with any ISO8859 encoding then the upper half becomes relevant. Thus, depending on the currently used ISO8859 encoding, this may produce different glyphs.


Some continuation:

OP sent the following snapshot:

So, it seems instead of

d8 b3 d9 84 d8 a7 d9 85

he got

00 08 d8 b3 d9 84 d8 a7 d9 85

A possible interpretation:

The server sends first a 16 bit length 00 08 – interpreted as Big-Endian 16 bit integer: 8, then 8 bytes encoded in UTF-8 (which look exactly like the one I got with playing above). (AFAIK, it's not unusual to use Big-Endian for binary network protocols to prevent endianess issues if sender and receiver have natively different endianess.) Further reading e.g. here: htons(3) - Linux man page

On the i386 the host byte order is Least Significant Byte first, whereas the network byte order, as used on the Internet, is Most Significant Byte first.


OP claims that this protocol is used DataOutput – writeUTF:

Writes two bytes of length information to the output stream, followed by the modified UTF-8 representation of every character in the string s. If s is null, a NullPointerException is thrown. Each character in the string s is converted to a group of one, two, or three bytes, depending on the value of the character.

So, the decoding could look like this:

QByteArray readData("\x00\x08\xd8\xb3\xd9\x84\xd8\xa7\xd9\x85", 10);
//QByteArray readData = socket->readAll();
unsigned length
  = ((uint8_t)readData[0] <<  8) + (uint8_t)readData[1];
QString text = QString::fromUtf8(dataRead.data() + 2, length);

  1. The first two bytes are extracted from readData and combined to the length (decoding big-endian 16 bit integer).

  2. The rest of dataRead is converted to QString providing the previously extracted length. Thereby, the first 2 length bytes of readData are skipped.

这篇关于波斯语中的QString的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆