是否有一个ASCII扩展编码列表? [英] Is there a list of ASCII-extending encodings?
问题描述
我需要决定何时(不是)根据已知的文件编码和所需的输出编码转换文本文件。
如果文本是US-ASCII ,我不需要转换它,如果输出编码是ASCII,UTF-8,Latin1,...
显然,我需要将US-ASCII文件转换为UTF-16或UTF-32
标准编码列表存在于
http://www.iana.org/assignments/character-sets/character-sets.xml
如果符合以下条件,则必须进行转换:
- 最小字符大小> 1字节或
- 前127个代码点与US-ASCII不相同。
我想知道:
- 是否有类似的列表,包含有关每个编码的实现的详细信息(bytelenght,ASCII兼容性)?
- 我很高兴一个只包含 Qt5支持的编解码器。
< >
EDIT
我已经找到问题的答案
- 是否所有8位或8位的编解码器都是ASCII的超集?
- 字词:US-ASCII可以解释为任何8或8位编码吗?
此处:字符集这不是ASCII的超集
相反,这将有助于知道:
- 有一个字符集列表,它们是ASCII的超集。
这看起来很有前景:
mime.charsets - 是ASCII超集的字符集列表,
,但我找不到实际的mime.charsets档案。
解码给定编码中的字节0x00 - 0x7F,并检查字符是否与ASCII匹配。例如,在Python 3.x中:
def is_ascii_superset(encoding):
代码范围:
if bytes([codepoint])。decode(encoding,'ignore')!= chr(codepoint):
return False
return True
这给出:
> > is_ascii_superset('US-ASCII')
true
>>> is_ascii_superset('windows-1252')
True
>>> is_ascii_superset('ISO-8859-15')
True
>>> is_ascii_superset('UTF-8')
True
>>> is_ascii_superset('UTF-16')
False
>>> is_ascii_superset('IBM500')#EBCDIC的变体
False
EDIT:获取C ++中的Qt版本支持的每个编码的US-ASCII兼容性:
code> #include< QTextCodec>
#include< QMap>
typedef enum
{
eQtCodecUndefined,
eQtCodecAsciiIncompatible,
eQtCodecAsciiCompatible,
} tQtCodecType;
QMap< QByteArray,tQtCodecType> QtCodecTypes()
{
QMap< QByteArray,tQtCodecType> CodecTypes;
//如何测试Qt对ASCII数据的解释?
QList< QByteArray> available = QTextCodec :: availableCodecs();
QTextCodec * referenceCodec = QTextCodec :: codecForName(UTF-8); //因为Qt没有US-ASCII,但我们只测试字节0-127和UTF-8是US-ASCII的超集
if(referenceCodec == 0)
{
qDebug (Unable to get reference codec'UTF-8');
return CodecTypes;
}
for(int i = 0; i{
const QByteArray name = available.at(i);
QTextCodec * currCodec = QTextCodec :: codecForName(name);
if(currCodec == NULL)
{
qDebug(Unable to get codec for'%s',qPrintable(QString(name)));
CodecTypes.insert(name,eQtCodecUndefined);
continue;
}
tQtCodecType type = eQtCodecAsciiCompatible;
for(uchar j = 0; j <128; j ++)// UTF-8 == US-ASCII在低7位
{
const char c = ; // character to test< 2 ^ 8
QString sRef,sTest;
sRef = referenceCodec-> toUnicode(& c,1); //将字符转换为UTF-16(QString内部),假设它是ASCII(通过UTF-8)
sTest = currCodec-> toUnicode(& c,1); //将字符转换为UTF-16,假设它是类型[currCodec]
if(sRef!= sTest)//比较两个UTF-16表示 - >如果它们相等,这些编解码器对于Qt
{
type = eQtCodecAsciiIncompatible;
break;
}
}
CodecTypes.insert(name,type);
}
return CodecTypes;
}
I need to decide when (not) to convert a text file based on the known file encoding and the desired output encoding.
If the text is US-ASCII, I don't need to convert it if the output encoding is ASCII, UTF-8, Latin1, ...
Obviously I need to convert a US-ASCII file to UTF-16 or UTF-32.A list of standard encodings exists at
http://www.iana.org/assignments/character-sets/character-sets.xmlA conversion is necessary if:
- the minimal character size is > 1 byte or
- the first 127 code points are not the same as US-ASCII.
I'd like to know:
- Is there a similar list with details (bytelenght, ASCII-compatibility) about the implementation of each encoding?
- I'd be happy about a list containing only codecs supported by Qt5.
EDIT
I already found an answer to the question
- Are all 8-or-variable8-bit-based codecs a superset of ASCII?
- In other words: Can US-ASCII be interpreted as any 8-or-variable8-bit-based encoding?
here: Character set that is not a superset of ASCII
Instead, it would be helpful to know:
- Is there a list of character sets which are supersets of ASCII?
This looks promising:
mime.charsets - list of character sets which are ASCII supersets,
but I couldn't find an actual mime.charsets file.
An alternative approach is to decode the bytes 0x00 - 0x7F in the given encoding, and check that the characters match ASCII. For example, in Python 3.x:
def is_ascii_superset(encoding):
for codepoint in range(128):
if bytes([codepoint]).decode(encoding, 'ignore') != chr(codepoint):
return False
return True
This gives:
>>> is_ascii_superset('US-ASCII')
True
>>> is_ascii_superset('windows-1252')
True
>>> is_ascii_superset('ISO-8859-15')
True
>>> is_ascii_superset('UTF-8')
True
>>> is_ascii_superset('UTF-16')
False
>>> is_ascii_superset('IBM500') # a variant of EBCDIC
False
EDIT: Get US-ASCII compatibility for each encoding supported by your Qt version in C++:
#include <QTextCodec>
#include <QMap>
typedef enum
{
eQtCodecUndefined,
eQtCodecAsciiIncompatible,
eQtCodecAsciiCompatible,
} tQtCodecType;
QMap<QByteArray, tQtCodecType> QtCodecTypes()
{
QMap<QByteArray, tQtCodecType> CodecTypes;
// How to test Qt's interpretation of ASCII data?
QList<QByteArray> available = QTextCodec::availableCodecs();
QTextCodec *referenceCodec = QTextCodec::codecForName("UTF-8"); // because Qt has no US-ASCII, but we only test bytes 0-127 and UTF-8 is a superset of US-ASCII
if(referenceCodec == 0)
{
qDebug("Unable to get reference codec 'UTF-8'");
return CodecTypes;
}
for(int i = 0; i < available.count(); i++)
{
const QByteArray name = available.at(i);
QTextCodec *currCodec = QTextCodec::codecForName(name);
if(currCodec == NULL)
{
qDebug("Unable to get codec for '%s'", qPrintable(QString(name)));
CodecTypes.insert(name, eQtCodecUndefined);
continue;
}
tQtCodecType type = eQtCodecAsciiCompatible;
for(uchar j = 0; j < 128; j++) // UTF-8 == US-ASCII in the lower 7 bit
{
const char c = (char)j; // character to test < 2^8
QString sRef, sTest;
sRef = referenceCodec->toUnicode(&c, 1); // convert character to UTF-16 (QString internal) assuming it is ASCII (via UTF-8)
sTest = currCodec->toUnicode(&c, 1); // convert character to UTF-16 assuming it is of type [currCodec]
if(sRef != sTest) // compare both UTF-16 representations -> if they are equal, these codecs are transparent for Qt
{
type = eQtCodecAsciiIncompatible;
break;
}
}
CodecTypes.insert(name, type);
}
return CodecTypes;
}
这篇关于是否有一个ASCII扩展编码列表?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!