Prothon不应该借用Python字符串! [英] Prothon should not borrow Python strings!

查看:87
本文介绍了Prothon不应该借用Python字符串!的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我浏览了教程并且让我感到震惊。


"字符串是Prothon中强大的数据类型。与许多语言不同,

它们可以是无限大小(仅受内存大小限制)并且可以保存任意数据,甚至包括照片和电影等二进制数据。

当然也有利于他们传统的存储角色和

操纵文本。


这个字符串视图是大约十年已经过时的现代

programmimg练习。从程序员的角度来看,一个字符串

应该是一个字符列表。字符是具有由Unicode定义的

属性的逻辑对象。这是Java使用的模型,

Javascript,XML和C#。


字符是人类非常重要的逻辑概念

(计算机应该为人类服务!)他们需要

头等舱代表。你长大的

语言是一个历史事故,它的字符非常少,以至于它们可以与字节一一对应。
>

我可以理解为什么你可能害怕为

版本1.0解决所有Unicode问题。不要打扰。今天你需要做的就是避免死亡

结束是不允许在字符串中包含二进制数据。有二进制数据类型。

有一个字符串类型。给他们一个共同的原型如果你想要b $ b b。让他们分享方法。但是在代码中将它们分开。

读取文件的结果是二进制数据字符串。解析

XML文件的结果是一个字符串。这些与位

不同,代表特定文件格式的整数和逻辑整数。


即使您的字符数据类型今天有限对于
0和255之间的字符,您可以在以后轻松扩展。但是一旦你有兆字节的b $ b代码没有区分字符和字节,那么它将会太晚了。这就像试图在将它们视为难以区分之后挑逗整数和浮动

。 (这会把我带到我的

下一篇文章)


Paul Prescod

I skimmed the tutorial and something alarmed me.

"Strings are a powerful data type in Prothon. Unlike many languages,
they can be of unlimited size (constrained only by memory size) and can
hold any arbitrary data, even binary data such as photos and movies.They
are of course also good for their traditional role of storing and
manipulating text."

This view of strings is about a decade out of date with modern
programmimg practice. From the programmer''s point of view, a string
should be a list of characters. Characters are logical objects that have
properties defined by Unicode. This is the model used by Java,
Javascript, XML and C#.

Characters are an extremely important logical concept for human beings
(computers are supposed to serve human beings!) and they need
first-class representation. It is an accident of history that the
language you grew up with has so few characters that they can have a
one-to-one correspondance with bytes.

I can understand why you might be afraid to tackle all of Unicode for
version 1.0. Don''t bother. All you need to do today to avoid the dead
end is DO NOT ALLOW BINARY DATA IN STRINGS. Have a binary data type.
Have a character string type. Give them a common "prototype" if you
wish. Let them share methods. But keep them separate in your code. The
result of reading a file is a binary data string. The result of parsing
an XML file is a character string. These are as different as the bits
that represent an integer in a particular file format and a logical integer.

Even if your character data type is today limited to characters between
0 and 255, you can easily extend that later. But once you have megabytes
of code that makes no distinction between characters and bytes it will
be too late. It would be like trying to tease apart integers and floats
after having treated them as indistinguishable. (which brings me to my
next post)

Paul Prescod

推荐答案

" Paul Prescod" < PA ** @ prescod.net>写了
"Paul Prescod" <pa**@prescod.net> wrote
我可以理解为什么你可能害怕为版本1.0解决所有的Unicode问题。不要打扰。今天你需要做的就是避免死亡结束是不允许在字符串中包含二进制数据。有二进制数据类型。
有一个字符串类型。给他们一个共同的原型如果你愿意的话。让他们分享方法。但是在代码中将它们分开。读取文件的结果是二进制数据字符串。解析XML文件的结果是一个字符串。它们与表示特定文件格式的整数和逻辑
I can understand why you might be afraid to tackle all of Unicode for
version 1.0. Don''t bother. All you need to do today to avoid the dead
end is DO NOT ALLOW BINARY DATA IN STRINGS. Have a binary data type.
Have a character string type. Give them a common "prototype" if you
wish. Let them share methods. But keep them separate in your code. The
result of reading a file is a binary data string. The result of parsing
an XML file is a character string. These are as different as the bits
that represent an integer in a particular file format and a logical



整数的位
不同。


非常及时。我希望在7月之前解决这样的问题,并且

截止日期非常快。


我们在Prothon邮件列表上讨论了如何正确处理

Unicode,但没有人指出这一点。这对我来说非常有意义。


是否有任何动态语言可以帮我们从这个新区域偷取

?我肯定知道我不想窃取Java的流量。我记得很激动地恨他们。


integer.

This is very timely. I would like to resolve issues like this by July and
that deadline is coming up very fast.

We have had discussions on the Prothon mailing list about how to handle
Unicode properly but no one pointed this out. It makes perfect sense to me.

Is there any dynamic language that already does this right for us to steal
from or is this new territory? I know for sure that I don''t want to steal
Java''s streams. I remember hating them with a passion.


Mark Hahn写道:
Mark Hahn wrote:
" Paul Prescod" < PA ** @ prescod.net>写了




是否有任何动态语言可以帮助我们从这个新领域偷取
?我肯定知道我不想窃取Java的流。我记得以激情憎恨他们。
"Paul Prescod" <pa**@prescod.net> wrote


...

Is there any dynamic language that already does this right for us to steal
from or is this new territory? I know for sure that I don''t want to steal
Java''s streams. I remember hating them with a passion.




我不认为自己是专家:只有一些重大错误

我可以认出来但是我会给你尽可能多的指导。


从这里开始:

http://www.joelonsoftware.com/articles/Unicode.html


摘要:


"""如果不知道它使用的是什么编码

,那么拥有一个字符串是没有意义的。你不能再把头埋在沙子里假装

" plain"文字是ASCII。


有没有像平原文本那样的东西。


如果你有一个字符串,在内存中,一个文件,或者在一封电子邮件中,你需要知道它是什么编码,或者你无法解释它或

正确显示给用户。"


我应该告诉你的一件事是,将你的内部API作为语法来获得

同样重要。如果你嵌入了ASCII

假设在您的API中,您将拥有第三方

模块的巨大遗产,期望所有字符都<255并且您将被困在

相同的cul de囊作为Python。


我会定义宏类似


#define PROTHON_CHAR int


和功能如


Prothon_String_As_UTF8

Prothon_String_As_ASCII //如果有高字符则引发错误


显然我可以''想一想整个API。看看Python,

JavaScript和JNI,我猜。

http://java.sun.com/docs/books/jni/h...ypes.html#4001


要点是扩展不应该插入字符串

数据结构,期望数据是char * ASCII字节。

相反,它应该要求您将数据解码为新的缓冲区。如果他们要求的编码发生,那么
可以做一些棘手的缓冲区重用

与你的内部结构相同(看看JavaisCopy

东西)。但是如果你答应用户能够直接摆弄

内部数据那么你可能必须在某一天违背这个承诺。


来自Prothon字符串到C字符串需要编码,因为

_不存在普通的字符串_。如果C程序员

没告诉你他们想要如何编码数据,你怎么知道?


如果你得到了正确的API,那么将更容易处理所有事情

稍后。


选择内部编码实际上非常棘手,因为

是空格与时间权衡相比,你需要猜测

特定字符对用户可能有用的频率。


==


关于类型的问题:有两种模型在

练习中似乎没问题。 Python在字节字符串和Unicode字符串之间的拆分是

实际上还不错,除了默认的字符串文字是BYTE字符串

(由于历史原因)而不是字符string。



I don''t consider myself an expert: there are just some big mistakes that
I can recognize. But I''ll give you as much guidance as I can.

Start here:

http://www.joelonsoftware.com/articles/Unicode.html

Summary:

"""It does not make sense to have a string without knowing what encoding
it uses. You can no longer stick your head in the sand and pretend that
"plain" text is ASCII.

There Ain''t No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you
have to know what encoding it is in or you cannot interpret it or
display it to users correctly."""

One thing I should have told you is that it is just as important to get
your internal APIs right as your syntax. If you embed the "ASCII
assumption" into your APIs you will have a huge legacy of third party
modules that expect all characters to be <255 and you''ll be stuck in the
same cul de sac as Python.

I would define macros like

#define PROTHON_CHAR int

and functions like

Prothon_String_As_UTF8
Prothon_String_As_ASCII // raises error if there are high characters

Obviously I can''t think through the whole API. Look at Python,
JavaScript and JNI, I guess.

http://java.sun.com/docs/books/jni/h...ypes.html#4001

The gist is that extensions should not poke into the character string
data structure expecting the data to be a "char *" of ASCII bytes.
Rather it should ask you to decode the data into a new buffer. Maybe you
could do some tricky buffer reuse if the encoding they ask for happens
to be the same as your internal structure (look at the Java "isCopy"
stuff). But if you promise users the ability to directly fiddle with the
internal data then you may have to break that promise one day.

To get from a Prothon string to a C string requires encoding because
_there ain''t no such thing as a plain string_. If the C programmer
doesn''t tell you how they want the data encoded, how will you know?

If you get the APIs right, it will be much easier to handle everything
else later.

Choosing an internal encoding is actually pretty tricky because there
are space versus time tradeoffs and you need to make some guesses about
how often particular characters are likely to be useful to your users.

==

On the question of types: there are two models that seem to work okay in
practice. Python''s split between byte strings and Unicode strings is
actually not bad except that the default string literal is a BYTE string
(for historical reasons) rather than a character string.

a =" a \\\ሴ"
b = u" ab \ u1234"
a
''a \\\\ u1234''b
u''ab \ u1234''len(a)
8 len(b)
3


这里是Javascript的功能(即更好):


< script>

str =" a \\\ሴ"

alert(str.length)// 3

< / script>


===


顺便说一句,如果你有勇气让自己远离阳光下的每一种语言,我会建议你在未知的转义序列上抛出

异常。在Python中很容易让
意外地使用了如上所述不正确的转义序列。另外,

几乎不可能向Python添加新的转义序列,因为它们可能会在某处破坏某些代码。我不明白为什么这个案子足够特别破坏通常的Python承诺不猜。什么

程序员面对模棱两可的意思。这是另一个你必须在一开始就得到的东西,因为很难

稍后改变!另外,我完全不喜欢字符数字不是b $ b分隔的。它应该是\u {1}或\u {1234}或\u {12345}。我发现Python

非常奇怪:

u" \1"
u''\x01''u" \12"
u''\ n''u" \ 0123"
你'''u" \ 1234"
u''S4''u" \ u1234"
u''\ u1234''u" \ u123"
UnicodeDecodeError:''unicodeescape''编解码器无法解码位置上的字节

0-4:转义序列中的字符串结尾


====


无论如何,Python模型是

字符串(Python调用unicode strings)和byte

字符串(称为8位字符串)。如果你想解码从文件读取的数据,你只需:


文件(" filename")。read()。decode (ascii)





文件(" filename")。read()。decode(" utf-8" ;)


这里是一个字符串和

字节字符串之间清晰分割的说明:

file(" ; filename")。read()
< bytestring ['''',''b'',''c''...]> file(" filename")。read()。decode(" ascii")
a = "a \u1234"
b = u"ab\u1234"
a ''a \\u1234'' b u''ab\u1234'' len(a) 8 len(b) 3

Here''s what Javascript does (i.e. better):

<script>
str = "a \u1234"
alert(str.length) // 3
</script>

===

By the way, if you have the courage to distance yourself from every
other language under the sun, I would propose that you throw an
exception on unknown escape sequences. It is very easy in Python to
accidentally used an escape sequence that is incorrect as above. Plus,
it is near impossible to add new escape sequences to Python because they
may break some code somewhere. I don''t understand why this case is
special enough to break the usual Python commitment to "not guess" what
programmers mean in the face of ambiguity. This is another one of those
things you have to get right at the beginning because it is tough to
change later! Also, I totally hate how character numbers are not
delimited. It should be \u{1} or \u{1234} or \u{12345}. I find Python
totally weird:
u"\1" u''\x01'' u"\12" u''\n'' u"\123" u''S'' u"\1234" u''S4'' u"\u1234" u''\u1234'' u"\u123" UnicodeDecodeError: ''unicodeescape'' codec can''t decode bytes in position
0-4: end of string in escape sequence

====

So anyhow, the Python model is that there is a distinction between
character strings (which Python calls "unicode strings") and byte
strings (called 8-bit strings). If you want to decode data you are
reading from a file, you can just:

file("filename").read().decode("ascii")

or

file("filename").read().decode("utf-8")

Here''s an illustration of a clean split between character strings and
byte strings:
file("filename").read() <bytestring [''a'', ''b'', ''c''...]> file("filename").read().decode("ascii")



" abc"

现在Javascript模型似乎也有用,有点不同。
不同。只有一种字符串类型,但每个字符可以使用最多2 ^ 16的
值(稍后更多关于此数字)。

http://www.mozilla.org/js/language/e ... on.html#string

如果您在JavaScript中读取二进制数据,那么实现似乎只需将
映射到相应的Unicode代码点(另一种方式

说这是他们默认使用latin-1编码)。这应该在大多数浏览器中工作:


< SCRIPT language =" Javascript">

datafile =" http://www.python.org/pics/pythonHi.gif"


httpconn = new XMLHttpRequest();

httpconn.open(" GET" ;,datafile,false);

httpconn.send(null);

alert(httpconn.responseText);

< / SCRIPT>

< BODY>< / BODY>

< / HTML>


(忽略对Xml的引用上面。由于某种原因,微软决定将XML和HTTP混合在他们的API中。在这种情况下,我们无论如何都在使用XML,而不是XML。


我打算写Javascript也有一个函数允许你明确解码
。这是合乎逻辑的。您可以想象,您可以根据需要进行多种解码:


objXml.decode(" utf-8")。decode( latin-1。解码(utf-8)。decode(koi8-r)


这个模型有点简单。因为只有一个字符串

对象,程序员只是直接解读它是否已经解码了(或已经解码了多少次,如果对于

一些奇怪的原因它是双重或三重编码的。)


但事实证明我找不到Javascript Unicode解码

通过Google运行。更多证据表明Javascript已经死了脑筋我猜想。


无论如何,这描述了两个模型:一个是字节(0-255)和字符

(0-2 ** 16或2 ** 32)字符串是严格分开的,其中一个字节

字符串只被视为字符串的子集。你b $ b绝对不想要的是将字符处理完全留在应用程序员的

域中作为C和早期版本和

Python做了。


到字符范围。严格来说,Unicode上限为2 ^ 20

个字符。你会注意到这只是超过2 ^ 16,这是一个更方便(和节省空间)的数字。处理这种情况有三种基本方法




1.每个字符可以使用两个字节而忽略该问题。

这些字符不可用。处理它!听起来并不是很疯狂

因为高级角色还没有被普遍使用。


2.你可以直接使用3(或者更可能是每个字符4个字节。

内存很便宜。处理它!


3.你可以做一些技巧,你可以用两个字节切换到两个字节到四个字节的模式,使用替代物" [1]。这实际上距离

" 1"如果你将代理人的操纵完全留在

应用程序代码中。我相信这是Java使用的策略[2]和

Javascript。[3]


[1] http://www.i18nguy.com/surrogates.html


[2]只接受char值的方法不能支持

补充字符。他们将代理

范围内的char值视为未定义字符。

http://java.sun.com/j2se/1.5.0/docs/...Character.html


"字符是单个Unicode 16位代码点。我们用单引号将它们写成

?和?正好有65536个字符:

?? u0000 ??,?? u0001 ??,...,?A?,?B?,?C?,...,?? uFFFF? ? (另请参阅
非ASCII字符的
表示法)。为了本规范的目的,Unicode代理被认为是b / b



[3] http://www.mozilla.org/js/language/j .. ./notation.html

从正确的角度来看,4字节字符显然是b / b
Unicode正确。从性能的角度来看,大多数语言

设计人员已选择扫描表格下的问题并希望

每个字符16位继续足够大多数当时的而那些

那些关心更多的人会明确地编写自己的代码来处理

的高级角色。


Paul Prescod


"abc"

Now the Javascript model, which also seems to work, is a little bit
different. There is only one string type, but each character can take
values up to 2^16 (more on this number later).

http://www.mozilla.org/js/language/e...on.html#string

If you read binary data in JavaScript, the implementations seem to just
map each byte to a corresponding Unicode code point (another way of
saying that is that they default to the latin-1 encoding). This should
work in most browsers:

<SCRIPT language = "Javascript">
datafile = "http://www.python.org/pics/pythonHi.gif"

httpconn = new XMLHttpRequest();
httpconn.open("GET",datafile,false);
httpconn.send(null);
alert(httpconn.responseText);
</SCRIPT>
<BODY></BODY>
</HTML>

(ignore the reference to "Xml" above. For some reason Microsoft decided
to conflate XML and HTTP in their APIs. In this case we are doing
nothing with XML whatsoever)

I was going to write that Javascript also has a function that allows you
to explicitly decode. That would be logical. You could imagine that you
could do as many levels of decoding as you like:

objXml.decode("utf-8").decode("latin-1").decode("utf-8").decode("koi8-r")

This model is a little bit "simpler" in that there is only one string
object and the programmer just keeps straight in their head whether it
has been decoded already (or how many times it has been decoded, if for
some strange reason it were double or triple-encoded).

But it turns out that I can''t find a Javascript Unicode decoding
function through Google. More evidence that Javascript is brain-dead I
suppose.

Anyhow, that describes two models: one where byte (0-255) and character
(0-2**16 or 2**32) strings are strictly separated and one where byte
strings are just treated as a subset of character strings. What you
absolutely do not want is to leave character handling totally in the
domain of the application programmer as C and early and versions of
Python did.

On to character ranges. Strictly speaking, the Unicode cap is 2^20
characters. You''ll notice that this is just beyond 2^16, which is a much
more convenient (and space efficient) number. There are three basic ways
of dealing with this situation.

1. You can use two bytes per character and simply ignore the issue.
"Those characters are not available. Deal with it!" That isn''t as crazy
as it sounds because the high characters are not in common use yet.

2. You could directly use 3 (or more likely 4) bytes per character.
"Memory is cheap. Deal with it!"

3. You could do tricks where you sort of page switch from two-byte to
four-byte mode using "surrogates".[1] This is actually not that far from
"1" if you leave the manipulation of the surrogates entirely in
application code. I believe this is the strategy used by Java[2] and
Javascript.[3]

[1] http://www.i18nguy.com/surrogates.html

[2] "The methods that only accept a char value cannot support
supplementary characters. They treat char values from the surrogate
ranges as undefined characters."

http://java.sun.com/j2se/1.5.0/docs/...Character.html

"Characters are single Unicode 16-bit code points. We write them
enclosed in single quotes ? and ?. There are exactly 65536 characters:
??u0000??, ??u0001??, ...,?A?, ?B?, ?C?, ...,??uFFFF?? (see also
notation for non-ASCII characters). Unicode surrogates are considered to
be pairs of characters for the purpose of this specification."

[3] http://www.mozilla.org/js/language/j.../notation.html

From a correctness point of view, 4-byte chars is obviously
Unicode-correct. From a performance point of view, most language
designers people have chosen to sweep the issue under the table and hope
that 16 bits per char continue to be enough "most of the time" and that
those who care about more will explicitly write their own code to deal
with high characters.

Paul Prescod


Mark Hahn写道:
Mark Hahn wrote:
是否有任何动态语言可以帮我们从中窃取
这个新领域?我肯定知道我不想窃取Java的流。我记得很激情地讨厌他们。
Is there any dynamic language that already does this right for us to steal
from or is this new territory? I know for sure that I don''t want to steal
Java''s streams. I remember hating them with a passion.




Java签名的字节也给我带来了烦恼。

在我们的协议编组代码(幸好主要是自动生成的)

有很多代码只是为了将签名的字节转回

无符号字节。


(我也非常*非常赞同保罗。)


Roger



Java''s bytes being signed also caused no end of annoyance for me.
In our protocol marshalling code (thankfully mostly auto generated)
there was lots of code just to turn the signed bytes back into
unsigned bytes.

(I also *very* strongly agree with Paul.)

Roger


这篇关于Prothon不应该借用Python字符串!的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆