Twitter文本压缩的挑战 [英] Twitter text compression challenge
问题描述
规则
- 您的程式必须有两种模式:编码和解码 : 作为输入一些人类可读的
- 忽略标点符号无关紧要。
- 您只需要担心实际的英语单词,
- 任何重音字母都可以转换为简单的ASCII。
- 您可以选择处理数字的方式。
- 123
- 一两三
- 一百二十三
- 123
- 1 2 3
- 一百二十三
- 一两三
- 一百二十三
- 123 b $ b
- 1 2 3
-
输出可在
范围内的U + 0000
- code> -
U + 10FFFF
排除非字符:
-
U + FFFE
-
U + FFFF
-
U +
n
/ em>FFFE
,U +
n
/ em>FFFF
其中n
是1
-10
十六进制 -
U + FDD0
-U + FDF
-
U + D800
DFFF (代理码点)。
-
- code> -
Latin1
文本,大概是英语。 它可以以您选择的任何合理编码输出;
GNU iconv
支持的任何编码
当
解码:- 您的程序应该将 encoding li>
- 文本输出应为输入文本的近似值。
- 越接近原始文字越好。
- 不需要任何标点符号
输出文字应该是人类可读的,大概是英文。
- 可以是L337或lol。
的任何其它输出;
也就是说,你不能在某处上传文本,并输出URL
以供解码过程下载,或者任何类似的东西。
li>
- 您的程序必须是一个脚本可以设置为在具有适当解释器
或可编译成可执行文件的程序的平台上可执行。 - 您的程序必须将其作为第一个参数
编码
或解码
以设置模式。 - 您的程序必须输入至少采用以下方法之一:
- 从标准输入输入,并在标准输出上生成输出。
-
my-program encode< input.txt> output.utf
-
my-program decode< output.utf> output.txt
-
- 从第二个参数中命名的文件获取输入,并在第三个文件中命名的文件中生成输出。
-
my-program encode input.txt output.utf
-
my-program decode output.utf output.txt
-
- 从标准输入输入,并在标准输出上生成输出。
(如果它很长,或需要许多文件来编译,或者什么)。
或如果代码很长,人们会有兴趣的摘要。
尝试改进别人的想法是可以的,但是你必须对它们进行归属。
这些规则是规则的变体 Twitter图片编码挑战 。
不知道我会有时间/能量跟随实际代码,但这里是我的想法:
- 任何任意LATIN 1字符串在一定长度可以简单编码(即使压缩),而不会丢失到140个字符。天真的估计是280个字符,虽然在比赛规则中的代码点限制,它可能比那个稍短。
- 字符串比上述长度稍长(让宾客在280和500个字符)最可能使用标准压缩技术缩减为足够短的字符串,以允许上述编码。
那么,我们开始在文本中丢失信息。因此,执行以下步骤的最小数量以将字符串减少到可以使用上述方法进行压缩/编码的长度。此外,如果只是在子串上执行这些替换,它们将足够短(我可能会通过字符串向后),则不要在整个字符串上执行这些替换。 / p>
- 将所有高于127的拉丁字母1字符(主要是重音字母和时髦符号)替换为非重音字母字符中最接近的等效字符,
- 替换所有非字母数字(任何剩余的符号或标点符号)
- 将所有双字母字母(balon)。
- 用较短的等效字符替换其他常用字母组合(CK为K,WR为R等)
- 替换所有元音(aeiouy)与
- 将所有高字母(bdfhklt)替换为l
- 将所有短字母(cmnrsvwxz) / li>
- 将所有悬挂字母(gjpq)替换为p
- Your program must have two modes: encoding and decoding.
When encoding:
- Your program must take as input some human readable
Latin1
text, presumably English.- It doesn't matter if you ignore punctuation marks.
- You only need to worry about actual English words, not L337.
- Any accented letters may be converted to simple ASCII.
- You may choose how you want to deal with numbers.
- 123
- one two three
- one hundred twenty three
- 123
- 1 2 3
- one hundred twenty three
- one two three
- one hundred twenty three
- 123
- 1 2 3
Your program must output a message which can be represented in
140 code points in the range
U+0000
–U+10FFFF
Excluding non-characters:
U+FFFE
U+FFFF
U+
n
FFFE
,U+
n
FFFF
wheren
is1
–10
hexadecimalU+FDD0
–U+FDEF
U+D800
–U+DFFF
(surrogate code points).
It may be output in any reasonable encoding of your choice; any encoding supported by GNU
iconv
will be considered reasonable, and your platform native encoding or locale encoding would likely be a good choice.- Your program must take as input some human readable
When decoding:
- Your program should take as input the output of your encoding mode.
- The text output should be an approximation of the input text.
- The closer you can get to the original text, the better.
- Doesn't need to have any punctuation.
The output text should be readable by a human, again presumably English.
- Can be L337, or lol.
- The decoding process may have no access to any other output of the encoding process other than the output specified above; that is, you can't upload the text somewhere and output the URL for the decoding process to download, or anything silly like that.
- For the sake of consistency in user interface, your program must behave as follows:
- Your program must be a script that can be set to executable on a platform with the appropriate interpreter, or a program that can be compiled into an executable.
- Your program must take as its first argument either
encode
ordecode
to set the mode. - Your program must take input in at least one of the following ways:
- Take input from standard in and produce output on standard out.
my-program encode <input.txt >output.utf
my-program decode <output.utf >output.txt
- Take input from a file named in the second argument, and produce output in the file named in the third.
my-program encode input.txt output.utf
my-program decode output.utf output.txt
- Take input from standard in and produce output on standard out.
- For your solution, please post:
- Your code, in full, and/or a link to it hosted elsewhere (if it's very long, or requires many files to compile, or something).
- An explanation of how it works, if it's not immediately obvious from the code or if the code is long and people will be interested in a summary.
- An example text, with the original text, the text it compresses down to, and the decoded text.
- If you are building on an idea that someone else had, please attribute them. It's OK to try to do a refinement of someone else's idea, but you must attribute them.
- Any arbitrary LATIN 1 string under a certain length could be simply encoded (not even compressed) with no loss into 140 characters. The naive estimate is 280 characters, although with the code point restrictions in the contest rules, its probably a little shorter than that.
- Strings slightly longer than the above length (lets guestimate between 280 and 500 characters) can most likely be shrunk using standard compression techniques, into a string short enough to allow the above encoding.
- Replace all LATIN 1 characters above 127 (primarily accented letters and funky symbols) with their closest equivalent in non-accented alphabetic characters, or possibly with a generic symbol replacement like "#"
- Replace all uppercase letters with their equivalent lowercase form
- Replace all non-alphanumerics (any remaining symbols or punctuation marks) with a space
- Replace all numbers with 0
- Replace all double-letters (balloon) with a single letter (balon). Will look weird, but still hopefully decipherable by the reader.
- Replace other common letter combinations with shorter equivalents (CK with K, WR with R, etc)
- Replace all vowels (aeiouy) with a
- Replace all "tall" letters (bdfhklt) with l
- Replace all "short" letters (cmnrsvwxz) with n
- Replace all "hanging" letters (gjpq) with p
好的,现在我们' ve消除了尽可能多的超出字符,我们可以合理地摆脱。现在我们要做一些更大的减少:
好的,我们可以去,让文本可读。除此之外,让我们看看我们是否可以想出一个方法,以便文本将类似原始,即使它不是最终 deciperable (再次,执行这一个
字符串由5个可能的值(a,l,n,p和空格)组成,这应该允许我们编码相当长的字符串。
除此之外,我们只需要截断。
为基于字典的编码,常用字或字母组。这可能会给正确的句子一些好处,但可能不是任意字符串。
Rules
The rules are a variation on the rules for Twitter image encoding challenge.
Not sure if I'll have the time/energy to follow this up with actual code, but here's my idea:
Anything longer than that, and we're starting lose information in the text. So execute the minimum number of the following steps to reduce the string to a length that can then be compressed/encoded using the above methods. Also, don't perform these replacements on the entire string if just performing them on a substring will make it short enough (I would probably walk through the string backwards).
Ok, so now we've eliminated as many excess characters as we can reasonably get rid of. Now we're going to do some more dramatic reductions:
Ok, that's about as far as we can go and have the text be readable. Beyond this, lets see if we can come up with a method so that the text will resemble the original, even if it isn't ultimately deciperable (again, perform this one character at a time from the end of the string, and stop when it is short enough):
This should leave us with a string consisting of exactly 5 possible values (a, l, n, p, and space), which should allow us to encode pretty lengthy strings.
Beyond that, we'd simply have to truncate.
Only other technique I can think of would be to do dictionary-based encoding, for common words or groups of letters. This might give us some benefit for proper sentences, but probably not for arbitrary strings.
这篇关于Twitter文本压缩的挑战的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!