Twitter文本压缩的挑战 [英] Twitter text compression challenge

查看:178
本文介绍了Twitter文本压缩的挑战的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

规则




  1. 您的程式必须有两种模式:编码解码
  2. 作为输入一些人类可读的 Latin1 文本,大概是英语。


    • 忽略标点符号无关紧要。

    • 您只需要担心实际的英语单词,

    • 任何重音字母都可以转换为简单的ASCII。

    • 您可以选择处理数字的方式。

    • 123


      • 一两三

      • 一百二十三

      • 123

      • 1 2 3


    • 一百二十三


      • 一两三

      • 一百二十三

      • 123
      • b $ b
      • 1 2 3



  3. 输出可在 U + 0000

    范围内的




    • code> - U + 10FFFF



      排除非字符:




      • U + FFFE

      • U + FFFF

      • U + n / em> FFFE U + n / em> FFFF 其中 n 1 - 10 十六进制

      • U + FDD0 - U + FDF​​

      • U + D800 DFFF (代理码点)。



它可以以您选择的任何合理编码输出;
GNU iconv 支持的任何编码


  • 解码


    1. 您的程序应该将 encoding li>
    2. 文本输出应为输入文本的近似值。


      • 越接近原始文字越好。

      • 不需要任何标点符号


    3. 输出文字应该是人类可读的,大概是英文。




      • 可以是L337或lol。


    4. 解码过程可能没有访问权限到除上述输出之外的编码处理
      的任何其它输出;
      也就是说,你不能在某处上传文本,并输出URL
      以供解码过程下载,或者任何类似的东西。

    li>

  • 为了保持用户界面的一致性,程序必须如下所示:


    1. 您的程序必须是一个脚本可以设置为在具有适当解释器
      或可编译成可执行文件的程序的平台上可执行。

    2. 您的程序必须将其作为第一个参数编码解码以设置模式。

    3. 您的程序必须输入至少采用以下方法之一:


      • 从标准输入输入,并在标准输出上生成输出。


        • my-program encode< input.txt> output.utf

        • my-program decode< output.utf> output.txt


      • 从第二个参数中命名的文件获取输入,并在第三个文件中命名的文件中生成输出。


        • my-program encode input.txt output.utf

        • my-program decode output.utf output.txt


      / li>


  • 完整的,和/或其托管在其他地方的链接
    (如果它很长,或需要许多文件来编译,或者什么)。

  • ,如果它不是立即显现从代码
    或如果代码很长,人们会有兴趣的摘要。

  • 一个示例文本,原始文本,文本

  • 如果您正在构建其他人的想法,请将其归因。
    尝试改进别人的想法是可以的,但是你必须对它们进行归属。


  • 这些规则是规则的变体 Twitter图片编码挑战

    解决方案

    不知道我会有时间/能量跟随实际代码,但这里是我的想法:




    • 任何任意LATIN 1字符串在一定长度可以简单编码(即使压缩),而不会丢失到140个字符。天真的估计是280个字符,虽然在比赛规则中的代码点限制,它可能比那个稍短。

    • 字符串比上述长度稍长(让宾客在280和500个字符)最可能使用标准压缩技术缩减为足够短的字符串,以允许上述编码。



    那么,我们开始在文本中丢失信息。因此,执行以下步骤的最小数量以将字符串减少到可以使用上述方法进行压缩/编码的长度。此外,如果只是在子串上执行这些替换,它们将足够短(我可能会通过字符串向后),则不要在整个字符串上执行这些替换。 / p>


    1. 将所有高于127的拉丁字母1字符(主要是重音字母和时髦符号)替换为非重音字母字符中最接近的等效字符,

    2. 替换所有非字母数字(任何剩余的符号或标点符号)


    3. 好的,现在我们' ve消除了尽可能多的超出字符,我们可以合理地摆脱。现在我们要做一些更大的减少:


      1. 将所有双字母字母(balon)。

      2. 用较短的等效字符替换其他常用字母组合(CK为K,WR为R等)

      好的,我们可以去,让文本可读。除此之外,让我们看看我们是否可以想出一个方法,以便文本将类似原始,即使它不是最终 deciperable (再次,执行这一个


      1. 替换所有元音(aeiouy)与

      2. 将所有高字母(bdfhklt)替换为l

      3. 将所有短字母(cmnrsvwxz) / li>
      4. 将所有悬挂字母(gjpq)替换为p

      字符串由5个可能的值(a,l,n,p和空格)组成,这应该允许我们编码相当长的字符串。



      除此之外,我们只需要截断。



      为基于字典的编码,常用字或字母组。这可能会给正确的句子一些好处,但可能不是任意字符串。


      Rules

      1. Your program must have two modes: encoding and decoding.
      2. When encoding:

        1. Your program must take as input some human readable Latin1 text, presumably English.
          • It doesn't matter if you ignore punctuation marks.
          • You only need to worry about actual English words, not L337.
          • Any accented letters may be converted to simple ASCII.
          • You may choose how you want to deal with numbers.
          • 123
            • one two three
            • one hundred twenty three
            • 123
            • 1 2 3
          • one hundred twenty three
            • one two three
            • one hundred twenty three
            • 123
            • 1 2 3
        2. Your program must output a message which can be represented in

          • 140 code points in the range U+0000U+10FFFF

            Excluding non-characters:

            • U+FFFE
            • U+FFFF
            • U+nFFFE, U+nFFFF where n is 110 hexadecimal
            • U+FDD0U+FDEF
            • U+D800U+DFFF (surrogate code points).

        It may be output in any reasonable encoding of your choice; any encoding supported by GNU iconv will be considered reasonable, and your platform native encoding or locale encoding would likely be a good choice.

      3. When decoding:

        1. Your program should take as input the output of your encoding mode.
        2. The text output should be an approximation of the input text.
          • The closer you can get to the original text, the better.
          • Doesn't need to have any punctuation.
        3. The output text should be readable by a human, again presumably English.

          • Can be L337, or lol.
        4. The decoding process may have no access to any other output of the encoding process other than the output specified above; that is, you can't upload the text somewhere and output the URL for the decoding process to download, or anything silly like that.

      4. For the sake of consistency in user interface, your program must behave as follows:

        1. Your program must be a script that can be set to executable on a platform with the appropriate interpreter, or a program that can be compiled into an executable.
        2. Your program must take as its first argument either encode or decode to set the mode.
        3. Your program must take input in at least one of the following ways:
          • Take input from standard in and produce output on standard out.
            • my-program encode <input.txt >output.utf
            • my-program decode <output.utf >output.txt
          • Take input from a file named in the second argument, and produce output in the file named in the third.
            • my-program encode input.txt output.utf
            • my-program decode output.utf output.txt

      5. For your solution, please post:

        1. Your code, in full, and/or a link to it hosted elsewhere (if it's very long, or requires many files to compile, or something).
        2. An explanation of how it works, if it's not immediately obvious from the code or if the code is long and people will be interested in a summary.
        3. An example text, with the original text, the text it compresses down to, and the decoded text.
        4. If you are building on an idea that someone else had, please attribute them. It's OK to try to do a refinement of someone else's idea, but you must attribute them.

      The rules are a variation on the rules for Twitter image encoding challenge.

      解决方案

      Not sure if I'll have the time/energy to follow this up with actual code, but here's my idea:

      • Any arbitrary LATIN 1 string under a certain length could be simply encoded (not even compressed) with no loss into 140 characters. The naive estimate is 280 characters, although with the code point restrictions in the contest rules, its probably a little shorter than that.
      • Strings slightly longer than the above length (lets guestimate between 280 and 500 characters) can most likely be shrunk using standard compression techniques, into a string short enough to allow the above encoding.

      Anything longer than that, and we're starting lose information in the text. So execute the minimum number of the following steps to reduce the string to a length that can then be compressed/encoded using the above methods. Also, don't perform these replacements on the entire string if just performing them on a substring will make it short enough (I would probably walk through the string backwards).

      1. Replace all LATIN 1 characters above 127 (primarily accented letters and funky symbols) with their closest equivalent in non-accented alphabetic characters, or possibly with a generic symbol replacement like "#"
      2. Replace all uppercase letters with their equivalent lowercase form
      3. Replace all non-alphanumerics (any remaining symbols or punctuation marks) with a space
      4. Replace all numbers with 0

      Ok, so now we've eliminated as many excess characters as we can reasonably get rid of. Now we're going to do some more dramatic reductions:

      1. Replace all double-letters (balloon) with a single letter (balon). Will look weird, but still hopefully decipherable by the reader.
      2. Replace other common letter combinations with shorter equivalents (CK with K, WR with R, etc)

      Ok, that's about as far as we can go and have the text be readable. Beyond this, lets see if we can come up with a method so that the text will resemble the original, even if it isn't ultimately deciperable (again, perform this one character at a time from the end of the string, and stop when it is short enough):

      1. Replace all vowels (aeiouy) with a
      2. Replace all "tall" letters (bdfhklt) with l
      3. Replace all "short" letters (cmnrsvwxz) with n
      4. Replace all "hanging" letters (gjpq) with p

      This should leave us with a string consisting of exactly 5 possible values (a, l, n, p, and space), which should allow us to encode pretty lengthy strings.

      Beyond that, we'd simply have to truncate.

      Only other technique I can think of would be to do dictionary-based encoding, for common words or groups of letters. This might give us some benefit for proper sentences, but probably not for arbitrary strings.

      这篇关于Twitter文本压缩的挑战的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆