缩短Java中已经很短的字符串 [英] Shorten an already short string in Java

查看:1613
本文介绍了缩短Java中已经很短的字符串的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找一种尽可能缩短已经短的字符串的方法。



字符串是一个主机名:port combo,可能看起来像 em> my-domain.se:2121 或 123.211.80.4:2122



由于所需的开销和缺乏重复,我有一个想法如何做到。



因为字母表限制为39个字符( [az] [0-9] - :)每个字符可以适合6位。与ASCII相比,这减少了长度的25%。所以我的建议是这样的:


  1. 使用某种自定义编码将字符串编码为字节数组

  2. 将字节数组解码为UTF-8或ASCII字符串(这个字符串显然没有任何意义)。



对我的问题:

    $ b $

    解决方案

    您可以将字符串编码为base 40,比base 64更紧凑。这将给你12个这样的令牌成64位长。第40个令牌可以是字符串标记的结尾,以提供长度(因为它不会是整个字节数)



    如果使用算术编码,它可以小得多,但你需要一个频率表为每个令牌。 (使用一长串可能的例子)

      class Encoder {
    public static final int BASE = 40;
    StringBuilder chars = new StringBuilder(BASE);
    byte [] index = new byte [256];

    {
    chars.append('\0');
    for(char ch ='a'; ch< ='z'; ch ++)chars.append(ch);
    for(char ch ='0'; ch <='9'; ch ++)chars.append(ch);
    chars.append( - :.);
    Arrays.fill(index,(byte)-1);
    for(byte i = 0; i index [chars.charAt(i)] = i;
    }

    public byte [] encode(String address){
    try {
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    DataOutputStream dos = new DataOutputStream(baos);
    for(int i = 0; i switch(Math.min(3,address.length() - i)){
    case 1:// last one。
    byte b = index [address.charAt(i)];
    dos.writeByte(b);
    break;

    case 2:
    char ch =(char)((index [address.charAt(i + 1)])* 40 + index [address.charAt(i)]);
    dos.writeChar(ch);
    break;

    case 3:
    char ch2 =(char)((index [address.charAt(i + 2)] * 40 + index [address.charAt(i + 1)])* 40 + index [address.charAt(i)]);
    dos.writeChar(ch2);
    break;
    }
    }
    return baos.toByteArray();
    } catch(IOException e){
    throw new AssertionError(e);
    }
    }

    public static void main(String [] args){
    Encoder encoder = new Encoder();
    for(String s:twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80.split(,)){
    System.out.println(s +(+ s.length()+chars)encoded是+ encoder.encode(s).length +bytes。
    }
    }
    }

    b
    $ b

      twitter.com:2122(16个字符)编码为11个字节。 
    123.211.80.4:2122(17个字符)编码是12字节。
    my-domain.se:2121(17个字符)编码是12字节。
    www.stackoverflow.com:80(24个字符)编码是16字节。

    我将解码作为练习。 ;)


    I'm looking for a way to shorten an already short string as much as possible.

    The string is a hostname:port combo and could look like "my-domain.se:2121" or "123.211.80.4:2122".

    I know regular compression is pretty much out of the question on strings this short due to the overhead needed and the lack of repetition but I have an idea of how to do it.

    Because the alphabet is limited to 39 characters ([a-z][0-9]-:.) every character could fit in 6 bits. This reduce the length with up to 25% compared to ASCII. So my suggestion is somthing along these lines:

    1. Encode the string to a byte array using some kind of custom encoding
    2. Decode the byte array to a UTF-8 or ASCII string (this string will obviously not make any sense).

    And then reverse the process to get the original string.

    So to my questions:

    1. Could this work?
    2. Is there a better way?
    3. How?

    解决方案

    You could encode the string as base 40 which is more compact than base 64. This will give you 12 such tokens into a 64 bit long. The 40th token could be the end of string marker to give you the length (as it will not be a whole number of bytes any more)

    If you use arithmetic encoding, it could be much smaller but you would need a table of frequencies for each token. (using a long list of possible examples)

    class Encoder {
      public static final int BASE = 40;
      StringBuilder chars = new StringBuilder(BASE);
      byte[] index = new byte[256];
    
      {
        chars.append('\0');
        for (char ch = 'a'; ch <= 'z'; ch++) chars.append(ch);
        for (char ch = '0'; ch <= '9'; ch++) chars.append(ch);
        chars.append("-:.");
        Arrays.fill(index, (byte) -1);
        for (byte i = 0; i < chars.length(); i++)
          index[chars.charAt(i)] = i;
      }
    
      public byte[] encode(String address) {
        try {
          ByteArrayOutputStream baos = new ByteArrayOutputStream();
          DataOutputStream dos = new DataOutputStream(baos);
          for (int i = 0; i < address.length(); i += 3) {
            switch (Math.min(3, address.length() - i)) {
              case 1: // last one.
                byte b = index[address.charAt(i)];
                dos.writeByte(b);
                break;
    
              case 2:
                char ch = (char) ((index[address.charAt(i+1)]) * 40 + index[address.charAt(i)]);
                dos.writeChar(ch);
                break;
    
              case 3:
                char ch2 = (char) ((index[address.charAt(i+2)] * 40 + index[address.charAt(i + 1)]) * 40 + index[address.charAt(i)]);
                dos.writeChar(ch2);
                break;
            }
          }
          return baos.toByteArray();
        } catch (IOException e) {
          throw new AssertionError(e);
        }
      }
    
      public static void main(String[] args) {
        Encoder encoder = new Encoder();
        for (String s : "twitter.com:2122,123.211.80.4:2122,my-domain.se:2121,www.stackoverflow.com:80".split(",")) {
          System.out.println(s + " (" + s.length() + " chars) encoded is " + encoder.encode(s).length + " bytes.");
        }
      }
    }
    

    prints

    twitter.com:2122 (16 chars) encoded is 11 bytes.
    123.211.80.4:2122 (17 chars) encoded is 12 bytes.
    my-domain.se:2121 (17 chars) encoded is 12 bytes.
    www.stackoverflow.com:80 (24 chars) encoded is 16 bytes.
    

    I leave decoding as an exercise. ;)

    这篇关于缩短Java中已经很短的字符串的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆