有没有像&QUOT了这样的事情,用户自定义编码后备" [英] Is there a such a thing like "user-defined encoding fallback"

查看:156
本文介绍了有没有像&QUOT了这样的事情,用户自定义编码后备"的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

在使用ASCII编码和编码字符串,字节,字符,如 0 将导致以

When using ASCII encoding and encoding strings to bytes, characters like ö will result to ?.

Encoding encoding = Encoding.GetEncoding("us-ascii");     // or Encoding encoding = Encoding.ASCI;
data = encoding.GetBytes(s);



我在寻找一种方式来取代那些不同的角色,而不仅仅是一个问号。结果
的例子:

I'm searching for a way to replace those characters by different ones, not just a question mark.
Examples:

ä -> ae
ö -> oe
ü -> ue
ß -> ss

如果这是不可能由多个替换一个字符,我会接受,如果我甚至可以取代它们一个字符( 0 - > 0

If it's not possible to replace one character by multiple, I will accept if I can even replace them by one character (ö -> o)

现在也有 EncoderFallback 几种实现,但我不明白他们如何工作。结果
一个快速和肮脏的解决办法是更换所有这些字符之前给字符串 Encoding.GetBytes(),但这并不似乎是正确的方式。结果
我希望我可以给替代的编码对象的表。

Now there are several implementations of EncoderFallback, but I don't understand how they work.
A quick and dirty solution would be to replace all those characters before giving the string to Encoding.GetBytes(), but that doesn't seems to be the "right" way.
I wish I could give a table of replacements to the encoding object.

我怎样才能做到这一点?

How can I accomplish this?

推荐答案

在最正确的方式来实现你想要的是实现一个自定义后备编码器,做一个最合适的备用。内置于.NET的人,由于种种原因,是它会尝试最适合哪些字符(有安全隐患,这取决于使用你计划重新编码的字符串)。您的自定义后备策略很保守。根据你想要的任何规则可以做最适合的

The "most correct" way to achieve what you want is to implement a custom fallback encoder that does a best-fit fallback. The one built in to .NET, for various reasons, is pretty conservative in what characters it will try to best-fit (there are security implications, depending on what use you plan to put the re-encoded string.) Your custom fallback strategy could do best-fit based on whatever rules you want.

话虽如此 - 在你的后备类,你会写出来的所有巨人case语句非编码,能够Unicode代码点和手动将它们映射到自己最合适的替代品。您可以通过简单地通过你的字符串循环的时间提前和交换了不支持的字符替换达到同样的目的。回退战略的主要好处是性能:你最终只会通过你的字符串循环一次,而不是至少两次。除非你的字符串是巨大的,不过,我不会太担心了。

Having said that - in your fallback class, you're going to end up writing a giant case statement of all the non-encode-able Unicode code points and manually mapping them to their best-fit alternatives. You can achieve the same goal by simply looping through your string ahead of time and swapping out the unsupported characters for replacements. The main benefit of the fallback strategy is performance: you only end up looping through your string once, instead of at least twice. Unless your strings are huge, though, I wouldn't worry too much about it.

如果你想实现一个自定义的后备策略,你一定要读条在我的评论:在字符编码.NET框架。这不是真的很难,但你必须明白的编码后备如何工作的。

If you do want to implement a custom fallback strategy, you should definitely read the article in my comment: Character Encoding in the .NET Framework. It's not really hard, but you have to understand how the encoding fallback works.

您提供 Encoder.GetEncoding 方法自定义类,其中有从 EncoderFallback 得出的实现。这个类,不过,基本上就在实际工作中,这是在 EncoderFallbackBuffer 做了包装。需要缓冲的理由是因为后备不一定是一个一对一的过程;在你的榜样,你可能最终一个Unicode字符映射到两个ASCII字符。

You provide the Encoder.GetEncoding method an implementation of your custom class, which has to derive from EncoderFallback. That class, though, is basically just a wrapper around the real work, which is done in EncoderFallbackBuffer. The reason you need a buffer is because fallback is not necessarily a one-to-one process; in your example, you may end up mapping a single Unicode character to two ASCII characters.

目前,其中编码过程首先运行到一个问题,需要回到属于你的战略来看,它使用你的 EncoderFallback 实施创建 EncoderFallbackBuffer 的一个实例。然后,它调用您的自定义缓冲区的后备方法。

At the point where the encoding process first runs into a problem and needs to fall back on your strategy, it uses your EncoderFallback implementation to create an instance of your EncoderFallbackBuffer. It then calls the Fallback method of your custom buffer.

在内部,您的缓冲区建立一组字符来在地方非编码,能够一返回,返回真正。从那里,编码器将 GetNextChar 一再只要呼叫还剩> 0 和/或直到 GetNextChar 返回CP 0,并坚持这些字符到编码的结果。

Internally, your buffer builds up a set of characters to be returned in place of the non-encode-able one, and returns true. From there, the encoder will call GetNextChar repeatedly as long as Remaining > 0 and/or until GetNextChar returns CP 0, and stick those characters into the encoded result.

本文包括几乎你想要做什么的实现; 。我下面复制出来的基本框架,这应该让你开始

The article includes an implementation of pretty much exactly what you're trying to do; I've copied out the basic framework below, which should get you started.

public class CustomMapper : EncoderFallback
{
   // Use can override the "replacement character", so track what they
   // give us.
   public string DefaultString;

   public CustomMapper() : this("*")
   {   
   }

   public CustomMapper(string defaultString)
   {
      this.DefaultString = defaultString;
   }

   public override EncoderFallbackBuffer CreateFallbackBuffer()
   {
      return new CustomMapperFallbackBuffer(this);
   }

   // This is the length of the largest possible replacement string we can
   // return for a single Unicode code point.
   public override int MaxCharCount
   {
      get { return 2; }
   } 
}

public class CustomMapperFallbackBuffer : EncoderFallbackBuffer
{
   CustomMapper fb; 

   public CustomMapperFallbackBuffer(CustomMapper fallback)
   {
      // We can use the same custom buffer with different fallbacks, e.g.
      // we might have different sets of replacement characters for different
      // cases. This is just a reference to the parent in case we want it.
      this.fb = fallback;
   }

   public override bool Fallback(char charUnknown, int index)
   {
      // Do the work of figuring out what sequence of characters should replace
      // charUnknown. index is the position in the original string of this character,
      // in case that's relevant.

      // If we end up generating a sequence of replacement characters, return
      // true, and the encoder will start calling GetNextChar. Otherwise return
      // false.

      // Alternatively, instead of returning false, you can simply extract
      // DefaultString from this.fb and return that for failure cases.
   }

   public override bool Fallback(char charUnknownHigh, char charUnknownLow, int index)
   {
      // Same as above, except we have a UTF-16 surrogate pair. Same rules
      // apply: if we can map this pair, return true, otherwise return false.
      // Most likely, you're going to return false here for an ASCII-type
      // encoding.
   }

   public override char GetNextChar()
   {
      // Return the next character in our internal buffer of replacement
      // characters waiting to be put into the encoded byte stream. If
      // we're all out of characters, return '\u0000'.
   }

   public override bool MovePrevious()
   {
      // Back up to the previous character we returned and get ready
      // to return it again. If that's possible, return true; if that's
      // not possible (e.g. we have no previous character) return false;
   }

   public override int Remaining 
   {
      // Return the number of characters that we've got waiting
      // for the encoder to read.
      get { return count < 0 ? 0 : count; }
   }

   public override void Reset()
   {
       // Reset our internal state back to the initial one.
   }
}

这篇关于有没有像&QUOT了这样的事情,用户自定义编码后备&QUOT;的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆