如何从剪贴板获取正确编码的HTML? [英] How to get correctly-encoded HTML from the clipboard?

查看:126
本文介绍了如何从剪贴板获取正确编码的HTML?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

有人注意到,如果从剪贴板中检索HTML,它会获得编码错误并注入奇怪的字符?



例如,执行如下命令:

  string s =(string)Clipboard.GetData(DataFormats.Html)
pre>

结果如下:

 < FONT size = -2>Â< A href =/ advanced_search?hl = en>高级
搜索< / A>< BR>Â< A href =/ preferences?hl = en> ;首选项< / A>< BR>< A
href =/ language_tools?hl = en>语言
工具< / A>< / FONT>

不知道MarkDown如何处理这个问题,但上述结果标记中有奇怪的字符。



似乎该错误是与.NET框架。你认为从剪贴板获取正确编码的HTML的最佳方式是什么?

解决方案

在这种情况下,它不是可见,就像我的情况。今天我试图从剪贴板复制数据,但有一些unicode字符。我得到的数据就好像我将在Windows-1250编码(我的Windows中的本地编码)中读取UTF-8编码的文件。



(请记住,在Windows-1252(或Windows-1250;两个版本)中,将字符后的不间断空格= 0xa0放在标准空格中) / em>的。然后打开这个文件作为一个UTF-8文件,你会看到应该有什么。



对于我的另一个项目,我做了一个修复数据损坏编码的功能。 / p>

在这种情况下,简单的转换应该是足够的:

  byte [] data = Encoding.Default.GetBytes(text); 
text = Encoding.UTF8.GetString(data);

我的原始功能有点复杂,包含测试以确保数据没有损坏..

  public static bool FixMisencodedUTF8(ref string text,Encoding encoding)
{
if(string。 IsNullOrEmpty(text))
return false;
byte [] data = encoding.GetBytes(text);
//源代码外不应有任何字符串
string newStr = encoding.GetString(data);
if(!string.Equals(text,newStr))//如果有任何字符outside
return false; // leave,输入是不同的编码
if(IsValidUtf8(data)== 0)//测试数据是有效的UTF-8字节序列
return false; //如果没有,不能转换为UTF-8
text = Encoding.UTF8.GetString(data);
返回true;
}

我知道这不是最好的(或正确的解决方案) 但我没有找到任何其他方式如何修复输入...



编辑(7月20日,2017)



它似乎已经发现这个错误,现在它正常工作。我不知道这个问题是否在一些框架中,但我确实知道,当我写出答案时,应用程序使用不同的框架。 (现在是4.5,以前的版本是2.0)
(现在我的所有代码在解析数据时都失败了,另外还有一个问题可以用来确定应用程序的正确行为已经没有修复了。)


Has anyone noticed that if you retrieve HTML from the clipboard, it gets the encoding wrong and injects weird characters?

For example, executing a command like this:

string s = (string) Clipboard.GetData(DataFormats.Html)

Results in stuff like:

<FONT size=-2>  <A href="/advanced_search?hl=en">Advanced 
Search</A><BR>  <A href="/preferences?hl=en">Preferences</A><BR>  <A 
href="/language_tools?hl=en">Language 
Tools</A></FONT>

Not sure how MarkDown will process this, but there are weird characters in the resulting markup above.

It appears that the bug is with the .NET framework. What do you think is the best way to get correctly-encoded HTML from the clipboard?

解决方案

In this case it is not so visible as it was in my case. Today I tried to copy data from clipboard but there were a few unicode characters. The data I got were as if I would read a UTF-8 encoded file in Windows-1250 encoding (local encoding in my Windows).

It seems you case is the same. If you save the html data (remember to put non-breakable space = 0xa0 after the  character, not a standard space) in Windows-1252 (or Windows-1250; both works). Then open this file as a UTF-8 file and you will see what there should be.

For my other project I made a function that fix data with corrupted encoding.

In this case simple conversion should be sufficient:

byte[] data = Encoding.Default.GetBytes(text);
text = Encoding.UTF8.GetString(data);

My original function is a little bit more complex and contains tests to ensure that data are not corrupted...

public static bool FixMisencodedUTF8(ref string text, Encoding encoding)
{
  if (string.IsNullOrEmpty(text))
    return false;
  byte[] data = encoding.GetBytes(text);
  // there should not be any character outside source encoding
  string newStr = encoding.GetString(data);
  if (!string.Equals(text, newStr)) // if there is any character "outside"
    return false; // leave, the input is in a different encoding
  if (IsValidUtf8(data) == 0) // test data to be valid UTF-8 byte sequence
    return false; // if not, can not convert to UTF-8
  text = Encoding.UTF8.GetString(data);
  return true;
}

I know that this is not the best (or correct solution) but I did not found any other way how to fix the input...

EDIT: (July 20, 2017)

It Seems like the Microsoft already found this error and now it works correctly. I'm not sure whether the problem is in some frameworks, but I know for sure, that now the application uses a different framework as in time, when I wrote the answer. (Now it is 4.5; the previous version was 2.0) (Now all my code fails in parsing the data. There is another problem to determine the correct behaviour for application with fix already aplied and without fix.)

这篇关于如何从剪贴板获取正确编码的HTML?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆