打开Excel和TextEdit时编码UTF8 CSV文件的问题 [英] Encoding issues for UTF8 CSV file when opening Excel and TextEdit

查看:180
本文介绍了打开Excel和TextEdit时编码UTF8 CSV文件的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我最近添加了一个CSV数据库下载按钮,它从数据库(Postgres)获取数据(来自服务器的Ruby on Rails),并将其转换为客户端的一个CSV文件(Javascript,HTML5)。我正在测试CSV文件,我遇到了一些编码问题。



当我通过'less'查看CSV文件时,文件看起来很好。但是当我在Excel或TextEdit中打开文件时,我开始看到奇怪的字符,如


â€â€ / p>

出现在文本中。基本上,我看到这里描述的字符: http:/ /digwp.com/2011/07/clean-up-weird-characters-in-database/



我看到,这种问题可能会出现在数据库编码设置设置为错误。但是,我使用的数据库设置为使用UTF8编码。当我调试创建CSV文件的JS代码时,文本显示正常。 (这可能是Chrome的能力,而且能力较弱)



我感到沮丧,因为我正在从我的在线搜索中学到的唯一的事情是可能有很多原因为什么编码不起作用,我不知道哪个部分是错误的(所以请原谅我,因为我最初标记了许多东西),没有任何我尝试的东西已经揭示了我的问题。



为了参考,这里是创建CSV文件的JavaScript代码片段! ($($)$ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ $ view_scope,clicks_post).as_json%> ;;
var csvContent =data:text / csv; charset = utf-8,
csvData.forEach(function(infoArray,index){
var dataString = infoArray.join(,);
csvContent + = dataString +\\\
;
});
var encodedUri = encodeURI(csvContent);
var button = $('< a>');
button.text('下载CSV');
button.addClass(button right);
button.attr href',encodedUri);
button.attr('target','_ blank');
button.attr('download','<%= title%> _25_posts.csv)
$(#<%= title%> _download_action)。append(button);
});


解决方案

由于@jlarson更新了Mac是最大的信息罪魁祸首我们可能会进一步。 Office for Mac至少在2011年和后期,在导入文件时读取Unicode格式的支持相当差。



对UTF-8的支持似乎接近不存在已经阅读了一些关于它的工作的评论,而大多数人则表示没有。不幸的是我没有任何Mac测试。再次:文件本身应该是UTF-8,但导入会停止进程。



在Javascript中编写了一个快速测试,用于导出百分号转义的UTF-16小和大的endian,有/没有BOM等。



代码应该是重构的,但应该是可以测试。它可能比UTF-8更好。当然这也通常意味着更大的数据传输,因为任何字形是两个或四个字节。



你可以在这里找到一个小提琴:


Unicode导出示例小提琴< a>


请注意,它不是 CSV以任何特定的方式。它主要用于纯转换为具有UTF-8,UTF-16大/小端和+/- BOM的数据URL 。

通常使用如下:

  //启动
encoder = new DataEnc({
mime:'text / csv',
charset:'UTF-16BE',
bom:true
});

//将数据转换为百分号转义文本
encoder.enc(data);

//获取结果
var result = encoder.pay();

对象有两个结果属性:



1。) encoder.lead



这是数据URL的MIME类型,字符集等。由选项传递给初始化程序,或者也可以说 .config({... new conf ...})。intro()重新构建。 p>

  data:[< MIME-type>] [; charset =< encoding>] [; base64] 

您可以指定 base64 ,但是没有 base64 转换至少不是这么远)。



2。) encoder.buf



这是一个带有百分比转义数据的字符串。




.pay()




主要代码:






  function DataEnc(a){
this.config(a);
this.intro();
}
/ *
* http://www.iana.org/assignments/character-sets/character-sets.xhtml
* * /
DataEnc。 _enctype = {
u8:['u8','utf8'],
// RFC-2781,如果没有给定,则应假定Big endian
u16be:['u16','u16be ','utf16','utf16be','ucs2','ucs2be'],
u16le:['u16le','utf16le','ucs2le']
};
DataEnc._BOM = {
'none':'',
'UTF-8':'%ef%bb%bf',//拒绝
'UTF-16BE ':'%fe%ff',
'UTF-16LE':'%ff%fe'
};
DataEnc.prototype = {
//基本设置
config:function(a){
var opt = {
charset:'u8',
mime:'text / csv',
base64:0,
bom:0
};
a = a || {};
this.charset = typeof a.charset!=='undefined'?
a.charset:opt.charset;
this.base64 = typeof a.base64!=='undefined'? a.base64:opt.base64;
this.mime = typeof a.mime!=='undefined'? a.mime:opt.mime;
this.bom = typeof a.bom!=='undefined'? a.bom:opt.bom;

this.enc = this.utf8;
this.buf ='';
this.lead ='';
返回这个;
},
//根据config
// data:[< MIME-type>] [; charset =< encoding>] [; base64],< data> ;
intro:function(){
var
g = [],
c = this.charset || '',
b ='none'
;
if(this.mime&& this.mime!=='')
g.push(this.mime);
if(c!==''){
c = c.replace(/ [ - \s] / g,'').toLowerCase();
if(DataEnc._enctype.u8.indexOf(c)> -1){
c ='UTF-8';
if(this.bom)
b = c;
this.enc = this.utf8;
} else if(DataEnc._enctype.u16be.indexOf(c)> -1){
c ='UTF-16BE';
if(this.bom)
b = c;
this.enc = this.utf16be;
} else if(DataEnc._enctype.u16le.indexOf(c)> -1){
c ='UTF-16LE';
if(this.bom)
b = c;
this.enc = this.utf16le;
} else {
if(c ==='copy')
c ='';
this.enc = this.copy;
}
}
if(c!=='')
g.push('charset ='+ c);
if(this.base64)
g.push('base64');
this.lead ='data:'+ g.join(';')+','+ DataEnc._BOM [b];
返回这个;
},
//交付
付款:function(){
return this.lead + this.buf;
},
// UTF-16BE
utf16be:function(t){// U + 0500 => %05%00
var i,c,buf = []; (i = 0; i if((c = t.charCodeAt(i))> 0xff){
buf.push (('00'+(c> 0x08).toString(16))。substr(-2));
buf.push(('00'+(c& 0xff).toString(16))。substr(-2));
} else {
buf.push('00');
buf.push(('00'+(c& 0xff).toString(16))。substr(-2));
}
}
this.buf + ='%'+ buf.join('%');
//注意返回的十六进制数组,而不是'%'的字符串
//如果想要循环使用数据,可能会很有用。
return buf;
},
// UTF-16LE
utf16le:function(t){// U + 0500 => %00%05
var i,c,buf = []; (i = 0; i if((c = t.charCodeAt(i))> 0xff){
buf.push (('00'+(c& 0xff).toString(16))。substr(-2));
buf.push(('00'+(c>> 0x08).toString(16))。substr(-2));
} else {
buf.push(('00'+(c& 0xff).toString(16))。substr(-2));
buf.push('00');
}
}
this.buf + ='%'+ buf.join('%');
//注意返回的十六进制数组,而不是'%'的字符串
//如果想要循环使用数据,可能会很有用。
return buf;
},
// UTF-8
utf8:function(t){
this.buf + = encodeURIComponent(t);
返回这个;
},
//直接复制
copy:function(t){
this.buf + = t;
返回这个;
}
};






上一个答案:






我没有任何设置来复制你的,但是如果你的情况与@jlarson相同,那么生成的文件应该是正确的。 >

这个答案变得有点长,(有趣的话题你说?),但是讨论这个问题的各个方面,什么是可能发生的



TL; DR:



可能导入文字如ISO-8859-1,Windows-1252等,而不是UTF-8。强制应用程序使用导入或其他方式将文件读取为UTF-8。






PS: UniSearcher 是一个很好的工具,可以在这个旅程中使用。



< h1>长途跋涉

最简单的方式是100%确定我们正在看的是使用十六进制编辑器结果。或者使用命令行中的 hexdump xxd 等来查看文件。在这种情况下,字节序列应该是从脚本发送的UTF-8序列。



举个例子,如果我们使用 jlarson的脚本它需要数据 数组

  data = ['name','city','state'],
['\\\Ԁ\\\ס\\\ก\\\ၔ','seattle','washington']

这个被合并到字符串中:

  name,city,state< newline> 
\\\Ԁ\\\ס\\\ก\\\ၔ,西雅图,华盛顿< newline>

通过Unicode转换为:

  name,city,state< newline> 
Ԁסกၔ,西雅图,华盛顿<换行>

由于UTF-8使用ASCII作为基础(具有最高位的字节不集合与ASCII相同),测试数据中唯一的特殊顺序是Ԁסกၔ,它们又是:

 代码点字形UTF-8 
----------------------------
U + 0500Ԁd4 80
U + 05E1סd7 a1
U + 0E01กe0 b8 81
U + 1054ၔe1 81 94

查看下载文件的十六进制转储:

  0000000:6e61 6d65 2c63 6974 792c 7374 6174 650a名称,城市,州。 
0000010:d480 d7a1 e0b8 81e1 8194 2c73 6561 7474 ..........,seatt
0000020:6c65 2c77 6173 6869 6e67 746f 6e0a le,washington。

在第二行我们发现 d480 d7a1 e0b8 81e1 8194 与上述相符:

  0000010:d480 d7a1 e0b8 81 e1 8194 2c73 6561 7474 ...... ....,seatt 
| | | | | | | | | | | | | |
+ - + - + + - + - + + - + - + + - + - + | | | | | |
| | | | | | | | | |
Ԁסกၔ,seatt

其他字符都不会被破坏。 p>

如果需要,做类似的测试。结果应该是类似的。






根据样例提供†€œ



我们还可以看看问题中提供的示例。可能会假设文本在Excel / TextEdit中由代码页1252表示。



要在Windows-1252上引用维基百科:


Windows-1252或CP-1252是拉丁字母的字符编码,由Microsoft Windows的旧版组件中的
默认使用英文和一些其他
西方语言。它是Windows代码页组内的一个版本。
在LaTeX包中,它被称为ansinew。




检索原始字节< h2>

要将其翻译成原始形式,我们可以查看代码页布局,从中我们得到:

 字符:<? <€> <> <,> < > < a取代; <€> < > <,> < > < a取代; <€> <œ> 
U.Hex:e2 20ac 201d 2c 20 e2 20ac 9d 2c 20 e2 20ac 153
T.Hex:e2 80 94 2c 20 e2 80 9d * 2c 20 e2 80 9c




  • U 是Unicode的缩写

  • T
  • 的简称


      例如:

       â=> Unicode 0xe2 => CP-1252 0xe2 
      => Unicode 0x201d => CP-1252 0x94
      €=> Unicode 0x20ac => CP-1252 0x80

      特殊情况如 9d 在CP-1252中没有相应的代码点,我们只需直接复制。



      注意:如果通过将文本复制到文件中查看被破坏的字符串,执行十六进制转储,使用UTF-16编码保存文件以获取表中所示的Unicode值。例如。在Vim中:

        set fenc = utf-16 
      #或
      set fenc = ucs-2



      UTF-8的字节



      然后将结果 T.Hex 行结合到UTF-8中。在UTF-8序列中,字节由前导字节表示,告诉我们有多少个后续字节使字形 。例如,如果一个字节具有二进制值 110x xxxx ,我们知道这个字节和下一个代表一个代码点。共有两个。 1110 xxxx 告诉我们是三个等等。 ASCII值不具有高位置位,因此与 0xxx xxxx 匹配的任何字节都是独立的。总共一个字节。

       0xe2 = 1110 0010  => 3 bytes => 0xe28094(em-dash) -  
      0x2c = 0010 1100 => 1 byte => 0x2c(逗号),
      0x2c = 0010 0000 => 1 byte => 0x20空格)
      0xe2 = 1110 0010 => 3 bytes => 0xe2809d(right-dq)
      0x2c = 0010 1100 bin => 1 byte => 0x2c(逗号),
      0x2c = 0010 0000 bin => 1 byte => 0x20(空格)
      0xe2 = 1110 0010 => 3 bytes => 0xe2809c(left-dq)

      结论; 原始的UTF-8字符串为:

         - ,, 



      退回



      我们可以也做相反。原始字符串为字节:

        UTF-8:e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c 

      cp-1252

        e2 => â
      80 => €
      94 =>
      2c => ,
      20 => <空>
      ...

      等等,结果:

       â€â€






      导入到MS Excel



      换句话说:手头的问题可能是如何导入UTF- 8个文本文件到MS Excel,以及其他一些应用程序。在Excel中,这可以通过各种方式完成。




      • 方法一: / li>


      不要使用应用程序识别的扩展名保存文件,例如 .csv ,或 .txt ,但完全省略或提出问题。



      作为示例将文件保存为testfile,没有扩展名。然后在Excel中打开该文件,确认我们实际上要打开此文件,并且voilà我们使用encoding选项提供服务。选择UTF-8,文件应正确读取。




      • 方法二:



      使用导入数据而不是打开的文件。如下所示:

       数据 - >导入外部数据 - >导入数据

      选择编码并继续。



      检查Excel和所选字体实际是否支持字形



      我们还可以使用有时是更友好的剪贴板来测试Unicode字符的字体支持。例如,将此页面的文本复制到Excel中:





      如果支持代码点存在,文本应呈现罚款。






      Linux



      在Linux上,主要是用户界面中的UTF-8,这不应该是一个问题。使用Libre Office Calc,Vim等显示正确呈现的文件。






      为什么它工作(或应该) h1>

      encodeURI 从规格说明(另见 sec-15.1.3 ):


      encodeURI函数计算一个新版本的URI,其中某个字符的每个实例被一个,两个,三个或四个转义代替代表字符的UTF-8编码的序列。


      我们可以在我们的控制台中简单的测试一下,例如: p>

       >> encodeURI('Ԁסกၔ,seattle,washington')
      < %D4%80%D7%A1%E0%B8%81%E1%81%94,西雅图,华盛顿

      当我们注册转义序列等于上述十六进制转储中的转义序列:

       %D4%80 %D7%A1%E0%B8%81%E1%81%94(记录中的encodeURI)
      d4 80 d7 a1 e0 b8 81 e1 81 94(文件的十六进制转储)

      或测试一个4字节代码:

       >>是encodeURI(

      I recently added a CSV-download button that takes data from database (Postgres) an array from server (Ruby on Rails), and turns it into a CSV file on the client side (Javascript, HTML5). I'm currently testing the CSV file and I am coming across some encoding issues.

      When I view the CSV file via 'less', the file appears fine. But when I open the file in Excel OR TextEdit, I start seeing weird characters like

      â€", â€, “

      appear in the text. Basically, I see the characters that are described here: http://digwp.com/2011/07/clean-up-weird-characters-in-database/

      I read that this sort of issue can arise when the Database encoding setting is set to the wrong one. BUT, the database that I am using is set to use UTF8 encoding. And when I debug through the JS codes that create the CSV file, the text appear normal. (This could be a Chrome ability, and less capability)

      I'm feeling frustrated because the only thing I am learning from my online search is that there could be many reasons why encoding is not working, I'm not sure which part is at fault (so excuse me as I initially tag numerous things), and nothing I tried has shed new light on my problem.

      For reference, here's the JavaScript snippet that creates the CSV file!

      $(document).ready(function() {
      var csvData = <%= raw to_csv(@view_scope, clicks_post).as_json %>;
      var csvContent = "data:text/csv;charset=utf-8,";
      csvData.forEach(function(infoArray, index){
        var dataString = infoArray.join(",");
        csvContent += dataString+ "\n";
      }); 
      var encodedUri = encodeURI(csvContent);
      var button = $('<a>');
      button.text('Download CSV');
      button.addClass("button right");
      button.attr('href', encodedUri);
      button.attr('target','_blank');
      button.attr('download','<%=title%>_25_posts.csv');
      $("#<%=title%>_download_action").append(button);
      });
      

      解决方案

      As @jlarson updated with information that Mac was the biggest culprit we might get some further. Office for Mac has, at least 2011 and back, rather poor support for reading Unicode formats when importing files.

      Support for UTF-8 seems to be close to non-existent, have read a tiny few comments about it working, whilst the majority say it does not. Unfortunately I do not have any Mac to test on. So again: The files themselves should be OK as UTF-8, but the import halts the process.

      Wrote up a quick test in Javascript for exporting percent escaped UTF-16 little and big endian, with- / without BOM etc.

      Code should probably be refactored but should be OK for testing. It might work better then UTF-8. Of course this also usually means bigger data transfers as any glyph is two or four bytes.

      You can find a fiddle here:

      Unicode export sample Fiddle

      Note that it does not handle CSV in any particular way. It is mainly meant for pure conversion to data URL having UTF-8, UTF-16 big/little endian and +/- BOM. There is one option in the fiddle to replace commas with tabs, – but believe that would be rather hackish and fragile solution if it works.


      Typically use like:

      // Initiate
      encoder = new DataEnc({
          mime   : 'text/csv',
          charset: 'UTF-16BE',
          bom    : true
      });
      
      // Convert data to percent escaped text
      encoder.enc(data);
      
      // Get result
      var result = encoder.pay();
      

      There is two result properties of the object:

      1.) encoder.lead

      This is the mime-type, charset etc. for data URL. Built from options passed to initializer, or one can also say .config({ ... new conf ...}).intro() to re-build.

      data:[<MIME-type>][;charset=<encoding>][;base64]
      

      You can specify base64, but there is no base64 conversion (at least not this far).

      2.) encoder.buf

      This is a string with the percent escaped data.

      The .pay() function simply return 1.) and 2.) as one.


      Main code:


      function DataEnc(a) {
          this.config(a);
          this.intro();
      }
      /*
      * http://www.iana.org/assignments/character-sets/character-sets.xhtml
      * */
      DataEnc._enctype = {
              u8    : ['u8', 'utf8'],
              // RFC-2781, Big endian should be presumed if none given
              u16be : ['u16', 'u16be', 'utf16', 'utf16be', 'ucs2', 'ucs2be'],
              u16le : ['u16le', 'utf16le', 'ucs2le']
      };
      DataEnc._BOM = {
              'none'     : '',
              'UTF-8'    : '%ef%bb%bf', // Discouraged
              'UTF-16BE' : '%fe%ff',
              'UTF-16LE' : '%ff%fe'
      };
      DataEnc.prototype = {
          // Basic setup
          config : function(a) {
              var opt = {
                  charset: 'u8',
                  mime   : 'text/csv',
                  base64 : 0,
                  bom    : 0
              };
              a = a || {};
              this.charset = typeof a.charset !== 'undefined' ?
                              a.charset : opt.charset;
              this.base64 = typeof a.base64 !== 'undefined' ? a.base64 : opt.base64;
              this.mime = typeof a.mime !== 'undefined' ? a.mime : opt.mime;
              this.bom = typeof a.bom !== 'undefined' ? a.bom : opt.bom;
      
              this.enc = this.utf8;
              this.buf = '';
              this.lead = '';
              return this;
          },
          // Create lead based on config
          // data:[<MIME-type>][;charset=<encoding>][;base64],<data>
          intro : function() {
              var
                  g = [],
                  c = this.charset || '',
                  b = 'none'
              ;
              if (this.mime && this.mime !== '')
                  g.push(this.mime);
              if (c !== '') {
                  c = c.replace(/[-\s]/g, '').toLowerCase();
                  if (DataEnc._enctype.u8.indexOf(c) > -1) {
                      c = 'UTF-8';
                      if (this.bom)
                          b = c;
                      this.enc = this.utf8;
                  } else if (DataEnc._enctype.u16be.indexOf(c) > -1) {
                      c = 'UTF-16BE';
                      if (this.bom)
                          b = c;
                      this.enc = this.utf16be;
                  } else if (DataEnc._enctype.u16le.indexOf(c) > -1) {
                      c = 'UTF-16LE';
                      if (this.bom)
                          b = c;
                      this.enc = this.utf16le;
                  } else {
                      if (c === 'copy')
                          c = '';
                      this.enc = this.copy;
                  }
              }
              if (c !== '')
                  g.push('charset=' + c);
              if (this.base64)
                  g.push('base64');
              this.lead = 'data:' + g.join(';') + ',' + DataEnc._BOM[b];
              return this;
          },
          // Deliver
          pay : function() {
              return this.lead + this.buf;
          },
          // UTF-16BE
          utf16be : function(t) { // U+0500 => %05%00
              var i, c, buf = [];
              for (i = 0; i < t.length; ++i) {
                  if ((c = t.charCodeAt(i)) > 0xff) {
                      buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
                      buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                  } else {
                      buf.push('00');
                      buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                  }
              }
              this.buf += '%' + buf.join('%');
              // Note the hex array is returned, not string with '%'
              // Might be useful if one want to loop over the data.
              return buf;
          },
          // UTF-16LE
          utf16le : function(t) { // U+0500 => %00%05
              var i, c, buf = [];
              for (i = 0; i < t.length; ++i) {
                  if ((c = t.charCodeAt(i)) > 0xff) {
                      buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                      buf.push(('00' + (c >> 0x08).toString(16)).substr(-2));
                  } else {
                      buf.push(('00' + (c  & 0xff).toString(16)).substr(-2));
                      buf.push('00');
                  }
              }
              this.buf += '%' + buf.join('%');
              // Note the hex array is returned, not string with '%'
              // Might be useful if one want to loop over the data.
              return buf;
          },
          // UTF-8
          utf8 : function(t) {
              this.buf += encodeURIComponent(t);
              return this;
          },
          // Direct copy
          copy : function(t) {
              this.buf += t;
              return this;
          }
      };
      


      Previous answer:


      I do not have any setup to replicate yours, but if your case is the same as @jlarson then the resulting file should be correct.

      This answer became somewhat long, (fun topic you say?), but discuss various aspects around the question, what is (likely) happening, and how to actually check what is going on in various ways.

      TL;DR:

      The text is likely imported as ISO-8859-1, Windows-1252, or the like, and not as UTF-8. Force application to read file as UTF-8 by using import or other means.


      PS: The UniSearcher is a nice tool to have available on this journey.

      The long way around

      The "easiest" way to be 100% sure what we are looking at is to use a hex-editor on the result. Alternatively use hexdump, xxd or the like from command line to view the file. In this case the byte sequence should be that of UTF-8 as delivered from the script.

      As an example if we take the script of jlarson it takes the data Array:

      data = ['name', 'city', 'state'],
             ['\u0500\u05E1\u0E01\u1054', 'seattle', 'washington']
      

      This one is merged into the string:

       name,city,state<newline>
       \u0500\u05E1\u0E01\u1054,seattle,washington<newline>
      

      which translates by Unicode to:

       name,city,state<newline>
       Ԁסกၔ,seattle,washington<newline>
      

      As UTF-8 uses ASCII as base (bytes with highest bit not set are the same as in ASCII) the only special sequence in the test data is "Ԁסกၔ" which in turn, is:

      Code-point  Glyph      UTF-8
      ----------------------------
          U+0500    Ԁ        d4 80
          U+05E1    ס        d7 a1
          U+0E01    ก     e0 b8 81
          U+1054    ၔ     e1 81 94
      

      Looking at the hex-dump of the downloaded file:

      0000000: 6e61 6d65 2c63 6974 792c 7374 6174 650a  name,city,state.
      0000010: d480 d7a1 e0b8 81e1 8194 2c73 6561 7474  ..........,seatt
      0000020: 6c65 2c77 6173 6869 6e67 746f 6e0a       le,washington.
      

      On second line we find d480 d7a1 e0b8 81e1 8194 which match up with the above:

      0000010: d480  d7a1  e0b8 81  e1 8194 2c73 6561 7474  ..........,seatt
               |   | |   | |     |  |     |  | |  | |  | |
               +-+-+ +-+-+ +--+--+  +--+--+  | |  | |  | |
                 |     |      |        |     | |  | |  | |
                 Ԁ     ס      ก        ၔ     , s  e a  t t
      

      None of the other characters is mangled either.

      Do similar tests if you want. The result should be the similar.


      By sample provided â€", â€, “

      We can also have a look at the sample provided in the question. It is likely to assume that the text is represented in Excel / TextEdit by code-page 1252.

      To quote Wikipedia on Windows-1252:

      Windows-1252 or CP-1252 is a character encoding of the Latin alphabet, used by default in the legacy components of Microsoft Windows in English and some other Western languages. It is one version within the group of Windows code pages. In LaTeX packages, it is referred to as "ansinew".

      Retrieving the original bytes

      To translate it back into it's original form we can look at the code page layout, from which we get:

      Character:   <â>  <€>  <">  <,>  < >  <â>  <€>  < >  <,>  < >  <â>  <€>  <œ>
      U.Hex    :    e2 20ac 201d   2c   20   e2 20ac   9d   2c   20   e2 20ac  153
      T.Hex    :    e2   80   94   2c   20   e2   80   9d*  2c   20   e2   80   9c
      

      • U is short for Unicode
      • T is short for Translated

      For example:

      â => Unicode 0xe2   => CP-1252 0xe2
      " => Unicode 0x201d => CP-1252 0x94
      € => Unicode 0x20ac => CP-1252 0x80
      

      Special cases like 9d does not have a corresponding code-point in CP-1252, these we simply copy directly.

      Note: If one look at mangled string by copying the text to a file and doing a hex-dump, save the file with for example UTF-16 encoding to get the Unicode values as represented in the table. E.g. in Vim:

      set fenc=utf-16
      # Or
      set fenc=ucs-2
      

      Bytes to UTF-8

      We then combine the result, the T.Hex line, into UTF-8. In UTF-8 sequences the bytes are represented by a leading byte telling us how many subsequent bytes make the glyph. For example if a byte has the binary value 110x xxxx we know that this byte and the next represent one code-point. A total of two. 1110 xxxx tells us it is three and so on. ASCII values does not have the high bit set, as such any byte matching 0xxx xxxx is a standalone. A total of one byte.

      0xe2 = 1110 0010bin => 3 bytes => 0xe28094 (em-dash)  —
      0x2c = 0010 1100bin => 1 byte  => 0x2c     (comma)    ,
      0x2c = 0010 0000bin => 1 byte  => 0x20     (space)   
      0xe2 = 1110 0010bin => 3 bytes => 0xe2809d (right-dq) "
      0x2c = 0010 1100bin => 1 byte  => 0x2c     (comma)    ,
      0x2c = 0010 0000bin => 1 byte  => 0x20     (space)   
      0xe2 = 1110 0010bin => 3 bytes => 0xe2809c (left-dq)  "
      

      Conclusion; The original UTF-8 string was:

      —, ", "
      

      Mangling it back

      We can also do the reverse. The original string as bytes:

      UTF-8: e2 80 94 2c 20 e2 80 9d 2c 20 e2 80 9c
      

      Corresponding values in cp-1252:

      e2 => â
      80 => €
      94 => "
      2c => ,
      20 => <space>
      ...
      

      and so on, result:

      â€", â€, “
      


      Importing to MS Excel

      In other words: The issue at hand could be how to import UTF-8 text files into MS Excel, and some other applications. In Excel this can be done in various ways.

      • Method one:

      Do not save the file with an extension recognized by the application, like .csv, or .txt, but omit it completely or make something up.

      As an example save the file as "testfile", with no extension. Then in Excel open the file, confirm that we actually want to open this file, and voilà we get served with the encoding option. Select UTF-8, and file should be correctly read.

      • Method two:

      Use import data instead of open file. Something like:

      Data -> Import External Data -> Import Data
      

      Select encoding and proceed.

      Check that Excel and selected font actually supports the glyph

      We can also test the font support for the Unicode characters by using the, sometimes, friendlier clipboard. For example, copy text from this page into Excel:

      If support for the code points exist, the text should render fine.


      Linux

      On Linux, which is primarily UTF-8 in userland this should not be an issue. Using Libre Office Calc, Vim, etc. show the files correctly rendered.


      Why it works (or should)

      encodeURI from the spec states, (also read sec-15.1.3):

      The encodeURI function computes a new version of a URI in which each instance of certain characters is replaced by one, two, three, or four escape sequences representing the UTF-8 encoding of the character.

      We can simply test this in our console by, for example saying:

      >> encodeURI('Ԁסกၔ,seattle,washington')
      << "%D4%80%D7%A1%E0%B8%81%E1%81%94,seattle,washington"
      

      As we register the escape sequences are equal to the ones in the hex dump above:

      %D4%80%D7%A1%E0%B8%81%E1%81%94 (encodeURI in log)
       d4 80 d7 a1 e0 b8 81 e1 81 94 (hex-dump of file)
      

      or, testing a 4-byte code:

      >> encodeURI('
                              

      这篇关于打开Excel和TextEdit时编码UTF8 CSV文件的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆