使用Google Apps脚本从网页提取数据时的字符编码问题 [英] Character encoding issue when using Google Apps Script to extract data from web page

查看:684
本文介绍了使用Google Apps脚本从网页提取数据时的字符编码问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我使用Google Apps Script编写了一个脚本,将网页中的文本提取到Google表格中。我只需要这个脚本来处理特定的网页,所以它不需要多功能。该脚本几乎完全符合我的要求,只是遇到了字符编码问题。我正在提取希伯来文和英文文本。 HTML中的元标记具有charset = Windows-1255。英文摘录完美,但希伯来文显示为含有问号的黑色钻石。



我发现方法返回编码为给定字符串的字符串的HTTP响应的内容。 p>



var xml = UrlFetchApp.fetch(url).getContentText( Windows的1255\" );

尽管 blob()不再需要解决方法。 (其实是无害的)。其他评论:

$ ul
逻辑OR运算符( ||

code>)对设置默认值非常有帮助。我已经调整了前几行来启用测试,但仍然让函数正常地使用参数运行。 用字符串填充它是糟糕的JavaScript;它是不需要的复杂代码,所以折腾它。相反,我们将声明数组数组,然后将 push()行声明到它上面。


  • 使用更巧妙的RegExp可以减少 .replace()函数;我已经包含了真正棘手的演示的URL。

  • 在我猜测的文字中有 \ n 换行符不符合您的用途,所以为它们添加了 replace()

  • >下面是你留下的内容:

      function parseText(book,chapter){
    var bk = book || 04’ ; //用于测试目的的硬编码
    var ch = chapter || 01’ ; //硬编码用于测试目的
    var url ='http://www.mechon-mamre.org/p/pt/pt'+ bk + ch +'.htm';

    var xml = UrlFetchApp.fetch(url).getContentText(Windows-1255);

    //我必须为
    下面的XmlService.parse(xml)//修复这些xml错误才能正常工作。
    xml = xml.replace(/(<!DOCTYPE。* EN)> / gi,'$ 1>')
    .replace(/(<(LINK | meta) ($(<。*?=)( [^'] *?)([>])/ gi,'$ 1$ 2$ 3')// https://regex101.com/r/eP7wO7/1
    .replace(/< BR> / gi,'< BR />')
    .replace(/ \ n / g,'')

    //这部分是到表格的具体路线在页面中我想
    var document = XmlService.parse(xml);
    var body = document.getRootElement()。getChildren(BODY);
    var maintable = body [0] .getChildren(TABLE);
    var maintablechildren = maintable [0] .getChildren();

    //这是表格被分析到数组中的位置
    var array = [];
    for(var i = 0; i< maintablechildren.length; i ++){
    var verse = maintablechildren [i] .getChildren();

    //我最初尝试使用[0] .getText(),但它不起作用。
    var hebrew = verse [0] .getText();
    //这个数组接收英文文本并且工作正常。
    var english = verse [1] .getText();
    array.push([hebrew,english]);
    }

    返回数组;
    }



    结果



      [
    [
    的וַיְדַבֵּריְהוָהאֶל-מֹשֶׁהבְּמִדְבַּרסִינַי,בְּאֹהֶלמוֹעֵד:בְּאֶחָדלַחֹדֶשׁהַשֵּׁנִיבַּשָּׁנָההַשֵּׁנִית,לְצֵאתָםמֵאֶרֶץמִצְרַיִם - לֵאמֹר ,b $ b耶和华在西乃的旷野,在会幕中,在二月初一日,即从埃及地出来后的第二年,说:


    , b'拿你们所有以色列人的会众,他们的家庭的总和按他们的父亲的房子,根据名字数量,每个男人,通过他们的民意测验;
    ],
    [
    מִבֶּןעֶשְׂרִיםשָׁנָהוָמַעְלָה,כָּל-יֹצֵאצָבָאבְּיִשְׂרָאֵל - תִּפְקְדוּאֹתָםלְצִבְאֹתָם,אַתָּהוְאַהֲרֹן,
    从二十岁以外,凡在以色列能出去打仗,你们要用他们的军队数,就是你和亚伦。
    ],
    ...


    I have written a script using Google Apps Script to extract text from a web page into Google Sheets. I only need this script to work with a specific web page, so it does not need to be versatile. The script works almost exactly as I want it to except that I have run into a character encoding problem. I am extracting both Hebrew and English text. The meta tag in the HTML has charset=Windows-1255. The English extracts perfectly, but the Hebrew displays as black diamonds containing a question mark.

    I found this question that says to pass the data into a blob then use the getDataAsString method to convert to another encoding. I tried converting to different encodings and got different results. UTF-8 displays the black diamonds with question marks, UTF-16 displays Korean, ISO 8859-8 returns an error and says it's not a valid parameter, and the original Windows-1255 displays one Hebrew character but a bunch of other gibberish.

    However, I am able to copy and paste the Hebrew text into Google Sheets manually and it displays correctly.

    I have even tested passing Hebrew directly from Google Apps Script code like so:

    function passHebrew() {
      return "וַיְדַבֵּר";
    }
    

    This displays the Hebrew text properly on Google Sheets.

    My code is as follows:

    function parseText(book, chapter) {
      //var bk = book;
      //var ch = chapter;
      var bk = '04'; //hard-coded for testing purposes
      var ch = '01'; //hard-coded for testing purposes
      var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';
    
      var xml = UrlFetchApp.fetch(url).getContentText();
    
      //I had to "fix" these xml errors for XmlService.parse(xml) below
      //to function.
      xml = xml.replace('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">', '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">');
      xml = xml.replace('<LINK REL="stylesheet" HREF="p.css" TYPE="text/css">', '<LINK REL="stylesheet" HREF="p.css" TYPE="text/css"></LINK>');
      xml = xml.replace('<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255">', '<meta http-equiv="Content-Type" content="text/html; charset=Windows-1255"></meta>');
      xml = xml.replace(/ALIGN=CENTER/gi, 'ALIGN="CENTER"');
      xml = xml.replace(/<BR>/gi, '<BR></BR>');
      xml = xml.replace(/class=h/gi, 'class="h"');
    
      //This section is the specific route to the table in the page I want
      var document = XmlService.parse(xml);
      var body = document.getRootElement().getChildren("BODY");
      var maintable = body[0].getChildren("TABLE");
      var maintablechildren = maintable[0].getChildren();
    
      //This creates a two-dimensional array so that I can store the Hebrew
      //in the first column and the English in the second column
      var array = new Array(maintablechildren.length);
      for (var i = 0; i < maintablechildren.length; i++) {
        array[i] = new Array(2);
      }
    
      //This is where the table gets parsed into the array
      for (var i = 0; i < maintablechildren.length; i++) {
        var verse = maintablechildren[i].getChildren();
    
        //This is where the encoding problem occurs.
        //I originally tried verse[0].getText() but it didn't work.
        array[i][0] = Utilities.newBlob(verse[0].getText()).getDataAsString('UTF-8');
        //This array receives the English text and works fine.
        array[i][1] = verse[1].getText();
      }
    
      return array;
    }
    

    What am I overlooking, misunderstanding, or doing wrong? I don't have a very good understanding of how encoding works so I don't understand why converting it to UTF-8 isn't working.

    解决方案

    Your problem occurs before the lines you've commented as an encoding problem: because the default encoding for UrlFetchApp is munging the unicode text from the start.

    You should use the variation of the .getContentText() method that Returns the content of an HTTP response encoded as a string of the given charset. For your case:

    var xml = UrlFetchApp.fetch(url).getContentText("Windows-1255");
    

    That should be all you need to change, although the blob() work-around is no longer needed. (It's harmless, though.) Other comments:

    • The logical OR operator (||) is very helpful for setting default values. I've tweaked the first few lines to enable testing but still let the function operate normally with arguments.

    • The way you're setting up an empty array before populating it with strings is Bad JavaScript; it's complex code that isn't needed, so toss it. Instead, we'll declare the array Array, then push() rows onto it.

    • The .replace() functions can be reduced with more clever RegExp use; I've included the URLs for demos of the really tricky ones.

    • There were \n newline characters in the text which I guessed were unnecessary for your purposes, so added a replace() for them as well.

    Here's what you're left with:

    function parseText(book, chapter) {
      var bk = book || '04'; //hard-coded for testing purposes
      var ch = chapter || '01'; //hard-coded for testing purposes
      var url = 'http://www.mechon-mamre.org/p/pt/pt' + bk + ch + '.htm';
    
      var xml = UrlFetchApp.fetch(url).getContentText("Windows-1255");
    
      //I had to "fix" these xml errors for XmlService.parse(xml) below
      //to function.
      xml = xml.replace(/(<!DOCTYPE.*EN")>/gi, '$1 "">')
               .replace(/(<(LINK|meta).*>)/gi,'$1</$2>')        // https://regex101.com/r/nH3pU8/1
               .replace(/(<.*?=)([^"']*?)([ >])/gi,'$1"$2"$3')  // https://regex101.com/r/eP7wO7/1
               .replace(/<BR>/gi, '<BR/>')
               .replace(/\n/g, '')
    
      //This section is the specific route to the table in the page I want
      var document = XmlService.parse(xml);
      var body = document.getRootElement().getChildren("BODY");
      var maintable = body[0].getChildren("TABLE");
      var maintablechildren = maintable[0].getChildren();
    
      //This is where the table gets parsed into the array
      var array = [];
      for (var i = 0; i < maintablechildren.length; i++) {
        var verse = maintablechildren[i].getChildren();
    
        //I originally tried verse[0].getText() but it didn't work.** It does now!
        var hebrew = verse[0].getText();
        //This array receives the English text and works fine.
        var english = verse[1].getText();
        array.push([hebrew,english]);
      }
    
      return array;
    }
    

    Results

     [
      [
        "  וַיְדַבֵּר יְהוָה אֶל-מֹשֶׁה בְּמִדְבַּר סִינַי, בְּאֹהֶל מוֹעֵד:  בְּאֶחָד לַחֹדֶשׁ הַשֵּׁנִי בַּשָּׁנָה הַשֵּׁנִית, לְצֵאתָם מֵאֶרֶץ מִצְרַיִם--לֵאמֹר.",
        " And the LORD spoke unto Moses in the wilderness of Sinai, in the tent of meeting, on the first day of the second month, in the second year after they were come out of the land of Egypt, saying:"
      ],
      [
        "  שְׂאוּ, אֶת-רֹאשׁ כָּל-עֲדַת בְּנֵי-יִשְׂרָאֵל, לְמִשְׁפְּחֹתָם, לְבֵית אֲבֹתָם--בְּמִסְפַּר שֵׁמוֹת, כָּל-זָכָר לְגֻלְגְּלֹתָם.",
        " 'Take ye the sum of all the congregation of the children of Israel, by their families, by their fathers' houses, according to the number of names, every male, by their polls;"
      ],
      [
        "  מִבֶּן עֶשְׂרִים שָׁנָה וָמַעְלָה, כָּל-יֹצֵא צָבָא בְּיִשְׂרָאֵל--תִּפְקְדוּ אֹתָם לְצִבְאֹתָם, אַתָּה וְאַהֲרֹן.",
        " from twenty years old and upward, all that are able to go forth to war in Israel: ye shall number them by their hosts, even thou and Aaron."
      ],
      ...
    

    这篇关于使用Google Apps脚本从网页提取数据时的字符编码问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆