如何使用Python解码Angular的自定义HTML编码 [英] How to decode Angular's custom HTML encoding with Python

查看：48 发布时间：2021/4/15 19:06:31 python angular parsing web-scraping beautifulsoup

本文介绍了如何使用Python解码Angular的自定义HTML编码的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

 & l; div class = \"news-body-content \"& g;& l; html xmlns = \"http://www.w3.org/1999/xhtml \& g; \ n& l; head& g; \ n& l; meta http-equiv = \" Content-Type \"content = \" text/html;charset = UTF-8 \"/& g; \ n& l; title& g;/title& g; \ n& l;元名称= \"generator \";

我使用 .replace()链来处理此问题:

  import json汇入要求从bs4导入BeautifulSoupurl ="https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"脚本= BeautifulSoup(requests.get(url).text，"lxml").find("script"，{"id":"ng-lseg-state"})article = json.loads(script.string.replace(& q;"，''')))main_key ="G.{{api_endpoint}}/api/v1/pages?parameters = newsId％3D14850033& a; path = news-article"article_body = article [main_key] ["body"] ["components"] [1] ["content"] ["newsArticle"] ["value"]encoded_body =(article_body.replace('& l;'，'<').replace('& g;'，'>').replace('& q;'，'"'))print(BeautifulSoup(decoded_body，"lxml").find_all("p")))

但是还有一些我不确定如何处理的字符:

仅举几例.

所以，问题是，我如何处理其余字符?也许有一个解析器或可靠的char映射在那里我不知道?

解决方案

角度编码传输状态使用位于 <代码>导出函数escapeHtml(text:string):string {const escapedText:{[k:字符串]:字符串} = {'&':'& a;'，'" ;:'& q;'，'\'':'& s;'，'<':'& l;'，'>':'& g;'，};return text.replace(/[&''<>]/g，s => scapedText [s]);}导出函数unescapeHtml(text:string):string {const unescapedText:{[k:字符串]:字符串} = {'& a':'&'，'& q:':'"'，'& s':'\''，'& l':'<'，'& g:':'>'，};返回text.replace(/& [^;] +;/g，s => unescapedText [s]);}

您可以在python中重现 unescapeHtml 函数，并添加 html.unescape 来解析其他html实体:

  import json汇入要求从bs4导入BeautifulSoup导入HTMLunescapedText = {'& a':'&'，'& q:':'"'，'& s':'\''，'& l':'<'，'& g:':'>'，}def unescape(str):对于键，为unescapedText.items()中的值:str = str.replace(键，值)返回html.unescape(str)url ="https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"脚本= BeautifulSoup(requests.get(url).text，"lxml").find("script" ,, {"id":"ng-lseg-state"；})有效负载= json.loads(unescape(script.string))main_key ="G.{{api_endpoint}}/api/v1/pages?parameters = newsId％3D14850033& path = news-article"article_body ＝ payload [main_key] ["body"] ["components"] [1] ["content"] ["newsArticle"] ["value"]打印(BeautifulSoup(article_body，"lxml").find_all("p"))

您缺少& s 和& a;

repl.it: https://replit.com/@bertrandmartel/AngularTransferStateDecode

I want to scrape and parse a London Stock Exchange news article.

Almost the entire content of the site comes from a JSON that's consumed by JavaScript. However, this can be easily extracted with BeautifulSoup and parsed with the JSON module.

But the encoding of the script is a bit funky.

The <script> tag has an id of "ng-lseg-state", which means this is Angular's custom HTML encoding.

For example:

&l;div class=\"news-body-content\"&g;&l;html xmlns=\"http://www.w3.org/1999/xhtml\"&g;\n&l;head&g;\n&l;meta http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\" /&g;\n&l;title&g;&l;/title&g;\n&l;meta name=\"generator\"

I handle this with a .replace() chain:

import json

import requests
from bs4 import BeautifulSoup

url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {"id": "ng-lseg-state"})
article = json.loads(script.string.replace("&q;", '"'))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&a;path=news-article"
article_body = article[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
decoded_body = (
    article_body
    .replace('&l;', '<')
    .replace('&g;', '>')
    .replace('&q;', '"')
)
print(BeautifulSoup(decoded_body, "lxml").find_all("p"))

But there are still some characters that I'm not sure how to handle:

&a;#160;
&a;amp;
&s;

just to name a few.

So, the question is, how do I deal with the rest of the chars? Or maybe there's a parser or a reliable char mapping out there that I don't know of?

解决方案

Angular encodes transfer state using a special escape function located here:

export function escapeHtml(text: string): string {
  const escapedText: {[k: string]: string} = {
    '&': '&a;',
    '"': '&q;',
    '\'': '&s;',
    '<': '&l;',
    '>': '&g;',
  };
  return text.replace(/[&"'<>]/g, s => escapedText[s]);
}

export function unescapeHtml(text: string): string {
  const unescapedText: {[k: string]: string} = {
    '&a;': '&',
    '&q;': '"',
    '&s;': '\'',
    '&l;': '<',
    '&g;': '>',
  };
  return text.replace(/&[^;]+;/g, s => unescapedText[s]);
}

You can reproduce the unescapeHtml function in python, and add html.unescape to resolve additionnal html entities:

import json
import requests
from bs4 import BeautifulSoup
import html

unescapedText = {
    '&a;': '&',
    '&q;': '"',
    '&s;': '\'',
    '&l;': '<',
    '&g;': '>',
}

def unescape(str):
    for key, value in unescapedText.items():
        str = str.replace(key, value)
    return html.unescape(str)

url = "https://www.londonstockexchange.com/news-article/ESNT/date-for-fy-2020-results-announcement/14850033"
script = BeautifulSoup(requests.get(url).text, "lxml").find("script", {
    "id": "ng-lseg-state"
})
payload = json.loads(unescape(script.string))
main_key = "G.{{api_endpoint}}/api/v1/pages?parameters=newsId%3D14850033&path=news-article"
article_body = payload[main_key]["body"]["components"][1]["content"]["newsArticle"]["value"]
print(BeautifulSoup(article_body, "lxml").find_all("p"))

you were missing &s; and &a;

repl.it: https://replit.com/@bertrandmartel/AngularTransferStateDecode

这篇关于如何使用Python解码Angular的自定义HTML编码的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

如何使用Python解码Angular的自定义HTML编码 [英] How to decode Angular's custom HTML encoding with Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

如何使用Python解码Angular的自定义HTML编码 [英] How to decode Angular&#39;s custom HTML encoding with Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

如何使用Python解码Angular的自定义HTML编码 [英] How to decode Angular's custom HTML encoding with Python

登录关闭