Beautiful Soup如何在< script>中解码html json数据目的 [英] Beautiful Soup how to decode html json data in <script> object
问题描述
我从发布短消息更新的网站上收集文本.最近,该站点的前端已升级,现在使用Angular.历史文档将加载到新的Angular新闻"页面内的元素中.
I collect text from a website which publishes short news updates. Recently the site's front-end was upgraded and now uses Angular. The historical documents load within a element within a new Angular "news" page.
此脚本元素中的数据以html格式存储为JSON.它以我不熟悉的格式编码.我无法对其进行解码.但是,Chrome浏览器会解释该元素内的元素.
The data within this script element is html stored as JSON. It is encoded in a format that I am unfamiliar with. I have not been able to decode it. However, a Chrome Browser interprets the elements within the element.
从存储每个旧文档的元素中提取的内容如下所示:
Extracts from the element storing each old document is shown below:
<script id="ng-agritown-state" type="application/json">
{&q;G.{{api_endpoint}}/api/v12/pages?parameters=newsId%3D343436565656&a;path=news-article&q;:{&q;body&q;:{&q;id&q;:&q;8&q;,&q;layout&q;:&q;onecol&q;,&q;info&q;:{&q;title&q;:&q;News article&q;
...
&q;&g;&l;span class=\&q;z\&q;&g;Record harvest 2020&l;/span&g;&l;/p&g;\n&l;p class=\&q;a\&q;&g;&l;span class=\&q;z\&q;&g;We are pleased to announce a record harvest in this current
...
&q;isDataComponentAndIsAvailable&q;:true,&q;status&q;:{&q;refreshedTime&q;:1590993288947,&q;childComponents&q;:[],&q;params&q;:{&q;updates&q;:null,&q;cloneFrom&q;:null,&q;encoder&q;:{},&q;map&q;:null}}}]}}
</script>
任何人都可以识别这种编码格式吗?如何使用Python/Beautiful Soup对其进行解码?
Can anyone identify this encoding format? How can I decode it with Python / Beautiful Soup?
推荐答案
该内容似乎是自定义编码的.您可以尝试简单的 str.replace
:
This content seems to be custom encoded. You can try simple str.replace
:
txt = r'''<script id="ng-agritown-state" type="application/json">
{&q;G.{{api_endpoint}}/api/v12/pages?parameters=newsId%3D343436565656&a;path=news-article&q;:{&q;body&q;:{&q;id&q;:&q;8&q;,&q;layout&q;:&q;onecol&q;,&q;info&q;:{&q;title&q;:&q;News article&q;
...
&q;&g;&l;span class=\&q;z\&q;&g;Record harvest 2020&l;/span&g;&l;/p&g;\n&l;p class=\&q;a\&q;&g;&l;span class=\&q;z\&q;&g;We are pleased to announce a record harvest in this current
...
&q;isDataComponentAndIsAvailable&q;:true,&q;status&q;:{&q;refreshedTime&q;:1590993288947,&q;childComponents&q;:[],&q;params&q;:{&q;updates&q;:null,&q;cloneFrom&q;:null,&q;encoder&q;:{},&q;map&q;:null}}}]}}
</script>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
print( soup.script.contents[0].replace('&l;', '<').replace('&g;', '>').replace('&q;', '"') )
打印:
{"G.{{api_endpoint}}/api/v12/pages?parameters=newsId%3D343436565656&a;path=news-article":{"body":{"id":"8","layout":"onecol","info":{"title":"News article"
...
"><span class=\"z\">Record harvest 2020</span></p>\n<p class=\"a\"><span class=\"z\">We are pleased to announce a record harvest in this current
...
"isDataComponentAndIsAvailable":true,"status":{"refreshedTime":1590993288947,"childComponents":[],"params":{"updates":null,"cloneFrom":null,"encoder":{},"map":null}}}]}}
然后 json
/ re
模块对信息进行解码.
Then json
/re
module to decode the information.
这篇关于Beautiful Soup如何在< script>中解码html json数据目的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!