Beautiful Soup如何在< script>中解码html json数据目的 [英] Beautiful Soup how to decode html json data in <script> object

查看:33
本文介绍了Beautiful Soup如何在< script>中解码html json数据目的的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我从发布短消息更新的网站上收集文本.最近,该站点的前端已升级,现在使用Angular.历史文档将加载到新的Angular新闻"页面内的元素中.

I collect text from a website which publishes short news updates. Recently the site's front-end was upgraded and now uses Angular. The historical documents load within a element within a new Angular "news" page.

此脚本元素中的数据以html格式存储为JSON.它以我不熟悉的格式编码.我无法对其进行解码.但是,Chrome浏览器会解释该元素内的元素.

The data within this script element is html stored as JSON. It is encoded in a format that I am unfamiliar with. I have not been able to decode it. However, a Chrome Browser interprets the elements within the element.

从存储每个旧文档的元素中提取的内容如下所示:

Extracts from the element storing each old document is shown below:

 <script id="ng-agritown-state" type="application/json">

{&q;G.{{api_endpoint}}/api/v12/pages?parameters=newsId%3D343436565656&a;path=news-article&q;:{&q;body&q;:{&q;id&q;:&q;8&q;,&q;layout&q;:&q;onecol&q;,&q;info&q;:{&q;title&q;:&q;News article&q;

    ... 

    &q;&g;&l;span class=\&q;z\&q;&g;Record harvest 2020&l;/span&g;&l;/p&g;\n&l;p class=\&q;a\&q;&g;&l;span class=\&q;z\&q;&g;We are pleased to announce a record harvest in this current

    ...

    &q;isDataComponentAndIsAvailable&q;:true,&q;status&q;:{&q;refreshedTime&q;:1590993288947,&q;childComponents&q;:[],&q;params&q;:{&q;updates&q;:null,&q;cloneFrom&q;:null,&q;encoder&q;:{},&q;map&q;:null}}}]}}

</script>

任何人都可以识别这种编码格式吗?如何使用Python/Beautiful Soup对其进行解码?

Can anyone identify this encoding format? How can I decode it with Python / Beautiful Soup?

推荐答案

该内容似乎是自定义编码的.您可以尝试简单的 str.replace :

This content seems to be custom encoded. You can try simple str.replace:

txt = r'''<script id="ng-agritown-state" type="application/json">

{&q;G.{{api_endpoint}}/api/v12/pages?parameters=newsId%3D343436565656&a;path=news-article&q;:{&q;body&q;:{&q;id&q;:&q;8&q;,&q;layout&q;:&q;onecol&q;,&q;info&q;:{&q;title&q;:&q;News article&q;

    ...

    &q;&g;&l;span class=\&q;z\&q;&g;Record harvest 2020&l;/span&g;&l;/p&g;\n&l;p class=\&q;a\&q;&g;&l;span class=\&q;z\&q;&g;We are pleased to announce a record harvest in this current

    ...

    &q;isDataComponentAndIsAvailable&q;:true,&q;status&q;:{&q;refreshedTime&q;:1590993288947,&q;childComponents&q;:[],&q;params&q;:{&q;updates&q;:null,&q;cloneFrom&q;:null,&q;encoder&q;:{},&q;map&q;:null}}}]}}

</script>'''


from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')

print( soup.script.contents[0].replace('&l;', '<').replace('&g;', '>').replace('&q;', '"') )

打印:

{"G.{{api_endpoint}}/api/v12/pages?parameters=newsId%3D343436565656&a;path=news-article":{"body":{"id":"8","layout":"onecol","info":{"title":"News article"

    ...

    "><span class=\"z\">Record harvest 2020</span></p>\n<p class=\"a\"><span class=\"z\">We are pleased to announce a record harvest in this current

    ...

    "isDataComponentAndIsAvailable":true,"status":{"refreshedTime":1590993288947,"childComponents":[],"params":{"updates":null,"cloneFrom":null,"encoder":{},"map":null}}}]}}

然后 json / re 模块对信息进行解码.

Then json/re module to decode the information.

这篇关于Beautiful Soup如何在&lt; script&gt;中解码html json数据目的的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
相关文章
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆