在BeautifulSoup中使用dict解析脚本标签 [英] Parsing a script tag with dicts in BeautifulSoup
问题描述
对这个问题的部分答案,我来了在bs4.element.Tag
上,这是一堆嵌套的字典和列表(下面是s
).
Working on a partial answer to this question, I came across a bs4.element.Tag
that is a mess of nested dicts and lists (s
, below).
是否可以使用re.find_all
返回s
中没有的url列表?有关此标签结构的其他注释也很有帮助.
Is there a way to return a list of urls contained in s
without using re.find_all
? Other comments regarding the structure of this tag are helpful too.
from bs4 import BeautifulSoup
import requests
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
## the first bit of s:
# s
# Out[116]:
# <script type="application/ld+json">
# {"@context":"http://schema.org","@type":"ItemList","numberOfItems":50,
我尝试过的:
- 随机浏览
s
上带有制表符补全的方法. - 通过文档进行选择.
- randomly perusing through methods with tab completion on
s
. - picking through the docs.
我的问题是s
仅具有1个属性(type
),并且似乎没有任何子标记.
My problem is that s
only has 1 attribute (type
) and doesn't seem to have any child tags.
推荐答案
您可以使用s.text
获取脚本的内容.它是JSON,因此您可以使用json.loads
对其进行解析.从那里开始,它是简单的字典访问:
You can use s.text
to get the content of the script. It's JSON, so you can then just parse it with json.loads
. From there, it's simple dictionary access:
import json
from bs4 import BeautifulSoup
import requests
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(urls)
这篇关于在BeautifulSoup中使用dict解析脚本标签的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!