如何将BeautifulSoup标记转换为JSON? [英] How to convert a BeautifulSoup tag to JSON?
问题描述
我有一个类型元素bs4.element.Tag
,它是Web抓取的产品,我通常这样做:json.loads (soup.find ('script', type = 'application / ld + json'). Text)
,但是在此页面上它仅出现在:<script> </script>
中,所以我必须做:scripts = soup.find_all ('script')
直到我找到我感兴趣的人:script = scripts [18]
.
I have a type element, bs4.element.Tag
, product of a web scraping, I usually do: json.loads (soup.find ('script', type = 'application / ld + json'). Text)
, but on this page it only appears in: <script> </script>
so I had to do: scripts = soup.find_all ('script')
until I get to the one that interests me: script = scripts [18]
.
有问题的变量是script
.我的问题是我想访问其属性,例如script ['goodsInfo']
,显然是元素类型bs4.element.Tag
,请尝试执行:script.attrs
并返回我{}
.然后我尝试将其转换为类型json: json.loads (str (script))
,并抛出异常:'JSONDecodeError:预期值:第1行第1列(char 0)'
The variable in question is script
. My problem is that I want to access its attributes, for example script ['goodsInfo']
, obviously being an element type bs4.element.Tag
, try to do: script.attrs
and return me {}
. Then I tried to convert it to the type json: json.loads (str (script))
and it throws me the exception: 'JSONDecodeError: Expecting value: line 1 column 1 (char 0)'
这是我的代码:
import json
from bs4 import BeautifulSoup
import requests
url_aux = 'https://www.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0'
response = requests.get(url_aux)
soup = BeautifulSoup(response.content, "html.parser")
scripts = soup.find_all('script')
script = scripts[18]
print(json.loads(str(script)))
#output: JSONDecodeError: Expecting value: line 1 column 1 (char 0)
print(type(script))
#output: bs4.element.Tag
print(str(json.loads(str(script))))
推荐答案
您可以使用json
模块提取数据,但首先必须找到正确的信息-您可以使用re
模块.
You can use json
module to extract the data, but first it's necessary to locate the right info - you can use re
module for that.
例如:
import re
import json
import requests
url = 'https://eur.shein.com/Mock-neck-Brush-Stroke-Print-Bodycon-Dress-p-941649-cat-1727.html?scici=navbar_2~~tab01navbar04~~4~~real_1727~~~~0~~0&ref=www&rep=dir&ret=eur'
txt = re.findall(r'goodsInfo\s*:\s*({.*})', requests.get(url).text)[0]
data = json.loads(txt)
# print(json.dumps(data, indent=4)) # <-- uncomment to see all data
print(data['detail']['goods_name'])
print(data['detail']['brand'])
print('Num of comments:', data['detail']['comment']['comment_num'])
打印:
Mock-neck Brush Stroke Print Bodycon Dress
SHEIN
Num of comments: 17
这篇关于如何将BeautifulSoup标记转换为JSON?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!