刮刮 wsj.com [英] Scraping wsj.com
问题描述
我想从 wsj.com 抓取一些数据并打印出来.实际网址是:https://www.wsj.com/market-data/stocks?mod=md_home_overview_stk_main,数据是纽约证券交易所发行量上升、下降和纽约证券交易所股票交易量上升、下降.
我在观看 YouTube 视频后尝试使用 beautifulsoup,但我无法让任何类在 body 内返回值.
这是我的代码:
from bs4 import BeautifulSoup进口请求source = requests.get('https://www.wsj.com/market-data/stocks?mod=md_home_overview_stk_main').text汤 = BeautifulSoup(source, 'lxml')身体 = 汤.find('身体')adv = body.find('td', class_='WSJTables--table__cell--2dzGiO7q WSJTheme--table__cell--1At-VGNg ')打印(广告)
在检查 Network 中的元素时,我注意到这些数据也可以作为 JSON 使用.
所以我写了另一个脚本来尝试使用 JSON 解析这些数据,但它再次不起作用.
代码如下:
导入json进口请求url = 'https://www.wsj.com/market-data/stocks?id=%7B%22application%22%3A%22WSJ%22%2C%22marketsDiaryType%22%3A%22overview%22%7D&type=mdc_marketsdiary'响应 = json.loads(requests.get(url).text)打印(响应)
我得到的错误是:
文件C:UsersUserAnaconda3libjsondecoder.py",第 355 行,raw_decode从 None 引发 JSONDecodeError("Expecting value", s, err.value)JSONDecodeError:期望值
我还尝试了此链接中的几种不同方法 似乎没有任何效果.
你能告诉我如何抓取这些数据吗?
from bs4 import BeautifulSoup进口请求导入json参数 = {'id': '{"application":"WSJ","marketsDiaryType":"overview"}','类型':'mdc_marketsdiary'}标题 = {"用户代理": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0"}r = 请求.get("https://www.wsj.com/market-data/stocks", params=params, headers=headers).json()数据 = json.dumps(r, indent=4)打印(数据)
输出:
<代码>{"id": "{"application":"WSJ","marketsDiaryType":"overview"}","type": "mdc_marketsdiary",数据": {仪器集":[{标题字段":[{值":名称","label": "问题"}],仪器": [{"name": "前进","纳斯达克": "169",纽约证券交易所":69"},{"name": "下降","纳斯达克": "3,190",纽约证券交易所":2,973"},{"name": "不变","纳斯达克": "24",纽约证券交易所":10"},{"name": "总计","纳斯达克": "3,383",纽约证券交易所":3,052"}]},{标题字段":[{值":名称","label": "问题在"}],仪器": [{"name": "新高","纳斯达克": "53",纽约证券交易所":14"},{"name": "新低","纳斯达克": "1,406",纽约证券交易所":1,620"}]},{标题字段":[{值":名称","label": "共享音量"}],仪器": [{"name": "总计","纳斯达克": "4,454,691,895",纽约证券交易所":7,790,947,818"},{"name": "前进","纳斯达克": "506,192,012",纽约证券交易所":219,412,232"},{"name": "下降","纳斯达克": "3,948,035,191",纽约证券交易所":7,570,377,893"},{"name": "不变",纳斯达克":464,692",纽约证券交易所":1,157,693"}]}],"timestamp": "4:00 PM EDT 3/09/20"},"hash": "{"id":"{\"application\":\"WSJ\",\"marketsDiaryType\":\"概览\"}","type":"mdc_marketsdiary","data":{"instrumentSets":[{"headerFields":[{"value":"name","label":"Issues"}],"instruments":[{"name":"Advancing","NASDAQ":"169","NYSE":"69"},{"name":"Declining","NASDAQ":"3,190","NYSE":"2,973"},{"名称":"未更改","纳斯达克":"24","NYSE":"10"},{"名称":"总计","纳斯达克":"3,383","NYSE":"3,052"}]},{"headerFields":[{"value":"name","label":"Issues At"}],"instruments":[{"name":"New Highs","NASDAQ":"53","NYSE":"14"},{"name":"New Lows","NASDAQ":"1,406","NYSE":"1,620"}]},{"headerFields":[{"value":"name","label":"Share Volume"}],"instruments":[{"name":"Total","NASDAQ":"4,454,691,895","NYSE":"7,790,947,818"},{"name":"Advancing","NASDAQ":"506,192,012","NYSE":"219,412,232"},{"name":"Declining","NASDAQ":"3,948,035,191","NYSE":"7,570,377,893"},{"name":"Unchanged","NASDAQ":"464,692","NYSE":"1,157,693"}]}],"timestamp":"4:00 PM EDT 3/09/20"}}"}
<块引用>
注意:您可以通过 dict
print(r.keys())
访问它.
I wanted to scrape some data from wsj.com and print it. The actual website is: https://www.wsj.com/market-data/stocks?mod=md_home_overview_stk_main and the data is NYSE Issues Advancing, Declining and NYSE Share Volume Advancing, Declining.
I tried using beautifulsoup after watching a youtube video but I can't get any of the classes to return a value inside body.
Here is my code:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://www.wsj.com/market-data/stocks?mod=md_home_overview_stk_main').text
soup = BeautifulSoup(source, 'lxml')
body = soup.find('body')
adv = body.find('td', class_='WSJTables--table__cell--2dzGiO7q WSJTheme--table__cell--1At-VGNg ')
print(adv)
Also while inspecting elements in Network I noticed that this data is also available as a JSON.
Here is the link: https://www.wsj.com/market-data/stocks?id=%7B%22application%22%3A%22WSJ%22%2C%22marketsDiaryType%22%3A%22overview%22%7D&type=mdc_marketsdiary
So I wrote another script to try and parse this data using JSON but again its not working.
Here is the code:
import json
import requests
url = 'https://www.wsj.com/market-data/stocks?id=%7B%22application%22%3A%22WSJ%22%2C%22marketsDiaryType%22%3A%22overview%22%7D&type=mdc_marketsdiary'
response = json.loads(requests.get(url).text)
print(response)
The error I get is:
File "C:UsersUserAnaconda3libjsondecoder.py", line 355, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
JSONDecodeError: Expecting value
I also tried a few different methods from this link and none seem to work.
Can you please set me on the right path how to scrape this data?
from bs4 import BeautifulSoup
import requests
import json
params = {
'id': '{"application":"WSJ","marketsDiaryType":"overview"}',
'type': 'mdc_marketsdiary'
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0"
}
r = requests.get(
"https://www.wsj.com/market-data/stocks", params=params, headers=headers).json()
data = json.dumps(r, indent=4)
print(data)
Output:
{
"id": "{"application":"WSJ","marketsDiaryType":"overview"}",
"type": "mdc_marketsdiary",
"data": {
"instrumentSets": [
{
"headerFields": [
{
"value": "name",
"label": "Issues"
}
],
"instruments": [
{
"name": "Advancing",
"NASDAQ": "169",
"NYSE": "69"
},
{
"name": "Declining",
"NASDAQ": "3,190",
"NYSE": "2,973"
},
{
"name": "Unchanged",
"NASDAQ": "24",
"NYSE": "10"
},
{
"name": "Total",
"NASDAQ": "3,383",
"NYSE": "3,052"
}
]
},
{
"headerFields": [
{
"value": "name",
"label": "Issues At"
}
],
"instruments": [
{
"name": "New Highs",
"NASDAQ": "53",
"NYSE": "14"
},
{
"name": "New Lows",
"NASDAQ": "1,406",
"NYSE": "1,620"
}
]
},
{
"headerFields": [
{
"value": "name",
"label": "Share Volume"
}
],
"instruments": [
{
"name": "Total",
"NASDAQ": "4,454,691,895",
"NYSE": "7,790,947,818"
},
{
"name": "Advancing",
"NASDAQ": "506,192,012",
"NYSE": "219,412,232"
},
{
"name": "Declining",
"NASDAQ": "3,948,035,191",
"NYSE": "7,570,377,893"
},
{
"name": "Unchanged",
"NASDAQ": "464,692",
"NYSE": "1,157,693"
}
]
}
],
"timestamp": "4:00 PM EDT 3/09/20"
},
"hash": "{"id":"{\"application\":\"WSJ\",\"marketsDiaryType\":\"overview\"}","type":"mdc_marketsdiary","data":{"instrumentSets":[{"headerFields":[{"value":"name","label":"Issues"}],"instruments":[{"name":"Advancing","NASDAQ":"169","NYSE":"69"},{"name":"Declining","NASDAQ":"3,190","NYSE":"2,973"},{"name":"Unchanged","NASDAQ":"24","NYSE":"10"},{"name":"Total","NASDAQ":"3,383","NYSE":"3,052"}]},{"headerFields":[{"value":"name","label":"Issues At"}],"instruments":[{"name":"New Highs","NASDAQ":"53","NYSE":"14"},{"name":"New Lows","NASDAQ":"1,406","NYSE":"1,620"}]},{"headerFields":[{"value":"name","label":"Share Volume"}],"instruments":[{"name":"Total","NASDAQ":"4,454,691,895","NYSE":"7,790,947,818"},{"name":"Advancing","NASDAQ":"506,192,012","NYSE":"219,412,232"},{"name":"Declining","NASDAQ":"3,948,035,191","NYSE":"7,570,377,893"},{"name":"Unchanged","NASDAQ":"464,692","NYSE":"1,157,693"}]}],"timestamp":"4:00 PM EDT 3/09/20"}}"
}
Note: You can access it as
dict
print(r.keys())
.
这篇关于刮刮 wsj.com的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!