刮刮 wsj.com [英] Scraping wsj.com

查看:23
本文介绍了刮刮 wsj.com的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从 wsj.com 抓取一些数据并打印出来.实际网址是:https://www.wsj.com/market-data/stocks?mod=md_home_overview_stk_main,数据是纽约证券交易所发行量上升、下降和纽约证券交易所股票交易量上升、下降.

我在观看 YouTube 视频后尝试使用 beautifulsoup,但我无法让任何类在 body 内返回值.

这是我的代码:

from bs4 import BeautifulSoup进口请求source = requests.get('https://www.wsj.com/market-data/stocks?mod=md_home_overview_stk_main').text汤 = BeautifulSoup(source, 'lxml')身体 = 汤.find('身体')adv = body.find('td', class_='WSJTables--table__cell--2dzGiO7q WSJTheme--table__cell--1At-VGNg ')打印(广告)

在检查 Network 中的元素时,我注意到这些数据也可以作为 JSON 使用.

这是链接:https://www.wsj.com/market-data/stocks?id=%7B%22application%22%3A%22WSJ%22%2C%22marketsDiaryType%22%3A%22overview%22%7D&type=mdc_marketsdiary

所以我写了另一个脚本来尝试使用 JSON 解析这些数据,但它再次不起作用.

代码如下:

导入json进口请求url = 'https://www.wsj.com/market-data/stocks?id=%7B%22application%22%3A%22WSJ%22%2C%22marketsDiaryType%22%3A%22overview%22%7D&type=mdc_marketsdiary'响应 = json.loads(requests.get(url).text)打印(响应)

我得到的错误是:

 文件C:UsersUserAnaconda3libjsondecoder.py",第 355 行,raw_decode从 None 引发 JSONDecodeError("Expecting value", s, err.value)JSONDecodeError:期望值

我还尝试了此链接中的几种不同方法 似乎没有任何效果.

你能告诉我如何抓取这些数据吗?

解决方案

from bs4 import BeautifulSoup进口请求导入json参数 = {'id': '{"application":"WSJ","marketsDiaryType":"overview"}','类型':'mdc_marketsdiary'}标题 = {"用户代理": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0"}r = 请求.get("https://www.wsj.com/market-data/stocks", params=params, headers=headers).json()数据 = json.dumps(r, indent=4)打印(数据)

输出:

<代码>{"id": "{"application":"WSJ","marketsDiaryType":"overview"}","type": "mdc_marketsdiary",数据": {仪器集":[{标题字段":[{值":名称","label": "问题"}],仪器": [{"name": "前进","纳斯达克": "169",纽约证券交易所":69"},{"name": "下降","纳斯达克": "3,190",纽约证券交易所":2,973"},{"name": "不变","纳斯达克": "24",纽约证券交易所":10"},{"name": "总计","纳斯达克": "3,383",纽约证券交易所":3,052"}]},{标题字段":[{值":名称","label": "问题在"}],仪器": [{"name": "新高","纳斯达克": "53",纽约证券交易所":14"},{"name": "新低","纳斯达克": "1,406",纽约证券交易所":1,620"}]},{标题字段":[{值":名称","label": "共享音量"}],仪器": [{"name": "总计","纳斯达克": "4,454,691,895",纽约证券交易所":7,790,947,818"},{"name": "前进","纳斯达克": "506,192,012",纽约证券交易所":219,412,232"},{"name": "下降","纳斯达克": "3,948,035,191",纽约证券交易所":7,570,377,893"},{"name": "不变",纳斯达克":464,692",纽约证券交易所":1,157,693"}]}],"timestamp": "4:00 PM EDT 3/09/20"},"hash": "{"id":"{\"application\":\"WSJ\",\"marketsDiaryType\":\"概览\"}","type":"mdc_marketsdiary","data":{"instrumentSets":[{"headerFields":[{"value":"name","label":"Issues"}],"instruments":[{"name":"Advancing","NASDAQ":"169","NYSE":"69"},{"name":"Declining","NASDAQ":"3,190","NYSE":"2,973"},{"名称":"未更改","纳斯达克":"24","NYSE":"10"},{"名称":"总计","纳斯达克":"3,383","NYSE":"3,052"}]},{"headerFields":[{"value":"name","label":"Issues At"}],"instruments":[{"name":"New Highs","NASDAQ":"53","NYSE":"14"},{"name":"New Lows","NASDAQ":"1,406","NYSE":"1,620"}]},{"headerFields":[{"value":"name","label":"Share Volume"}],"instruments":[{"name":"Total","NASDAQ":"4,454,691,895","NYSE":"7,790,947,818"},{"name":"Advancing","NASDAQ":"506,192,012","NYSE":"219,412,232"},{"name":"Declining","NASDAQ":"3,948,035,191","NYSE":"7,570,377,893"},{"name":"Unchanged","NASDAQ":"464,692","NYSE":"1,157,693"}]}],"timestamp":"4:00 PM EDT 3/09/20"}}"}

<块引用>

注意:您可以通过 dict print(r.keys()) 访问它.

I wanted to scrape some data from wsj.com and print it. The actual website is: https://www.wsj.com/market-data/stocks?mod=md_home_overview_stk_main and the data is NYSE Issues Advancing, Declining and NYSE Share Volume Advancing, Declining.

I tried using beautifulsoup after watching a youtube video but I can't get any of the classes to return a value inside body.

Here is my code:

from bs4 import BeautifulSoup
import requests


source = requests.get('https://www.wsj.com/market-data/stocks?mod=md_home_overview_stk_main').text

soup = BeautifulSoup(source, 'lxml')

body = soup.find('body')

adv = body.find('td', class_='WSJTables--table__cell--2dzGiO7q WSJTheme--table__cell--1At-VGNg ')


print(adv)

Also while inspecting elements in Network I noticed that this data is also available as a JSON.

Here is the link: https://www.wsj.com/market-data/stocks?id=%7B%22application%22%3A%22WSJ%22%2C%22marketsDiaryType%22%3A%22overview%22%7D&type=mdc_marketsdiary

So I wrote another script to try and parse this data using JSON but again its not working.

Here is the code:

import json

import requests

url = 'https://www.wsj.com/market-data/stocks?id=%7B%22application%22%3A%22WSJ%22%2C%22marketsDiaryType%22%3A%22overview%22%7D&type=mdc_marketsdiary'

response = json.loads(requests.get(url).text)

print(response)

The error I get is:

 File "C:UsersUserAnaconda3libjsondecoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None

JSONDecodeError: Expecting value

I also tried a few different methods from this link and none seem to work.

Can you please set me on the right path how to scrape this data?

解决方案

from bs4 import BeautifulSoup
import requests
import json


params = {
    'id': '{"application":"WSJ","marketsDiaryType":"overview"}',
    'type': 'mdc_marketsdiary'
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:73.0) Gecko/20100101 Firefox/73.0"
}
r = requests.get(
    "https://www.wsj.com/market-data/stocks", params=params, headers=headers).json()


data = json.dumps(r, indent=4)

print(data)

Output:

{
    "id": "{"application":"WSJ","marketsDiaryType":"overview"}",
    "type": "mdc_marketsdiary",
    "data": {
        "instrumentSets": [
            {
                "headerFields": [
                    {
                        "value": "name",
                        "label": "Issues"
                    }
                ],
                "instruments": [
                    {
                        "name": "Advancing",
                        "NASDAQ": "169",
                        "NYSE": "69"
                    },
                    {
                        "name": "Declining",
                        "NASDAQ": "3,190",
                        "NYSE": "2,973"
                    },
                    {
                        "name": "Unchanged",
                        "NASDAQ": "24",
                        "NYSE": "10"
                    },
                    {
                        "name": "Total",
                        "NASDAQ": "3,383",
                        "NYSE": "3,052"
                    }
                ]
            },
            {
                "headerFields": [
                    {
                        "value": "name",
                        "label": "Issues At"
                    }
                ],
                "instruments": [
                    {
                        "name": "New Highs",
                        "NASDAQ": "53",
                        "NYSE": "14"
                    },
                    {
                        "name": "New Lows",
                        "NASDAQ": "1,406",
                        "NYSE": "1,620"
                    }
                ]
            },
            {
                "headerFields": [
                    {
                        "value": "name",
                        "label": "Share Volume"
                    }
                ],
                "instruments": [
                    {
                        "name": "Total",
                        "NASDAQ": "4,454,691,895",
                        "NYSE": "7,790,947,818"
                    },
                    {
                        "name": "Advancing",
                        "NASDAQ": "506,192,012",
                        "NYSE": "219,412,232"
                    },
                    {
                        "name": "Declining",
                        "NASDAQ": "3,948,035,191",
                        "NYSE": "7,570,377,893"
                    },
                    {
                        "name": "Unchanged",
                        "NASDAQ": "464,692",
                        "NYSE": "1,157,693"
                    }
                ]
            }
        ],
        "timestamp": "4:00 PM EDT 3/09/20"
    },
    "hash": "{"id":"{\"application\":\"WSJ\",\"marketsDiaryType\":\"overview\"}","type":"mdc_marketsdiary","data":{"instrumentSets":[{"headerFields":[{"value":"name","label":"Issues"}],"instruments":[{"name":"Advancing","NASDAQ":"169","NYSE":"69"},{"name":"Declining","NASDAQ":"3,190","NYSE":"2,973"},{"name":"Unchanged","NASDAQ":"24","NYSE":"10"},{"name":"Total","NASDAQ":"3,383","NYSE":"3,052"}]},{"headerFields":[{"value":"name","label":"Issues At"}],"instruments":[{"name":"New Highs","NASDAQ":"53","NYSE":"14"},{"name":"New Lows","NASDAQ":"1,406","NYSE":"1,620"}]},{"headerFields":[{"value":"name","label":"Share Volume"}],"instruments":[{"name":"Total","NASDAQ":"4,454,691,895","NYSE":"7,790,947,818"},{"name":"Advancing","NASDAQ":"506,192,012","NYSE":"219,412,232"},{"name":"Declining","NASDAQ":"3,948,035,191","NYSE":"7,570,377,893"},{"name":"Unchanged","NASDAQ":"464,692","NYSE":"1,157,693"}]}],"timestamp":"4:00 PM EDT 3/09/20"}}"
}

Note: You can access it as dict print(r.keys()).

这篇关于刮刮 wsj.com的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆