使用python获取HTML元素并发送新的json请求 [英] Gettin HTML element and sending new json requests in python

查看:87
本文介绍了使用python获取HTML元素并发送新的json请求的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试通过发送json请求来抓取此链接。我的第一个请求是:

I try to crawl this link by sending json requests. My first request would be :

parameters1 = {'ticker':'XOM', 'countryCode':'US',
       'dateTime':'', 'docId':'1222737422 ',
       'docType':'806','sequence':'e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2',
       'messageNumber':'','count':'10',
      'channelName':'/news/latest/company/us/xom', 'topic':'',
       '_':'' }
firstUrl = "http://www.marketwatch.com/news/headline/getheadlines"
html1 = requests.get(firstUrl, params = parameters1, headers = header)
html_json1=(json.loads(html1.text))

发送下一个请求,我有从相应的HTML中提取docId并将其添加到新参数中。我不知道该怎么做。你知道如何在发送json请求之后得到新的HTML吗?

for sending the next requests, I have to extract docId from the corresponding HTML and add it to the new parameters. I don't know how to do that. Do you have any idea how to get new HTML frile after sending json requestes?

推荐答案

import requests
import json

from bs4 import BeautifulSoup 


def main():

    html_url = 'http://www.marketwatch.com/investing/stock/xom'

    resp = requests.get(html_url)
    if resp.status_code != 200:
        raise Exception("http request failed: %s" % resp)
    soup = BeautifulSoup(resp.text, 'lxml')

    # get value of `data-uniqueid` from last news node of 'MarketWatch News on XOM'
    li_node = soup.select("#mwheadlines > div.headlinewrapper > ol > li[data-uniqueid]")[-1]
    unique_id = li_node['data-uniqueid']
    print('got unique_id=%r, from %r' % (unique_id, li_node.text.replace('\n', ' ').strip()))


    baseUrl = 'http://www.marketwatch.com/news/headline/getheadlines'
    parameters = {
        'ticker':'XOM',
        'countryCode':'US',
        'docType':'806',
        'docId': '', # (Optional) initial value extract from HTML page
        'sequence':'e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2', # initial value extract from HTML page
        'messageNumber':'8589', # initial value extract from HTML page
        'count':'10',
        'channelName': '/news/latest/company/us/xom',
    }

    parameters.update(extract_page_params(unique_id))


    while True:
        resp = requests.get(baseUrl, params = parameters)
        data = json.loads(resp.text) # array of size 10 
        first = data[0] # get first item of array
        last = data[-1] # get last item of array
        print("\ngot %d data, url: %s" % (len(data), resp.url))
        print("\tfirst: %-42s, %s" % (first['UniqueId'], first['SeoHeadlineFragment']))
        print("\t last: %-42s, %s" % (last['UniqueId'], last['SeoHeadlineFragment']))
        print("")


        uid = last['UniqueId'] # get value of UniqueId from dict object `last`

        parameters.update(extract_page_params(uid))

        input("press <enter> to get next")


def extract_page_params(uid):
    sequence = ''
    messageNumber = ''

    docId = ''

    if ':' in uid: # if the symbol ':' in string `uid`
        # uid looks like `e5a00f51-8821-4fbc-8ac6-e5f64b5eb0f2:8499`
        # so split it by ':'
        sequence, messageNumber = uid.split(':')
    else:
        docId = uid

    return {
        'sequence': sequence,
        'messageNumber': messageNumber,
        'docId': docId,
    }


if __name__ == '__main__':
    main()

这是我的代码来解决您的问题。

由于您是编程新手,我添加了一些注释。

您可以直接复制并使用python版本3运行。(2应该可以)

This is my code to solve your problem.
Since you are new to programming, i have added some comments.
You could directly copy and run with python version 3. (2 should work either)

这篇关于使用python获取HTML元素并发送新的json请求的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆