需要帮助使用 beautifulsoup 和 selenium webdriver 进行网络抓取表 [英] Need help web scraping table with beautifulsoup and selenium webdriver

查看:21
本文介绍了需要帮助使用 beautifulsoup 和 selenium webdriver 进行网络抓取表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

所以我正在尝试抓取

  • 导航到 Params 标签.

  • 现在您可以发出 POST 请求.并且由于 Table 出现在 HTML 源代码中并且它不是通过 JavaScript 加载的,因此您可以在 bs4 中解析它代码> 或使用 pandas.read_html 以良好的格式阅读()

  • 注意:只要表格不是通过JavaScript 加载的,您就可以读取它.否则您可以尝试跟踪 XHR 请求(检查之前的answer),或者您可以使用 seleniumrequests_html 来呈现 JS 因为 requests 是一个 HTTP 库,它不能为你呈现.

    So I am working on trying to webscrape https://data.bls.gov/cgi-bin/surveymost?bls and was able to figure out how to webcrawl through clicks to get to a table.

    The selection that I am practicing on is after you select the checkbox associated with " Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A" under Compensation and then select "Retrieve data".

    Once those two are processed a table shows. This is the table I am trying to scrape.

    Below is the code that I have as of right now.

    Note that you have to put your own path for your browser driver where I have put < browser driver >.

    from bs4 import BeautifulSoup
    from urllib.request import urlopen
    import pandas as pd
    import numpy as np
    import requests
    import lxml.html as lh
    
    from selenium import webdriver
    url = "https://data.bls.gov/cgi-bin/surveymost?bls"
    ChromeSource = r"<browser driver>"
    
    # Open up a Chrome browser and navigate to web page.
    options = webdriver.ChromeOptions()
    options.add_argument('--ignore-certificate-errors')
    options.add_argument('--incognito')
    options.add_argument('--headless') # will run without opening browser.
    driver = webdriver.Chrome(ChromeSource, chrome_options=options)
    driver.get(url)
    
    driver.find_element_by_xpath("//input[@type='checkbox' and @value = 'CIU1010000000000A']").click()
    driver.find_element_by_xpath("//input[@type='Submit' and @value = 'Retrieve data']").click()
    
    i = 2
    
    def myTEST(i):
        xpath = '//*[@id="col' + str(i) + '"]'
        TEST = driver.find_elements_by_xpath(xpath)
    
        num_page_items = len(TEST)
        for i in range(num_page_items):
            print(TEST[i].text)
    myTEST(i)
    
    # Clean up (close browser once completed task).
    driver.close() 
    

    Right now this only is looking at the headers. I would like to also get the table content as well.

    If I make i = 0, it produces "Year". i = 1, it produces "Period". But if I select i = 2 I get two variables which have the same col2 id for "Estimated Value" and "Standard Error".

    I tried to think of a way to work around this and can't seem to get anything that I have researched to work.

    In essence, it would be better to start at the point where I am done clicking and am at the table of interest and then look at the xpath of the header and pull in the text for all of the sub 's.

    <tr> == $0
      <th id="col0"> Year </th>
      <th id="col1"> Period </th>
      <th id="col2">Estimated Value</th>
      <th id="col2">Standard Error</th>
    <tr>
    

    I am not sure how to do that. I also tried to loop through the {i} but obviously sharing with two header text causes an issue.

    Once I am able to get the header, I want to get the contents. I could you some insight on if I am on the right path, overthinking it or if there is a simpler way to do all of this. I am learning and this is my first attempt using the selenium library for clicks. I just want to get it to work so I can try it again on a different table and make it as automate or reusable (with tweaking) as possible.

    解决方案

    Actually you don't need selenium, You can just track the POST Form data, and apply the same within your POST request.

    Then you can load the table using Pandas easily.

    import requests
    import pandas as pd
    
    data = {
        "series_id": "CIU1010000000000A",
        "survey": "bls"
    }
    
    
    def main(url):
        r = requests.post(url, data=data)
        df = pd.read_html(r.content)[1]
        print(df)
    
    
    main("https://data.bls.gov/cgi-bin/surveymost")
    

    Explanation:

    • open the site.
    • Select Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A
    • Now you have to open your browser Developer Tools and navigate to Network Monitor section. etc Press  Ctrl + Shift + E ( Command + Option + E on a Mac).
    • Now you will found a POST request done.

    • Navigate to Params tab.

    • Now you can make the POST request. and since the Table is presented within the HTML source and it's not loaded via JavaScript, so you can parse it within bs4 or read it in nice format using pandas.read_html()

    Note: You can read the table as long as it's not loaded via JavaScript. otherwise you can try to track the XHR request (Check previous answer) or you can use selenium or requests_html to render JS since requests is an HTTP library which can't render it for you.

    这篇关于需要帮助使用 beautifulsoup 和 selenium webdriver 进行网络抓取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

    查看全文
    登录 关闭
    扫码关注1秒登录
    发送“验证码”获取 | 15天全站免登陆