需要帮助使用 beautifulsoup 和 selenium webdriver 进行网络抓取表 [英] Need help web scraping table with beautifulsoup and selenium webdriver
问题描述
所以我正在尝试抓取
导航到 Params
标签.
现在您可以发出 POST
请求.并且由于 Table
出现在 HTML
源代码中并且它不是通过 JavaScript
加载的,因此您可以在 bs4
中解析它代码> 或使用 pandas.read_html 以良好的格式阅读()
注意:只要表格不是通过JavaScript
加载的,您就可以读取它.否则您可以尝试跟踪 XHR
请求(检查之前的answer),或者您可以使用 selenium
或 requests_html
来呈现 JS
因为 requests
是一个 HTTP
库,它不能为你呈现.
So I am working on trying to webscrape https://data.bls.gov/cgi-bin/surveymost?bls and was able to figure out how to webcrawl through clicks to get to a table.
The selection that I am practicing on is after you select the checkbox associated with " Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A" under Compensation and then select "Retrieve data".
Once those two are processed a table shows. This is the table I am trying to scrape.
Below is the code that I have as of right now.
Note that you have to put your own path for your browser driver where I have put < browser driver >.
from bs4 import BeautifulSoup
from urllib.request import urlopen
import pandas as pd
import numpy as np
import requests
import lxml.html as lh
from selenium import webdriver
url = "https://data.bls.gov/cgi-bin/surveymost?bls"
ChromeSource = r"<browser driver>"
# Open up a Chrome browser and navigate to web page.
options = webdriver.ChromeOptions()
options.add_argument('--ignore-certificate-errors')
options.add_argument('--incognito')
options.add_argument('--headless') # will run without opening browser.
driver = webdriver.Chrome(ChromeSource, chrome_options=options)
driver.get(url)
driver.find_element_by_xpath("//input[@type='checkbox' and @value = 'CIU1010000000000A']").click()
driver.find_element_by_xpath("//input[@type='Submit' and @value = 'Retrieve data']").click()
i = 2
def myTEST(i):
xpath = '//*[@id="col' + str(i) + '"]'
TEST = driver.find_elements_by_xpath(xpath)
num_page_items = len(TEST)
for i in range(num_page_items):
print(TEST[i].text)
myTEST(i)
# Clean up (close browser once completed task).
driver.close()
Right now this only is looking at the headers. I would like to also get the table content as well.
If I make i = 0, it produces "Year". i = 1, it produces "Period". But if I select i = 2 I get two variables which have the same col2 id for "Estimated Value" and "Standard Error".
I tried to think of a way to work around this and can't seem to get anything that I have researched to work.
In essence, it would be better to start at the point where I am done clicking and am at the table of interest and then look at the xpath of the header and pull in the text for all of the sub 's.
<tr> == $0
<th id="col0"> Year </th>
<th id="col1"> Period </th>
<th id="col2">Estimated Value</th>
<th id="col2">Standard Error</th>
<tr>
I am not sure how to do that. I also tried to loop through the {i} but obviously sharing with two header text causes an issue.
Once I am able to get the header, I want to get the contents. I could you some insight on if I am on the right path, overthinking it or if there is a simpler way to do all of this. I am learning and this is my first attempt using the selenium library for clicks. I just want to get it to work so I can try it again on a different table and make it as automate or reusable (with tweaking) as possible.
Actually you don't need selenium
, You can just track the POST
Form data
, and apply the same within your POST
request.
Then you can load the table using Pandas
easily.
import requests
import pandas as pd
data = {
"series_id": "CIU1010000000000A",
"survey": "bls"
}
def main(url):
r = requests.post(url, data=data)
df = pd.read_html(r.content)[1]
print(df)
main("https://data.bls.gov/cgi-bin/surveymost")
Explanation:
- open the site.
- Select
Employment Cost Index (ECI) Civilian (Unadjusted) - CIU1010000000000A
- Now you have to open your browser Developer Tools and navigate to
Network Monitor
section. etc Press Ctrl + Shift + E ( Command + Option + E on a Mac). Now you will found a
POST
request done.Navigate to
Params
tab.Now you can make the
POST
request. and since theTable
is presented within theHTML
source and it's not loaded viaJavaScript
, so you can parse it withinbs4
or read it in nice format using pandas.read_html()
Note: You can read the table as long as it's not loaded via JavaScript
. otherwise you can try to track the XHR
request (Check previous answer) or you can use selenium
or requests_html
to render JS
since requests
is an HTTP
library which can't render it for you.
这篇关于需要帮助使用 beautifulsoup 和 selenium webdriver 进行网络抓取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!