刮美国职业信息网 [英] Scrape America's Career InfoNet
问题描述
我有雇主ID,可用于业务领域:
https://www.careerinfonet.org/employ4.asp?emp_id=558742391
HTML包含tr/td表中的数据:
Business Description: Exporters (Whls) Primary Industry:Other Miscellaneous Durable Goods Merchant Wholesalers Related Industry:Sporting and Athletic Goods Manufacturing
所以我想得到
- 出口商(Whls)
- 其他杂项耐用品商人批发商
- 体育和运动用品制造
我的示例代码如下:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find('td', class_='content')
for td in div.find_all('td'):
print(td.text)
我想先说一下这项技术相当草率,但前提是您假设抓取的每个页面都有类似的设置,这样就可以完成工作./p>
您的代码非常适合访问页面本身,我只需为每个元素添加一个检查,以确定它是"Business Description"
还是"Primary"
或"Related Industry"
.然后,您可以访问适当的元素并使用它.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find('td', class_='content')
lst = div.find_all('td')
for td in lst:
if td.text == "Business Description:":
print(lst[lst.index(td)+1].text)
if td.text == "Primary Industry:":
print(lst[lst.index(td)+1].text)
if td.text == "Related Industry:":
print(lst[lst.index(td)+1].text)
我进行的另一项小修改是将div.find_all('td')
放入可以进行索引的列表中,以访问所需的元素.
希望有帮助!
I've got employer IDs, which can be utilized get the business area:
https://www.careerinfonet.org/employ4.asp?emp_id=558742391
The HTML contains the data in tr/td tables:
Business Description: Exporters (Whls) Primary Industry:Other Miscellaneous Durable Goods Merchant Wholesalers Related Industry:Sporting and Athletic Goods Manufacturing
So I would like to get
- Exporters (Whls)
- Other Miscellaneous Durable Goods Merchant Wholesalers
- Sporting and Athletic Goods Manufacturing
My example code looks like this:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find('td', class_='content')
for td in div.find_all('td'):
print(td.text)
I would like to preface this by saying that this technique is fairly sloppy, but it gets the job done assuming each page you scrape has a similar set up.
Your code is excellent for accessing the page itself, I simply add a check for every element to determine if it is the "Business Description"
, or the "Primary"
or "Related Industry"
. Then you can access the appropriate element and use that.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')
div = soup.find('td', class_='content')
lst = div.find_all('td')
for td in lst:
if td.text == "Business Description:":
print(lst[lst.index(td)+1].text)
if td.text == "Primary Industry:":
print(lst[lst.index(td)+1].text)
if td.text == "Related Industry:":
print(lst[lst.index(td)+1].text)
The other small modification I made is putting div.find_all('td')
in a list that can then be indexed, to access the element you want.
Hope it helps!
这篇关于刮美国职业信息网的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!