刮美国职业信息网 [英] Scrape America's Career InfoNet

查看:60
本文介绍了刮美国职业信息网的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有雇主ID,可用于业务领域:

https://www.careerinfonet.org/employ4.asp?emp_id=558742391

HTML包含tr/td表中的数据:

    Business Description:
         Exporters (Whls)   Primary Industry:Other Miscellaneous Durable Goods Merchant Wholesalers
     Related Industry:Sporting and Athletic Goods Manufacturing

所以我想得到

  • 出口商(Whls)
  • 其他杂项耐用品商人批发商
  • 体育和运动用品制造

我的示例代码如下:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')

div = soup.find('td', class_='content')    
for td in div.find_all('td'):
    print(td.text)

解决方案

我想先说一下这项技术相当草率,但前提是您假设抓取的每个页面都有类似的设置,这样就可以完成工作./p>

您的代码非常适合访问页面本身,我只需为每个元素添加一个检查,以确定它是"Business Description"还是"Primary""Related Industry".然后,您可以访问适当的元素并使用它.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')

div = soup.find('td', class_='content')  
lst = div.find_all('td')  
for td in lst:
    if td.text == "Business Description:":
        print(lst[lst.index(td)+1].text)
    if td.text == "Primary Industry:":
        print(lst[lst.index(td)+1].text)
    if td.text == "Related Industry:":
        print(lst[lst.index(td)+1].text)

我进行的另一项小修改是将div.find_all('td')放入可以进行索引的列表中,以访问所需的元素.

希望有帮助!

I've got employer IDs, which can be utilized get the business area:

https://www.careerinfonet.org/employ4.asp?emp_id=558742391

The HTML contains the data in tr/td tables:

    Business Description:
         Exporters (Whls)   Primary Industry:Other Miscellaneous Durable Goods Merchant Wholesalers
     Related Industry:Sporting and Athletic Goods Manufacturing

So I would like to get

  • Exporters (Whls)
  • Other Miscellaneous Durable Goods Merchant Wholesalers
  • Sporting and Athletic Goods Manufacturing

My example code looks like this:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')

div = soup.find('td', class_='content')    
for td in div.find_all('td'):
    print(td.text)

解决方案

I would like to preface this by saying that this technique is fairly sloppy, but it gets the job done assuming each page you scrape has a similar set up.

Your code is excellent for accessing the page itself, I simply add a check for every element to determine if it is the "Business Description", or the "Primary" or "Related Industry". Then you can access the appropriate element and use that.

import requests
from bs4 import BeautifulSoup

page = requests.get("https://www.careerinfonet.org/employ4.asp?emp_id=558742391")
soup = BeautifulSoup(page.text, 'html.parser')

div = soup.find('td', class_='content')  
lst = div.find_all('td')  
for td in lst:
    if td.text == "Business Description:":
        print(lst[lst.index(td)+1].text)
    if td.text == "Primary Industry:":
        print(lst[lst.index(td)+1].text)
    if td.text == "Related Industry:":
        print(lst[lst.index(td)+1].text)

The other small modification I made is putting div.find_all('td') in a list that can then be indexed, to access the element you want.

Hope it helps!

这篇关于刮美国职业信息网的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆