无法在 Web Scraping Python 中抓取字段 [英] Unable to grab a field in Web Scraping Python

查看:30
本文介绍了无法在 Web Scraping Python 中抓取字段的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从以下站点获取所有公司名称(突出显示).这是我的第一次网络抓取工作,所以我正在努力理解为什么我无法获取公司名称,尽管我有正确的参数,

I'm trying to grab all the Company names(highlighted) from the below site. This is my first web scraping work so i'm trying hard to understand why am i not able to grab the company names though i've the right parameters in place,

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
request = requests.get("https://www.hispanicmeetings.org", verify=False, headers=headers)
soup = bs4.BeautifulSoup(request.content)
soup.find_all("a href") # this is not getting me company names
soup.find_all('alt') #this either

我在 html 标签的帮助下尝试了许多小组合,我找到了网页,但似乎没有任何效果.任何将所有公司名称都集中到一个地方的建议对我来说意义重大.

I tried many little combinations with the help of the html tags i found the webpage but nothing seems to work.Any suggestion to grab all the company names to one place would mean alot to me.

推荐答案

您没有使用 BeautifulSoup 正确引用正确的标签和/或属性.我建议找一个关于 html 的小教程来理解标签和属性,然后看看你如何用 bs4 选择它们.然后您可以看到如何提取标签,并从这些标签中提取文本和/或属性值.试试下面的代码:

You are not correctly referencing the correct tags and/or attributes with BeautifulSoup. I'd suggest find a little tutorial on html to understand tags and attributes, then see how you select them with bs4. Then you can see how you pull out tags, and from those tags, pull out the text and/or attribute values. Try this code below:

import requests
import bs4

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.76 Safari/537.36'} # This is chrome, you can set whatever browser you like
request = requests.get("https://www.cloudtango.org", verify=False, headers=headers)
soup = bs4.BeautifulSoup(request.content, 'html.parser')
data = soup.find_all('td', {'class':'company'})

for each in data:
    print(each.find('img')['alt'])

输出:

Managed Solution
Redcentric
First Focus
K3 Technology
ICC Managed Services
AffinityMSP
BCA IT, Inc.
CloudCoCo Plc (formerly Adept4 PLC)
SCC
Datacom Systems
Compugen
Cancom
All Covered
Computacenter
q.beyond AG
Atos
Controlware GmbH Firmenzentrale
Trustmarque
Bytes
AHEAD
ACP IT Solutions GmbH
PROFI Engineering Systems AG
PQR
Orbit GmbH
SVA System Vertrieb Alexander GmbH
Ensono
Phoenix Software Ltd
Atea Norge AS
Axians
Kick ICT Group
Atea Sverige AB
Catapult Systems LLC
Valid

这篇关于无法在 Web Scraping Python 中抓取字段的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆