soup.findAll不适用于表 [英] soup.findAll is not working for table

查看:98
本文介绍了soup.findAll不适用于表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试解析此网站 https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017

I am trying to parse this site https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017

使用以下代码

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import ssl
context = ssl._create_unverified_context()
dibbsurl = 'https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017'
uClient = uReq(dibbsurl, context=context)
dibbshtml = uClient.read()
uClient.close()

#html parser
dibbssoup = soup(dibbshtml, "html.parser")

#grabs each rfq
containers = dibbssoup.findAll("tr",{"Class":"Bgwhite"})

出于研究目的,我想从表格中获取国家股票编号,名称和数量.

I want to grab the National Stock Numbers, the Nomenclature and QTY from the table for research purposes.

containers = dibbssoup.findAll("tr",{"Class":"Bgwhite"})

我试图抓住桌子的每一行,但是容器似乎没有抓住它.当我打字 len(容器)它显示0 为什么没有抓住桌子,我该如何解决?

I was trying to grab each row of the table but containers does not seem to be grabing it. when I type len(containers) it shows 0 why is the table not being grabbed and how can I fix it?

更新 这是该网站的示例html

update this is the sample html from the site

<tr class="BgWhite">
    <td headers="th0" valign="top">
        1
    </td>
    <td headers="th1" style="width: 125px;" valign="top">
        <a href="https://www.dibbs.bsm.dla.mil/RFQ/RFQNsn.aspx?value=8465015550093&amp;category=issue&amp;Scope=" title="go to NSN view">8465-01-555-0093</a>
    </td>
    <td headers="th2" valign="top">
        SNAP LINK, RAPPELLER
    </td>
    <td headers="th3" valign="top">
        None
    </td>
    <td headers="th4" style="width: 150px;" valign="top">
        <a href="https://dibbs2.bsm.dla.mil/Downloads/RFQ/8/SPE1C117T2608.PDF" title="RFQ document" target="DIBBSDocuments">SPE1C1-17-T-2608</a><br>&nbsp;&nbsp;<span style="font-size: 9px; color: #505050;">» <a href="https://www.dibbs.bsm.dla.mil/rfq/rfqrec.aspx?sn=SPE1C117T2608" title="Package View" class="SubMenuLink">Package View</a></span><a href="https://www.dibbs.bsm.dla.mil/RFQ/RFQQHlp.aspx?ht=fi"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconFastPace.gif" alt="Fast Award Candidate.  Micro-purchase quotes may be awarded prior to the solicitation return date.  See Master Solicitation for Additional Info" width="14" height="11" hspace="0" border="0" align="middle"></a><br><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconEproc.gif" width="36" height="16" hspace="1" border="0" alt="DLA E-Procurement" style="border-width:0px;  vertical-align: bottom;">
    </td>
    <td headers="th5" valign="top">
        <span style="color:#000099">Open</span><br><a href="https://www.dibbs.bsm.dla.mil/RA/Quote/QuoteFrm.aspx?sn=SPE1C117T2608"><img src="https://www.dibbs.bsm.dla.mil/app_themes/images/buttons/btnQ.gif" width="18" height="18" border="0" alt="Click to submit Quote" hspace="1" align="bottom"></a><a href="https://www.dibbs.bsm.dla.mil/RA/Quote/QuoteFrm.aspx?sn=SPE1C117T2608"><span style="font-size: 9px;">uote</span></a>&nbsp;&nbsp;<img src="https://www.dibbs.bsm.dla.mil/app_themes/images/icons/iconSpace1010.gif" alt=" " width="18" height="16" hspace="0" border="0">
    </td>
    <td headers="th6" valign="top">
        0070631319<br>QTY: 400
    </td>
    <td headers="th7" valign="top">
        09-07-2017
    </td>
    <td headers="th8" valign="top">
        09-18-2017
    </td>
</tr>

推荐答案

我分析了您要抓取的网站,发现该网站确实有一个页面,例如条款和条件,您需要先同意才能查看内容.为了能够同意",需要提交一份表格.因此,创建一个具有3个级别的页面源获取或检索的解决方案.

I analyzed the site you want to scrape, I found out that the site does have a page like a Terms and Condition that you need to agree before viewing the content. To be able to "agree" to that there is a need to submit a form. Thus, create a solution with 3 levels of fetches or retrieval of page source.

在此示例中,我使用了requestshtml5lib,因为它易于使用.您可以使用pip

I used requests and html5lib on this example because it's easy to use. You can install them using pip

最后一部分是表的解析,与您所做的类似.

The last part is the parsing of the table and similar to what you did.

import requests
from bs4 import BeautifulSoup
import urllib3

urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

request_headers = {'Accept': '*/*',
                   'Accept-Encoding': 'gzip, deflate, sdch',
                   'Accept-Language': 'en-US,en;q=0.8',
                   'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36'
                       }

req = requests.Session()
warning_url = 'https://www.dibbs.bsm.dla.mil/dodwarning.aspx'

# get initial warning page
get_warning_page = req.get(warning_url, headers=request_headers, verify=False)
warning_soup = BeautifulSoup(get_warning_page.content, 'html5lib')

# parse forms needed to be submitted later (T&C of the site that you need to agree before proceeding)
payload = {}
for inp in warning_soup.find('form').find_all('input'):
    payload[inp.get('name')] = inp.get('value')

# submit the warning form (means you already agreed on the T&C)
submit_warning_form = req.post(warning_url, headers=request_headers, data=payload, verify=False)

# lastly, navigate to the main page that contains the table
main_page = req.post('https://www.dibbs.bsm.dla.mil/RFQ/RfqRecs.aspx?category=issue&TypeSrch=dt&Value=09-07-2017', headers=request_headers, verify=False)

# parsing of table
dibbssoup = BeautifulSoup(main_page.content, 'html5lib')
#grabs each rfq
containers = dibbssoup.find_all("tr", {"class": "BgWhite"})

print(containers)

如果您有任何疑问或遇到错误,请告诉我.如果这解决了您的问题,请将其标记为答案.谢谢!

If you have any questions or encountered errors, just let me know. If this solved your issue, please mark it as answer. Thanks!

这篇关于soup.findAll不适用于表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆