BeautifulSoup:从表单中刮取答案 [英] BeautifulSoup: Scraping answers from form

查看:61
本文介绍了BeautifulSoup:从表单中刮取答案的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从以下

I need to scrape the answers to the questions from the following link, including the check boxes.

这是我到目前为止所拥有的:

Here's what I have so far:

from bs4 import BeautifulSoup
import selenium.webdriver as webdriver

url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'

driver = webdriver.Firefox()
driver.get(url)

soup = BeautifulSoup(driver.page_source)

如果有的话,下面给出了我所有的书面答案:

The following gives me all the written answers, if there are any:

soup.find_all('span', {'class':'PrintHistRed'})

我想我可以将所有复选框答案汇总在一起:

and I think I can piece together all the checkbox answers from this:

soup.find_all('img')

但是这些命令的排列顺序不正确,因为这不会获得不是用红色书写的无信息归档"答案.

but these aren't going to be ordered correctly, because this doesn't pick up the "No Information Filed" answers that aren't written in red.

我也觉得有更好的方法可以做到这一点.理想情况下,我希望(对于前6个问题)返回:

I also feel like there's a much better way to be doing this. Ideally I want (for the first 6 questions) to return:

['APEX INVESTMENT FUND, V, L.P',
 '805-2054766781',
 'Delaware',
 'United States',
 'APEX MANAGEMENT V, LLC',
 'X',
 'O',
 'No Information Filed',
 'NO',
 'NO']

编辑

下面马丁的答案似乎可以解决问题,但是当我将其放入循环中时,结果在第3次迭代后开始发生变化.有任何解决方法的想法吗?

Martin's answer below seems to do the trick, however when I put it in a loop, the results begin to change after the 3rd iteration. Any ideas how to fix this?

from bs4 import BeautifulSoup
import requests
import re

for x in range(5):
    url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
    html = requests.get(url)
    soup = BeautifulSoup(html.text, "lxml")

    tags = list(soup.find_all('span', {'class':'PrintHistRed'}))
    tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:])       # 2: skip "are you an adviser" at the top
    tags.extend([t.parent for t in soup.find_all(text="No Information Filed")])

    output = []

    for entry in sorted(tags):
        if entry.name == 'img':
            alt = entry['alt']
            if 'Radio' in alt:
                output.append('NO' if 'not selected' in alt else 'YES')
            else:
                output.append('O' if 'not checked' in alt else 'X')
        else:
            output.append(entry.text)

    print output[:9] 

推荐答案

该网站无法通过Javascript生成任何所需的HTML,因此我选择仅使用 requests 来获取HTML(应该更快).

The website does not generate any of the required HTML via Javascript, so I have chosen to use just requests to get the HTML (which should be faster).

解决问题的一种方法是将三种不同类型的所有标签存储在一个数组中.如果将其排序,则将导致标签按树顺序排列.

One approach to solving your problem is to store all the tags for your three different types into a single array. If this is then sorted, it will result in the tags being in tree order.

第一次搜索仅使用您的 PrintHistRed 来获取匹配的span标签.其次,它找到所有具有 alt 文本的 img 标签,其中包含 Radio Checkbox 单词.最后,它搜索找不到 No Information Filed 的所有位置,并返回父标记.

The first search simply uses your PrintHistRed to get the matching span tags. Secondly it finds all img tags that have alt text containing either the word Radio or Checkbox. Lastly it searches for all locations where No Information Filed is found and returns the parent tag.

现在可以对标签进行排序,并构建合适的 output 数组,其中包含所需格式的信息:

The tags can now be sorted and a suitable output array built containing the information in the required format:

from bs4 import BeautifulSoup
import requests
import re

url = 'https://www.adviserinfo.sec.gov/IAPD/content/viewform/adv/Sections/iapd_AdvPrivateFundReportingSection.aspx?ORG_PK=161227&FLNG_PK=05C43A1A0008018C026407B10062D49D056C8CC0'
html = requests.get(url)
soup = BeautifulSoup(html.text, "lxml")

tags = list(soup.find_all('span', {'class':'PrintHistRed'}))
tags.extend(list(soup.find_all('img', alt=re.compile('Radio|Checkbox')))[2:])       # 2: skip "are you an adviser" at the top
tags.extend([t.parent for t in soup.find_all(text="No Information Filed")])

output = []

for entry in sorted(tags):
    if entry.name == 'img':
        alt = entry['alt']
        if 'Radio' in alt:
            output.append('NO' if 'not selected' in alt else 'YES')
        else:
            output.append('O' if 'not checked' in alt else 'X')
    else:
        output.append(entry.text)

print output[:9]        # Display the first 9 entries

给你

[u'APEX INVESTMENT FUND V, L.P.', u'805-2054766781', u'Delaware', u'United States', 'X', 'O', u'No Information Filed', 'NO', 'YES']

这篇关于BeautifulSoup:从表单中刮取答案的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆