如何从SECñ-Q文档使用BeautifulSoup提取表 [英] How to extract table from SEC N-Q doc using BeautifulSoup

查看:124
本文介绍了如何从SECñ-Q文档使用BeautifulSoup提取表的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

(Python 2.7版,BeautifulSoup4)

(python 2.7, BeautifulSoup4)

我试图提取SECñ-Q文件,表格内容。示例HTML浏览:的https://www.sec.gov/Archives/edgar/data/36405/000093247115006447/indexfunds_final.htm

I am trying to extract the table contents from SEC N-Q documents. Sample html here: https://www.sec.gov/Archives/edgar/data/36405/000093247115006447/indexfunds_final.htm

该文件没有标签的。我想搜索一节C.期货合约,并寻找下一个<表>并提取上述&lt内容; TR>。有多个C.期货合约中出现一个文档了。

The file has no tag at all. I want to search for section 'C. Futures Contract' and look for the next < table > and extract the contents in < tr >. There are multiple 'C. Futures Contract' occurrences in one document too.

我试过以下code,但一无所获。

I've tried the following code but got nothing.

import requests, re
from bs4 import BeautifulSoup
r = requests.get("https://www.sec.gov/Archives/edgar/data/36405/000093247115006447/indexfunds_final.htm")
futures = soup.find_all(re.compile('C. Futures Contract'))
print futures

[]

推荐答案

首先,如果你是文本搜索,使用文本参数(从BS 4.4起。 0参数被命名为 字符串 )。

First of all, if you are searching by text, use text argument (starting from bs 4.4.0 the argument is named string).

除此之外,对于每一个期货部分,使用的 find_next() 寻找下一个元素。

Aside from that, for every futures section, use find_next() to find the next table element.

工作code:

import re

import requests
from bs4 import BeautifulSoup

response = requests.get("https://www.sec.gov/Archives/edgar/data/36405/000093247115006447/indexfunds_final.htm")
soup = BeautifulSoup(response.content)

futures = soup.find_all(text=re.compile('C. Futures Contract'))
for future in futures:
    for row in future.find_next("table").find_all("tr"):
        print [cell.get_text(strip=True) for cell in row.find_all("td")]

这篇关于如何从SECñ-Q文档使用BeautifulSoup提取表的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆