在10-K Edgar馅料中使用Beautiful Soup和正则表达式提取文本 [英] Extraction of text using Beautiful Soup and regular expressions in 10-K Edgar fillings

查看:126
本文介绍了在10-K Edgar馅料中使用Beautiful Soup和正则表达式提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从大约10000个文件中自动提取"1A.风险因素"部分,并将其写入txt文件. 可以在此处

I want to automatically extract section "1A. Risk Factors" from around 10000 files and write it into txt files. A sample URL with a file can be found here

所需部分在项目1a风险因素"和项目1b"之间.问题是,"item","1a"和"1b"在所有这些文件中可能看起来都不同,并且可能出现在多个位置-不仅是我感兴趣的最长,最适当的文件.因此,应该使用一些正则表达式,以便:

The desired section is between "Item 1a Risk Factors" and "Item 1b". The thing is that the 'item', '1a' and '1b' might look different in all these files and may be present in multiple places - not only the longest, proper one that interest me. Thus, there should be some regular expressions used, so that:

  1. 提取"1a"和"1b"之间的最长部分(否则将显示目录和其他无用的元素)

  1. The longest part between "1a" and "1b" is extracted (otherwise the table of contents will appear and other useless elements)

考虑了表达式的不同变体

Different variants of the expressions are taken into consideration

我试图在脚本中实现这两个目标,但由于这是我在Python中的第一个项目,所以我只是随机排序了一些我认为可能有用的表达式,显然它们的顺序错误(我确定我应该迭代(< a>)元素,将每个提取的部分"添加到列表中,然后选择最长的部分并将其写入文件(尽管我不知道如何实现此想法). 目前,我的方法从目录中返回的1a和1b(我认为是页码)之间的数据很少,然后停止...(?)

I tried to implement these two goals in the script, but as it's my first project in Python, I just randomly sorted expressions that I think might work and apparently they are in a wrong order (I'm sure I should iterate on the "< a >"elements, add each extracted "section" to a list, then choose the longest one and write it to a file, though I don't know how to implement this idea). Currently my method returns very little data between 1a and 1b (i think it's a page number) from the table of contents and then it stops...(?)

我的代码:

import requests
import re
import csv

from bs4 import BeautifulSoup as bs
with open('indexes.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for line in reader:
        fn1 = line[0]
        fn2 = re.sub(r'[/\\]', '', line[1])
        fn3 = re.sub(r'[/\\]', '', line[2])
        fn4 = line[3]
        saveas = '-'.join([fn1, fn2, fn3, fn4])
        f = open(saveas + ".txt", "w+",encoding="utf-8")
        url = 'https://www.sec.gov/Archives/' + line[4].strip()
        print(url)
        response = requests.get(url)
        soup = bs(response.content, 'html.parser')
        risks = soup.find_all('a')
        regexTxt = 'item[^a-zA-Z\n]*1a.*item[^a-zA-Z\n]*1b'
        for risk in risks:
            for i in risk.findAllNext():
                i.get_text()
                sections = re.findall(regexTxt, str(i), re.IGNORECASE | re.DOTALL)
                for section in sections:
                    clean = re.compile('<.*?>')
                    # section = re.sub(r'table of contents', '', section, flags=re.IGNORECASE)
                    # section = section.strip()
                    # section = re.sub('\s+', '', section).strip()
                    print(re.sub(clean, '', section))

目标是在当前URL中找到"1a"和"1b"之间的最长部分(无论它们的外观如何)并将其写入文件.

The goal is to find the longest part between "1a" and "1b" (regardless of how they exactly look) in the current URL and write it to a file.

推荐答案

最后,我使用了此网站.我写了一个简单的脚本,将纯txt写入文件.处理它现在将是一个简单的任务.

In the end I used a CSV file, that contains a column HTMURL, which is the link to htm-format 10-K. I got it from Kai Chen that created this website. I wrote a simple script that writes pure txt into files. Processing it will be a simple task now.

import requests
import csv
from pathlib import Path

from bs4 import BeautifulSoup
with open('index.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for line in reader:
        print(line[9])
        url = line[9]
        html_doc = requests.get(url).text
        soup = BeautifulSoup(html_doc, 'html.parser')
        print(soup.get_text())
        name = line[1]
        name = name.replace('/', '')
        name = name.replace("/PA/", "")
        name = name.replace("/DE/", "")
        dir = Path(name + line[4] + ".txt")
        f = open(dir, "w+", encoding="utf-8")
        if dir.is_dir():
            break
        else: f.write(soup.get_text())

这篇关于在10-K Edgar馅料中使用Beautiful Soup和正则表达式提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆