解析EDGAR文件 [英] Parsing EDGAR filings
问题描述
我想使用python2.7从EDGAR归档文件(可作为.txt文件在线获得)中删除不是文档文本的任何内容.文件的外观示例如下:
示例 >
EDGAR从此文件的第48页开始提供其文档类型定义:
程序的第一部分将EDGAR在线数据库中的.txt文件获取到一个名为"parseme.txt"的本地文件中.我想知道的是如何使用DTD解析.txt文件.我将使用诸如BeautifulSoup之类的固定解析模块来完成这项工作,但是EDGAR的格式似乎是唯一的,并且我希望避免使用较大的正则表达式来完成这项工作.
import os
filename = 'parseme.txt'
with open(filename) as f:
lines = f.readlines()
pysec 项目看起来很有希望.这是一个基本的Django应用,可以下载Edgar索引,然后允许您下载特定的文件并从XBRL中提取财务参数.
I would like to use python2.7 to remove anything that isn't the documents' text from EDGAR filings (which are available online as .txt files). An example of what the files look like is here:
EDGAR provides its Document Type Definitions starting on page 48 of this file:
The first part of my program gets the .txt file from the EDGAR online database into a local file that I've named "parseme.txt". What I would like to know is how to use the DTD to parse the .txt file. I would use a canned parsing module like BeautifulSoup for the job, but EDGAR's format appears unique, and I hope to avoid a large regex to get the job done.
import os
filename = 'parseme.txt'
with open(filename) as f:
lines = f.readlines()
My question is related to the question at Parse SGML with Open Arbitrary Tags in Python 3 and Use lxml to parse text file with bad header in Python but I believe distinct as my question relates to python2.7 and I'm not concerned with the header - I'm just concerned with the text of the file.
The pysec project looks promising. It's a basic Django app that downloads the Edgar index and then allows you to download specific filings and extract financial parameters from the XBRL.
这篇关于解析EDGAR文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!