解析EDGAR文件 [英] Parsing EDGAR filings

查看:91
本文介绍了解析EDGAR文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想使用python2.7从EDGAR归档文件(可作为.txt文件在线获得)中删除不是文档文本的任何内容.文件的外观示例如下:

示例

EDGAR从此文件的第48页开​​始提供其文档类型定义:

DTD

程序的第一部分将EDGAR在线数据库中的.txt文件获取到一个名为"parseme.txt"的本地文件中.我想知道的是如何使用DTD解析.txt文件.我将使用诸如BeautifulSoup之类的固定解析模块来完成这项工作,但是EDGAR的格式似乎是唯一的,并且我希望避免使用较大的正则表达式来完成这项工作.

import os
filename = 'parseme.txt'
with open(filename) as f:
    lines = f.readlines()

我的问题与使用以下命令解析SGML有关:在Python 3中打开任意标签解决方案

pysec 项目看起来很有希望.这是一个基本的Django应用,可以下载Edgar索引,然后允许您下载特定的文件并从XBRL中提取财务参数.

I would like to use python2.7 to remove anything that isn't the documents' text from EDGAR filings (which are available online as .txt files). An example of what the files look like is here:

Example

EDGAR provides its Document Type Definitions starting on page 48 of this file:

DTD

The first part of my program gets the .txt file from the EDGAR online database into a local file that I've named "parseme.txt". What I would like to know is how to use the DTD to parse the .txt file. I would use a canned parsing module like BeautifulSoup for the job, but EDGAR's format appears unique, and I hope to avoid a large regex to get the job done.

import os
filename = 'parseme.txt'
with open(filename) as f:
    lines = f.readlines()

My question is related to the question at Parse SGML with Open Arbitrary Tags in Python 3 and Use lxml to parse text file with bad header in Python but I believe distinct as my question relates to python2.7 and I'm not concerned with the header - I'm just concerned with the text of the file.

解决方案

The pysec project looks promising. It's a basic Django app that downloads the Edgar index and then allows you to download specific filings and extract financial parameters from the XBRL.

这篇关于解析EDGAR文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆