文字抓取(来自EDGAR 10K Amazon)代码不起作用 [英] Text Scraping (from EDGAR 10K Amazon) code not working

查看:96
本文介绍了文字抓取(来自EDGAR 10K Amazon)代码不起作用的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我有以下代码从财务报表(US SEC EDGAR 10K)文本文件中抓取一些特定的单词列表.如果您能在此方面为我提供帮助,将非常感谢.我已经手动交叉检查并在文档中找到了单词,但是我的代码根本找不到任何单词.我正在使用Python 3.5.3.预先感谢

I have the below code to scrape some specific word list from the financial statements (US SEC EDGAR 10K) text file. Will highly appreciate if you anyone can help me with this. I have manually cross-checked and found the words in the document, but my code is not finding any word at all. I am using Python 3.5.3. Thanks in advance

#!/usr/bin/python
# -*- coding: utf-8 -*-
import urllib.request as urllib2
import time
import csv
import sys

CIK = '0001018724'
Year = '2013'
string_match1 = 'edgar/data/1018724/0001193125-13-028520.txt'
url3 = 'https://www.sec.gov/Archives/' + string_match1
response3 = urllib2.urlopen(url3)
words = [
    'anticipate',
    'believe',
    'depend',
    'fluctuate',
    'indefinite',
    'likelihood',
    'possible',
    'predict',
    'risk',
    'uncertain',
    ]
count = {}  # is a dictionary data structure in Python
for elem in words:
    count[elem] = 0
for line in response3:
    elements = line.split()
    for word in words:
     count[word] = count[word] + elements.count(word)
print CIK
print Year
print url3
print count

这是脚本输出:

0001018724

2013

https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt

{
    'believe': 0,
    'likelihood': 0,
    'anticipate': 0,
    'fluctuate': 0,
    'predict': 0,
    'risk': 0,
    'possible': 0,
    'indefinite': 0,
    'depend': 0,
    'uncertain': 0,
}

推荐答案

您的代码的简化版本似乎可以在带有请求库的Python 3.7中使用:

A simplified version of your code seems to work in Python 3.7 with the requests library:

import requests
url = 'https://www.sec.gov/Archives/edgar/data/1018724/0001193125-13-028520.txt'
response = requests.get(url)

words = [your word list above ]


count = {}  # is a dictionary data structure in Python
for elem in words:
    count[elem] = 0
    info = str(response.content)
    count[elem] = count[elem] + info.count(elem)


print(count)

输出:

    {'anticipate': 9, 'believe': 32, 'depend': 39, 'fluctuate': 4, 'indefinite': 15, 'likelihood': 15, 'possible': 25,
 'predict': 6, 'risk': 55, 'uncertain': 38}

这篇关于文字抓取(来自EDGAR 10K Amazon)代码不起作用的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆