网络文本文档中的字数导致0 [英] word count from web text document result in 0

查看:84
本文介绍了网络文本文档中的字数导致0的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我尝试了Rasha Ashraf的文章用Python擦除EDGAR"中的python代码.我猜他使用的urllib2现在在python 3中无效.因此,我将其更改为urllib.

I tried the python codes from the article of Rasha Ashraf "Scraping EDGAR with Python". He used urllib2 which is now invalid in python 3, I guess. Thus, I changed it into urllib.

我可以带以下Edgar网页.但是,无论我如何尝试固定代码,单词计数的数量都为0.请帮助我解决此问题.仅供参考,我手动检查URL页面,以确保地址",类型"和交易"分别发生5次,9次和49次.但是,我的错误python结果显示这三个单词的结果为0.

I could bring the following Edgar web page. However, the number of word counting resulted in 0 no matter how I tried to fix the codes. Please help me to fix this problem. FYI, I manually check on the URL page so that "ADDRESS", "TYPE", and "transaction" occur 5 times, 9 times, and 49 times each. Nevertheless, my faulty python result shows 0 results for these three words.

这是我修改过的Rasha Ashraf的python代码(仅urllib部分和Web URL).原始URL包含大量文本内容.因此,我将其更改为一个更简单的网络页面.

Here are the python codes of Rasha Ashraf amended by me (only the urllib part and web URL). The original URL contains vast text content. So I changed it into a more simple page of the web.

import time
import csv
import sys

CIK = '0001018724'
Year= '2013'
string_match1= 'edgar/data/1018724/000112760220028651/0001127602-20-028651.txt'
url3= 'http://www.sec.gov/Archives/'+string_match1

import urllib.request
 
response3= urllib.request.urlopen(url3)
#output = response3.read()
#print(output)
words=  ['ADDRESS','TYPE', 'transaction']
count= {}
for elem in words:
    count[elem]= 0
    
for line in response3:
    elements= line.split()
    for word in words:
       count[word]= count[word] + elements.count(word)

print (CIK)
print (Year)
print (url3)
print (count)

=>到目前为止,我的代码的结果

0001018724

2013

http://www.sec.gov/Archives/edgar/data/1018724/000112760220028651/0001127602-20-028651.txt

{'ADDRESS': 0, 'TYPE': 0, 'transaction': 0}

推荐答案

要正确计数文件中出现的3个字符串(不是单词!)中每个字符串的出现次数,请尝试执行以下操作:

To get the correct count of the number of times each of your 3 strings (not words!) appear in the filing, try something like this:

import requests
url = "http://www.sec.gov/Archives/edgar/data/1018724/000112760220028651/0001127602-20-028651.txt"
req = requests.get(url)

words = ['address','type','transaction']
filing = req.text
for word in words:
    print(word,': ',filing.lower().count(word))

输出:

address :  5
type :  9
transaction :  49

这篇关于网络文本文档中的字数导致0的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆