面临“wget"的问题在蟒蛇 [英] facing issue with "wget" in python

查看:20
本文介绍了面临“wget"的问题在蟒蛇的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我对python非常陌生.我面临着wget"和urllib.urlretrieve(str(myurl),tail)"的问题

当我运行脚本时它正在下载文件但文件名以?"结尾

我的完整代码:

导入操作系统导入 wget导入 urllib导入子流程将 open('/var/log/na/na.access.log') 作为输入文件,将 open('/tmp/reddy_log.txt', 'w') 作为输出文件:结果 = 设置()对于 infile 中的行:如果 '200' 在线:令牌 = line.split()results.add(tokens[6]) # 7th token对于排序(结果)中的结果:打印 >>输出文件,结果打开 ('/tmp/reddy_log.txt') 作为 infile:结果 = 设置()对于 infile 中的行:头,尾 = os.path.split(line)打印尾myurl = "http://data.xyz.com" + str(line)打印我的网址wget.download(str(myurl))# urllib.urlretrieve(str(myurl),tail)

输出:

# python last.py0011400026_recap.xmlhttp://data.na.com/feeds/mobile/android/v2.0/video/games/high/0011400026_recap.xml最新_1.xmlhttp://data.na.com/feeds/mobile/iphone/article/league/news/latest_1.xml当前时间.js

列出文件:

# ls0011400026_recap.xml?当前时间.js?最新_1.xml?今天.xml?

解决方案

对您所经历的行为的一种可能解释是不清理您的输入 line

<块引用>

 with open ('/tmp/reddy_log.txt') as infile:...对于 infile 中的行:...myurl = "http://data.xyz.com" + str(line)wget.download(str(myurl))

当你迭代一个文件对象时,(for line in infile:) 字符串你得到的是一个换行符 ('\n') 终止——如果你没有在使用 line 之前去掉换行符,哦,换行符仍然存在于您使用 line 产生的内容中……

为了说明这个概念,请看一下成绩单我做过的测试

08:28 $ cat >一份文件一种乙C08:29 $ 猫 >测试文件数据 = 打开('a_file')对于数据行:new_file = 打开(行,'w')new_file.close()08:31 $ lsa_file test.py08:31 $ python test.py08:31 $ ls一种?a_file b?C?测试文件08:31 $ ls -ba\n a_file b\n c\n test.py08:31 $

如您所见,我从文件中读取行并使用line 作为文件名,猜猜是什么,ls 列出的文件名最后有一个 ? — 但我们可以做得更好,正如在ls

的精美手册页<块引用>

 -b, --escape为非图形字符打印 C 风格的转义符

而且,正如您在 ls -b 的输出中所见,文件名不是以问号结尾(它只是默认使用的占位符由 ls 程序)但以换行符终止.

当我在做的时候,我不得不说你应该避免使用用于存储计算中间结果的临时文件.

Python 的一个很好的特性是存在生成器表达式,如果你愿意,你可以写你的代码如下

import wget# 你在整条线上匹配了一个200",我假设# 你真正想要的是匹配一个特定的列,'error_column'# 我象征性地从外部资源加载从 my_constants 导入 error_column、payload_column# 这里是一系列生成器表达式,每个表达式都依赖# 在上一个# 1. 文件中的行,从空白处剥离# 在右边(换行符被认为是空格)# === 不是绝对必要的,只是方便,因为# === 下面我们要测试非空行lines = (line.rstrip() for line in open('whatever.csv'))# 2. 行被转换为tokens"列表all_tokens = (line.split() for line in lines if line)# 3. 对于 'all_tokens' 生成器表达式中的每个 'tokens',我们# 检查代码200"并可能生成一个新目标目标 = (tokens[payload_column] for tokens in all_tokens if tokens[error_column]=='200')# 最后,使用 'targets' 生成器继续下载对于目标中的目标:wget.download(target)

不要被评论的数量所迷惑,没有评论我的代码只是

import wget从 my_constants 导入 error_columnlines = (line.rstrip() for line in open('whatever.csv'))all_tokens = (line.split() for line in lines if line)目标 = (tokens[payload_column] for tokens in all_tokens if tokens[error_column]=='200')对于目标中的目标:wget.download(target)

I am very novice to python. I am facing issue with "wget" as well as " urllib.urlretrieve(str(myurl),tail)"

when I run script it's downloading files but filename are ending with "?"

my complete code :

import os
import wget
import urllib
import subprocess
with open('/var/log/na/na.access.log') as infile, open('/tmp/reddy_log.txt', 'w') as outfile:
    results = set()
    for line in infile:
        if ' 200 ' in line:
            tokens = line.split()
            results.add(tokens[6]) # 7th token
    for result in sorted(results):
        print >>outfile, result
with open ('/tmp/reddy_log.txt') as infile:
     results = set()
     for line in infile:
     head, tail = os.path.split(line)
                print tail
                myurl = "http://data.xyz.com" + str(line)
                print myurl
                wget.download(str(myurl))
                #  urllib.urlretrieve(str(myurl),tail)

output :

# python last.py
0011400026_recap.xml

http://data.na.com/feeds/mobile/android/v2.0/video/games/high/0011400026_recap.xml

latest_1.xml

http://data.na.com/feeds/mobile/iphone/article/league/news/latest_1.xml

currenttime.js

Listing the files :

# ls
0011400026_recap.xml?                   currenttime.js?  latest_1.xml?      today.xml?

解决方案

A possible explanation of the behaviour you experience is that you do not sanitize your input line

with open ('/tmp/reddy_log.txt') as infile:
     ...
     for line in infile:
         ...
         myurl = "http://data.xyz.com" + str(line)
         wget.download(str(myurl))

When you iterate on a file object, (for line in infile:) the string you get is terminated by a newline ('\n') character — if you do not remove the newline before using line, oh well, the newline character is still there in what is produced by your use of line

As an illustration of this concept, have a look at the transcript of a test I've done

08:28 $ cat > a_file
a
b
c
08:29 $ cat > test.py
data = open('a_file')
for line in data:
    new_file = open(line, 'w')
    new_file.close() 
08:31 $ ls
a_file  test.py
08:31 $ python test.py
08:31 $ ls
a?  a_file  b?  c?  test.py
08:31 $ ls -b
a\n  a_file  b\n  c\n  test.py
08:31 $

As you can see, I read lines from a file and create some files using line as the filename and guess what, the filenames as listed by ls have a ? at the end — but we can do better, as it's explained in the fine manual page of ls

  -b, --escape
         print C-style escapes for nongraphic characters

and, as you can see in the output of ls -b, the filenames are not terminated by a question mark (it's just a placeholder used by default by the ls program) but are terminated by a newline character.

While I'm at it, I have to say that you should avoid to use a temporary file to store the intermediate results of your computation.

A nice feature of Python is the presence of generator expressions, if you want you can write your code as follows

import wget

# you matched on a '200' on the whole line, I assume that what
# you really want is to match a specific column, the 'error_column'
# that I symbolically load from an external resource
from my_constants import error_column, payload_column

# here it is a sequence of generator expressions, each one relying
# on the previous one

# 1. the lines in the file, stripped from the white space
#    on the right (the newline is considered white space)
#    === not strictly necessary, just convenient because
#    === below we want to test for non-empty lines
lines = (line.rstrip() for line in open('whatever.csv'))

# 2. the lines are converted to a list of 'tokens' 
all_tokens = (line.split() for line in lines if line)

# 3. for each 'tokens' in the 'all_tokens' generator expression, we
#    check for the code '200' and possibly generate a new target
targets = (tokens[payload_column] for tokens in all_tokens if tokens[error_column]=='200')

# eventually, use the 'targets' generator to proceed with the downloads
for target in targets: wget.download(target)

Don't be fooled by the amount of comments, w/o comments my code is just

import wget
from my_constants import error_column

lines = (line.rstrip() for line in open('whatever.csv'))
all_tokens = (line.split() for line in lines if line)
targets = (tokens[payload_column] for tokens in all_tokens if tokens[error_column]=='200')

for target in targets: wget.download(target)

这篇关于面临“wget"的问题在蟒蛇的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆