BeautifulSoup无法按预期下载文件 [英] BeautifulSoup not downloading files as expected

查看:80
本文介绍了BeautifulSoup无法按预期下载文件的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在尝试从此下载所有.txt文件网站,其中包含以下代码:

I'm trying to download all the .txt files from this website with the following code:

from bs4 import BeautifulSoup as bs
import urllib 
import urllib2

baseurl = "http://m-selig.ae.illinois.edu/props/volume-1/data/"

soup = bs(urllib2.urlopen(baseurl), 'lxml')
links = soup.findAll("a")
for link in links:
    print link.text
    urllib.urlretrieve(baseurl+link.text, link.text)

当我运行此代码时, print(link.text)行将打印正确的文件名,并且目录中将填充具有正确名称的文件,但是文件的内容类似于:

When I run this code, the print(link.text) line prints the correct file names and the directory gets populated with files with the correct names, but the contents of the files look something like:

<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /props/volume-1/data/ ance_8.5x6_2849cm_4000.txt was not found on this server.</p>
<p>Additionally, a 404 Not Found
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.2.29 (Unix) mod_ssl/2.2.29 OpenSSL/1.0.1e-fips mod_bwlimited/1.4 Server at m-selig.ae.illinois.edu Port 80</address>
</body></html>

因此,我确定通信正常,但是我没有正确指导BS如何保存文件内容.

Thus, I'm sure the communication is working, but I'm not instructing BS correctly on how to save the contents of the files.

此外,我目前正在使用 findAll("a")命令下载所有文件,但实际上我只想下载具有 * geom之类名称的特定文件.txt

Also, I'm currently downloading all the files with the findAll("a") command, but I would actually like to only download specific files with names such as *geom.txt

推荐答案

您需要拉href以获取链接,也可以使用css选择器仅获取包含 geom.txt 的链接:

You need to pull the href to get the link, you can also get just links that contain geom.txt using a css selector:

from bs4 import BeautifulSoup as bs
import urllib
import urllib2
from urlparse import urljoin


baseurl = "http://m-selig.ae.illinois.edu/props/volume-1/data/"




soup = bs(urllib2.urlopen(baseurl), 'lxml')
links = (a["href"] for a in soup.select("a[href*=geom.txt]"))
for link in links:
    urllib.urlretrieve(urljoin(baseurl, link), link)

a [href * = geom.txt] 查找所有带有href且带有 geom.txt 的href的锚标签,这等效于在中使用子字符串python中的main_string .

a[href*=geom.txt] finds all anchor tags that have a href with geom.txt, it is equivalent to using if substring in main_string in python.

您还可以在CSS中使用 $ = 查找以 geom.txt 结尾的hrefs:

You could also use $= in your css to find hrefs ending in geom.txt:

links = (a["href"] for a in soup.select("a[href$=geom.txt]"))

这篇关于BeautifulSoup无法按预期下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆