BeautifulSoup无法按预期下载文件 [英] BeautifulSoup not downloading files as expected
问题描述
我正在尝试从此下载所有.txt文件网站,其中包含以下代码:
I'm trying to download all the .txt files from this website with the following code:
from bs4 import BeautifulSoup as bs
import urllib
import urllib2
baseurl = "http://m-selig.ae.illinois.edu/props/volume-1/data/"
soup = bs(urllib2.urlopen(baseurl), 'lxml')
links = soup.findAll("a")
for link in links:
print link.text
urllib.urlretrieve(baseurl+link.text, link.text)
当我运行此代码时, print(link.text)
行将打印正确的文件名,并且目录中将填充具有正确名称的文件,但是文件的内容类似于:
When I run this code, the print(link.text)
line prints the correct file names and the directory gets populated with files with the correct names, but the contents of the files look something like:
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
<html><head>
<title>404 Not Found</title>
</head><body>
<h1>Not Found</h1>
<p>The requested URL /props/volume-1/data/ ance_8.5x6_2849cm_4000.txt was not found on this server.</p>
<p>Additionally, a 404 Not Found
error was encountered while trying to use an ErrorDocument to handle the request.</p>
<hr>
<address>Apache/2.2.29 (Unix) mod_ssl/2.2.29 OpenSSL/1.0.1e-fips mod_bwlimited/1.4 Server at m-selig.ae.illinois.edu Port 80</address>
</body></html>
因此,我确定通信正常,但是我没有正确指导BS如何保存文件内容.
Thus, I'm sure the communication is working, but I'm not instructing BS correctly on how to save the contents of the files.
此外,我目前正在使用 findAll("a")
命令下载所有文件,但实际上我只想下载具有 * geom之类名称的特定文件.txt
Also, I'm currently downloading all the files with the findAll("a")
command, but I would actually like to only download specific files with names such as *geom.txt
推荐答案
您需要拉href以获取链接,也可以使用css选择器仅获取包含 geom.txt
的链接:
You need to pull the href to get the link, you can also get just links that contain geom.txt
using a css selector:
from bs4 import BeautifulSoup as bs
import urllib
import urllib2
from urlparse import urljoin
baseurl = "http://m-selig.ae.illinois.edu/props/volume-1/data/"
soup = bs(urllib2.urlopen(baseurl), 'lxml')
links = (a["href"] for a in soup.select("a[href*=geom.txt]"))
for link in links:
urllib.urlretrieve(urljoin(baseurl, link), link)
a [href * = geom.txt]
查找所有带有href且带有 geom.txt
的href的锚标签,这等效于在
a[href*=geom.txt]
finds all anchor tags that have a href with geom.txt
, it is equivalent to using if substring in main_string
in python.
您还可以在CSS中使用 $ =
查找以 geom.txt
结尾的hrefs:
You could also use $=
in your css to find hrefs ending in geom.txt
:
links = (a["href"] for a in soup.select("a[href$=geom.txt]"))
这篇关于BeautifulSoup无法按预期下载文件的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!