解析URL beautifulsoup [英] Parse URL beautifulsoup
问题描述
导入请求导入csv从bs4导入BeautifulSoup页面= requests.get("https://www.google.com/search?q=cars")汤= BeautifulSoup(page.content,"lxml")汇入链接= soup.findAll("a")使用open('aaa.csv','wb')作为myfile:for soup.find_all("a",href = re.compile((?< =/url \?q =)(htt.*://.*)"))中的链接:一个=(re.split(:(?= http)",link ["href"].replace("/url?q =",")))wr = csv.writer(myfile,quoting = csv.QUOTE_ALL)wr.writerow(a)
此代码的输出是我有一个CSV文件,其中保存了28个URL,但是这些URL不正确.例如,这是错误的URL:-
它应该是:-
http://www.imdb.com/title/tt0317219/ >
如果每个URL包含& sa ="
因为URL的第二部分开始于:-& sa ="
应该被删除,以便所有URL都像第二个URL一样被保存.
我正在使用python 2.7和Ubuntu 16.04.
如果每次url的多余部分都以&
开头,则可以将 split()
应用于每个网址:
url ='http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKQCd2url = url.split('&')[0]打印(URL)
输出:
http://www.imdb.com/title/tt0317219/
import requests
import csv
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=cars")
soup = BeautifulSoup(page.content, "lxml")
import re
links = soup.findAll("a")
with open('aaa.csv', 'wb') as myfile:
for link in soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")):
a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))
wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
wr.writerow(a)
The output of this code is that I have a CSV file where 28 URLs are saved, however the URLs are not correct. For example this is a wrong URL:-
Instead it should be:-
http://www.imdb.com/title/tt0317219/
How can I remove the second part for each URL if it contains "&sa="
Because then the second part of the URL starting from:-
"&sa="
should be removed, so that all URLs are saved like the second URL.
I am using python 2.7 and Ubuntu 16.04.
If every time redundant part of url starts with &
, you can apply split()
to each url:
url = 'http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A'
url = url.split('&')[0]
print(url)
output:
http://www.imdb.com/title/tt0317219/
这篇关于解析URL beautifulsoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!