解析URL beautifulsoup [英] Parse URL beautifulsoup

查看:59
本文介绍了解析URL beautifulsoup的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

 导入请求导入csv从bs4导入BeautifulSoup页面= requests.get("https://www.google.com/search?q=cars")汤= BeautifulSoup(page.content,"lxml")汇入链接= soup.findAll("a")使用open('aaa.csv','wb')作为myfile:for soup.find_all("a",href = re.compile((?< =/url \?q =)(htt.*://.*)"))中的链接:一个=(re.split(:(?= http)",link ["href"].replace("/url?q =",")))wr = csv.writer(myfile,quoting = csv.QUOTE_ALL)wr.writerow(a) 

此代码的输出是我有一个CSV文件,其中保存了28个URL,但是这些URL不正确.例如,这是错误的URL:-

".imdb.com/title/tt0317219/& sa = U& ved = 0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk& usg = AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A

它应该是:-

http://www.imdb.com/title/tt0317219/

如果每个URL包含& sa ="

因为URL的第二部分开始于:-& sa =" 应该被删除,以便所有URL都像第二个URL一样被保存.

我正在使用python 2.7和Ubuntu 16.04.

解决方案

如果每次url的多余部分都以& 开头,则可以将 split()应用于每个网址:

  url ='http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKQCd2url = url.split('&')[0]打印(URL) 

输出:

  http://www.imdb.com/title/tt0317219/ 

import requests
import csv
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=cars")
soup = BeautifulSoup(page.content, "lxml")
import re
links = soup.findAll("a")
with open('aaa.csv', 'wb') as myfile:
    for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")): 
        a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))
        wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
        wr.writerow(a)

The output of this code is that I have a CSV file where 28 URLs are saved, however the URLs are not correct. For example this is a wrong URL:-

http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A

Instead it should be:-

http://www.imdb.com/title/tt0317219/

How can I remove the second part for each URL if it contains "&sa="

Because then the second part of the URL starting from:- "&sa=" should be removed, so that all URLs are saved like the second URL.

I am using python 2.7 and Ubuntu 16.04.

解决方案

If every time redundant part of url starts with &, you can apply split() to each url:

url = 'http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A'
url = url.split('&')[0]
print(url)

output:

http://www.imdb.com/title/tt0317219/

这篇关于解析URL beautifulsoup的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆