解析URL beautifulsoup [英] Parse URL beautifulsoup

查看：59 发布时间：2021/4/15 19:11:40 python url beautifulsoup

本文介绍了解析URL beautifulsoup的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

 导入请求导入csv从bs4导入BeautifulSoup页面= requests.get("https://www.google.com/search?q=cars")汤= BeautifulSoup(page.content，"lxml")汇入链接= soup.findAll("a")使用open('aaa.csv'，'wb')作为myfile:for soup.find_all("a"，href = re.compile((?< =/url \?q =)(htt.*://.*)"))中的链接:一个=(re.split(:(?= http)"，link ["href"].replace("/url?q ="，")))wr = csv.writer(myfile，quoting = csv.QUOTE_ALL)wr.writerow(a)

此代码的输出是我有一个CSV文件，其中保存了28个URL，但是这些URL不正确.例如，这是错误的URL:-

".imdb.com/title/tt0317219/& sa = U& ved = 0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk& usg = AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A

它应该是:-

http://www.imdb.com/title/tt0317219/

如果每个URL包含& sa ="

因为URL的第二部分开始于:-& sa =" 应该被删除，以便所有URL都像第二个URL一样被保存.

我正在使用python 2.7和Ubuntu 16.04.

解决方案

如果每次url的多余部分都以& 开头，则可以将 split()应用于每个网址:

  url ='http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKQCd2url = url.split('&')[0]打印(URL)

输出:

  http://www.imdb.com/title/tt0317219/

import requests
import csv
from bs4 import BeautifulSoup
page = requests.get("https://www.google.com/search?q=cars")
soup = BeautifulSoup(page.content, "lxml")
import re
links = soup.findAll("a")
with open('aaa.csv', 'wb') as myfile:
    for link in  soup.find_all("a",href=re.compile("(?<=/url\?q=)(htt.*://.*)")): 
        a = (re.split(":(?=http)",link["href"].replace("/url?q=","")))
        wr = csv.writer(myfile, quoting=csv.QUOTE_ALL)
        wr.writerow(a)

The output of this code is that I have a CSV file where 28 URLs are saved, however the URLs are not correct. For example this is a wrong URL:-

http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A

Instead it should be:-

http://www.imdb.com/title/tt0317219/

How can I remove the second part for each URL if it contains "&sa="

Because then the second part of the URL starting from:- "&sa=" should be removed, so that all URLs are saved like the second URL.

I am using python 2.7 and Ubuntu 16.04.

解决方案

If every time redundant part of url starts with &, you can apply split() to each url:

url = 'http://www.imdb.com/title/tt0317219/&sa=U&ved=0ahUKEwjg5fahi7nVAhWdHsAKHSQaCekQFgg9MAk&usg=AFQjCNFu_Vg9v1oVhEtR-vKqCJsR2YGd2A'
url = url.split('&')[0]
print(url)

output:

http://www.imdb.com/title/tt0317219/

这篇关于解析URL beautifulsoup的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

解析URL beautifulsoup [英] Parse URL beautifulsoup

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

解析URL beautifulsoup [英] Parse URL beautifulsoup

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭