使用Beatiful Soup抓取多个网站 [英] Use Beatiful Soup in scraping multiple websites
问题描述
我想知道为什么列表all_links
和all_titles
不想从列表titles
和links
接收任何记录.我也尝试过.extend()
方法,但没有帮助.
I want to know why lists all_links
and all_titles
don't want to receive any records from lists titles
and links
. I have tried also .extend()
method and it didn't help.
import requests
from bs4 import BeautifulSoup
all_links = []
all_titles = []
def title_link(page_num):
page = requests.get(
'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d'
% (page_num, page_num, page_num))
soup = BeautifulSoup(page.content, 'html.parser')
links = ['https://www.gumtree.pl' + link.get('href')
for link in soup.find_all('a', class_ ="href-link tile-title-text")]
titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")]
print(titles)
for i in range(1,5+1):
title_link(i)
all_links = all_links + links
all_titles = all_titles + titles
i+=1
print(all_links)
import pandas as pd
df = pd.DataFrame(data = {'title': all_titles ,'link': all_links})
df.head(100)
#df.to_csv("./gumtree_page_1.csv", sep=';',index=False, encoding = 'utf-8')
#df.to_excel('./gumtree_page_1.xlsx')
推荐答案
尝试一下:
import requests
from bs4 import BeautifulSoup
all_links = []
all_titles = []
def title_link(page_num):
page = requests.get(
'https://www.gumtree.pl/s-mieszkania-i-domy-sprzedam-i-kupie/warszawa/page-%d/v%dc9073l3200008p%d'
% (page_num, page_num, page_num))
page.encoding = 'utf-8'
soup = BeautifulSoup(page.content, 'html.parser', from_encoding='utf-8')
links = ['https://www.gumtree.pl' + link.get('href')
for link in soup.find_all('a', class_ ="href-link tile-title-text")]
titles = [flat.next_element for flat in soup.find_all('a', class_ = "href-link tile-title-text")]
print(titles)
return links, titles
for i in range(1,5+1):
links, titles = title_link(i)
all_links.extend(links)
all_titles.extend(titles)
# i+=1 not needed in python
print(all_links)
import pandas as pd
df = pd.DataFrame(data = {'title': all_titles ,'link': all_links})
df.head(100)
我认为您只需要从title_link(page_num)
中删除links
和titles
.
I think you just needed to get links
and titles
out of title_link(page_num)
.
删除了每个注释的手动递增
removed the manual incrementing per comments
将all_links = all_links + links
更改为all_links.extend(links)
网站是utf-8编码的,添加了page.encoding = 'utf-8'
并作为from_encoding='utf-8'
的额外(可能是不必要的)措施,添加到BeautifulSoup
website is utf-8 encoded, added page.encoding = 'utf-8'
and as extra (probably unnecessary) measure, from_encoding='utf-8'
to the BeautifulSoup
这篇关于使用Beatiful Soup抓取多个网站的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!