编写循环:Beautifulsoup和lxml用于在页面到页面的跳过设置中获取页面内容 [英] Writing a loop: Beautifulsoup and lxml for getting page-content in a page-to-page skip-setting
问题描述
更新:现在带有6600多个目标页面之一的图像: https://europa.eu/youth/volunteering/organisation/48592 参见下文-图像以及目标和所需数据的解释与说明.
Update: now with a image of one of the more than 6600 target-pages: https://europa.eu/youth/volunteering/organisation/48592 see below - the images and the explanation and description of the aimed goals and the data which are wanted.
我在志愿服务领域的数据工作领域是一个新手.任何帮助表示赞赏.在过去几天中,我从一些编码英雄那里学到了很多东西,例如αԋɱҽԃαмєяιcαη和KunduK.
I am a pretty new in the field of data work in the field of volunteering services. Any help is appreciated. I have learned a lot in the past few days from some coding heroes such as αԋɱҽԃ αмєяιcαη and KunduK.
基本上,我们的目标是对欧洲免费志愿服务的一系列机会进行快速概述.我有要用于获取数据的URL列表.我可以为一个网址做这样的事情:-目前正在动手研究Python编程:我已经有几个解析器部分已经在工作-请在几页的概述中查看以下内容.顺便说一句:我想我们应该用熊猫收集信息并将其存储在csv中...
Basically our goal is to create a quick overview on a set of opportunities for free volunteering in Europe. I have the list of the URL which I want to use to fetch the data. I can do for one url like this:- currently working on a hands on approach to dive into python programming: i have several parser-parts that work already - see below a overview on several pages. BTW: I guess that we should gather the info with pandas and store it in csv...
- https://europa.eu/youth/volunteering/organisation/50160
- https://europa.eu/youth/volunteering/organisation/50162
- https://europa.eu/youth/volunteering/organisation/50163
- https://europa.eu/youth/volunteering/organisation/50160
- https://europa.eu/youth/volunteering/organisation/50162
- https://europa.eu/youth/volunteering/organisation/50163
...等等,以此类推....-[注意-并非每个URL和ID都有一个内容页面备份-因此我们需要增量n + 1设置]因此我们可以计算页面数每个一个-计算n + 1个增量
...and so forth and so forth .... - [note - not every URL and id is backed up with a content-page - therefore we need an incremental n+1 setting] therefore we can count the pages each by each - and count incremental n+1
请参阅示例:
- https://europa.eu/youth/volunteering/organisation/48592
- https://europa.eu/youth/volunteering/organisation/50160
- https://europa.eu/youth/volunteering/organisation/48592
- https://europa.eu/youth/volunteering/organisation/50160
方法:我使用了CSS选择器; XPath和CSS Selector可以完成相同的任务,但是-对于BS或lxml,我们可以使用它,也可以将它们与find()和findall()混合使用.
Approach: I used CSS Selector; XPath and CSS Selector do same task but - with both BS or lxml we can use this or mix with find() and findall().
所以我在这里运行这个迷你方法:
So I run this mini-approach here:
from bs4 import BeautifulSoup
import requests
url = 'https://europa.eu/youth/volunteering/organisation/50160'
resonse = requests.get(url)
soup = BeautifulSoup(resonse.content, 'lxml')
tag_info = soup.select('.col-md-12 > p:nth-child(3) > i:nth-child(1)')
print(tag_info[0].text)
输出: Norwegian Judo Federation
小型方法2:
from lxml import html
import requests
url = 'https://europa.eu/youth/volunteering/organisation/50160'
response = requests.get(url)
tree = html.fromstring(response.content)
tag_info = tree.xpath("//p[contains(text(),'Norwegian')]")
print(tag_info[0].text)
输出: Norwegian Judo Federation (NJF) is a center organisation for Norwegian Judo clubs. NJF has 65 member clubs, which have about 4500 active members. 73 % of the members are between ages of 3 and 19. NJF is organized in The Norwegian Olympic and Paralympic Committee and Confederation of Sports (NIF). We are a member organisation in European Judo Union (EJU) and International Judo Federation (IJF). NJF offers and organizes a wide range of educational opportunities to our member clubs.
以此类推.我要达到的目标:目标是从所有6800页中收集所有有趣的信息-这意味着信息,例如:
and so forth and so fort. What I am trying to achieve: aimed is to gather all the interesting information from all the 6800 pages - this means information, such as:
- 页面的URL以及页面中所有标记为红色的部分
- 组织名称
- 地址
- 组织描述
- 角色
- 到期日期
- 范围
- 最后更新
- 组织主题(并非在每个页面上都注明:偶尔出现)
...然后迭代到下一页,获取所有信息等等.因此,我尝试下一步以获得更多经验:...从所有页面中收集信息注意:,我们有 6926个页面
...and iterate to the next page, getting all the information and so forth. So I try a next step to get some more experience:... to gather info form all of the pages Note: we've got 6926 pages
问题是-关于URL,如何确定第一个URL和最后一个URL-想法:如果我们从零迭代到10000,该怎么办!?
The question is - regarding the URLs how to find out which is the first and which is the last URL - idea: what if we iterate from zero to 10 000!?
带有网址的号码!?
import requests
from bs4 import BeautifulSoup
import pandas as pd
numbers = [48592, 50160]
def Main(url):
with requests.Session() as req:
for num in numbers:
resonse = req.get(url.format(num))
soup = BeautifulSoup(resonse.content, 'lxml')
tag_info =soup.select('.col-md-12 > p:nth-child(3) > i:nth-child(1)')
print(tag_info[0].text)
Main("https://europa.eu/youth/volunteering/organisation/{}/")
但是在这里我遇到了问题.猜想我在结合上述各部分的思想的同时监督了一些事情.再次.我想我们应该用熊猫收集信息并将其存储在csv中...
but here i run into issues. Guess that i have overseen some thing while combining the ideas of the above mentioned parts. Again. I guess that we should gather the infos with pandas and store it in csv...
推荐答案
import requests
from bs4 import BeautifulSoup
import re
import csv
from tqdm import tqdm
first = "https://europa.eu/youth/volunteering/organisations_en?page={}"
second = "https://europa.eu/youth/volunteering/organisation/{}_en"
def catch(url):
with requests.Session() as req:
pages = []
print("Loading All IDS\n")
for item in tqdm(range(0, 347)):
r = req.get(url.format(item))
soup = BeautifulSoup(r.content, 'html.parser')
numbers = [item.get("href").split("/")[-1].split("_")[0] for item in soup.findAll(
"a", href=re.compile("^/youth/volunteering/organisation/"), class_="btn btn-default")]
pages.append(numbers)
return numbers
def parse(url):
links = catch(first)
with requests.Session() as req:
with open("Data.csv", 'w', newline="", encoding="UTF-8") as f:
writer = csv.writer(f)
writer.writerow(["Name", "Address", "Site", "Phone",
"Description", "Scope", "Rec", "Send", "PIC", "OID", "Topic"])
print("\nParsing Now... \n")
for link in tqdm(links):
r = req.get(url.format(link))
soup = BeautifulSoup(r.content, 'html.parser')
task = soup.find("section", class_="col-sm-12").contents
name = task[1].text
add = task[3].find(
"i", class_="fa fa-location-arrow fa-lg").parent.text.strip()
try:
site = task[3].find("a", class_="link-default").get("href")
except:
site = "N/A"
try:
phone = task[3].find(
"i", class_="fa fa-phone").next_element.strip()
except:
phone = "N/A"
desc = task[3].find(
"h3", class_="eyp-project-heading underline").find_next("p").text
scope = task[3].findAll("span", class_="pull-right")[1].text
rec = task[3].select("tbody td")[1].text
send = task[3].select("tbody td")[-1].text
pic = task[3].select(
"span.vertical-space")[0].text.split(" ")[1]
oid = task[3].select(
"span.vertical-space")[-1].text.split(" ")[1]
topic = [item.next_element.strip() for item in task[3].select(
"i.fa.fa-check.fa-lg")]
writer.writerow([name, add, site, phone, desc,
scope, rec, send, pic, oid, "".join(topic)])
parse(second)
注意:我已经测试了前10
页,如果您希望获得更多的speed
,我建议您使用concurrent.futures
.以及是否有任何错误.使用try/except
.
Note: I've tested for the first 10
pages, in case if you are looking to gain more speed
, i advise you to use concurrent.futures
. and if there's any error. use try/except
.
这篇关于编写循环:Beautifulsoup和lxml用于在页面到页面的跳过设置中获取页面内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!