尝试从一页中抓取多个 URL [英] Trying to Scrape multiple URLS from one page

查看:32
本文介绍了尝试从一页中抓取多个 URL的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从这里 18 个 NI 选区的选举结果中获取信息:

http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results

每个唯一的 URL 都是这样开头的:

Each of the unique URLs starts like this:

http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/

18个网址的选择器如下:

The selector for the 18 URLS is as follows:

#container > div.two-column-content.clearfix > div > div.right-column.cms > div > ul > li

我想要开始的是一个包含 18 个 URL 的列表.这个列表应该是干净的(即只有实际地址,没有标签等)

What I want to start with is a list with the 18 URLS. This list should be clean (i.e. just have the actual addresses, no tags, etc)

到目前为止我的代码:

import requests
from bs4 import BeautifulSoup
import pandas as pd
import numpy as np
from time import sleep
from random import randint
from selenium import webdriver

url = 'http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results'

response = requests.get(url)
response.status_code

text = requests.get(url).text

soup = BeautifulSoup(text, parser="html5lib")

link_list = []
for a in soup('a'):
    if a.has_attr('href'):
        link_list.append(a)

re_pattern = r"^/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/"

这就是我迷路的地方,因为我需要搜索以该模式开头的所有 18 个 URL(我很确定该模式是错误的.请帮忙!)

其余代码:

import re
good_urls = [url for url in link_list if re.match(re_pattern, url)]

在这里我得到这个错误:

here I get this error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-36-f3fbbd3199b1> in <module>
----> 1 good_urls = [url for url in link_list if re.match(re_pattern, url)]

<ipython-input-36-f3fbbd3199b1> in <listcomp>(.0)
----> 1 good_urls = [url for url in link_list if re.match(re_pattern, url)]

~/opt/anaconda3/lib/python3.7/re.py in match(pattern, string, flags)
    173     """Try to apply the pattern at the start of the string, returning
    174     a Match object, or None if no match was found."""
--> 175     return _compile(pattern, flags).match(string)
    176 
    177 def fullmatch(pattern, string, flags=0):

TypeError: expected string or bytes-like object

我应该用什么不同的方式来获得这 18 个网址?谢谢!

What should I type differently to get those 18 urls? Thank you!

推荐答案

这似乎可以解决问题.

我已经删除了一些不必要的导入和这里不需要的东西,当然如果你在其他地方需要它们,请阅读它们.

I've removed some unnecessary imports and stuff that's not needed here, just readd them if you need them elsewhere of course.

错误消息是由于尝试对汤对象进行正则表达式比较,需要将其转换为字符串(与@Huzefa 发布的链接中讨论的问题相同,因此绝对相关).

The error message was due to triyng to do a regex comparison on a soup object, it needs to be cast to string (same problem as discussed in the link @Huzefa posted, so that was definitely relevant).

修复仍然留下了尝试隔离正确字符串的问题.我已经简化了匹配的正则表达式,然后在 " 上使用了一个简单的字符串拆分.并选择拆分产生的第二个对象(这是我们的网址)

Fixing that still left the issue of trying to isolate the correct strings. I've simplified the regex for matching, then use a simple string split on " and selecting the second object resulting from the split (which is our url)

import requests
from bs4 import BeautifulSoup
import re
url = 'http://www.eoni.org.uk/Elections/Election-results-and-statistics/Election-results-and-statistics-2003-onwards/Elections-2019/UK-Parliamentary-Election-2019-Results'
response = requests.get(url)
text = requests.get(url).text
soup = BeautifulSoup(text, "html.parser")
re_pattern = "<a href=\".*/Elections-2019/.*"
link_list = []
for a in soup('a'):
    if a.has_attr('href') and re.match(re_pattern, str(a)):
        link_list.append((str(a).split('"')[1]))

希望它符合您的目的,如果有任何不清楚的地方,请询问.

Hope it fits your purpose, ask if anything is unclear.

这篇关于尝试从一页中抓取多个 URL的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆