无法使用正则表达式抓取网站的某些值 [英] Unable to scrape certain values of a website using regex
问题描述
我一直在尝试抓取网站上一组特定 p 标签中的信息,但遇到了很多麻烦.
I've been trying to scrape the information inside of a particular set of p tags on a website and running into a lot of trouble.
我的代码如下:
import urllib
import re
def scrape():
url = "https://www.theWebsite.com"
statusText = re.compile('<div id="holdsThePtagsIwant">(.+?)</div>')
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
status = re.findall(statusText,htmltext)
print("Status: " + str(status))
scrape()
不幸的是只返回:"Status: []"
然而,话虽如此,我不知道我做错了什么,因为当我在同一个网站上进行测试时,我可以使用代码
However, that being said I have no idea what I am doing wrong because when I was testing on the same website I could use the code
statusText = re.compile('<a href="/about">(.+?)</a>')
相反,我会得到我想要的,"Status: ['About', 'About']"
instead and I would get what I was trying to, "Status: ['About', 'About']"
有谁知道我可以做些什么来获取 div 标签中的信息?或者更具体地说是 div 标签包含的一组 p 标签?我尝试插入我能想到的几乎所有值,但一无所获.在 Google、YouTube 和 SO 搜索之后,我现在已经没有想法了.
Does anyone know what I can do to get the information within the div tags? Or more specifically the single set of p tags the div tags contain? I've tried plugging in just about any values I could think of and have gotten nowhere. After Google, YouTube, and SO searching I'm running out of ideas now.
推荐答案
我使用 BeautifulSoup 用于提取 html 标签之间的信息.假设您要提取这样的分区:<div class='article_body' itemprop='articleBody'>...</div>
那么你可以使用beautifulsoup并通过以下方式提取这个部门:
I use BeautifulSoup for extracting information between html tags. Suppose you want to extract a division like this : <div class='article_body' itemprop='articleBody'>...</div>
then you can use beautifulsoup and extract this division by:
soup = BeautifulSoup(<htmltext>) # creating bs object
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
另请参阅 bs4 的官方文档
also see the official documentation of bs4
例如,我编辑了您的代码,用于从 文章 bloomberg您可以进行自己的更改
as an example i have edited your code for extracting a division form an article of bloomberg you can make your own changes
import urllib
import re
from bs4 import BeautifulSoup
def scrape():
url = 'http://www.bloomberg.com/news/2014-02-20/chinese-group-considers-south-africa-platinum-bids-amid-strikes.html'
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
soup = BeautifulSoup(htmltext)
ans = soup.find('div', {'class':'article_body', 'itemprop':'articleBody'})
print ans
scrape()
您可以从这里
附言:我使用 scrapy 和 BeautifulSoup 进行网页抓取,我对此很满意
P.S. : I use scrapy and BeautifulSoup for web scraping and I am satisfied with it
这篇关于无法使用正则表达式抓取网站的某些值的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!