使用 Python/Scrapy 在 h1 中提取 p [英] Extracting p within h1 with Python/Scrapy
问题描述
我正在使用 Scrapy 从网站中提取有关音乐会的一些数据.我正在使用的至少一个网站(错误地,根据 W3C - 在 HTML5 中的标题标签内有段落元素是否有效(P 在 H1 内)?)在 h1 元素内的 ap 元素.尽管如此,我还是需要提取 p 元素中的文本,但不知道如何提取.
我已经阅读了文档并查看了示例用途,但我对 Scrapy 比较陌生.我知道该解决方案与将 Selector 类型设置为xml"而不是html"以识别任何 XML 树有关,但在我的一生中,我无法弄清楚在这种情况下如何或在何处执行此操作.
例如,一个网站具有以下 HTML:
<p>伯纳德·海廷克指挥勃拉姆斯和 Dvořák,钢琴家伊曼纽尔·艾克斯主演</p>
我制作了一个名为 Concert() 的项目,其值为title".在我的项目加载器中,我使用:
def parse_item(self, response):thisconcert = ItemLoader(item=Concert(), response=response)thisconcert.add_xpath('title','//h1[@class="performance-title"]/p/text()')返回 thisconcert.load_item()
这将在 item['title'] 中返回一个不包含 p 元素内文本的 unicode 列表,例如:
['
', '
', '
']
我明白为什么,但我不知道如何解决它.我也试过这样的事情:
from scrapy import Selectordef parse_item(self, response):s = Selector(text=''.join(response.xpath('.//section[@id="performers"]/text()').extract()), type='xml')
我在这里做错了什么,如何解析包含此问题的 HTML(h1 中的 p)?
我在 scrapy xpath 选择器在 h1-h6 标签上的行为,但它没有提供可应用于蜘蛛的完整解决方案,仅提供使用给定文本的会话中的示例细绳.
那真是令人费解.坦率地说,我仍然不明白为什么会这样.发现应该包含在 标签中的
标签不是这样.以
<h1><p> 形式显示的站点显示的卷曲</p></h1>
,而从网站获得的响应显示为:
<p>伯纳德·海廷克指挥勃拉姆斯和xa0Dvou0159xe1k,演奏
钢琴家伊曼纽尔·艾克斯</p>
正如我所提到的,我确实有疑问,但没有什么具体的.无论如何,用于获取 <p>
标签内的文本的 xpath 是:
response.xpath('//h1[@class="performance-title"]/following-sibling::p/text()').extract()
这是通过使用 <h1 class="performance-title">
作为地标并找到它的兄弟 <p>
标签
I am using Scrapy to extract some data about musical concerts from websites. At least one website I'm working with uses (incorrectly, according to W3C - Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?) a p element within an h1 element. I need to extract the text within the p element nevertheless, and cannot figure out how.
I have read the documentation and looked around for example uses, but am relatively new to Scrapy. I understand the solution has something to do with setting the Selector type to "xml" rather than "html" in order to recognize any XML tree, but for the life of me I cannot figure out how or where to do that in this instance.
For example, a website has the following HTML:
<h1 class="performance-title">
<p>Bernard Haitink conducts Brahms and Dvořák featuring pianist Emanuel Ax
</p>
</h1>
I have made an item called Concert() that has a value called 'title'. In my item loader, I use:
def parse_item(self, response):
thisconcert = ItemLoader(item=Concert(), response=response)
thisconcert.add_xpath('title','//h1[@class="performance-title"]/p/text()')
return thisconcert.load_item()
This returns, in item['title'], a unicode list that does not include the text inside the p element, such as:
['
', '
', '
']
I understand why, but I don't know how to get around it. I have also tried things like:
from scrapy import Selector
def parse_item(self, response):
s = Selector(text=' '.join(response.xpath('.//section[@id="performers"]/text()').extract()), type='xml')
What am I doing wrong here, and how can I parse HTML that contains this problem (p within h1)?
I have referenced the information concerning this specific issue at Behavior of the scrapy xpath selector on h1-h6 tags but it does not provide a complete solution that can be applied to a spider, only an example within a session using a given text string.
That was quite baffling. To be frank, I still do not get why this is happening. Found out that the <p>
tag that should be contained within the <h1>
tag, is not so. Curl for the site shows of the form <h1><p> </p></h1>
, whereas the response obtained from the site shows it as :
<h1 class="performance-title"> </h1> <p>Bernard Haitink conducts Brahms andxa0Dvou0159xe1k featuring pianist Emanuel Ax </p>
As I mentioned, I do have my doubts but nothing concrete. Anyways, the xpath for getting the text inside <p>
tag hence is :
response.xpath('//h1[@class="performance-title"]/following-sibling::p/text()').extract()
This is by using the <h1 class="performance-title">
as a landmark and finding its sibling <p>
tag
这篇关于使用 Python/Scrapy 在 h1 中提取 p的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!