用Python / Scrapy在h1中提取p [英] Extracting p within h1 with Python/Scrapy

查看：243 发布时间：2018/6/25 18:31:23 python html scrapy lxml

本文介绍了用Python / Scrapy在h1中提取p的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在使用Scrapy从网站中提取有关音乐会的一些数据。至少有一个我正在使用的网站使用（错误地，根据W3C - ）h1元素中的ap元素是否有效？然而，我需要提取p元素中的文本，但不知道如何。

我已阅读文档并查看了各种示例用法，但对Scrapy相对来说比较新。我知道解决方案与将Selector类型设置为xml而不是html以便识别任何XML树有关，但对于我的生活，我无法弄清楚在这种情况下如何或在哪里做。

例如，一个网站有以下HTML：

 < ; h1 class =performance-title> 
< p> Bernard Haitink以钢琴家Emanuel Ax 
< / p>为名创作勃拉姆斯和 Dvořák。 
< / h1>

我创建了一个名为Concert（）的项目，其值为'title'。在我的物品加载器中，我使用：

  def parse_item（self，response）：
 thisconcert = ItemLoader（item = Concert（），response = response）
 thisconcert.add_xpath（'title'，'// h1 [@ class =performance-title] / p / text（）'）
 
返回thisconcert.load_item（）

这会返回项目['title']中的一个unicode列表不包括p元素内的文本，如：

  ['\ n'，'\\\
'， '\\\
']

我明白为什么，但我不知道如何绕过它。我也尝试过这样的事情：从scrapy导入选择器

def parse_item（self，response）

  ：
 
s = Selector（text =''.join（response.xpath（'.// section [@ id =performers] / text（）'）。extract（）），type = 'xml'）

我在这里做错了什么，以及如何解析包含此问题的HTML （p在h1内）？

我已经在 scrapy xpath选择器在h1-h6标签上的行为，但它不提供完整的解决方案可以应用于蜘蛛，只是使用给定文本字符串的会话中的一个示例。

解决方案
那真是令人费解。坦率地说，我仍然不明白为什么会发生这种情况。发现应该包含在< h1> 标签中的< p> 标签并非如此。卷曲为< h1>< p> < / h1> ，而从该网站获得的响应显示为：

< h1 class =performance-title> \\\ < / h1> b $ b< p< p> Bernard Haitink举办Brahms和\xa0Dvo \\

正如我所提到的，我确实有疑虑，但没有具体的。无论如何，用于获取< p> 标记内文本的 xpath 是：
response.xpath（'// h1 [@ class =performance-title] / following-sibling :: p / text（）'）。extract（）
这是通过使用< h1 class =performance-title> 作为里程碑并找到它的兄弟姐妹< p> 标记
I am using Scrapy to extract some data about musical concerts from websites. At least one website I'm working with uses (incorrectly, according to W3C - Is it valid to have paragraph elements inside of a heading tag in HTML5 (P inside H1)?) a p element within an h1 element. I need to extract the text within the p element nevertheless, and cannot figure out how. I have read the documentation and looked around for example uses, but am relatively new to Scrapy. I understand the solution has something to do with setting the Selector type to "xml" rather than "html" in order to recognize any XML tree, but for the life of me I cannot figure out how or where to do that in this instance. For example, a website has the following HTML: <h1 class="performance-title"> <p>Bernard Haitink conducts Brahms and Dvořák featuring pianist Emanuel Ax </p> </h1> I have made an item called Concert() that has a value called 'title'. In my item loader, I use: def parse_item(self, response): thisconcert = ItemLoader(item=Concert(), response=response) thisconcert.add_xpath('title','//h1[@class="performance-title"]/p/text()') return thisconcert.load_item() This returns, in item['title'], a unicode list that does not include the text inside the p element, such as: ['\n ', '\n ', '\n '] I understand why, but I don't know how to get around it. I have also tried things like: from scrapy import Selector def parse_item(self, response): s = Selector(text=' '.join(response.xpath('.//section[@id="performers"]/text()').extract()), type='xml') What am I doing wrong here, and how can I parse HTML that contains this problem (p within h1)? I have referenced the information concerning this specific issue at Behavior of the scrapy xpath selector on h1-h6 tags but it does not provide a complete solution that can be applied to a spider, only an example within a session using a given text string. 解决方案 That was quite baffling. To be frank, I still do not get why this is happening. Found out that the <p> tag that should be contained within the <h1> tag, is not so. Curl for the site shows of the form <h1><p> </p></h1>, whereas the response obtained from the site shows it as : <h1 class="performance-title">\n</h1> <p>Bernard Haitink conducts Brahms and\xa0Dvo\u0159\xe1k featuring\npianist Emanuel Ax </p> As I mentioned, I do have my doubts but nothing concrete. Anyways, the xpath for getting the text inside <p> tag hence is : response.xpath('//h1[@class="performance-title"]/following-sibling::p/text()').extract() This is by using the <h1 class="performance-title"> as a landmark and finding its sibling <p> tag 这篇关于用Python / Scrapy在h1中提取p的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用Python / Scrapy在h1中提取p [英] Extracting p within h1 with Python/Scrapy

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

用Python / Scrapy在h1中提取p [英] Extracting p within h1 with Python/Scrapy

问题描述

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭