如何使用scrapy在多个html标记之间获取纯文本 [英] How to get plain text in between multiple html tag using scrapy

查看:165
本文介绍了如何使用scrapy在多个html标记之间获取纯文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图从使用scrapy的给定网址的多个标记中获取所有文本。我是scrapy的新手。我不太了解如何实现这一点。通过示例和人员在stackoverflow上的体验学习。
这里是我定位的标签列表。

 < div class =TabsMenu fl coloropa2 fontreg> ;< p>根div< p> 
< a class =sub_hid =mtonguehref =#>母语< / a>
< a class =sub_hid =castehref =#>种姓< / a>

< a class =sub_hid =scaseshref =#>我的名字是nand< / a> < / DIV>
< div class =BrowseContent fl>
< figcaption>
< div class =fullwidth clearfix pl10>测试的Div字符串< / div>
< ul>
< li>咖啡< / li>
< li>茶< / li>
< li>牛奶< / li>
< / ul>
< div>

< select>
< option value =volvo>沃尔沃< / option>
< option value =saab> Saab< / option>

< / select>

< / div>
< li>< a title =印地语UP婚姻href =/ hindi-up-matrimony-matrimonials> Hindi-UP< / a>< / li>

预期的输出是

  root div 
母语
种类
我的名字是nand
测试字符串
咖啡
茶叶
牛奶
沃尔沃
萨博
印地语-UP

通过Xpath获取它。
这里是蜘蛛代码捕获

$ pre $ def parse(self,response):
for sel in response。 xpath('// body'):

lit = sel.xpath('// * [@ id =tab_description] / ul / li [descendant-or-self :: text() ]')。extract()
print lit
string1 =''.join(lit).encode('utf-8')。strip('\r\t\\\
')
print string1
para = sel.xpath('// p / text()')。extract()
span = sel.xpath('// span / text()') .extract()
div = sel.xpath('// div / text()')。extract()
strong = sel.xpath('// span / strong / text()') .extract()$ b $ link = sel.xpath('// a / text()')。extract()
string2 =''.join(para).encode('utf-8') .strip('\r\t\\\
')
string3 =''.join(span).encode('utf-8')。strip('\r\t\\\
')
string4 =''.join(div).encode('utf-8')。strip('\r\t\\\
')
string5 = ''.join(strong).encode('utf-8')。strip('\r\t\\\
')
string6 =''.join(link).encode('utf- 8')。strip('\r\t\\\
')
string = string6 + string5 + string4 + string3 + string2
打印字符串



项目的代码捕捉

  class DmozItem(scrapy .Item):
title = scrapy.Field()$ b $ link = scrapy.Field()
desc = scrapy.Field()
para = scrapy.Field()
strong = scrapy.Filed()
span = scrapy.Filed()
div = scrapy.Filed()

这里是输出

 浏览档案BYMother舌头状况宗教城市职业国家特殊情况印度语 - 德里马拉地语印地语 - 旁遮普语泰卢固语孟加拉语泰米尔语古吉拉特语马拉雅拉姆语Kannada印地语-MP比哈里拉贾斯坦邦奥里亚KonkaniHimachaliHaryanviAssameseKashmiriSikkim / NepaliHindiBrahmin逊尼Kayastha拉杰普特马拉塔Khatri Aggarwal阿罗拉Kshatriya Shwetamber亚达夫信德Bania计划CasteNairLingayatJatCatholic  -  RomanPatelDigamberSikh-JatGuptaCatholicTeliVishwakarmaBrahmin IyerVaishnavJaiswalGujjarSyrianAdi DravidaArya VysyaBalija NaiduBhandariBillavaAnavilGoswamiBrahmin HavyakaKumaoniMadhwaNagarSmarthaVaidikiViswaBuntChambharChaurasiaChettiarDevangaDhangarEzhavasGoudGowda婆罗门IyengarMarwariJatavKammaKapuKhandayatKoliKosht​​iKunbiKurubaKushwahaLeva PatidarLohanaMaheshwariMahisyaMaliMauryaMenonMudaliarMudaliar ArcotMogaveeraNadarNaiduNambiarNepaliPadmashaliPatilPillaiPrajapatiReddySadgopeShimpiSomvanshiSonarSutarSwarnkarThevarThiyyaVaishVaishyaVanniyarVarshneyVeerashaivaVellalarVysyaGursikhRamgarhiaSainiMallahShahDhobi-KalarKambojKashmiri PanditRigvediVokkaligaBhavasar KshatriyaAgnikula Audichya Baidya Baishya Bhumihar Bohra恰马尔Chasa乔杜里刹帝利DHIMAN Garhwali Gudia Havyaka Kammavar卡拉那Khandelwal Knanaya Kumbhar马哈詹Mukkulathor Pareek Sourashtra坦蒂塔库尔Vanjari Vokkaliga Daivadnya卡什亚普Kutchi OBC印度穆斯林克里斯蒂亚ñ锡克教耆那教佛教帕西犹太新德里孟买班加罗尔浦那海得拉巴金奈加尔各答勒克瑙艾哈迈达巴德昌迪加尔那格浦尔JaipurGurgaonBhopalNoidaIndorePatnaBhubaneshwarGhaziabadKanpurFaridabadLudhianaThaneAlabamaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict ColumbiaFloridaIndianaIowaKansasKentuckyMassachusettsMichiganMinnesotaMississippiNew JerseyNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaSouth CarolinaTennesseeTexasVirginiaWashingtonMangalorean IT软件教师CA /会计师商人医生/护士官立。服务律师国防IAS马哈拉施特拉邦北方邦

这段代码捕捉所有文本字符串,可以将每一个短语放在新的行中,并在词之间留出空间。
是否有任何有效的方法,以便使用废料稍后我想将它们保存在一个文件中。可以使用某些代码捕捉指导我。

解决方案

@paultrmbrth向我推荐了这个解决方案,它适用于我

  def parse_item(self,response):


with open(text,'wb')as f:
f.write(。join(response.xpath) ('// body * * [not(self :: script or self :: style)] / text()')。extract()).encode('utf-8'))

item = DmozItem()
yield item


I am trying to grab all text from multiple tag from a given URL using scrapy .I am new to scrapy. I don't have much idea how to achieve this.Learning through examples and people experience on stackoverflow. Here is list of tags that i am targeting.

<div class="TabsMenu fl coloropa2 fontreg"><p>root div<p>
<a class="sub_h" id="mtongue" href="#">Mother tongue</a>
<a class="sub_h" id="caste" href="#">Caste</a>

<a class="sub_h" id="scases" href="#">My name is nand </a> </div>
<div class="BrowseContent fl">
<figure style="display: block;" class="mtongue_h">
<figcaption>
<div class="fullwidth clearfix pl10">Div string for test</div>
<ul>
  <li>Coffee</li>
  <li>Tea</li>
  <li>Milk</li>
</ul>
<div>

<select>
  <option value="volvo">Volvo</option>
  <option value="saab">Saab</option>

</select>

</div>
<li><a title="Hindi UP Matrimony" href="/hindi-up-matrimony-matrimonials"> Hindi-UP </a></li>

Expected outout would be

root div
Mother tongue
Caste
My name is nand
Div string for test
Coffee
Tea
Milk
Volvo
Saab
Hindi-UP

I was trying to get it through Xpath . here is spider code snap

     def parse(self, response):
 for sel in response.xpath('//body'):

        lit = sel.xpath('//*[@id="tab_description"]/ul/li[descendant-or-self::text()]').extract()
        print lit
        string1 = ''.join(lit).encode('utf-8').strip('\r\t\n')
        print string1
        para=sel.xpath('//p/text()').extract()
        span=sel.xpath('//span/text()').extract()
        div=sel.xpath('//div/text()').extract()
        strong=sel.xpath('//span/strong/text()').extract()
        link=sel.xpath('//a/text()').extract()
        string2 = ''.join(para).encode('utf-8').strip('\r\t\n')
        string3 = ''.join(span).encode('utf-8').strip('\r\t\n')
        string4 = ''.join(div).encode('utf-8').strip('\r\t\n')
        string5 = ''.join(strong).encode('utf-8').strip('\r\t\n')
        string6 = ''.join(link).encode('utf-8').strip('\r\t\n')
        string=string6+string5+string4+string3+string2
        print string

Code snap for Items

class DmozItem(scrapy.Item):
title = scrapy.Field()
link = scrapy.Field()
desc = scrapy.Field()
para=scrapy.Field()
strong=scrapy.Filed()
span=scrapy.Filed()
div=scrapy.Filed()

Here is output

BROWSE PROFILES BYMother tongueCasteReligionCityOccupationStateNRISpecial Cases Hindi-Delhi  Marathi  Hindi-UP  Punjabi  Telugu  Bengali  Tamil  Gujarati  Malayalam  Kannada  Hindi-MP  Bihari RajasthaniOriyaKonkaniHimachaliHaryanviAssameseKashmiriSikkim/NepaliHindi Brahmin  Sunni  Kayastha  Rajput  Maratha  Khatri  Aggarwal  Arora  Kshatriya  Shwetamber  Yadav  Sindhi  Bania Scheduled CasteNairLingayatJatCatholic - RomanPatelDigamberSikh-JatGuptaCatholicTeliVishwakarmaBrahmin IyerVaishnavJaiswalGujjarSyrianAdi DravidaArya VysyaBalija NaiduBhandariBillavaAnavilGoswamiBrahmin HavyakaKumaoniMadhwaNagarSmarthaVaidikiViswaBuntChambharChaurasiaChettiarDevangaDhangarEzhavasGoudGowda Brahmin IyengarMarwariJatavKammaKapuKhandayatKoliKoshtiKunbiKurubaKushwahaLeva PatidarLohanaMaheshwariMahisyaMaliMauryaMenonMudaliarMudaliar ArcotMogaveeraNadarNaiduNambiarNepaliPadmashaliPatilPillaiPrajapatiReddySadgopeShimpiSomvanshiSonarSutarSwarnkarThevarThiyyaVaishVaishyaVanniyarVarshneyVeerashaivaVellalarVysyaGursikhRamgarhiaSainiMallahShahDhobi-KalarKambojKashmiri PanditRigvediVokkaligaBhavasar KshatriyaAgnikula Audichya Baidya Baishya Bhumihar Bohra Chamar Chasa Chaudhary Chhetri Dhiman Garhwali Gudia Havyaka Kammavar Karana Khandelwal Knanaya Kumbhar Mahajan Mukkulathor Pareek Sourashtra Tanti Thakur Vanjari Vokkaliga Daivadnya Kashyap Kutchi OBC Hindu  Muslim  Christian  Sikh  Jain  Buddhist  Parsi  Jewish  New Delhi  Mumbai  Bangalore  Pune  Hyderabad  Kolkata  Chennai  Lucknow  Ahmedabad  Chandigarh  Nagpur JaipurGurgaonBhopalNoidaIndorePatnaBhubaneshwarGhaziabadKanpurFaridabadLudhianaThaneAlabamaArizonaArkansasCaliforniaColoradoConnecticutDelawareDistrict ColumbiaFloridaIndianaIowaKansasKentuckyMassachusettsMichiganMinnesotaMississippiNew JerseyNew YorkNorth CarolinaNorth DakotaOhioOklahomaOregonPennsylvaniaSouth CarolinaTennesseeTexasVirginiaWashingtonMangalorean  IT Software  Teacher  CA/Accountant  Businessman  Doctors/Nurse  Govt. Services  Lawyers  Defence  IAS  Maharashtra  Uttar Pradesh 

This code snap giving all text string but all text coming all together without space.It possible to get each and every phrase in new line and put space between word. is there any efficient way there so that using scrap.later i want to save them in a file.Can some one guide me using some code snap.

解决方案

@paultrmbrth suggested me this solution and it work for me

def parse_item(self,response):


        with open(text, 'wb') as f:
            f.write("".join(response.xpath('//body//*[not(self::script or self::style)]/text()').extract() ).encode('utf-8'))

        item = DmozItem()
        yield item

这篇关于如何使用scrapy在多个html标记之间获取纯文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆