Flipkart.com产品'价格'和产品'称号'提取使用Python [英] Flipkart.com product 'price' and product 'title' extraction using Python

查看:114
本文介绍了Flipkart.com产品'价格'和产品'称号'提取使用Python的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了下面的Python code提取价格指定项目从flipkart.com

 进口的urllib2
进口BS4
进口重项=Wilco的经典文库:自传的瑜伽(精装)
item.replace(,+)
链接= 'http://www.flipkart.com/search/a/all?query={0}&vertical=all&dd=0&autosuggest[as]=off&autosuggest[as-submittype]=entered&autosuggest[as-grouprank]=0&autosuggest[as-overallrank]=0&autosuggest[orig-query]=&autosuggest[as-shown]=off&Search=%C2%A0&otracker=start&_r=YSWdYULYzr4VBYklfpZRbw--&_l=pMHn9vNCOBi05LKC_PwHFQ--&ref=a2c6fadc-2e24-4412-be6a-ce02c9707310&selmitem=All+Categories'.format(item)
R = urllib2.Request(链接,标题= {用户代理:巨蟒urlli〜})
尝试:
    响应= urllib2.urlopen(r)的
除:
    打印网络连接错误
翻动书页= response.read()
汤= bs4.BeautifulSoup(翻动书页)
firstBlockSoup = soup.find('格',ATTRS = {'类':'FK-SRCH项'})
priceSoup = firstBlockSoup.find('B',ATTRS = {'类':'fksd-bodyText的价格最终价格'})
价格= priceSoup.contents [0]
打印价格titleSoup = firstBlockSoup.find('A',ATTRS = {'类':'FK-SRCH标题文本fksd-bodyText的'})
标题= titleSoup.findAll('B')
打印标题

执行打印时没有问题的价格高于code。

 卢比。 138

但是如下获得标题

  [< B>的Wilco< / B&GT中,< B>经典< / B&GT中,< B>图书馆< / B&GT中,< B>自传< / b&GT中,< b>将< / b&GT中,< b>在< / b&GT中,< b>瑜伽< / b&GT中,< b>精装书< / b>]

原因将是显而易见的,如果你看一下的源头code <一个href=\"http://www.flipkart.com/search-books?query=Wilco%20Classic%20Library:Autobiography%20Of%20a%20Yogi%20%28Hardcover%29&from=all&searchGroup=all&autosuggest%5Bas%5D=off&autosuggest%5Bas-submittype%5D=entered&autosuggest%5Bas-grouprank%5D=0&autosuggest%5Bas-overallrank%5D=0&autosuggest%5Borig-query%5D=&autosuggest%5Bas-shown%5D=off&selmitem=All%20Categories&otracker=start&vertical=all&_l=WE_JphiGfT8Bh6aXp1vT2w--&_r=_MFMNA8pxFY3ZpKGrqRTOA--&ref=b015ecfa-833a-4e1b-b50e-11b9515d2498\"相对=nofollow>产品页面(使用检查元素')

现在,我如何提取称号适当的格式,以便打印:

  Wilco的经典文库:自传的瑜伽(精装)


解决方案

只需使用上 titleSoup 文本方法

 &GT;&GT;&GT; titleSoup = firstBlockSoup.find('A',ATTRS = {'类':'FK-SRCH标题文本fksd-bodyText的'})
&GT;&GT;&GT; titleSoup.text
u'Wilco经典文库:自传的瑜伽(精装)'

这也将工作:

  invalid_tag​​s = ['B']
titleSoup = firstBlockSoup.find('A',ATTRS = {'类':'FK-SRCH标题文本fksd-bodyText的'})在invalid_tag​​s标签:
    对于比赛中titleSoup.findAll(标签):
       match.replaceWithChildren()
打印。加入(titleSoup.contents)

I have written the following Python code to extract the PRICE of the item specified from flipkart.com

import urllib2
import bs4
import re

item="Wilco Classic Library: Autobiography Of a Yogi (Hardcover)"
item.replace(" ", "+")
link = 'http://www.flipkart.com/search/a/all?query={0}&vertical=all&dd=0&autosuggest[as]=off&autosuggest[as-submittype]=entered&autosuggest[as-grouprank]=0&autosuggest[as-overallrank]=0&autosuggest[orig-query]=&autosuggest[as-shown]=off&Search=%C2%A0&otracker=start&_r=YSWdYULYzr4VBYklfpZRbw--&_l=pMHn9vNCOBi05LKC_PwHFQ--&ref=a2c6fadc-2e24-4412-be6a-ce02c9707310&selmitem=All+Categories'.format(item)
r = urllib2.Request(link, headers={"User-Agent": "Python-urlli~"})
try:
    response = urllib2.urlopen(r)
except:
    print "Internet connection error"  
thePage = response.read()
soup = bs4.BeautifulSoup(thePage)
firstBlockSoup = soup.find('div', attrs={'class': 'fk-srch-item'})
priceSoup=firstBlockSoup.find('b',attrs={'class':'fksd-bodytext price final-price'})
price=priceSoup.contents[0]
print price

titleSoup=firstBlockSoup.find('a',attrs={'class':'fk-srch-title-text fksd-bodytext'})
title=titleSoup.findAll('b')
print title

The above code when executed prints the PRICE without issues.

Rs. 138 

But the title is obtained as follows:

[<b>Wilco</b>, <b>Classic</b>, <b>Library</b>, <b>Autobiography</b>, <b>Of</b>, <b>a</b>, <b>Yogi</b>, <b>Hardcover</b>] 

The reason will be apparent if you have a look at the source code of the product page (use 'Inspect element')

Now, How do I extract the TITLE in a proper format so as to print:

Wilco Classic Library: Autobiography Of a Yogi (Hardcover)

解决方案

Just use the text method on titleSoup

>>> titleSoup=firstBlockSoup.find('a',attrs={'class':'fk-srch-title-text fksd-bodytext'})
>>> titleSoup.text
u'Wilco Classic Library: Autobiography Of a Yogi (Hardcover)'

This will also work:

invalid_tags = ['b']
titleSoup=firstBlockSoup.find('a',attrs={'class':'fk-srch-title-text fksd-bodytext'})

for tag in invalid_tags: 
    for match in titleSoup.findAll(tag):
       match.replaceWithChildren()
print "".join(titleSoup.contents)

这篇关于Flipkart.com产品'价格'和产品'称号'提取使用Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆