Flipkart.com产品'价格'和产品'称号'提取使用Python [英] Flipkart.com product 'price' and product 'title' extraction using Python
问题描述
我写了下面的Python code提取价格指定项目从flipkart.com
进口的urllib2
进口BS4
进口重项=Wilco的经典文库:自传的瑜伽(精装)
item.replace(,+)
链接= 'http://www.flipkart.com/search/a/all?query={0}&vertical=all&dd=0&autosuggest[as]=off&autosuggest[as-submittype]=entered&autosuggest[as-grouprank]=0&autosuggest[as-overallrank]=0&autosuggest[orig-query]=&autosuggest[as-shown]=off&Search=%C2%A0&otracker=start&_r=YSWdYULYzr4VBYklfpZRbw--&_l=pMHn9vNCOBi05LKC_PwHFQ--&ref=a2c6fadc-2e24-4412-be6a-ce02c9707310&selmitem=All+Categories'.format(item)
R = urllib2.Request(链接,标题= {用户代理:巨蟒urlli〜})
尝试:
响应= urllib2.urlopen(r)的
除:
打印网络连接错误
翻动书页= response.read()
汤= bs4.BeautifulSoup(翻动书页)
firstBlockSoup = soup.find('格',ATTRS = {'类':'FK-SRCH项'})
priceSoup = firstBlockSoup.find('B',ATTRS = {'类':'fksd-bodyText的价格最终价格'})
价格= priceSoup.contents [0]
打印价格titleSoup = firstBlockSoup.find('A',ATTRS = {'类':'FK-SRCH标题文本fksd-bodyText的'})
标题= titleSoup.findAll('B')
打印标题
执行打印时没有问题的价格高于code。
卢比。 138
但是如下获得标题
[< B>的Wilco< / B&GT中,< B>经典< / B&GT中,< B>图书馆< / B&GT中,< B>自传< / b&GT中,< b>将< / b&GT中,< b>在< / b&GT中,< b>瑜伽< / b&GT中,< b>精装书< / b>]
原因将是显而易见的,如果你看一下的源头code <一个href=\"http://www.flipkart.com/search-books?query=Wilco%20Classic%20Library:Autobiography%20Of%20a%20Yogi%20%28Hardcover%29&from=all&searchGroup=all&autosuggest%5Bas%5D=off&autosuggest%5Bas-submittype%5D=entered&autosuggest%5Bas-grouprank%5D=0&autosuggest%5Bas-overallrank%5D=0&autosuggest%5Borig-query%5D=&autosuggest%5Bas-shown%5D=off&selmitem=All%20Categories&otracker=start&vertical=all&_l=WE_JphiGfT8Bh6aXp1vT2w--&_r=_MFMNA8pxFY3ZpKGrqRTOA--&ref=b015ecfa-833a-4e1b-b50e-11b9515d2498\"相对=nofollow>产品页面(使用检查元素')
现在,我如何提取称号适当的格式,以便打印:
Wilco的经典文库:自传的瑜伽(精装)
只需使用上 titleSoup
的文本
方法
&GT;&GT;&GT; titleSoup = firstBlockSoup.find('A',ATTRS = {'类':'FK-SRCH标题文本fksd-bodyText的'})
&GT;&GT;&GT; titleSoup.text
u'Wilco经典文库:自传的瑜伽(精装)'
这也将工作:
invalid_tags = ['B']
titleSoup = firstBlockSoup.find('A',ATTRS = {'类':'FK-SRCH标题文本fksd-bodyText的'})在invalid_tags标签:
对于比赛中titleSoup.findAll(标签):
match.replaceWithChildren()
打印。加入(titleSoup.contents)
I have written the following Python code to extract the PRICE of the item specified from flipkart.com
import urllib2
import bs4
import re
item="Wilco Classic Library: Autobiography Of a Yogi (Hardcover)"
item.replace(" ", "+")
link = 'http://www.flipkart.com/search/a/all?query={0}&vertical=all&dd=0&autosuggest[as]=off&autosuggest[as-submittype]=entered&autosuggest[as-grouprank]=0&autosuggest[as-overallrank]=0&autosuggest[orig-query]=&autosuggest[as-shown]=off&Search=%C2%A0&otracker=start&_r=YSWdYULYzr4VBYklfpZRbw--&_l=pMHn9vNCOBi05LKC_PwHFQ--&ref=a2c6fadc-2e24-4412-be6a-ce02c9707310&selmitem=All+Categories'.format(item)
r = urllib2.Request(link, headers={"User-Agent": "Python-urlli~"})
try:
response = urllib2.urlopen(r)
except:
print "Internet connection error"
thePage = response.read()
soup = bs4.BeautifulSoup(thePage)
firstBlockSoup = soup.find('div', attrs={'class': 'fk-srch-item'})
priceSoup=firstBlockSoup.find('b',attrs={'class':'fksd-bodytext price final-price'})
price=priceSoup.contents[0]
print price
titleSoup=firstBlockSoup.find('a',attrs={'class':'fk-srch-title-text fksd-bodytext'})
title=titleSoup.findAll('b')
print title
The above code when executed prints the PRICE without issues.
Rs. 138
But the title is obtained as follows:
[<b>Wilco</b>, <b>Classic</b>, <b>Library</b>, <b>Autobiography</b>, <b>Of</b>, <b>a</b>, <b>Yogi</b>, <b>Hardcover</b>]
The reason will be apparent if you have a look at the source code of the product page (use 'Inspect element')
Now, How do I extract the TITLE in a proper format so as to print:
Wilco Classic Library: Autobiography Of a Yogi (Hardcover)
Just use the text
method on titleSoup
>>> titleSoup=firstBlockSoup.find('a',attrs={'class':'fk-srch-title-text fksd-bodytext'})
>>> titleSoup.text
u'Wilco Classic Library: Autobiography Of a Yogi (Hardcover)'
This will also work:
invalid_tags = ['b']
titleSoup=firstBlockSoup.find('a',attrs={'class':'fk-srch-title-text fksd-bodytext'})
for tag in invalid_tags:
for match in titleSoup.findAll(tag):
match.replaceWithChildren()
print "".join(titleSoup.contents)
这篇关于Flipkart.com产品'价格'和产品'称号'提取使用Python的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!