Flipkart.com产品'价格'和产品'称号'提取使用Python [英] Flipkart.com product 'price' and product 'title' extraction using Python

查看：114 发布时间：2016/8/5 19:14:50 python python-2.7 web-scraping beautifulsoup

本文介绍了Flipkart.com产品'价格'和产品'称号'提取使用Python的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我写了下面的Python code提取价格指定项目从flipkart.com

 进口的urllib2
进口BS4
进口重项=Wilco的经典文库：自传的瑜伽（精装）
item.replace（，+）
链接= 'http://www.flipkart.com/search/a/all?query={0}&vertical=all&dd=0&autosuggest[as]=off&autosuggest[as-submittype]=entered&autosuggest[as-grouprank]=0&autosuggest[as-overallrank]=0&autosuggest[orig-query]=&autosuggest[as-shown]=off&Search=%C2%A0&otracker=start&_r=YSWdYULYzr4VBYklfpZRbw--&_l=pMHn9vNCOBi05LKC_PwHFQ--&ref=a2c6fadc-2e24-4412-be6a-ce02c9707310&selmitem=All+Categories'.format(item)
R = urllib2.Request（链接，标题= {用户代理：巨蟒urlli〜}）
尝试：
    响应= urllib2.urlopen（r）的
除：
    打印网络连接错误
翻动书页= response.read（）
汤= bs4.BeautifulSoup（翻动书页）
firstBlockSoup = soup.find（'格'，ATTRS = {'类'：'FK-SRCH项'}）
priceSoup = firstBlockSoup.find（'B'，ATTRS = {'类'：'fksd-bodyText的价格最终价格'}）
价格= priceSoup.contents [0]
打印价格titleSoup = firstBlockSoup.find（'A'，ATTRS = {'类'：'FK-SRCH标题文本fksd-bodyText的'}）
标题= titleSoup.findAll（'B'）
打印标题

执行打印时没有问题的价格高于code。

 卢比。 138

但是如下获得标题

  [＆LT; B＆GT;的Wilco＆LT; / B＆GT中，＆lt; B＆GT;经典＆LT; / B＆GT中，＆lt; B＆GT;图书馆＆LT; / B＆GT中，＆lt; B＆GT;自传＆LT; / b＆GT中，＆lt; b＆gt;将＆LT; / b＆GT中，＆lt; b＆gt;在＆LT; / b＆GT中，＆lt; b＆GT;瑜伽＆LT; / b＆GT中，＆lt; b＆GT;精装书＆LT; / b＆GT;]

原因将是显而易见的，如果你看一下的源头code <一个href=\"http://www.flipkart.com/search-books?query=Wilco%20Classic%20Library:Autobiography%20Of%20a%20Yogi%20%28Hardcover%29&from=all&searchGroup=all&autosuggest%5Bas%5D=off&autosuggest%5Bas-submittype%5D=entered&autosuggest%5Bas-grouprank%5D=0&autosuggest%5Bas-overallrank%5D=0&autosuggest%5Borig-query%5D=&autosuggest%5Bas-shown%5D=off&selmitem=All%20Categories&otracker=start&vertical=all&_l=WE_JphiGfT8Bh6aXp1vT2w--&_r=_MFMNA8pxFY3ZpKGrqRTOA--&ref=b015ecfa-833a-4e1b-b50e-11b9515d2498\"相对=nofollow>产品页面（使用检查元素'）

现在，我如何提取称号适当的格式，以便打印：

  Wilco的经典文库：自传的瑜伽（精装）

解决方案

只需使用上 titleSoup 的文本方法

 ＆GT;＆GT;＆GT; titleSoup = firstBlockSoup.find（'A'，ATTRS = {'类'：'FK-SRCH标题文本fksd-bodyText的'}）
＆GT;＆GT;＆GT; titleSoup.text
u'Wilco经典文库：自传的瑜伽（精装）'

这也将工作：

  invalid_tags = ['B']
titleSoup = firstBlockSoup.find（'A'，ATTRS = {'类'：'FK-SRCH标题文本fksd-bodyText的'}）在invalid_tags标签：
    对于比赛中titleSoup.findAll（标签）：
       match.replaceWithChildren（）
打印。加入（titleSoup.contents）

I have written the following Python code to extract the PRICE of the item specified from flipkart.com

import urllib2
import bs4
import re

item="Wilco Classic Library: Autobiography Of a Yogi (Hardcover)"
item.replace(" ", "+")
link = 'http://www.flipkart.com/search/a/all?query={0}&vertical=all&dd=0&autosuggest[as]=off&autosuggest[as-submittype]=entered&autosuggest[as-grouprank]=0&autosuggest[as-overallrank]=0&autosuggest[orig-query]=&autosuggest[as-shown]=off&Search=%C2%A0&otracker=start&_r=YSWdYULYzr4VBYklfpZRbw--&_l=pMHn9vNCOBi05LKC_PwHFQ--&ref=a2c6fadc-2e24-4412-be6a-ce02c9707310&selmitem=All+Categories'.format(item)
r = urllib2.Request(link, headers={"User-Agent": "Python-urlli~"})
try:
    response = urllib2.urlopen(r)
except:
    print "Internet connection error"  
thePage = response.read()
soup = bs4.BeautifulSoup(thePage)
firstBlockSoup = soup.find('div', attrs={'class': 'fk-srch-item'})
priceSoup=firstBlockSoup.find('b',attrs={'class':'fksd-bodytext price final-price'})
price=priceSoup.contents[0]
print price

titleSoup=firstBlockSoup.find('a',attrs={'class':'fk-srch-title-text fksd-bodytext'})
title=titleSoup.findAll('b')
print title

The above code when executed prints the PRICE without issues.

Rs. 138

But the title is obtained as follows:

[<b>Wilco</b>, <b>Classic</b>, <b>Library</b>, <b>Autobiography</b>, <b>Of</b>, <b>a</b>, <b>Yogi</b>, <b>Hardcover</b>]

The reason will be apparent if you have a look at the source code of the product page (use 'Inspect element')

Now, How do I extract the TITLE in a proper format so as to print:

Wilco Classic Library: Autobiography Of a Yogi (Hardcover)

解决方案

Just use the text method on titleSoup

>>> titleSoup=firstBlockSoup.find('a',attrs={'class':'fk-srch-title-text fksd-bodytext'})
>>> titleSoup.text
u'Wilco Classic Library: Autobiography Of a Yogi (Hardcover)'

This will also work:

invalid_tags = ['b']
titleSoup=firstBlockSoup.find('a',attrs={'class':'fk-srch-title-text fksd-bodytext'})

for tag in invalid_tags: 
    for match in titleSoup.findAll(tag):
       match.replaceWithChildren()
print "".join(titleSoup.contents)

这篇关于Flipkart.com产品'价格'和产品'称号'提取使用Python的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

Flipkart.com产品'价格'和产品'称号'提取使用Python [英] Flipkart.com product 'price' and product 'title' extraction using Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录关闭

Flipkart.com产品'价格'和产品'称号'提取使用Python [英] Flipkart.com product &#39;price&#39; and product &#39;title&#39; extraction using Python

问题描述

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

Flipkart.com产品'价格'和产品'称号'提取使用Python [英] Flipkart.com product 'price' and product 'title' extraction using Python

登录关闭