提取/刮从A HREF内里格p文本 [英] Extracting/Scraping text from a href inside p inside div

查看：134 发布时间：2016/8/5 19:11:34 python html web-scraping beautifulsoup screen-scraping

本文介绍了提取/刮从A HREF内里格p文本的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我是用美丽的汤（BS4）和Python我现在有这个结构

I am using beautiful soup(bs4) and Python I currently have this structure

<div class="class1">
  <a class="name" href="/doctor/dr-xxxxxxxxx"><h2>Dr. XX XXXX</h2></a>
  <p class="specialties"><a href="/location/abcd">ab cd</a></p>
  <p class="doc-clinic-name">
     <a class="light_grey link" href="/clinic/fff">f ff</a>
  </p>
</div>


<div class="class2">
  <p class="locality">
    <a class="link grey" href="/location/doctors/ccc">c cc</a>
  </p>
  <p class="fees">INR 999</p>
  <div class="timings">
       <p><span class="strong">MON-SAT</span><br/><span>11:00AM-1:00PM</span>                                   <span>6:00PM-8:00PM</span></p>
  <div class="clear"></div>
</div>

到目前为止，code我已经是这样的。

So far the code i have is this

import urllib2
from bs4 import BeautifulSoup

soup = BeautifulSoup(urllib2.urlopen('abc.com').read())

 for post in soup.find("div", "class1"):
print post

for x in soup.find("div", "class2"):
    print x

所以基本上张贴和x含有的div的Class1和Class2中。现在我想提取的信息

So basically post and x contain the divs class1 and class2. Now the information I want to extract is

DR.XXXXXX
A B C D
FFF
CCC
INR 999
周一至周六11:00 1:00

DR.XXXXXX abcd fff ccc INR 999 MON-SAT 11:00AM-1:00PM

我如何的职位和变量x的内部分支以获得所需的信息。谢谢

How do I branch inside the post and x variables to get the required info. Thanks

修改

我在HTML添加空格。是有可能产生的格式的一个CSV而不伤害空格？
DR。 XX XXXX，AB CD，女FF，C CC，INR 999，周一至周六11:00 1:00

I have added spaces in the html. Is it possible to produce a csv of the format without harming the spaces? DR. XX XXXX,ab cd,f ff,c cc,INR 999,MON-SAT 11:00AM-1:00PM

推荐答案

首先，你看起来缩进错误。其次，我不认为你只是用的时候需要一个为循环找到，因为它应该只返回第一个匹配

First off, your indentation looks wrong. Secondly, I don't think you need a for loop when just using find as it should just return the first match.

如果你只是想的链接，你可以尝试：

if you just want the links, you could try:

for link in soup.find("div", {"class": "class1"}).findAll("a"):
  print link.text

或者，如果你想链接本身：

or, if you want the link itself:

for link in soup.find("div", {"class": "class1"}).findAll("a"):
  print link.get("href")

您还应该注意到用于搜索类中的方法，通过传递一个字典到找到办法（编辑：我怀疑有这样做的其他方式。这仅仅是我学会了做它的方式！）

You should also note the method used to search for a class, by passing a dict to the find method ( I suspect there are other ways of doing this. This is just the way I learnt to do it!)

因此，你可以尽可能具体你需要例如

You can therefore be as specific as you need to be e.g.

doctorlink = soup.find(("div", {"class": "class1"}).find("a", {"class": "name"})

这篇关于提取/刮从A HREF内里格p文本的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

提取/刮从A HREF内里格p文本 [英] Extracting/Scraping text from a href inside p inside div

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

提取/刮从A HREF内里格p文本 [英] Extracting/Scraping text from a href inside p inside div

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭