BeautifulSoup:从另一个标签的文本替换锚文本 [英] BeautifulSoup: Replace anchor text with text from another tag

查看:216
本文介绍了BeautifulSoup:从另一个标签的文本替换锚文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图提取网页上的所有链接,到目前为止,我能够得到的链接,但链接的锚文本不提供任何的相关的信息。这些信息包含在另一个兄弟标签。

这是HTML布局:

 <&TBODY GT;
&所述; TR>
     &所述; TD>
        < H3>驱动器与许可证E或F< / H3 GT&;
        < D​​IV CLASS =日期> ..< / DIV>
        < BR>
        < P> ...< / P>
        < D​​IV ID ='打印'>
        <?show_classifieds ...A HREF =类=酒吧>转到详情及LT; / A>
        < / DIV>
        < BR>
    < / TD>
< / TR>
    &所述; TR>
    &所述; TD>
        < H3>工资管理员< / H3 GT&;
        < D​​IV CLASS =日期> ..< / DIV>
        < BR>
        < P> ...< / P>
        < D​​IV ID ='打印'>
        <?show_classifieds ...A HREF =类=酒吧>转到详情及LT; / A>
        < / DIV>
        < BR>
    < / TD>
< / TR>
&所述; TR>
    &所述; TD>
        < H3>销售和营销与LT主任; / H3 GT&;
        < D​​IV CLASS =日期> ..< / DIV>
        < BR>
        < P> ...< / P>
        < D​​IV ID ='打印'>
        <?show_classifieds ...A HREF =类=酒吧>转到详情及LT; / A>
        < / DIV>
        < BR>
   < / TD>
< / TR>
< / TBODY>

当我解压的联系,我得到如下:

 < A HREF =show_classifieds ......?类=酒吧>转到详情及LT; / A>
<?show_classifieds ...A HREF =类=酒吧>转到详情及LT; / A>
<?show_classifieds ...A HREF =类=酒吧>转到详情及LT; / A>

不过:<​​/ P>


  1. 我感兴趣的替换文本的转到详情在每种情况下在标签中的文本。


  2. 这些链接将显示在外部网站,所以我preFER他们要的绝对代替的相对


因此​​,在我希望的东西这样的结尾:

 &LT; A HREF =HTTP://www.example.com/show_classifieds ......级=酒吧&GT;驱动器与许可证E或F&LT; / A&GT ;
&LT; A HREF =HTTP://www.example.com/show_classifieds ......级=酒吧&GT;工资管理员&LT; / A&GT;
&LT; A HREF =HTTP://www.example.com/show_classifieds ......级=酒吧&gt;销售和营销与LT主任; / A&GT;

任何帮助将优雅AP preciated


解决方案

要给你一个稳定的解决方案,你真的需要确保所有的页面遵循完全相同的图案作为你的榜样。

基本假设:

假设你要始终文字驻留在 H3 标记是 DIV打印的兄弟姐妹,谁是锚链接的父。

 从BS4进口BeautifulSoup
汤= BeautifulSoup(HTML)
要在soup.find_all('一个'):
    这里#是你如何从H3标签文本
    标题= a.parent.find_ previous_sibling('H3')。文本
    这里#是如何设置的锚标记的文字是H3标签的文本
    a.string =头
    打印一

延伸阅读:
tag.string

(可以使用urljoin与域名来构建绝对URL,如果你想)
urljoin

输出

 &LT;一类=栏的href =show_classifieds ......?&GT;驱动器与许可证E或F&LT; / A&GT;
&LT;一类=栏的href =show_classifieds ......?&GT;工资管理员&LT; / A&GT;
&LT;一类=栏的href =show_classifieds ......?&gt;销售和营销与LT主任; / A&GT;

I'm trying to extract all links on a page and so far I'm able to get the links but the anchor text in the link doesn't provide any relevant information. That information is contained in another sibling tag.

This is the Html Layout:

<tbody>
<tr>
     <td>
        <h3>Driver with license E or F</h3>
        <div class = "date">..</div>
        <br>
        <p>...</p>
        <div id='print'>
        <a href="show_classifieds?..." class="bar">Go To Details</a>
        </div>
        <br>    
    </td>
</tr>
    <tr>
    <td>
        <h3>Payroll Administrator</h3>
        <div class = "date">..</div>
        <br>
        <p>...</p>
        <div id='print'>
        <a href="show_classifieds?..." class="bar">Go To Details</a>
        </div>
        <br>    
    </td>
</tr>
<tr>
    <td>
        <h3>Head of Sales and Marketing</h3>
        <div class = "date">..</div>
        <br>
        <p>...</p>
        <div id='print'>
        <a href="show_classifieds?..." class="bar">Go To Details</a>
        </div>
        <br>    
   </td>
</tr>
</tbody>

When I extract the links, I get the following:

<a href="show_classifieds?..." class="bar">Go To Details</a>
<a href="show_classifieds?..." class="bar">Go To Details</a>
<a href="show_classifieds?..." class="bar">Go To Details</a>

But:

  1. I'm interested in replacing the text Go To Details with the text in the tag in each case.

  2. These links will be displayed on an external website so I prefer them to be absolute instead of relative

hence in the end I'm hoping for something like these:

<a href="http://www.example.com/show_classifieds?..." class="bar">Driver with license E or F</a>
<a href="http://www.example.com/show_classifieds?..." class="bar">Payroll Administrator</a>
<a href="http://www.example.com/show_classifieds?..." class="bar">Head of Sales and Marketing</a>

Any help will be gracefully appreciated

解决方案

To give you a stable solution, you really need to make sure that all pages follow exactly the same pattern as your example.

Basic Assumption:

Assuming the text you want always resides in the h3 tag which is the sibling of div print, who is the parent of the anchor link.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for a in soup.find_all('a'):
    # here is how you get the text from 'h3' tag
    header = a.parent.find_previous_sibling('h3').text
    # here is how you set the text of the anchor tag to be the text of 'h3' tag
    a.string = header
    print a

Further Reading: tag.string

(You can use urljoin with the domain name to construct absolute urls if you want) urljoin

Output:

<a class="bar" href="show_classifieds?...">Driver with license E or F</a>
<a class="bar" href="show_classifieds?...">Payroll Administrator</a>
<a class="bar" href="show_classifieds?...">Head of Sales and Marketing</a>

这篇关于BeautifulSoup:从另一个标签的文本替换锚文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆