在Python用BeautifulSoup提取文本 [英] Extracting text using BeautifulSoup in Python

查看:278
本文介绍了在Python用BeautifulSoup提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

能否请你帮我这个律的事。我期待用美丽的汤(蟒蛇)在SCRIPT标签(而不是在身体)以下code提取电子邮件,电话和名称值。我是新来的Python和博客被推荐使用美丽的汤提取。

Could you please help me with this lil thing. I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python). I am new to Python and blog are recommending to use Beautiful soup for extracting.

我试着用下面的code得到页 -

I tried getting page using the following code -

fileDetails = BeautifulSoup(urllib2.urlopen('http://www.example.com').read())
results = fileDetails.find(email:")

这Ajax请求code未在页面重复一次。我们也可以写试图赶上这样,如果它不发现它在页面时,它不会引发任何错误。

This Ajax request code is not repeating in the page again. Can we also write try and catch so that if it doesn't found it in the page, it won't throw any error.

<script type="text/javascript" language='javascript'> 
$(document).ready( function (){

   $('#message').click(function(){
       alert();
   });

    $('#addmessage').click(function(){
        $.ajax({ 
            type: "POST",
            url: 'http://www.example.com',
            data: { 
                email: 'abc@g.com', 
                phone: '9999999999', 
                name: 'XYZ'
            }
        });
    });
});

一旦我得到这个,我也想在一个Excel文件来存储。

Once I get this, I also want to store in an excel file.

谢谢期待。

推荐答案

您可以在剧本通过 BeautifulSoup ,然后申请一个正则表达式来获得所需的数据。

You can get the script tag contents via BeautifulSoup and then apply a regex to get the desired data.

工作示例(根据您的问题所描述的):

Working example (based on what you've described in the question):

import re
from bs4 import BeautifulSoup

data = """
<html>
    <head>
        <title>My Sample Page</title>
        <script>
        $.ajax({
            type: "POST",
            url: 'http://www.example.com',
            data: {
                email: 'abc@g.com',
                phone: '9999999999',
                name: 'XYZ'
            }
        });
        </script>
    </head>
    <body>
        <h1>What a wonderful world</h1>
    </body>
</html>
"""

soup = BeautifulSoup(data)
script = soup.find('script')

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

打印:

abc@g.com 9999999999 XYZ


我真的不喜欢的解决方案,因为该正则表达式的方法是真的很脆弱。各种各样的事情会发生,将打破它。我仍然认为这是一个更好的解决方案,我们在这里失去了一个更大的图片。提供了一个链接到特定的站点将有很大的帮助,但它是它是什么。


I don't really like the solution, since that regex approach is really fragile. All sorts of things can happen that would break it. I still think there is a better solution and we are missing a bigger picture here. Providing a link to that specific site would help a lot, but it is what it is.

UPD(固定提供的code OP):

UPD (fixing the code OP provided):

soup = BeautifulSoup(data, 'html.parser')
script = soup.html.find_next_sibling('script', text=re.compile(r"\$\(document\)\.ready"))

pattern = re.compile("(\w+): '(.*?)'")
fields = dict(re.findall(pattern, script.text))
print fields['email'], fields['phone'], fields['name']

打印:

abcd@gmail.com 9999999999 Shamita Shetty

这篇关于在Python用BeautifulSoup提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆