在Python中使用BeautifulSoup从脚本标签中提取文本 [英] Extracting text from script tag using BeautifulSoup in Python

查看:169
本文介绍了在Python中使用BeautifulSoup从脚本标签中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

您能帮我解决这个小问题吗?我正在寻找使用美丽的汤(Python)从以下代码中的SCRIPT标签(而不是在正文中)提取电子邮件,电话和姓名值.我是Python的新手,并且博客建议使用Beautiful汤进行提取.

Could you please help me with this lil thing. I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python). I am new to Python and blog are recommending to use Beautiful soup for extracting.

我尝试使用以下代码获取页面-

I tried getting page using the following code -

fileDetails = BeautifulSoup(urllib2.urlopen('http://www.example.com').read())
results = fileDetails.find(email:")

此Ajax请求代码不再在页面中重复.我们还可以编写try and catch以便在页面中未找到它时也不会引发任何错误.

This Ajax request code is not repeating in the page again. Can we also write try and catch so that if it doesn't found it in the page, it won't throw any error.

<script type="text/javascript" language='javascript'> 
$(document).ready( function (){

   $('#message').click(function(){
       alert();
   });

    $('#addmessage').click(function(){
        $.ajax({ 
            type: "POST",
            url: 'http://www.example.com',
            data: { 
                email: 'abc@g.com', 
                phone: '9999999999', 
                name: 'XYZ'
            }
        });
    });
});

一旦我知道了,我也想存储在一个excel文件中.

Once I get this, I also want to store in an excel file.

感谢您的期待.

推荐答案

或者,基于正则表达式的方法也可以使用

Alternatively to the regex-based approach, you can parse the javascript code using slimit module, that builds an Abstract Syntax Tree and gives you a way of getting all assignments and putting them into the dictionary:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """
<html>
    <head>
        <title>My Sample Page</title>
        <script>
        $.ajax({
            type: "POST",
            url: 'http://www.example.com',
            data: {
                email: 'abc@g.com',
                phone: '9999999999',
                name: 'XYZ'
            }
        });
        </script>
    </head>
    <body>
        <h1>What a wonderful world</h1>
    </body>
</html>
"""

# get the script tag contents from the html
soup = BeautifulSoup(data)
script = soup.find('script')

# parse js
parser = Parser()
tree = parser.parse(script.text)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
          for node in nodevisitor.visit(tree)
          if isinstance(node, ast.Assign)}

print fields

打印:

{u'name': u"'XYZ'", u'url': u"'http://www.example.com'", u'type': u'"POST"', u'phone': u"'9999999999'", u'data': '', u'email': u"'abc@g.com'"}

在其他字段中,您感兴趣的是emailnamephone.

Among other fields, there are email, name and phone that you are interested in.

希望有帮助.

这篇关于在Python中使用BeautifulSoup从脚本标签中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆