在 Python 中使用 BeautifulSoup 从脚本标签中提取文本 [英] Extracting text from script tag using BeautifulSoup in Python

查看:48
本文介绍了在 Python 中使用 BeautifulSoup 从脚本标签中提取文本的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在寻找使用 Beautiful Soup(Python)从 SCRIPT 标签(不在正文)中的以下代码中提取电子邮件、电话和姓名值.我看到美汤可以用来提取.

I am looking to extract email, phone and name value from the below code in SCRIPT tag(not in Body) using Beautiful soup(Python). I see Beautiful soup can be used for extracting.

我尝试使用以下代码获取页面 -

I tried getting page using the following code -

fileDetails = BeautifulSoup(urllib2.urlopen('http://www.example.com').read())
results = fileDetails.find(email:")

此 Ajax 请求代码不再在页面中重复.我们是否也可以编写 try 和 catch 以便在页面中没有找到它时,它不会抛出任何错误.

This Ajax request code is not repeating in the page again. Can we also write try and catch so that if it doesn't found it in the page, it won't throw any error.

<script type="text/javascript" language='javascript'> 
$(document).ready( function (){
   
   $('#message').click(function(){
       alert();
   });

    $('#addmessage').click(function(){
        $.ajax({ 
            type: "POST",
            url: 'http://www.example.com',
            data: { 
                email: 'abc@g.com', 
                phone: '9999999999', 
                name: 'XYZ'
            }
        });
    });
});

一旦我得到这个,我也想存储在一个 excel 文件中.

Once I get this, I also want to store in an excel file.

期待中的感谢.

推荐答案

作为基于正则表达式的方法的替代方案,您可以使用 slimit 模块,它构建了一个抽象语法树,并为您提供了一种获取所有赋值并将它们放入字典的方法:

Alternatively to the regex-based approach, you can parse the javascript code using slimit module, that builds an Abstract Syntax Tree and gives you a way of getting all assignments and putting them into the dictionary:

from bs4 import BeautifulSoup
from slimit import ast
from slimit.parser import Parser
from slimit.visitors import nodevisitor


data = """
<html>
    <head>
        <title>My Sample Page</title>
        <script>
        $.ajax({
            type: "POST",
            url: 'http://www.example.com',
            data: {
                email: 'abc@g.com',
                phone: '9999999999',
                name: 'XYZ'
            }
        });
        </script>
    </head>
    <body>
        <h1>What a wonderful world</h1>
    </body>
</html>
"""

# get the script tag contents from the html
soup = BeautifulSoup(data)
script = soup.find('script')

# parse js
parser = Parser()
tree = parser.parse(script.text)
fields = {getattr(node.left, 'value', ''): getattr(node.right, 'value', '')
          for node in nodevisitor.visit(tree)
          if isinstance(node, ast.Assign)}

print fields

打印:

{u'name': u"'XYZ'", u'url': u"'http://www.example.com'", u'type': u'"POST"', u'phone': u"'9999999999'", u'data': '', u'email': u"'abc@g.com'"}

在其他字段中,有您感兴趣的emailnamephone.

Among other fields, there are email, name and phone that you are interested in.

希望有所帮助.

这篇关于在 Python 中使用 BeautifulSoup 从脚本标签中提取文本的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆