我如何提取文本从一些JavaScript使用BeautifulSoup一个网页上的长字符串? [英] How do I extract a long string of text from some JavaScript on a web page using BeautifulSoup?

查看:282
本文介绍了我如何提取文本从一些JavaScript使用BeautifulSoup一个网页上的长字符串?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图写一个脚本这样我就可以登录到一个网站,但为了做到这一点,我需要present的验证码。来从URL验证码那直接图像的唯一方法是提取巨大的字符串名称'挑战',但我一直没能与BeautifulSoup做的某些原因。什么是提取长字符串的最佳方式?

  VAR RecaptchaState = {
    网站:'4LfjPgEA56AABAJExraAeYXdMbVhPcG__Hyv-URXF',
    挑战: '03AHJ_VusE_PgNB0vfBpD2h53o8uGMt1MeKi9bzhOTsjt0ze7SKmHVNe8uADceoU3JLPjpp8cJCVDGiYKo1ho-r1JcV19tm26doUHqevixJjH8SZ26i4EWbUOQLEuODf0Kt6JI0ZhtfiIaIXDg9MhUyDCEt_qxFWbSHA',
    is_incorrect:假的,
    programming_error:'',
    错误信息 : '',
    服务器:'http://www.google.com/recaptcha/api/',
    超时:18000
};文件撰写('
< SCR>
 ');
< / SCR>


解决方案

我只是用一个普通的前pression。不知道这一点,但我不认为beautifulsoup解析JavaScript的 - 只有(X)HTML:

 挑战= re.search(R挑战*:*'(\\ S +)',X)。集团(1)

给出:

'03AHJ_VusE_PgNB0vfBpD2h53o8uGMt1MeKi9bzhOTsjt0ze7SKmHVNe8uADceoU3JLPjpp8cJCVDGiYKo1ho-r1JcV19tm26doUHqevixJjH8SZ26i4EWbUOQLEuODf0Kt6JI0ZhtfiIaIXDg9MhUyDCEt_qxFWbSHA'

I'm trying to write a script so I can log into a website, but in order to do that I need to present the captcha. The only way to get that direct image of the captcha from the URL is to extract the giant string name 'challenge' but I have not been able to do it with BeautifulSoup for some reason. What is the best way to extract the long string?

var RecaptchaState = {
    site : '4LfjPgEA56AABAJExraAeYXdMbVhPcG__Hyv-URXF',
    challenge : '03AHJ_VusE_PgNB0vfBpD2h53o8uGMt1MeKi9bzhOTsjt0ze7SKmHVNe8uADceoU3JLPjpp8cJCVDGiYKo1ho-r1JcV19tm26doUHqevixJjH8SZ26i4EWbUOQLEuODf0Kt6JI0ZhtfiIaIXDg9MhUyDCEt_qxFWbSHA',
    is_incorrect : false,
    programming_error : '',
    error_message : '',
    server : 'http://www.google.com/recaptcha/api/',
    timeout : 18000
};

document.write('
<scr>
 ');
</scr>

解决方案

I'd just use a regular expression. Not sure about this, but I don't think beautifulsoup parses javascript--only (x)html:

challenge = re.search(r"challenge *: *'(\S+)'", x).group(1)

Gives:

'03AHJ_VusE_PgNB0vfBpD2h53o8uGMt1MeKi9bzhOTsjt0ze7SKmHVNe8uADceoU3JLPjpp8cJCVDGiYKo1ho-r1JcV19tm26doUHqevixJjH8SZ26i4EWbUOQLEuODf0Kt6JI0ZhtfiIaIXDg9MhUyDCEt_qxFWbSHA'

这篇关于我如何提取文本从一些JavaScript使用BeautifulSoup一个网页上的长字符串?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆