使用Python Web抓取进行数据检索时遇到的问题 [英] Problems with data retrieving using Python web scraping

查看:51
本文介绍了使用Python Web抓取进行数据检索时遇到的问题的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我写了一个简单的代码从网页上抓取数据,但是我提到了带有标记的对象类之类的所有东西,但是我的程序没有抓取数据.还有一件事是,我也想抓取一封电子邮件,但不知道该如何提及其ID或类.您能否指导我-我该如何解决此问题?谢谢!

I wrote a simple code for scraping data from a web page but I mention all the thing like object class with tag but my program does not scrape data. One more thing there is an email that I also want to scrape but not know how to mention its id or class. Could you please guide me - how can I fix this issue? Thanks!

这是我的代码:

import requests
from bs4 import BeautifulSoup
import csv

def get_page(url):
    response = requests.get(url)

    if not response.ok:
        print('server responded:', response.status_code)
    else:
        soup = BeautifulSoup(response.text, 'html.parser') # 1. html , 2. parser
    return soup

def get_detail_data(soup):

    try:
        title = soup.find('hi',class_="page-header",id=False).text
    except:
        title = 'empty'  
    print(title)
    try:
        email = soup.find('',class_="",id=False).text
    except:
        email = 'empty'  
    print(email)



def main():
    url = "https://www.igrc.org/clergydetail/2747164"
    #get_page(url)
    get_detail_data(get_page(url))
if __name__ == '__main__':
    main()

推荐答案

请注意,电子邮件的值不是纯文本格式.通过JS在 script标记中加载html:

As noticed the value of the email is not in plain text. The html is loaded via JS in a script tag :

<script type="text/javascript">document.write(String.fromCharCode(60,97,32,104,114,101,102,61,34,35,34,32,115,116,121,108,101,61,34,117,110,105,99,111,100,101,45,98,105,100,105,58,98,105,100,105,45,111,118,101,114,114,105,100,101,59,100,105,114,101,99,116,105,111,110,58,114,116,108,59,34,32,111,110,99,108,105,99,107,61,34,116,104,105,115,46,104,114,101,102,61,83,116,114,105,110,103,46,102,114,111,109,67,104,97,114,67,111,100,101,40,49,48,57,44,57,55,44,49,48,53,44,49,48,56,44,49,49,54,44,49,49,49,44,53,56,44,49,49,52,44,49,49,49,44,57,56,44,54,52,44,49,48,57,44,49,48,49,44,49,49,54,44,49,48,52,44,49,49,49,44,49,48,48,44,49,48,53,44,49,49,53,44,49,49,54,44,52,54,44,57,57,44,57,57,41,59,34,62,38,35,57,57,59,38,35,57,57,59,38,35,52,54,59,38,35,49,49,54,59,38,35,49,49,53,59,38,35,49,48,53,59,38,35,49,48,48,59,38,35,49,49,49,59,38,35,49,48,52,59,38,35,49,49,54,59,38,35,49,48,49,59,38,35,49,48,57,59,38,35,54,52,59,38,35,57,56,59,38,35,49,49,49,59,38,35,49,49,52,59,60,47,97,62));</script>

其中包含所有字符代码( ASCII代码).解码时会给出:

which contains all the characters code (ascii code). When decoded will gives :

<a href="#" style="unicode-bidi:bidi-override;direction:rtl;" onclick="this.href=String.fromCharCode(109,97,105,108,116,111,58,114,111,98,64,109,101,116,104,111,100,105,115,116,46,99,99);">&#99;&#99;&#46;&#116;&#115;&#105;&#100;&#111;&#104;&#116;&#101;&#109;&#64;&#98;&#111;&#114;</a>

也需要解码.我们只需要 onto 中存在的 mailto (mailto中的内容不变,而 a 标记的文本则是相反的(使用雨果注意到 direction:rtl ):

which needs to be decoded too. We just needs the mailto which is present in onclick (the content in the mailto is unchanged whereas the text of the a tag is reversed (using direction: rtl as noticed by Hugo) :

mailto:john@doe.inc

以下的问题的代码提取了邮件:

The following python code extracts the mail :

import requests
from bs4 import BeautifulSoup
import re

r = requests.get("https://www.igrc.org/clergydetail/2747164")
soup = BeautifulSoup(r.text, 'html.parser')

titleContainer = soup.find(class_ = "page-header")
title = titleContainer.text.strip() if titleContainer else "empty"

emailScript = titleContainer.findNext("script").text

def parse(data):
    res = re.search('\(([\d+,]*)\)', data, re.IGNORECASE)
    return "".join([ 
        chr(int(i)) 
        for i in res.group(1).split(",")
    ])

emailData1 = parse(emailScript)
email = parse(emailData1)

print(title)
print(email.split(":")[1])

可以使用以下代码以另一种方式重现此编码:

One could reproduce this encoding the other way around using the following code :

def encode(data):
    return ",".join([str(ord(i)) for i in data])

mail = "john@doe.inc"
encodedMailTo = encode("mailto:" + mail)
encodedHtmlEmail = "".join(["&#" + str(ord(i)) + ";" for i in mail])

htmlContainer = f'<a href="#" onclick="this.href=String.fromCharCode({encodedMailTo});" style="unicode-bidi:bidi-override;direction:rtl;">{encodedHtmlEmail}</a>'

encodedHtmlContainer = encode(htmlContainer)
scriptContainer = f'<script type="text/javascript">document.write(String.fromCharCode({encodedHtmlContainer}));</script>'

print(scriptContainer)

这篇关于使用Python Web抓取进行数据检索时遇到的问题的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆