从刮具有挑战性的网站信息,无指导HTML结构 [英] Scrape information from challenging website with no guiding HTML structure

查看:118
本文介绍了从刮具有挑战性的网站信息,无指导HTML结构的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我需要从一个非常具有挑战性的网站凑一些信息

I need to scrape some information from a very challenging website

这是一个例子:

<div class="overview">
        <span class="course_titles">Courses:</span> 
        <a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
        <a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17, 
        <a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ), 

        <a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12); 
        <a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16, 
        <a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17, 
        <a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ), 
</div>

每个当然有他们的名字后给予他们的年龄具体的学生(包括那些随机字符已经在那里)。

Each course has specific students with their age given after their name (and those random characters were already in there).

我要凑每门课程的各自的学生,再加上年龄。

I need to scrape each course with their respective students, plus age.

不幸的是,除了无所不包DIV级没有固有的层次结构。我试图通过COURSE_NAME与BeautifulSoup刮,然后添加具有coursestudent_name属性的所有项目,但这种方式,我想补充的所有学生present到每门课程。

Unfortunately, there is no inherent hierarchy besides the all encompassing div-class. I tried scraping with BeautifulSoup by "course_name" and then add all items that has the "coursestudent_name" attribute, but this way I add all students present to each course.

我想我可以改变的网站,但我不能。任何人都有一个想法,我怎么能拿每门课程的信息与正确的学生呢?

I wish I could change the website, but I cannot. Anyone have an idea how I could get the information per course with the correct students?

感谢您!

推荐答案

您可以做到这一点主要是BeautifulSoup正则表达式,然后一点点得到学生的年龄,是不是任何HTML标记内

You can do it mostly BeautifulSoup then a tiny bit of regex to get the the student age that isn't inside any html tags

soup = BeautifulSoup(html, "html.parser")
allA = soup.find("div", {"class" : "overview"}).find_all("a")

classInfo = {}
currentClass = None
for item in allA:
    if item['class'] == ['course_name']:
        classInfo[item.text] = []
        currentClass = item.text
    else:
        classInfo[currentClass] += [(item.text, int(re.search(item.text + r"</a> (\d+)", html).group(1)))]


print(classInfo)

此输出:

{'English101': [('Sarah', 16), ('Nancy', 17), ('Casey', 17)], 'Math101': [('Mark', 17), ('Alex', 18)]}

这篇关于从刮具有挑战性的网站信息,无指导HTML结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆