从刮具有挑战性的网站信息，无指导HTML结构 [英] Scrape information from challenging website with no guiding HTML structure

查看：118 发布时间：2016/8/5 19:15:10 python regex web-scraping beautifulsoup

本文介绍了从刮具有挑战性的网站信息，无指导HTML结构的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我需要从一个非常具有挑战性的网站凑一些信息

I need to scrape some information from a very challenging website

这是一个例子：

<div class="overview">
        <span class="course_titles">Courses:</span> 
        <a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
        <a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17, 
        <a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ), 

        <a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12); 
        <a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16, 
        <a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17, 
        <a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ), 
</div>

每个当然有他们的名字后给予他们的年龄具体的学生（包括那些随机字符已经在那里）。

Each course has specific students with their age given after their name (and those random characters were already in there).

我要凑每门课程的各自的学生，再加上年龄。

I need to scrape each course with their respective students, plus age.

不幸的是，除了无所不包DIV级没有固有的层次结构。我试图通过COURSE_NAME与BeautifulSoup刮，然后添加具有coursestudent_name属性的所有项目，但这种方式，我想补充的所有学生present到每门课程。

Unfortunately, there is no inherent hierarchy besides the all encompassing div-class. I tried scraping with BeautifulSoup by "course_name" and then add all items that has the "coursestudent_name" attribute, but this way I add all students present to each course.

我想我可以改变的网站，但我不能。任何人都有一个想法，我怎么能拿每门课程的信息与正确的学生呢？

I wish I could change the website, but I cannot. Anyone have an idea how I could get the information per course with the correct students?

感谢您！

推荐答案

您可以做到这一点主要是BeautifulSoup正则表达式，然后一点点得到学生的年龄，是不是任何HTML标记内

You can do it mostly BeautifulSoup then a tiny bit of regex to get the the student age that isn't inside any html tags

soup = BeautifulSoup(html, "html.parser")
allA = soup.find("div", {"class" : "overview"}).find_all("a")

classInfo = {}
currentClass = None
for item in allA:
    if item['class'] == ['course_name']:
        classInfo[item.text] = []
        currentClass = item.text
    else:
        classInfo[currentClass] += [(item.text, int(re.search(item.text + r"</a> (\d+)", html).group(1)))]


print(classInfo)

此输出：

{'English101': [('Sarah', 16), ('Nancy', 17), ('Casey', 17)], 'Math101': [('Mark', 17), ('Alex', 18)]}

这篇关于从刮具有挑战性的网站信息，无指导HTML结构的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

从刮具有挑战性的网站信息，无指导HTML结构 [英] Scrape information from challenging website with no guiding HTML structure

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

从刮具有挑战性的网站信息，无指导HTML结构 [英] Scrape information from challenging website with no guiding HTML structure

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭