从刮具有挑战性的网站信息,无指导HTML结构 [英] Scrape information from challenging website with no guiding HTML structure
问题描述
我需要从一个非常具有挑战性的网站凑一些信息
I need to scrape some information from a very challenging website
这是一个例子:
<div class="overview">
<span class="course_titles">Courses:</span>
<a href="/schools/courses/173/" class="course_name">Math101</a> (Math; Monday; Room 10);
<a href="/schools/student/1388/" class="coursestudent_name">Mark</a> 17,
<a href="/schools/student/1401/" class="coursestudent_name">Alex</a> 18, ),
<a href="/schools/courses/2693/" class="course_name">English101</a> (English; Thursdays; Room 12);
<a href="/schools/student/1403/" class="coursestudent_name">Sarah</a> 16,
<a href="/schools/student/1411/" class="coursestudent_name">Nancy</a> 17,
<a href="/schools/student/1390/" class="coursestudent_name">Casey</a> 17 ),
</div>
每个当然有他们的名字后给予他们的年龄具体的学生(包括那些随机字符已经在那里)。
Each course has specific students with their age given after their name (and those random characters were already in there).
我要凑每门课程的各自的学生,再加上年龄。
I need to scrape each course with their respective students, plus age.
不幸的是,除了无所不包DIV级没有固有的层次结构。我试图通过COURSE_NAME与BeautifulSoup刮,然后添加具有coursestudent_name属性的所有项目,但这种方式,我想补充的所有学生present到每门课程。
Unfortunately, there is no inherent hierarchy besides the all encompassing div-class. I tried scraping with BeautifulSoup by "course_name" and then add all items that has the "coursestudent_name" attribute, but this way I add all students present to each course.
我想我可以改变的网站,但我不能。任何人都有一个想法,我怎么能拿每门课程的信息与正确的学生呢?
I wish I could change the website, but I cannot. Anyone have an idea how I could get the information per course with the correct students?
感谢您!
推荐答案
您可以做到这一点主要是BeautifulSoup正则表达式,然后一点点得到学生的年龄,是不是任何HTML标记内
You can do it mostly BeautifulSoup then a tiny bit of regex to get the the student age that isn't inside any html tags
soup = BeautifulSoup(html, "html.parser")
allA = soup.find("div", {"class" : "overview"}).find_all("a")
classInfo = {}
currentClass = None
for item in allA:
if item['class'] == ['course_name']:
classInfo[item.text] = []
currentClass = item.text
else:
classInfo[currentClass] += [(item.text, int(re.search(item.text + r"</a> (\d+)", html).group(1)))]
print(classInfo)
此输出:
{'English101': [('Sarah', 16), ('Nancy', 17), ('Casey', 17)], 'Math101': [('Mark', 17), ('Alex', 18)]}
这篇关于从刮具有挑战性的网站信息,无指导HTML结构的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!