用Python中的正则表达式逐行搜索HTML [英] Search HTML line by line with regex in Python

查看：109 发布时间：2018/6/21 17:30:18 python html regex

本文介绍了用Python中的正则表达式逐行搜索HTML的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我试图根据此日历创建小时字典： http://disneyworld.disney.go.com/parks/magic-kingdom/calendar/

I'm attempting to create a dictionary of hours based off of this calendar: http://disneyworld.disney.go.com/parks/magic-kingdom/calendar/

<td class="first"> <div class="dayContainer">
      <a href="/parks/magic-kingdom/calendardayview/?asmbly_day=20120401"> 
         <p class="day"> 1
         </p> <p class="moreLink">Park Hours<br />8:00 AM - 12:00 AM<br /><br/>Extra Magic Hours<br />7:00 AM - 8:00 AM<br /><br/>Extra Magic Hours<br />12:00 AM - 3:00 AM<br /><br/>
         </p> 
      </a> 
   </div>
</td>

每个日历条目都在一行上，所以我认为最好只是去通过HTML逐行，如果该行包含小时，则将这些小时添加到相应日期的字典中（有些日子有多个小时条目）。

Each of the calendar entries are on a single line, so I figured it would be best to just go through the HTML line by line, and if that line contains hours, add those hours to a dictionary for the corresponding date (some days have multiple hour entries).

import urllib import re source = urllib.urlopen('http://disneyworld.disney.go.com/parks/magic-kingdom/c\ alendar/') page = source.read() prkhrs = {} def main(): parsehours() def parsehours(): #look for #:## AM - #:## PM date = r'201204\d{02}' hours = r'\d:0{2}\s\w{2}\s-\s\d:0{2}\s\w{2}' #go through page line by line for line in page: times = re.findall(hours, line) dates = re.search(date, line) if dates: start = dates.start() end = dates.end() curdate = line[start:end] #if #:## - #:## is found, a date has been found if times: #create dictionary from date, stores hours in variable #extra magic hours(emh) are stored in same format. #if entry has 2/3 hour listings, those listings are emh prkhrs[curdate]['hours'] = times #just print hours for now. will change later print prkhrs

我遇到的问题是，当我把'print line '在通过页面的for循环中，它每次都会打印出一个字符，我假设这是搞砸了什么。

The problem I encounter is that when I put 'print line' inside the for loop that goes through the page, it prints it out a character at a time, which I'm assuming is what's messing things up.

现在， '打印prkhrs'只是打印任何东西，但使用re.findall的日期和小时打印出正确的时间，所以我知道正则表达式的作品。关于如何让它起作用的任何建议？

Right now, the 'print prkhrs' just prints nothing, but using re.findall for both the dates and the hours prints out the correct times, so I know the regex works. Any suggestions on how I can get it to work?

推荐答案

更改 page = source.read（）至 page = source.readlines（）
来源.read（）将整个页面作为一个大字符串返回。对一个字符串进行迭代（就像在中为页面中的行一样）一次返回一个字符。仅仅因为你的变量被称为 line 和 page 并不意味着Python知道你想要什么。 source.read() returns the whole page as one big string. Iterating over a string (as when you do for line in page) returns one character at a time. Just because your variables are called line and page doesn't mean Python knows what you want. source.readlines（）返回一个字符串列表，每个字符串都是页面中的一行。 source.readlines() returns a list of strings, each of which is a line from the page. 这篇关于用Python中的正则表达式逐行搜索HTML的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

用Python中的正则表达式逐行搜索HTML [英] Search HTML line by line with regex in Python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录关闭

用Python中的正则表达式逐行搜索HTML [英] Search HTML line by line with regex in Python

问题描述

推荐答案

相关文章

前端开发最新文章

热门教程

热门工具

登录 关闭

登录关闭