网络爬虫列表之间提取 [英] Web crawler to extract in between the list
问题描述
我用Python写一个网络爬虫。我希望得到所有内容之间<立GT; < /李>
标签。例如:
I am writing a web-crawler in python. I wish to get all the content in between <li> </li>
tags .For example:
<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>
所以在这里我想:
So here I want to :
一)提取与日期,并将其转换成DD / MM / YYYY格式
a.)extract the date- and convert it into dd/mm/yyyy format
湾)人前的数量。
soup = BeautifulSoup(page1)
h2 =soup.find_all("li")
count = 0
while count < len(h2):
print (str(h2[count].get_text().encode('ascii', 'ignore')))
count += 1
我只能现在提取文本。
I can only extract the text right now.
推荐答案
获取与的.text
文的分割字符串的按第一次出现:
的,日期字符串转换为日期时间
使用的 strptime()
指定现有%A%D,%Y
格式,然后用它格式化字符串 的strftime()
指定所需%D /%M /%Y
格式,并使用提取数至少(\\ d +)
常规EX pression其中(\\ d +)
是的捕获组将匹配一个或多个数字:
Get the text with .text
, split the string by the first occurence of :
, convert the date string to datetime
using strptime()
specifying existing %B %d, %Y
format, then format it to string using strftime()
specifying the desired %d/%m/%Y
format and extract the number using At least (\d+)
regular expression where (\d+)
is a capturing group that would match one or more digits:
from datetime import datetime
import re
from bs4 import BeautifulSoup
data = '<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>'
soup = BeautifulSoup(data)
date_string, rest = soup.li.text.split(':', 1)
print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
print re.match(r'At least (\d+)', rest.strip()).group(1)
打印:
13/01/1991
40
这篇关于网络爬虫列表之间提取的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!