网络爬虫列表之间提取 [英] Web crawler to extract in between the list

查看：165 发布时间：2016/8/5 19:16:21 python parsing web-scraping beautifulsoup web-crawler

本文介绍了网络爬虫列表之间提取的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我用Python写一个网络爬虫。我希望得到所有内容之间＆LT;立GT; ＆LT; /李＆GT; 标签。例如：

I am writing a web-crawler in python. I wish to get all the content in between <li> </li> tags .For example:

<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>

所以在这里我想：

So here I want to :

一）提取与日期，并将其转换成DD / MM / YYYY格式

a.)extract the date- and convert it into dd/mm/yyyy format

湾）人前的数量。

soup = BeautifulSoup(page1)
h2 =soup.find_all("li")
count = 0
while count < len(h2):
    print (str(h2[count].get_text().encode('ascii', 'ignore')))
    count += 1

我只能现在提取文本。

I can only extract the text right now.

推荐答案

获取与的.text 文的分割字符串的按第一次出现： 的，日期字符串转换为日期时间使用的 strptime（） 指定现有％A％D，％Y 格式，然后用它格式化字符串 的strftime（） 指定所需％D /％M /％Y 格式，并使用提取数至少（\\ d +）常规EX pression其中（\\ d +）是的捕获组将匹配一个或多个数字：

Get the text with .text, split the string by the first occurence of :, convert the date string to datetime using strptime() specifying existing %B %d, %Y format, then format it to string using strftime() specifying the desired %d/%m/%Y format and extract the number using At least (\d+) regular expression where (\d+) is a capturing group that would match one or more digits:

from datetime import datetime
import re

from bs4 import BeautifulSoup


data = '<li>January 13, 1991: At least 40 people <a href ="......."> </a> </li>'
soup = BeautifulSoup(data)

date_string, rest = soup.li.text.split(':', 1)

print datetime.strptime(date_string, '%B %d, %Y').strftime('%d/%m/%Y')
print re.match(r'At least (\d+)', rest.strip()).group(1)

打印：

13/01/1991
40

这篇关于网络爬虫列表之间提取的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

网络爬虫列表之间提取 [英] Web crawler to extract in between the list

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录关闭

网络爬虫列表之间提取 [英] Web crawler to extract in between the list

问题描述

推荐答案

相关文章

Python最新文章

热门教程

热门工具

登录 关闭

登录关闭