从html中提取城市状态和国家/地区的正则表达式 [英] Regular Expression to extract city state and country from html

查看:27
本文介绍了从html中提取城市状态和国家/地区的正则表达式的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我正在使用 Outwit 中心为城市、州和国家/地区(仅限美国和加拿大)抓取网站.通过该程序,我可以使用正则表达式来定义标记之前和之后我想要抓取的文本.我还可以定义所需文本的格式.

I am using Outwit hub to scrape a website for city, state, and country (USA and Canada Only). With the program I can use regular expressions to define the markers Before and After the text I wish to grab. I can also define a format for the desired text.

这是一个 html 示例:

Here is a sample of the html:

<td width="8%" nowrap="nowrap"></td>                        
<td width="22%" nowrap="nowrap"><strong>
BILLINGS, MT
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">

我已经设置了我的 reg.ex.如下:

I have set up my reg.ex. as follows:

CITY - Before (未格式化为正则表达式)

<td width="22%" nowrap="nowrap"><strong>

CITY - After (考虑州、领地和普罗旺斯)

/(,\s|\bA[BLKSZRAEP]\b|\bBC\b\bC[AOT]\b|\bD[EC]\b|\bF[LM]\b|\bG[AU]\b|\bHI\b|\bI[ADLN]\b|\bK[SY]\b|\bLA\b|\bM[ABDEHINOPST]\b|\bN[BLTSUCDEHJMVY]\b|\bO[HKNR]\b|\bP[AERW]\b|\bQC\b|\bRI\b|\bS[CDK]\b|\bT[NX]\b|\bUT\b|\bV[AIT]\b|\bW[AIVY]\b|\bYT\b|\bUSA|\bCanada)/

状态 - 之前

\<td width="22%" nowrap="nowrap"\>\<strong\>\s|,\s

状态 - 之后

/\bUSA\<\/strong\>\<\/td\>|\bCanada\<\/strong\>\<\/td\>/

STATE - 格式

/\b[A-Z][A-Z]\b/

COUNTRY - Before (考虑州、领地和普罗旺斯)

/(\bA[BLKSZRAEP]\b|\bBC\b\bC[AOT]\b|\bD[EC]\b|\bF[LM]\b|\bG[AU]\b|\bHI\b|\bI[ADLN]\b|\bK[SY]\b|\bLA\b|\bM[ABDEHINOPST]\b|\bN[BLTSUCDEHJMVY]\b|\bO[HKNR]\b|\bP[AERW]\b|\bQC\b|\bRI\b|\bS[CDK]\b|\bT[NX]\b|\bUT\b|\bV[AIT]\b|\bW[AIVY]\b|\bYT\b)\s/

COUNTRY - After (未格式化为正则表达式)

</strong></td><td width="10%" align="right" nowrap="nowrap">

当未列出城市或州时,就会出现此问题.我试图解释这一点,但只是让它变得更糟.有什么办法可以清理它并仍然考虑丢失信息的可能性?谢谢.

The issue arrises when there is no city or state listed. I have tried to account for this, but am just making it worse. Is there any way this can be cleaned up and still account for the possibility of missing info? Thank you.

没有城市的例子:

<td width="8%" nowrap="nowrap"></td>                        
<td width="22%" nowrap="nowrap"><strong>
MT
USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">

没有城市/州的例子:(是的,有一个额外的换行符)

Example with no city / state: (yes, there is an extra line break)

<td width="8%" nowrap="nowrap"></td>                        
<td width="22%" nowrap="nowrap"><strong>

USA</strong></td>
<td width="10%" align="right" nowrap="nowrap">

感谢您提供的任何帮助.

Thank you for any help you can provide.

推荐答案

如果您拥有专业版,您可以执行以下操作:

Here is what you can do if you have the pro version:

Description: Data
Before: <td width="22%" nowrap="nowrap"><strong>
After: </strong>
Format: (([\w \-]+),)? ?([A-Z]{2})?[\r\n](USA|canada)\s*
Replace: \2##\3##\4
Separator: ##
Labels: City,State,Country

如果您使用的是轻量版,则必须分三行进行:

If you are using the light version, you have to do it in three lines:

Description: City
Before: <td width="22%" nowrap="nowrap"><strong>
After: ,
Format: [^<>]+

Description: State
Before: /<td width="22%" nowrap="nowrap"><strong>[\r\n]([^<>\r\n ]+,)?/
After: /[\r\n]/
Format: [A-Z]{2}

Description: Country
Before:
After: </strong></td>
Format: (USA|canada)

这篇关于从html中提取城市状态和国家/地区的正则表达式的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆