使用RegExp在HTML标记内获取内容 [英] Get content inside HTML tags with RegExp
问题描述
我想使用regexp从大型表单元文件中提取内容,并使用PHP处理数据.
I'd like to extract the content from a large file of table cells using regexp and process the data using PHP.
这是我要匹配的数据:
<td>Current Value: </td><td>100.178</td>
我尝试使用此正则表达式来匹配并检索文本:
I tried using this regexp to match and retrieve the text:
preg_match("<td>Current Value: </td><td>(.+?)</td>", $data, $output);
但是,我收到未知修饰符"警告,并且变量$ output变空.
However I get an "Unknown modifier" warning and my variable $output comes out empty.
我该怎么做-您能否简要介绍该解决方案的工作原理,以便我能理解为什么我的代码不起作用?
How can I accomplish this - and could you give me a brief summary of how the solution works so I can try to understand why my code didn't?
推荐答案
您需要在正则表达式周围添加定界符:
You need to add delimiters around your regex:
preg_match("#<td>Current Value: </td><td>(.+?)</td>#", $data, $output);
标准定界符是/
,但是如果您愿意,也可以使用其他非字母数字字符(这在这里很有意义,因为正则表达式本身包含斜杠).就您而言,正则表达式引擎认为您想使用尖括号作为定界符-失败了.
The standard delimiter is /
, but you can use other non-alphanumeric characters if you wish (which makes sense here because the regex itself contains slashes). In your case, the regex engine thought you wanted to use angle brackets as delimiters - and failed.
另一个提示(除了不应该使用regexen解析HTML"的规范性训诫(在这样的特定情况下,我认为这是完全可以的)):使用([^<>]+)
而不是(.*?)
.这样可以确保您的正则表达式永远不会跨越嵌套标记,这是处理标记语言时常见的错误来源.
One more tip (aside from the canonical exhortation "Thou shalt not parse HTML with regexen" (which I think is perfectly OK in a specific case like this)): Use ([^<>]+)
instead of (.*?)
. This ensures that your regex will never travel across nested tags, a common source of errors when dealing with markup languages.
这篇关于使用RegExp在HTML标记内获取内容的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!