如何使用preg_match在多字节字符串中获取正确的列表位置 [英] How to get correct list position in multi-byte string using preg_match

查看:130
本文介绍了如何使用preg_match在多字节字符串中获取正确的列表位置的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我目前正在使用以下代码匹配HTML:

I am currently matching HTML using this code:

preg_match('/<\/?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;/u', $html, $match, PREG_OFFSET_CAPTURE, $position)

它可以完美匹配所有内容,但是如果我有一个多字节字符,则在退还该职位时会将其计为2个字符.

It matches everything perfect, however if I have a multibyte character, it counts it as 2 characters when giving back the position.

例如,返回的$match数组将给出类似的内容:

For example the returned $match array would give something like:

array
  0 => 
    array
      0 => string '<br />' (length=6)
      1 => int 132
  1 => 
    array
      0 => string 'br' (length=2)
      1 => int 133

<br />匹配的实际数字为128,但是有4个多字节字符,因此为132.我真的认为添加/u修饰符会使它意识到正在发生的事情,但是在那里没有运气.

The real number for the <br /> match is 128, but there are 4 multibyte characters, so it's giving 132. I really thought adding the /u modifier would make it realize what's going on, but no luck there.

推荐答案

我从@Qtax看了这个建议:

I looked at this suggestion from @Qtax:

preg_match_all(PHP)中的UTF-8字符

为获得更多参考,在使用此错误时出现了该错误: 截断包含HTML的文本,忽略标签

And for some more reference, this bug surfaced while using this: Truncate text containing HTML, ignoring tags

更改的要点是:

$orig_utf = 'UTF-8';
$new_utf  = 'UTF-32';

mb_regex_encoding( $new_utf );

$html     = mb_convert_encoding( $html, $new_utf, $orig_utf );
$end_char = mb_convert_encoding( $end_char, $new_utf, $orig_utf );


mb_ereg_search_init( $html );

$pattern = '</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;';
$pattern = mb_convert_encoding( $pattern, $new_utf, $orig_utf );

while ( $printed < $limit && $tag_match = mb_ereg_search_pos( $pattern, $html ) ) {

  $tag_position = $tag_match[0]/4;
  $tag_length   = $tag_match[1];
  $tag          = mb_substr( $html, $tag_position, $tag_length/4, $new_utf );
  $tag_name     = preg_replace( '/[\s<>\/]+/', '', $tag );

  // Print text leading up to the tag.
  $str = mb_substr($html, $position, $tag_position - $position, $new_utf );

  .......

} 

关于截断HTML页面,还有其他必要的更改:

Also in reference to the truncate HTML page, there are other neccessary changes:

$first_char = mb_substr( $tag, 0, 1, $new_utf );

if ( $first_char == mb_convert_encoding( '&', $new_utf ) ) {
  ...
}

我的文本编辑器是UTF-8,因此,如果我将32与文件的&符号进行比较,那将无法正常工作.

My text editor is UTF-8 so if I was comparing the 32 to my file's ampersand, it wouldn't work.

这篇关于如何使用preg_match在多字节字符串中获取正确的列表位置的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆