用PHP转换所有类型的智能报价 [英] Convert all types of smart quotes with PHP

查看:99
本文介绍了用PHP转换所有类型的智能报价的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我试图在处理文本时将所有类型的智能引号转换为常规引号。然而,我编译的以下函数似乎仍然缺乏支持和适当的设计。



有谁知道如何正确获取所有 quote characters 转换了吗?

 函数convert_smart_quotes($ string)
{
$ quotes = array(
\xC2\xAB=>'',//«(U + 00AB)in UTF- 8
\xC2\xBB=>'',//»(U + 00BB)in UTF-8
\xE2\x80\x98=> UTF-8中的',//'(U + 2018)
\xE2\x80\x99=>',//'(U + 2019)
\xE2\x80\x9A=>',//,(U + 201A)in UTF-8
\xE2\x80\x9B= UTF-8
\xE2\x80\x9C=>',//'(U + 201B) -8
\xE2\x80\x9D=>'',//(U + 201D) UTF-8
\xE2\x80\x9E=>'',//(U + 201E)in UTF-8
\xE2\x80\\ UTF-8
\xE2\x80\xB9=" \\ x9F="'',//(U + 201F) ',// <(U + 2039)为UTF-8
\xE2\x80\xBA=> UTF-8
)中的',//>(U + 203A);
$ string = strtr($ string,$ quotes);
$ b $ //版本2
$ search = array(
chr(145),
chr(146),
chr(147),
chr(148),
chr(151)
);
$ replace = array(',','',''',' - ');
$ string = str_replace($ search,$ replace,$ string);

//版本3
$ string = str_replace(
array('&#8216;','&#8217;','&#8220;' ,'&#8221;'),
array(',',''','''),
$ string
);

//第4版
$ search = array(
'& lsquo;',
'& rsquo;',
'& ',
'& rdquo;',
'& mdash;',
'& ndash;',
);
$ replace = array(',',''',''',' - ',' - ');
$ string = str_replace($ search,$ replace,$ string);

返回$ string;
}

注意:这个问题是一个完整的询问, 微软引号在这里提出这是一个重复就像询问所有轮胎尺寸是询问汽车轮胎尺寸的重复一样。

需要类似这样的东西(假设UTF-8输入,而忽略CJK(中文,日文,韩文)):
$ b

  $ chr_map = array $(
// Windows codepage 1252
\xC2\x82="',// U +0082⇒U+ 201A单个低9引号
\xC2\x84=>'',// U +0084⇒U+ 201E双低-9的引号
\xC2\x8B=>',// U + 008B⇒U+ 2039单向左角引号
\xC2\x91="',// U +0091⇒U+ 2018左单引号
\xC 2 \x92=>',// U +0092⇒U+ 2019右单引号
\xC2\x93=> ''',// U +0093⇒U+ 201C左双引号
\xC2\x94="'',// U +0094⇒U+ 201D右双引号
\xC2\x9B=> ',// U + 009B⇒U+ 203A单向右角引号

//常规Unicode // U + 0022引号()
// U + 0027左撇双引号
\xC2\xBB=> ; ''',// U + 00BB右指双引号
\xE2\x80\x98=>',// U + 2018左单引号
\xE2\x80\x99=>',// U + 2019右单引号
\xE2\x80\x9A=>', // U + 201A单个低9引号
\xE2\x80\x9B=>',// U + 201B单个高反转9引号
\xE2\x80\x9C=>'',// U + 201C左侧双引号
\xE2\x80\x9D=> ''',// U + 201D右双引号
\xE2\x80\x9E="'',// U + 201E双低9引号
\xE2\x80\x9F=> ''',// U + 201F双倍高反转-9引号
\xE2\x80\xB9=>',// U + 2039单向左角引用标记
\xE2\x80\xBA=>',// U + 203A右单引号引号
);
$ chr = array_keys( $ chr_map); //但是:为了提高效率,你应该
$ rpl = array_values($ chr_map); //预先计算这两个数组
$ str = str_replace($ chr,$ rpl,html_entity_decode( $ str,ENT_QUOTES,UTF-8));

这里是背景:



每个Unicode字符只属于一个常规类别 ,其中可以包含引号的字符如下:



(这些页面可以方便地检查你是否错过了任何东西 - 还有一个索引类别



有时匹配这些类别在启用Unicode的正则表达式。



此外,Unicode字符有properties,其中您感兴趣的是 Quotation_Mark 。不幸的是,这些无法通过正则表达式访问。



在Wikipedia中,您可以找到包含 Quotation_Mark 属性的字符组。最终的参考文献是unicode.org上的 PropList.txt ,但这是一个ASCII文本文件。



如果您还需要翻译CJK字符,您只需要获取他们的代码点,决定他们的翻译,并找到他们的UTF-8编码,例如,通过查找它fileformat.info(例如,对于U + 301E: http://www.fileformat.info/info /unicode/char/301e/index.htm )。



关于Windows代码页1252: Unicode 定义前256个代码点,以表示与 ISO- 8859-1 ,但ISO-8859-1经常与 Windows代码页1252 混淆,所以所有浏览器渲染范围0x80-0x9F,在ISO-8859-1中为空(更确切地说:它包含控制字符acters),就像它是Windows代码页1252一样。 Wikipedia页面中的表格列出了与Unicode等效的内容。

注意: strtr() 通常比 str_replace()慢 。记下你的输入和你的PHP版本。如果速度足够快,可以直接使用像我的 $ chr_map 这样的地图。






如果你不确定你的输入是UTF-8编码的,AND愿意假设它不是,那么它就是ISO-8859-1或Windows代码页1252,那么你可以在做任何事之前做到这一点:

  if(!preg_match('/ ^ \\X * $ / u',$ str)){
$ str = utf8_encode($ str);
}

警告:此正则表达式在极少数情况下可能无法检测到非UTF -8编码,但。例如:Gruß.../ * CP-1252 * / ==Gru\xDF\x85看起来像这个正则表达式的UTF-8(U + 07C5是N'ko数字5)。这个正则表达式可以稍微增强,但不幸的是,它可以表明,没有完全万无一失的解决方案来解决编码检测问题。




如果您想将源自Windows代码页1252的范围0x80-0x9F归一化为常规Unicode代码点,则可以执行此操作(并移除 $ chr_map


$ b $ pre $ $ normalization_map = array(
\xC2\x80=>\\ \\ xE2 \x82\xAC,// U + 20AC欧元符号
\xC2\x82=>\xE2\x80\x9A,// U + 201A single低9引号
\xC2\x83=>\xC6\x92,// U + 0192拉丁字母f带钩子
\xC2\ x84=>\xE2\x80\x9E,// U + 201E双低-9的引号
\xC2\x85=>\xE2\x80 \xA6,// U + 2026水平省略号
\xC2\x86=>\xE2\x80\xA0, // U + 2020 dagger
\xC2\x87=>\xE2\x80\xA1,// U + 2021 double dagger
\xC2\ x88=>\xCB\x86,// U + 02C6修饰符字母回音符
\xC2\x89=> \xE2\x80\xB0,// U + 2030 / mille sign
\xC2\x8A=> \xC5\xA0,// U + 0160拉丁文大写字母s with caron
\xC2\x8B=> \xE2\x80\xB9,// U + 2039单向左角引号
\xC2\x8C=> \xC5\x92,// U + 0152 latin capital ligature oe
\xC2\x8E=> \xC5\xBD,// U + 017D拉丁文大写字母z加caron
\xC2\x91=> \xE2\x80\x98,// U + 2018留下单引号
\xC2\x92=> \xE2\x80\x99,// U + 2019右单引号
\xC2\x93=> \xE2\x80\x9C,// U + 201C左边双引号
\xC2\x94=> \xE2\x80\x9D,// U + 201D右双引号
\xC2\x95=> \xE2\x80\xA2,// U + 2022 bullet
\xC2\x96=> \xE2\x80\x93,// U + 2013短划线
\xC2\x97=> \xE2\x80\x94,// U + 2014 em dash
\xC2\x98=> \xCB\x9C,// U + 02DC small tilde
\xC2\x99=> \xE2\x84\xA2,// U + 2122商标标记
\xC2\x9A=> \xC5\xA1,// U + 0161拉丁字母与宝宝
\xC2\x9B=> \xE2\x80\xBA,// U + 203A单右方向角度引号
\xC2\x9C=> \xC5\x93,// U + 0153 latin small ligature oe
\xC2\x9E=> \xC5\xBE,// U + 017E拉丁字母z与caron
\xC2\x9F=> \xC5\xB8,// U + 0178拉丁文大写字母y,带分隔符
);
$ chr = array_keys($ normalization_map); //但是:为了提高效率,您应该
$ rpl = array_values($ normalization_map); //预先计算这两个数组
$ str = str_replace($ chr,$ rpl,$ str);


I am trying to convert all types of smart quotes to regular quotes when working with text. However, the following function I've compiled still seems to be lacking support and proper design.

Does anyone know how to properly get all quote characters converted?

function convert_smart_quotes($string)
{
    $quotes = array(
        "\xC2\xAB"   => '"', // « (U+00AB) in UTF-8
        "\xC2\xBB"   => '"', // » (U+00BB) in UTF-8
        "\xE2\x80\x98" => "'", // ‘ (U+2018) in UTF-8
        "\xE2\x80\x99" => "'", // ’ (U+2019) in UTF-8
        "\xE2\x80\x9A" => "'", // ‚ (U+201A) in UTF-8
        "\xE2\x80\x9B" => "'", // ‛ (U+201B) in UTF-8
        "\xE2\x80\x9C" => '"', // " (U+201C) in UTF-8
        "\xE2\x80\x9D" => '"', // " (U+201D) in UTF-8
        "\xE2\x80\x9E" => '"', // „ (U+201E) in UTF-8
        "\xE2\x80\x9F" => '"', // ‟ (U+201F) in UTF-8
        "\xE2\x80\xB9" => "'", // ‹ (U+2039) in UTF-8
        "\xE2\x80\xBA" => "'", // › (U+203A) in UTF-8
    );
    $string = strtr($string, $quotes);

    // Version 2
    $search = array(
        chr(145),
        chr(146),
        chr(147),
        chr(148),
        chr(151)
    );
    $replace = array("'","'",'"','"',' - ');
    $string = str_replace($search, $replace, $string);

    // Version 3
    $string = str_replace(
        array('&#8216;','&#8217;','&#8220;','&#8221;'),
        array("'", "'", '"', '"'),
        $string
    );

    // Version 4
    $search = array(
        '&lsquo;', 
        '&rsquo;', 
        '&ldquo;', 
        '&rdquo;', 
        '&mdash;',
        '&ndash;',
    );
    $replace = array("'","'",'"','"',' - ', '-');
    $string = str_replace($search, $replace, $string);

    return $string;
}

Note: This question is a complete query about the full of gamut of quotes including the "Microsoft" quotes asked here This is a "duplicate" in the same way that asking about all tire sizes is a "duplicate" of asking for a car tire size.

解决方案

You need something like this (assuming UTF-8 input, and ignoring CJK (Chinese, Japanese, Korean)):

$chr_map = array(
   // Windows codepage 1252
   "\xC2\x82" => "'", // U+0082⇒U+201A single low-9 quotation mark
   "\xC2\x84" => '"', // U+0084⇒U+201E double low-9 quotation mark
   "\xC2\x8B" => "'", // U+008B⇒U+2039 single left-pointing angle quotation mark
   "\xC2\x91" => "'", // U+0091⇒U+2018 left single quotation mark
   "\xC2\x92" => "'", // U+0092⇒U+2019 right single quotation mark
   "\xC2\x93" => '"', // U+0093⇒U+201C left double quotation mark
   "\xC2\x94" => '"', // U+0094⇒U+201D right double quotation mark
   "\xC2\x9B" => "'", // U+009B⇒U+203A single right-pointing angle quotation mark

   // Regular Unicode     // U+0022 quotation mark (")
                          // U+0027 apostrophe     (')
   "\xC2\xAB"     => '"', // U+00AB left-pointing double angle quotation mark
   "\xC2\xBB"     => '"', // U+00BB right-pointing double angle quotation mark
   "\xE2\x80\x98" => "'", // U+2018 left single quotation mark
   "\xE2\x80\x99" => "'", // U+2019 right single quotation mark
   "\xE2\x80\x9A" => "'", // U+201A single low-9 quotation mark
   "\xE2\x80\x9B" => "'", // U+201B single high-reversed-9 quotation mark
   "\xE2\x80\x9C" => '"', // U+201C left double quotation mark
   "\xE2\x80\x9D" => '"', // U+201D right double quotation mark
   "\xE2\x80\x9E" => '"', // U+201E double low-9 quotation mark
   "\xE2\x80\x9F" => '"', // U+201F double high-reversed-9 quotation mark
   "\xE2\x80\xB9" => "'", // U+2039 single left-pointing angle quotation mark
   "\xE2\x80\xBA" => "'", // U+203A single right-pointing angle quotation mark
);
$chr = array_keys  ($chr_map); // but: for efficiency you should
$rpl = array_values($chr_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, html_entity_decode($str, ENT_QUOTES, "UTF-8"));

Here comes the background:

Every Unicode character belongs to exactly one "General Category", of which the ones that can contain quote characters are the following:

(these pages are handy for checking that you didn't miss anything - there is also an index of categories)

It is sometimes useful to match these categories in a Unicode-enabled regex.

Furthermore, Unicode characters have "properties", of which the one you are interested in is Quotation_Mark. Unfortunately, these are not accessible in a regex.

In Wikipedia you can find the group of characters with the Quotation_Mark property. The final reference is PropList.txt on unicode.org, but this is an ASCII textfile.

In case you need to translate CJK characters too, you only have to get their code points, decide their translation, and find their UTF-8 encoding, e.g., by looking it up in fileformat.info (e.g., for U+301E: http://www.fileformat.info/info/unicode/char/301e/index.htm).

Regarding Windows codepage 1252: Unicode defines the first 256 code points to represent exactly the same characters as ISO-8859-1, but ISO-8859-1 is often confused with Windows codepage 1252, so that all browsers render the range 0x80-0x9F, which is "empty" in ISO-8859-1 (more exactly: it contains control characters), as if it were Windows codepage 1252. The table in the Wikipedia page lists the Unicode equivalents.

Note: strtr() is often slower than str_replace(). Time it with your input and your PHP version. If it's fast enough, you can directly use a map like my $chr_map.


If you are not sure that your input is UTF-8 encoded, AND are willing to assume that if it's not, then it's ISO-8859-1 or Windows codepage 1252, then you can do this before anything else:

if ( !preg_match('/^\\X*$/u', $str)) {
   $str = utf8_encode($str);
}

Warning: this regex can in very rare cases fail to detect a non-UTF-8 encoding, though. E.g.: "Gruß…"/*CP-1252*/=="Gru\xDF\x85" looks like UTF-8 to this regex (U+07C5 is the N'ko digit 5). This regex can be slightly enhanced, but unfortunately it can be shown that there exists NO completely foolproof solution to the problem of encoding detection.


If you want to normalize the range 0x80-0x9F that stems from Windows codepage 1252 to regular Unicode codepoints, you can do this (and remove the first part of the $chr_map above):

$normalization_map = array(
   "\xC2\x80" => "\xE2\x82\xAC", // U+20AC Euro sign
   "\xC2\x82" => "\xE2\x80\x9A", // U+201A single low-9 quotation mark
   "\xC2\x83" => "\xC6\x92",     // U+0192 latin small letter f with hook
   "\xC2\x84" => "\xE2\x80\x9E", // U+201E double low-9 quotation mark
   "\xC2\x85" => "\xE2\x80\xA6", // U+2026 horizontal ellipsis
   "\xC2\x86" => "\xE2\x80\xA0", // U+2020 dagger
   "\xC2\x87" => "\xE2\x80\xA1", // U+2021 double dagger
   "\xC2\x88" => "\xCB\x86",     // U+02C6 modifier letter circumflex accent
   "\xC2\x89" => "\xE2\x80\xB0", // U+2030 per mille sign
   "\xC2\x8A" => "\xC5\xA0",     // U+0160 latin capital letter s with caron
   "\xC2\x8B" => "\xE2\x80\xB9", // U+2039 single left-pointing angle quotation mark
   "\xC2\x8C" => "\xC5\x92",     // U+0152 latin capital ligature oe
   "\xC2\x8E" => "\xC5\xBD",     // U+017D latin capital letter z with caron
   "\xC2\x91" => "\xE2\x80\x98", // U+2018 left single quotation mark
   "\xC2\x92" => "\xE2\x80\x99", // U+2019 right single quotation mark
   "\xC2\x93" => "\xE2\x80\x9C", // U+201C left double quotation mark
   "\xC2\x94" => "\xE2\x80\x9D", // U+201D right double quotation mark
   "\xC2\x95" => "\xE2\x80\xA2", // U+2022 bullet
   "\xC2\x96" => "\xE2\x80\x93", // U+2013 en dash
   "\xC2\x97" => "\xE2\x80\x94", // U+2014 em dash
   "\xC2\x98" => "\xCB\x9C",     // U+02DC small tilde
   "\xC2\x99" => "\xE2\x84\xA2", // U+2122 trade mark sign
   "\xC2\x9A" => "\xC5\xA1",     // U+0161 latin small letter s with caron
   "\xC2\x9B" => "\xE2\x80\xBA", // U+203A single right-pointing angle quotation mark
   "\xC2\x9C" => "\xC5\x93",     // U+0153 latin small ligature oe
   "\xC2\x9E" => "\xC5\xBE",     // U+017E latin small letter z with caron
   "\xC2\x9F" => "\xC5\xB8",     // U+0178 latin capital letter y with diaeresis
);
$chr = array_keys  ($normalization_map); // but: for efficiency you should
$rpl = array_values($normalization_map); // pre-calculate these two arrays
$str = str_replace($chr, $rpl, $str);

这篇关于用PHP转换所有类型的智能报价的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆