字符串是相同的(使用`base :: identical`),但在`grepl`/`gsub`中的行为却有所不同 [英] strings are identical (using `base::identical`) and yet behave differently with `grepl` / `gsub`

查看:33
本文介绍了字符串是相同的(使用`base :: identical`),但在`grepl`/`gsub`中的行为却有所不同的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

与以下内容有关:将大写单词转换为标题大小写

某些使用从网上获取的字符串的代码无法正常运行,您可以通过运行以下命令重现该问题:

Some code that uses strings fetched from online doesn't behave as I expect, you can reproduce the issue by running the following:

library(xml2)
library(magrittr)
x <- xml2::read_html("https://poesie.webnet.fr/lesgrandsclassiques/Authors/B") %>%
  gsub("^.*?<span>(Pierre-Jean de BÉRANGER)</span>.*$","\\1",.)
x # [1] "Pierre-Jean de BÉRANGER"

此字符串与从页面源复制/粘贴的"Pierre-Jean deBÉRANGER" 相同,但是以下行为对我来说非常不安:

This string is identical to "Pierre-Jean de BÉRANGER" copied/pasted from page source, however the following behavior is very disturbing to me:

y <- "Pierre-Jean de BÉRANGER"
x == y  # TRUE
identical(x, y) # TRUE
gsub("\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", x, perl = TRUE) # [1] "Pierre-Jean de BÉRANGER"
gsub("\\b([A-Z])(\\w+)\\b", "\\1\\L\\2", y, perl = TRUE) # [1] "Pierre-Jean de Béranger"
grepl("\\bB\\w+", x, perl = TRUE) # FALSE
grepl("\\bB\\w+", y, perl = TRUE) # TRUE
grepl("\\bB\\w", x, perl = TRUE)  # TRUE
grepl("\\bB\\w", y, perl = TRUE)  # TRUE

如果 x y 相同,那么它们如何提供不同的输出?

If x and y are identical, how can these give a different output ?

?是否相同:

测试两个对象是否完全相同的安全可靠方法

The safe and reliable way to test two objects for being exactly equal


有一个明显的区别:

Encoding(x) # "UTF-8"
Encoding(y) # "latin1"

我正在 Windows

推荐答案

如果您查看将字符串值转换为UTF 在进行比较之前.因此, identical 并不会检查编码是否一定相同,而只是检查编码中嵌入的值是否相同.

If you check out the source of the identical() function, you can see that when it's passed a CHARSXP value (a character vector), it calls the internal helper function Seql(). That function converts string values to UTF prior to doing the comparison. Thus identical isn't checking that the encoding is necessarily the same, just that the value embded in the encoding is the same.

在理想情况下,除了进行比较时可以忽略的所有其他属性之外, identical()函数还应具有 ignore.encoding = 选项.

In a perfect world, the identical() function should have an ignore.encoding= option in addition to all the other properties you can ignore when doing a comparison.

但是从理论上讲,字符串实际上应该以相同的方式运行.因此,我想您可能会在这里将regexpr引擎的"perl"版本归咎于未正确处理编码.基本的regexpr引擎似乎没有这个问题

But in theory the strings should really behave in the same way. So I guess you could blame the "perl" version of the regexpr engine here for not properly dealing with encoding. The base regexpr engine doesn't seem to have this problem

grepl("B\\w+", x)
# [1] TRUE
grepl("B\\w+", y)
# [1] TRUE

这篇关于字符串是相同的(使用`base :: identical`),但在`grepl`/`gsub`中的行为却有所不同的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆