在R上的字符串中删除(非中断)空格字符 [英] remove (non-breaking) space character in string in R on Linux
问题描述
这个问题似乎很容易删除R中的字符串中的空格字符。但是,当我加载下表时,我无法删除两个数字之间的空格(例如 11 846.4
):
require(XML)
library(RCurl)
link2fetch ='https://www.destatis.de/ DE / ZahlenFakten / Wirtschaftsbereiche / LandForstwirtschaftFischerei / FeldfruechteGruenland / Tabellen / AckerlandHauptfruchtgruppenFruchtarten.html'
theurl = getURL(link2fetch,.opts = list(ssl.verifypeer = FALSE))#important!
area_cult10 = readHTMLTable(theurl,stringsAsFactors = FALSE)
area_cult10 = data.table :: rbindlist(area_cult10)
test = sub(',','。',area_cult10 $ V5)#更改为。
test = gsub('(。+)\\s([AZ] {1})*','\\1',test)#删除LETTERS
gsub('\\ \\'s','',test)#删除空格?
为什么我不能删除 test [1]中的空格
?
感谢您的任何建议!这可以是空间角色以外的东西吗?也许答案很简单,我忽略了一些东西。 您可以缩短测试< (code> perl = TRUE
参数):/ code>创建仅需2步,仅使用1个 PCRE 正则表达式:
test = sub(,,。,gsub((* UCP)[\\s\\\p {L}] + | \\W + $,,area_cult10 $ V5,perl = TRUE),fixed = TRUE)
结果:
[1]11846.46529.23282.7616.0 1621.8125.714.2
[8]401.6455.511.7160.479.137.629.6
[15]13.9 554.1236.7312.84.6136.9
[22]1374.41332.31281.83.75.018.423.4
[29] 42.02746.2106.62100.4267.8258.413.1
[36]23.511.6310.2
gsub
正则表达式值得特别注意:
(* UCP)
- 执行模式的PCRE动词以识别Unicode
[\\s\\\p {L}] +
- 匹配1+空格或字母字符
|
- 或(一个替代运算符)
\\W + $
- 在字符串末尾有1个非单词字符。
然后, sub(,,。,x,fixed = TRUE)
将替换第一个,
使用。
作为字符串, fixed = TRUE
可以节省性能,因为它不需要编译正则表达式。 p>
This question seems to make it easy to remove space characters in a string in R. However when I load the following table I'm not able to remove a space between two numbers (eg.11 846.4
):
require(XML)
library(RCurl)
link2fetch = 'https://www.destatis.de/DE/ZahlenFakten/Wirtschaftsbereiche/LandForstwirtschaftFischerei/FeldfruechteGruenland/Tabellen/AckerlandHauptfruchtgruppenFruchtarten.html'
theurl = getURL(link2fetch, .opts = list(ssl.verifypeer = FALSE) ) # important!
area_cult10 = readHTMLTable(theurl, stringsAsFactors = FALSE)
area_cult10 = data.table::rbindlist(area_cult10)
test = sub(',', '.', area_cult10$V5) # change , to .
test = gsub('(.+)\\s([A-Z]{1})*', '\\1', test) # remove LETTERS
gsub('\\s', '', test) # remove white space?
Why can't I remove the space in test[1]
?
Thanks for any advice! Can this be something else than a space character? Maybe the answer is really easy and I'm overlooking something.
You may shorten the test
creation to just 2 steps and using just 1 PCRE regex (note the perl=TRUE
parameter):
test = sub(",", ".", gsub("(*UCP)[\\s\\p{L}]+|\\W+$", "", area_cult10$V5, perl=TRUE), fixed=TRUE)
Result:
[1] "11846.4" "6529.2" "3282.7" "616.0" "1621.8" "125.7" "14.2"
[8] "401.6" "455.5" "11.7" "160.4" "79.1" "37.6" "29.6"
[15] "" "13.9" "554.1" "236.7" "312.8" "4.6" "136.9"
[22] "1374.4" "1332.3" "1281.8" "3.7" "5.0" "18.4" "23.4"
[29] "42.0" "2746.2" "106.6" "2100.4" "267.8" "258.4" "13.1"
[36] "23.5" "11.6" "310.2"
The gsub
regex is worth special attention:
(*UCP)
- the PCRE verb that enforces the pattern to be Unicode aware[\\s\\p{L}]+
- matches 1+ whitespace or letter characters|
- or (an alternation operator)\\W+$
- 1+ non-word chars at the end of the string.
Then, sub(",", ".", x, fixed=TRUE)
will replace the first ,
with a .
as literal strings, fixed=TRUE
saves performance since it does not have to compile a regex.
这篇关于在R上的字符串中删除(非中断)空格字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!