正则表达式删除 <> 之间的所有内容 [英] Regular expressions remove everything between <>

查看：34 发布时间：2021/9/24 18:54:58 r regex web-scraping gsub

本文介绍了正则表达式删除 <> 之间的所有内容的处理方法，对大家解决问题具有一定的参考价值，需要的朋友们下面随着小编来一起学习吧！

问题描述

我正在学习网络抓取.我掌握了一堆数据，但结构混乱.
我有一个这种形式的字符串向量:
"9,55< U+00A0>x< U+00A0>1016",(现在写的时候觉得是特殊的语法，因为不加空格就不能粘贴在这里在U"之前)在我抓取的网站上写成9,55*10^16".

I am learning to web scrape. I have got hold of a bunch of data but of a messy structure.
I have a vector of strings of this form:
"9,55< U+00A0>x< U+00A0>1016", (now when I am writing it I think it is a special syntax, because I cannot paste it here without putting a space before the "U") which on the website I am scraping from is written as "9,55*10^16".

从长远来看，我的目标是将此字符串转换为数字变量，即 95500000000000000.但首先我想删除第一个<"之间的所有内容和最后一个>".以下是我的尝试.

My goal in the long run, is to turn this string into a numeric variable, i.e. 95500000000000000. But first I want to remove everything between the first "<" and the last ">". Below is my attempt.

gsub("<(.*?)>", "", vectorOfStrings)

编辑:最好在 R 中使用 "9,55\U{00A0}x\U{00A0}1016" 生成字符串，因为 "<;"和>"不是字符串中的实际文字.

Edit: the string is best generated in R using "9,55\U{00A0}x\U{00A0}1016", since the "<" and ">" are not actual literals in the string.

推荐答案

您看到的字符是 unicode(我认为是 UTF-8)，而 R 的表示(当它不是很清楚时)是小于/大于- 比符号.要删除它，一种方法是将文本转换"为 ASCII:

The characters your seeing are unicode (UTF-8, I think), and R's representation (when it is not abundantly clear) is the less-than/greater-than notation. To remove it, one method is to "convert" the text to ASCII:

iconv(vectorOfStrings, "utf-8", "ASCII", sub = "")

任何不可翻译的东西都应该删除.

Anything non-translatable should be dropped.

这篇关于正则表达式删除 <> 之间的所有内容的文章就介绍到这了，希望我们推荐的答案对大家有所帮助，也希望大家多多支持IT屋！

查看全文

正则表达式删除 <> 之间的所有内容 [英] Regular expressions remove everything between <>

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录关闭

正则表达式删除 &lt;&gt; 之间的所有内容 [英] Regular expressions remove everything between &lt;&gt;

问题描述

推荐答案

相关文章

其他开发最新文章

热门教程

热门工具

登录 关闭

正则表达式删除 <> 之间的所有内容 [英] Regular expressions remove everything between <>

登录关闭