如何删除网站数据收集结果中的\\\\\\\\\\\\\\\\ [英] how to delete the \n\t\t\t in the result from website data collection?

查看:162
本文介绍了如何删除网站数据收集结果中的\\\\\\\\\\\\\\\\的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我想从网站上检索产品的名称,所以我在下面写下我的代码。但结果包含一些简单的信息,例如\\\\\\\\\\\\\\'有人可以帮助我如何删除这些东西?
代码:



检索名称



  reddoturl< - 'http://red-dot.de/pd/online-exhibition/?lang=en&c=163&a=0&y=2013&i=0&oes='
library(XML)
doc< - htmlParse(reddoturl)



审核数据



  reviews< -xpathSApply(doc,'// div [@ class =work_contaienterner_headline]',xmlValue)

结果:
[1]VZ-C6 / VZ-C3D\\\
\t\t\t\t\ t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\\\
\t\\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\文件相机\\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ t\t\t\\\
\t\t\t\t\t\t\t\t\t\t\t\t\t\\ \\ t\t\t\t

解决方案

查找和替换操作的前往功能ns中的字符串是 sub (取代第一个实例)和 gsub (取代所有实例)。这些函数在表示的字符串中查找模式正则表达式,并将其替换为固定的文本字符串。

例如:

  s < - VZ-C6 / VZ-C3D \\\\\\\\\\\\\\\\\\\\\\\\\\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ \\ t\t\t\t\t\t\t\t\t\\\
\t\t\t\t\t\t\t\\ \\ t \\ t \\ t \\ t \\ t \\ tt

gsub('\t | \\\
', '',s)

[1]VZ-C6 / VZ-C3DDocument Camera

上面的模式中的管道操作符( | ), \t | \\\
,确保 \\\
\ t 匹配,第二参数''表示用空字符串替换匹配(即没有)。



while <$ c $上面的c> s 仅包含一个元素 gsub sub 是矢量化的,所以也可以在任意长度的整个矢量上运行。


i want to retrieve the names of product from the website, so i write my code below. but the result includes some trivial info such as \n\t\t\t. Can someone help me how to delete these stuff? code:

retrieve name

reddoturl <- 'http://red-dot.de/pd/online-exhibition/?lang=en&c=163&a=0&y=2013&i=0&oes='
library(XML)
doc <- htmlParse(reddoturl)

review data

reviews<-xpathSApply(doc,'//div[@class="work_contaienterner_headline"]',xmlValue)

results: [1] "VZ-C6 / VZ-C3D\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tDocument Camera\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"

解决方案

The go-to function for "find and replace" operations on strings in R are sub (to replace just the first instance) and gsub (to replace all instances). These functions seek a pattern in the string represented by a regular expression, and replace it by a fixed string of text.

For example:

s <- "VZ-C6 / VZ-C3D\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\tDocument Camera\n\t\t\t\t\t\t\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t"

gsub('\t|\n', '', s)

[1] "VZ-C6 / VZ-C3DDocument Camera"

The pipe operator (|) in the the pattern above, \t|\n, ensures that either \n or \t are matched, and the second argument of '' says to replace matches with an empty string (i.e. nothing).

While s above contains just a single element, gsub and sub are vectorised and so will also work on an entire vector of arbitrary length.

这篇关于如何删除网站数据收集结果中的\\\\\\\\\\\\\\\\的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆