将 RTF 文件解析为 R? [英] Parsing RTF files into R?

查看:53
本文介绍了将 RTF 文件解析为 R?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

找不到对 R 的太多支持.我正在尝试将许多 RTF 文件读入 R 以构建数据框,但我正在努力寻找解析 RTF 文件并忽略的好方法文件的结构/格式.我只想从每个文件中提取两行文本——但它嵌套在文件结构中.

Couldn't find much support for this for R. I'm trying to read a number of RTF files into R to construct a data frame, but I'm struggling to find a good way to parse the RTF file and ignore the structure/formatting of the file. There are really only two lines of text I want to pull from each file -- but it's nested within the structure of the file.

我在下面粘贴了一个示例 RTF 文件.我想捕获的两个字符串是:

I've pasted a sample RTF file below. The two strings I'd like to capture are:

  1. 今天购买 26 英寸液晶电视还是下个月购买 32 英寸?模拟购买高科技耐用产品"

  1. "Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products"

技术水平 [...] 和管理影响."(整段)

"The technology level [...] and managerial implications." (the full paragraph)

关于如何有效解析这个有什么想法吗?我认为正则表达式可能对我有帮助,但我正在努力形成正确的表达式来完成工作.

Any thoughts on how to efficiently parse this? I think regular expressions might help me, but I'm struggling to form the right expression to get the job done.

{\rtf1\ansi\ansicpg1252\cocoartf1265
{\fonttbl\f0\fswiss\fcharset0 ArialMT;\f1\froman\fcharset0 Times-Roman;}
{\colortbl;\red255\green255\blue255;\red0\green0\blue0;\red109\green109\blue109;}
\margl1440\margr1440\vieww10800\viewh8400\viewkind0
\deftab720

\itap1\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil 
\clvertalt \clshdrawnil \clwWidth15680\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640

\itap2\trowd \taflags0 \trgaph108\trleft-108 \trbrdrt\brdrnil \trbrdrl\brdrnil \trbrdrt\brdrnil \trbrdrr\brdrnil 
\clmgf \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx4320
\clmrg \clvertalt \clshdrawnil \clwWidth14840\clftsWidth3 \clbrdrt\brdrnil \clbrdrl\brdrnil \clbrdrb\brdrnil \clbrdrr\brdrnil \clpadl0 \clpadr0 \gaph\cellx8640
\pard\intbl\itap2\pardeftab720

\f0\b\fs26 \cf0 Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products\nestcell 
\pard\intbl\itap2\nestcell \lastrow\nestrow
\pard\intbl\itap1\pardeftab720

\f1\b0\fs24 \cf0 \
\pard\intbl\itap1\pardeftab720

\f0\fs26 \cf0 The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers\'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications.\cell \lastrow\row
\pard\pardeftab720

\f1\fs24 \cf0 \
}

推荐答案

1) 如果您使用的是 Windows,一个简单的方法是使用 WordPad 或 Word 阅读它,然后将其另存为一个普通的文本文档.

1) A simple way if you are on Windows is to read it in using WordPad or Word and then save it as a plain text document.

2) 或者,直接在 R 中解析它,读入 rtf 文件,找到具有给定模式的行,pat 产生 g.然后用单引号替换任何 \\' 字符串,生成 noq.最后删除 pat 和任何尾随的垃圾.这适用于示例,但如果除了我们已经处理的 \\' 之外还有其他嵌入的 \\ 字符串,您可能需要修改模式:

2) Alternately, to parse it directly in R, read in the rtf file, find lines with the given pattern, pat producing g. Then replace any \\' strings with single quotes producing noq. Finally remove pat and any trailing junk. This works on the sample but you might need to revise the patterns if there are additional embedded \\ strings other than the \\' which we already handle:

Lines <- readLines("myfile.rtf")
pat <- "^\\\\f0.*\\\\cf0 "
g <- grep(pat, Lines, value = TRUE)
noq <- gsub("\\\\'", "'", g)
sub("\\\\.*", "", sub(pat, "", noq))

对于指定的文件,这是输出:

For the indicated file this is the output:

[1] "Buy a 26 Inch LCD-TV Today or a 32 Inch Next Month? Modeling Purchases of High-tech Durable Products"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
[2] "The technology level of new high-tech durable products, such as digital cameras and LCD-TVs, continues to go up, while prices continue to go down. Consumers may anticipate these trends. In particular, a consumer faces several options. The first is to buy the current level of technology at the current price. The second is not to buy and stick with the currently owned (old) level of technology. Hence, the consumer postpones the purchase and later on buys the same level of technology at a lower price, or better technology at the same price. We develop a new model to describe consumers'92 decisions with respect to buying these products. Our model is built on the theory of consumer expectations of price and the well-known utility maximizing framework. Since not every consumer responds the same, we allow for observed and unobserved consumer heterogeneity. We calibrate our model on a panel of several thousand consumers. We have information on the currently owned technology and on purchases in several categories of high-tech durables. Our model provides new insights in these product markets and managerial implications."

修改多次.添加了写字板/Word 解决方案.

Revised several times. Added Wordpad/Word solution.

这篇关于将 RTF 文件解析为 R?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆