如何awk读字典并替换文件中的单词? [英] How to awk to read a dictionary and replace words in a file?

查看:209
本文介绍了如何awk读字典并替换文件中的单词?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我们有一个看起来像这样的源文件("source-A")(如果看到蓝色文本,则来自stackoverflow,而不是文本文件):

We have a source file ("source-A") that looks like this (if you see blue text, it comes from stackoverflow, not the text file):

The container of white spirit was made of aluminium.
We will use an aromatic method to analyse properties of white spirit.
No one drank white spirit at stag night.
Many people think that a potato crisp is savoury, but some would rather eat mashed potato.
...
more sentences

"source-A"中的每个句子在其单独的行上并以换行符(\ n)结尾

Each sentence in "source-A" is on its own line and terminates with a newline (\n)

我们有一个字典/转换文件("converse-B"),如下所示:

We have a dictionary/conversion file ("converse-B") that looks like this:

aluminium<tab>aluminum
analyse<tab>analyze
white spirit<tab>mineral spirits
stag night<tab>bachelor party
savoury<tab>savory
potato crisp<tab>potato chip
mashed potato<tab>mashed potatoes

"converse-B"是两列的制表符分隔文件. 每个等效图(左上项 <tab> 右上项)位于自己的行上,并以换行符(\ n)终止

"converse-B" is a two column, tab delimited file. Each equivalence map (term-on-left<tab>term-on-right) is on its own line and terminates with a newline (\n)

如何阅读"converse-B"并替换"source-A"中的术语,其中"converse-B"第1列中的术语被替换为第2列中的术语,然后写入输出文件(输出C")?

How to read "converse-B", and replace terms in "source-A" where a term in "converse-B" column-1 is replaced with the term in column-2, and then write to an output file ("output-C")?

例如,"output-C"将如下所示:

For example, the "output-C" would look like this:

The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.

棘手的部分是马铃薯"一词.

The tricky part is the term potato.

如果简单" awk解决方案无法处理单数项(马铃薯)复数项(马铃薯),我们将使用手动替换方法. awk解决方案可以跳过该用例.

If a "simple" awk solution cannot handle a singular term (potato) and a plural term (potatoes), we'll use a manual substitution method. The awk solution can skip that use case.

换句话说,awk解决方案可以规定它仅适用于明确的单词或由空格分隔的明确单词组成的术语.

In other words, an awk solution can stipulate that it only works for an unambiguous word or a term composed of space separated, unambiguous words.

一个awk解决方案将使我们的完成率达到90%;我们将手动完成剩余的10%.

An awk solution will get us to a 90% completion rate; we'll do the remaining 10% manually.

推荐答案

sed可能更适合,因为它只是短语/单词的替换.请注意,如果相同的单词出现在多个短语中,则先到先得;因此请相应地更改字典顺序.

sed probably suits better since since it's only phrase/word replacements. Note that if the same words appear in multiple phrases first come first serve; so change your dictionary order accordingly.

$ sed -f <(sed -E 's_(.+)\t(.+)_s/\1/\2/g_' dict) content

The container of mineral spirits was made of aluminum.
We will use an aromatic method to analyze properties of mineral spirits.
No one drank mineral spirits at bachelor party.
Many people think that a potato chip is savory, but some would rather eat mashed potatoes.
...
more sentences

文件替换sed语句将字典条目转换为sed表达式,而主sed则将其用于内容替换.

file substitute sed statement converts dictionary entries into sed expressions and the main sed uses them for the content replacements.

注意:请注意,生产质量脚本应考虑单词大小写以及单词边界,以消除不需要的子字符串替换,此处将忽略它们.

NB: Note that production quality script should take of word cases and also word boundaries to eliminate unwanted substring substitution, which are ignored here.

这篇关于如何awk读字典并替换文件中的单词?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆