如何使用Python在文本文件中用土耳其语字符替换Unicode字符 [英] How can I replace Unicode characters with Turkish characters in a text file with Python

查看:254
本文介绍了如何使用Python在文本文件中用土耳其语字符替换Unicode字符的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我在Twitter上工作.我使用Stream API从Twitter获得数据,应用程序的结果是JSON文件.我在文本文件中写入了推文数据,现在我看到的是Unicode字符而不是土耳其语字符.我不想手动在Notepad ++中查找/替换.是否可以通过打开txt文件,读取文件中的所有数据并通过Python将土耳其语字符更改为Unicode字符来自动替换字符?

I am working on Twitter. I got data from Twitter with Stream API and the result of app is JSON file. I wrote tweets data in a text file and now I see Unicode characters instead of Turkish characters. I don't want to do find/replace in Notepad++ by hand. Is there any automatic option to replace characters by opening txt file, reading all data in file and changing Unicode characters with Turkish characters by Python?

这里是我要替换的Unicode字符和土耳其语字符.

Here are Unicode characters and Turkish characters which I want to replace.

  • ğ-\ u011f
  • Ğ-\ u011e
  • ı-\ u0131
  • İ-\ u0130
  • ö-\ u00f6
  • Ö-\ u00d6
  • ü-\ u00fc
  • Ü-\ u00dc
  • ş-\ u015f
  • Ş-\ u015e
  • ç-\ u00e7
  • Ç-\ u00c7
  • ğ - \u011f
  • Ğ - \u011e
  • ı - \u0131
  • İ - \u0130
  • ö - \u00f6
  • Ö - \u00d6
  • ü - \u00fc
  • Ü - \u00dc
  • ş - \u015f
  • Ş - \u015e
  • ç - \u00e7
  • Ç - \u00c7

我尝试了两种不同的类型

I tried two different type

#!/usr/bin/env python

# -*- coding: utf-8 -*- 

import re

dosya = open('veri.txt', 'r')

for line in dosya:
    match = re.search(line, "\u011f")
    if (match):
        replace("\u011f", "ğ")

dosya.close()

和:

#!/usr/bin/env python

# -*- coding: utf-8 -*- 

f1 = open('veri.txt', 'r')
f2 = open('veri2.txt', 'w')

for line in f1:
    f2.write=(line.replace('\u011f', 'ğ')) 
    f2.write=(line.replace('\u011e', 'Ğ'))
    f2.write=(line.replace('\u0131', 'ı'))
    f2.write=(line.replace('\u0130', 'İ'))
    f2.write=(line.replace('\u00f6', 'ö'))
    f2.write=(line.replace('\u00d6', 'Ö'))
    f2.write=(line.replace('\u00fc', 'ü'))
    f2.write=(line.replace('\u00dc', 'Ü'))
    f2.write=(line.replace('\u015f', 'ş'))
    f2.write=(line.replace('\u015e', 'Ş'))
    f2.write=(line.replace('\u00e7', 'ç'))
    f2.write=(line.replace('\u00c7', 'Ç'))

f1.close()
f2.close()

这两个都不起作用. 我该如何运作?

Both of these didn't work. How can I make it work?

推荐答案

JSON允许同时使用转义的"和未转义的"字符. Twitter API仅返回转义字符的原因是它可以使用ASCII编码,从而提高了互操作性.对于土耳其语字符,您需要其他编码.使用 open 函数打开文件会假设您当前使用语言环境编码,这可能是您的编辑人员所期望的.如果您希望输出文件具有例如ISO-8859-9编码,您可以将encoding='ISO-8859-9'作为附加参数传递给open函数.

JSON allows both "escaped" and "unescaped" characters. The reason why the Twitter API returns only escaped characters is that it can use the ASCII encoding, which increases interoperability. For Turkish characters you need another encoding. Opening a file with the open function opens a file assuming your current locale encoding, which is probably what your editor expects. If you want the output file to have e.g. the ISO-8859-9 encoding, you can pass encoding='ISO-8859-9' as an additional parameter to the open function.

您可以使用json.load函数读取包含JSON对象的文件.这将返回一个带有解码的转义字符的Python对象.再次使用json.dump编写并传递ensure_ascii=False作为参数将对象写回到文件中,而不会将土耳其语字符编码为转义序列.一个例子:

You can read a file containing a JSON object with the json.load function. This returns a Python object with the escaped characters decoded. Writing it again with json.dump and passing ensure_ascii=False as an argument writes the object back to a file without encoding Turkish characters as escape sequences. An example:

import json
inp = open('input.txt', 'r')
out = open('output.txt', 'w')
in_as_obj = json.load(inp)
json.dump(in_as_obj, out, ensure_ascii=False)

您的文件不是真正的JSON文件,而是包含多个JSON对象的文件.如果每个JSON对象都位于单独的行中,则可以尝试以下操作:

Your file isn't really a JSON file, but instead a file containing multiple JSON objects. If each JSON object is on its own line, you can try the following:

import json
inp = open('input.txt', 'r')
out = open('output.txt', 'w')
for line in inp:
    if not line.strip():
        out.write(line)
        continue
    in_as_obj = json.loads(line)
    json.dump(in_as_obj, out, ensure_ascii=False)
    out.write('\n')

但是对于您而言,最好首先将未转义的JSON写入文件.尝试用(未经测试的)替换您的on_data方法:

But in your case it's probably better to write unescaped JSON to the file in the first place. Try replacing your on_data method by (untested):

def on_data(self, raw_data):
    data = json.loads(raw_data)
    print(json.dumps(data, ensure_ascii=False))

这篇关于如何使用Python在文本文件中用土耳其语字符替换Unicode字符的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆