python中文解决方法gb2312<==>utf8

文章作者 100test 发表时间 2007:04:06 21:44:11
来源 100Test.Com百考试题网


抛砖引玉

这是我以前收集整理的。内容比较凌乱,也比较全面。
包括windows, python2.3,pyqt. 而pygtk和thinker和pyqt类似都用unicode.


我想最好的办法是做一个库直接调用gb13080编码字库.
我搜集了一个gb18030映射表 > 830k, 这样双向两个表 > 1.6 M


在win2000 sp3,python2.2

from Tkinter import *
w = Button(text="中国".decode("mbcs"), font="simhei", command=exit)
w.pack()
w.mainloop()
这个方法治标不治本
有时候,我会把字符串的mbcs(GB)和unicode混淆



这个方法有个缺点,由于mbcs的缘故,只适用于windows系统.
一个解决办法,安装
http://sourceforge.net/projects/python-codecs/
A SourceForge project working on additional support for Asian codecs for use
with Python. They are in the early stages of development at the time of this
writing -- look in their FTP area for downloadable files.
(见 Python Library Reference 4.9)
略作修改即可使用

下载4个文件
eucgb23212utf.py (182K) ,
utf2eucgb2321.py (182K),
( http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python-
codecs/practicecodecs/ChineseCodecs/chinesecn/Attic/ )
eucgb2321_cn.py (
http://cvs.sourceforge.net/cgi-bin/viewcvs.cgi/python-
codecs/practicecodecs/ChineseCodecs/Python/)
test.py

本来有个setup.py, 但我不会用,手工修改:

1.把EUCGB2321_CN 替换成gb2312,包括文件名,文件里面的内容.

2. aliases.py 文件最后添加一行
# eucgb2321_cn codec
gb2312 : gb2312,

3. 需要:c:\python22\lib\encodings中,新建一个目录chinesecn,
放置gb23122utf.py (182K) ,utf2gb2312.py (182K),
和 __init__.py(文件内容为空)三个文件,

4. encodings下,放置gb2312.py文件(原名是eucgb2321_cn.py ?)

)。

注释(2003.7):
EUCGB2321_CN 是unix下汉字编码。

直接下载:
http://bbs1.nju.edu.cn/file/gb2312.rar
即可。

------------------------------------------------------------------------
运行 test.py

gbstring = "大家好"
#print gbstring

uni = unicode(gbstring, "gb2312")

gstring = uni.encode("gb2312")

print "Original gb2312 encoded string:"
print gbstring
print "Transcode to Unicode encoding:"
print repr(uni)
print "Print as a gb2312 encoded string:"
print gstring

------------------------------------------------------------
运行结果:
Original gb2312 encoded string:
大家好
Transcode to Unicode encoding:
u\u5927\u5bb6\u597d
Print as a gb2312 encoded string:
大家好
------------------------------------------------------------------------------
这个方法的缺点,有点麻烦(unicode(gbstring, "gb2312")),
只适用gb2312,而不是gb18030编码(没有unicode我搜集了一个gb18030映射表 > 830k, 这样双向两个表 > 1.6 M

优点是 通用性很好,无论windows, linux系统,还是
Tkinter, pyQT, pyGTK, wxpython都可以使用。

---------------------------------------------------------------------------
btw,
eucgb2321, 2321? 2312? 把我搞迷糊了 ^_^
EUCGB2321_CN 是unix下汉字编码。

我原本用杜文山先生的汉化包( http://dohao.org),可是他并不能及时更新了,
只好另想办法。



python 开发人员的建议

寄件者:Martin v. Loewis ([email protected])
主旨:Re: Chinese language support of Python?

View this article only
新闻群组:comp.lang.python
日期:2002-07-07 01:01:02 PST



[email protected] (Leon Wang) writes:

> But still can not put Chinese directly as string in source, I can not
> live with so much \u... for a whole Chinese sensence/paragraph, its
> impossible to read and edit them

This is a known problem, and it will be addressed with PEP 263
(http://www.python.org/peps/pep-0263.html.

Meanwhile, you have the following options:

- Dont use IDLE to edit Python source code (but, say, notepad), and
only put Chinese text into string literals.
- Set the default encoding in site.py to the encoding you want to use.
- Apply patch
http://sourceforge.net/tracker/index.php?func=detail&.aid=508973&.group_id=957
9&.atid=309579

which allows you to declare the source encoding for IDLE.

In either case, you cannot use Chinese in Unicode literals. Instead,
you should always use

unicode("chinese string", "chinese encoding")

For portability, and if your editors support it, I recommend to use
UTF-8 as the "chinese encoding".

Regards,
Martin


相关文章


Qt不规则窗体的实现
Awk基础入门:Awk实例编程
python中文解决方法gb2312<==>utf8
Pythonanygui项目预览
Python编程技巧-使用状态机
澳大利亚华人论坛
考好网
日本华人论坛
华人移民留学论坛
英国华人论坛