Python 中文正则表达式笔记

2020-03-16 21:12:57

字体：大中小

来源：转载

供稿：网友

总结在 python 语言里使用正则表达式匹配中文的经验。

从字符串的角度来说，中文不如英文整齐、规范，这是不可避免的现实。本文结合网上资料以及个人经验，以 python 语言为例，稍作总结。欢迎补充或挑错。
一点经验
可以使用 repr()函数查看字串的原始格式。这对于写正则表达式有所帮助。
Python 的 re模块有两个相似的函数：re.match(), re.search 。两个函数的匹配过程完全一致，只是起点不同。match只从字串的开始位置进行匹配，如果失败，它就此放弃；而search则会锲而不舍地完全遍历整个字串中所有可能的位置，直到成功地找到一个匹配，或者搜索完字串，以失败告终。如果你了解match的特性（在某些情况下比较快），大可以自由用它；如果不太清楚，search通常是你需要的那个函数。
从一堆文本中，找出所有可能的匹配，以列表的形式返回，这种情况用findall()这个函数。例子见后面的代码。
utf8下，每个汉字占据3个字符位置，正则式为[/x80-/xff]{3}，这个都知道了吧。
unicode下，汉字的格式如/uXXXX，只要找到对应的字符集的范围，就能匹配相应的字串，方便从多语言文本中挑出所需要的某种语言的文本。不过，对于像日文这样的粘着语，既有中文字符，又有平假名片假名，或许结果会有所偏差。
两种字符类可以并列在一起使用，例如，平假名、片假名、中文的放在一起，u"[/u4e00-/u9fa5/u3040-/u309f/u30a0-/u30ff]+"，来自定义所需要匹配的文本。
匹配中文时，正则表达式和目标字串的格式必须相同。这一点至关重要。或者都用默认的utf8，此时你不用额外做什么；如果是unicode，就需要在正则式之前加上u""格式。
可以这样定义unicode字符串：string=u"我爱正则表达式"。如果字串不是unicode的，可以使用unicode()函数转换之。如果你知道源字串的编码，可以使用newstr=unicode(oldstring, original_coding_name)的方式转换，例如 linux 下常用unicode(string, "utf8")，windows 下或许会用cp936吧，没测试。
例程序

复制代码代码如下:

		
		#!/usr/bin/python 
		# -*- coding: utf-8 -*- 
		# 
		#author: rex 
		#blog: http://iregex.org 
		#filename py_utf8_unicode.py 
		#created: 2010-06-27 09:11 
		import re 
		def findPart(regex, text, name): 
		res=re.findall(regex, text) 
		if res: 
		print "There are %d %s parts:/n"% (len(res), name) 
		for r in res: 
		print "/t",r 
		print 
		#sample is utf8 by default. 
		sample='''en: Regular expression is a powerful tool for manipulating text. 
		zh: 正则表达式是一种很有用的处理文本的工具。 
		jp: 正規表現は非常に役に立つツールテキストを操作することです。 
		jp-char: あアいイうウえエおオ 
		kr:정규 표현식은 매우 유용한 도구 텍스트를 조작하는 것입니다. 
		puc: 。？！、，；：“ ”‘ '——……·－·《》〈〉！￥％＆＊＃ 
		''' 
		#let's look its raw representation under the hood: 
		print "the raw utf8 string is:/n", repr(sample) 
		print 
		#find the non-ascii chars: 
		findPart(r"[/x80-/xff]+",sample,"non-ascii") 
		#convert the utf8 to unicode 
		usample=unicode(sample,'utf8') 
		#let's look its raw representation under the hood: 
		print "the raw unicode string is:/n", repr(usample) 
		print 
		#get each language parts: 
		findPart(u"[/u4e00-/u9fa5]+", usample, "unicode chinese") 
		findPart(u"[/uac00-/ud7ff]+", usample, "unicode korean") 
		findPart(u"[/u30a0-/u30ff]+", usample, "unicode japanese katakana") 
		findPart(u"[/u3040-/u309f]+", usample, "unicode japanese hiragana") 
		findPart(u"[/u3000-/u303f/ufb00-/ufffd]+", usample, "unicode cjk Punctuation") 

其输出结果为：

复制代码代码如下:

		
		the raw utf8 string is: 
		'en: Regular expression is a powerful tool for manipulating text./nzh: /xe6/xad/xa3/xe5/x88/x99/xe8/xa1/xa8/xe8/xbe/xbe/xe5/xbc/x8f/xe6/x98/xaf/xe4/xb8/x80/xe7/xa7/x8d/xe5/xbe/x88/xe6/x9c/x89/xe7/x94/xa8/xe7/x9a/x84/xe5/xa4/x84/xe7/x90/x86/xe6/x96/x87/xe6/x9c/xac/xe7/x9a/x84/xe5/xb7/xa5/xe5/x85/xb7/xe3/x80/x82/njp: /xe6/xad/xa3/xe8/xa6/x8f/xe8/xa1/xa8/xe7/x8f/xbe/xe3/x81/xaf/xe9/x9d/x9e/xe5/xb8/xb8/xe3/x81/xab/xe5/xbd/xb9/xe3/x81/xab/xe7/xab/x8b/xe3/x81/xa4/xe3/x83/x84/xe3/x83/xbc/xe3/x83/xab/xe3/x83/x86/xe3/x82/xad/xe3/x82/xb9/xe3/x83/x88/xe3/x82/x92/xe6/x93/x8d/xe4/xbd/x9c/xe3/x81/x99/xe3/x82/x8b/xe3/x81/x93/xe3/x81/xa8/xe3/x81/xa7/xe3/x81/x99/xe3/x80/x82/njp-char: /xe3/x81/x82/xe3/x82/xa2/xe3/x81/x84/xe3/x82/xa4/xe3/x81/x86/xe3/x82/xa6/xe3/x81/x88/xe3/x82/xa8/xe3/x81/x8a/xe3/x82/xaa/nkr:/xec/xa0/x95/xea/xb7/x9c /xed/x91/x9c/xed/x98/x84/xec/x8b/x9d/xec/x9d/x80 /xeb/xa7/xa4/xec/x9a/xb0 /xec/x9c/xa0/xec/x9a/xa9/xed/x95/x9c /xeb/x8f/x84/xea/xb5/xac /xed/x85/x8d/xec/x8a/xa4/xed/x8a/xb8/xeb/xa5/xbc /xec/xa1/xb0/xec/x9e/x91/xed/x95/x98/xeb/x8a/x94 /xea/xb2/x83/xec/x9e/x85/xeb/x8b/x88/xeb/x8b/xa4./npuc: /xe3/x80/x82/xef/xbc/x9f/xef/xbc/x81/xe3/x80/x81/xef/xbc/x8c/xef/xbc/x9b/xef/xbc/x9a/xe2/x80/x9c /xe2/x80/x9d/xe2/x80/x98 /xe2/x80/x99/xe2/x80/x94/xe2/x80/x94/xe2/x80/xa6/xe2/x80/xa6/xc2/xb7/xef/xbc/x8d/xc2/xb7/xe3/x80/x8a/xe3/x80/x8b/xe3/x80/x88/xe3/x80/x89/xef/xbc/x81/xef/xbf/xa5/xef/xbc/x85/xef/xbc/x86/xef/xbc/x8a/xef/xbc/x83/n' 
		There are 14 non-ascii parts: 
		正则表达式是一种很有用的处理文本的工具。 
		正規表現は非常に役に立つツールテキストを操作することです。 
		あアいイうウえエおオ 
		정규 
		표현식은 
		매우 
		유용한 
		도구 
		텍스트를 
		조작하는 
		것입니다 
		。？！、，；：“ 
		”‘ 
		'——……·－·《》〈〉！￥％＆＊＃ 
		the raw unicode string is: 
		u'en: Regular expression is a powerful tool for manipulating text./nzh: /u6b63/u5219/u8868/u8fbe/u5f0f/u662f/u4e00/u79cd/u5f88/u6709/u7528/u7684/u5904/u7406/u6587/u672c/u7684/u5de5/u5177/u3002/njp: /u6b63/u898f/u8868/u73fe/u306f/u975e/u5e38/u306b/u5f79/u306b/u7acb/u3064/u30c4/u30fc/u30eb/u30c6/u30ad/u30b9/u30c8/u3092/u64cd/u4f5c/u3059/u308b/u3053/u3068/u3067/u3059/u3002/njp-char: /u3042/u30a2/u3044/u30a4/u3046/u30a6/u3048/u30a8/u304a/u30aa/nkr:/uc815/uaddc /ud45c/ud604/uc2dd/uc740 /ub9e4/uc6b0 /uc720/uc6a9/ud55c /ub3c4/uad6c /ud14d/uc2a4/ud2b8/ub97c /uc870/uc791/ud558/ub294 /uac83/uc785/ub2c8/ub2e4./npuc: /u3002/uff1f/uff01/u3001/uff0c/uff1b/uff1a/u201c /u201d/u2018 /u2019/u2014/u2014/u2026/u2026/xb7/uff0d/xb7/u300a/u300b/u3008/u3009/uff01/uffe5/uff05/uff06/uff0a/uff03/n' 
		There are 6 unicode chinese parts: 
		正则表达式是一种很有用的处理文本的工具 
		正規表現 
		非常 
		役 
		立 
		操作 
		There are 8 unicode korean parts: 
		정규 
		표현식은 
		매우 
		유용한 
		도구 
		텍스트를 
		조작하는 
		것입니다 
		There are 6 unicode japanese katakana parts: 
		ツールテキスト 
		ア 
		イ 
		ウ 
		エ 
		オ 
		There are 11 unicode japanese hiragana parts: 
		は 
		に 
		に 
		つ 
		を 
		することです 
		あ 
		い 
		う 
		え 
		お 
		There are 5 unicode cjk Punctuation parts: 
		。 
		。 
		。？！、，；： 
		－ 
		《》〈〉！￥％＆＊＃