根据所给文档生成字典

2019-11-06 07:40:20

字体：大中小

来源：转载

供稿：网友

在自然语言处理任务中，经常会对文本进行预处理。这种操作中有一部分十分重要，即建立词典。下面将给出一段讲解的python代码。

# 生成词汇表文件def gen_vocabulary_file(input_file, output_file): vocabulary = {} with open(input_file) as f: counter = 0 for line in f: counter += 1 #PRint line tokens = [Word for word in line.strip().decode('utf-8')]#这一步有问题，输出的不是汉字 for word in tokens: if word in vocabulary:#已在词汇表中，则词频加1 vocabulary[word] += 1 else:#不在则为1 vocabulary[word] = 1 vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True) #print vocabulary # 取前5000个常用汉字, 应该差不多够用了 if len(vocabulary_list) > 5000: vocabulary_list = vocabulary_list[:5000]#5000大小的词汇表 print(input_file , " 词汇表大小:", len(vocabulary_list)) with open(output_file, "w") as ff: for word in vocabulary_list: ff.write(word+'/n')

在这段代码中，函数有两个参数，一个为输入文件，一个是输出文件（词汇表）。（1）打开文档，并统计汉字词频；

with open(input_file) as f: counter = 0 for line in f: counter += 1 tokens = [word for word in line.strip().decode('utf-8')]#必须加上decode(),否则建立的词汇表会出现乱码，tokens为列表。

统计词频字典:

for word in tokens: if word in vocabulary:#已在词汇表中，则词频加1 vocabulary[word] += 1 else:#不在则为1 vocabulary[word] = 1

统计新的词频字典，以词频逆排

vocabulary_list = START_VOCABULART + sorted(vocabulary, key=vocabulary.get, reverse=True)

取前5000个汉字：

if len(vocabulary_list) > 5000: vocabulary_list = vocabulary_list[:5000]#5000大小的词汇表

词汇表大小，并写入文件

print(input_file , " 词汇表大小:", len(vocabulary_list))with open(output_file, "w") as ff: for word in vocabulary_list: ff.write(word+'/n')

如果出现编码错误，请在python文件头部加上：

import sysreload(sys)sys.setdefaultencoding('utf-8')

上一篇：getline中的ctrl+z的问题

下一篇：程序优雅的退出处理unhandler exception_不提示错误框