《机器学习实战》第四章4.6-4.7 示例1：垃圾邮件过滤示例2：从个人广告中获取区域倾向

2019-11-06 09:11:25

字体：大中小

来源：转载

供稿：网友

机器学习实战》系列博客主要是实现并理解书中的代码，相当于读书笔记了。毕竟实战不能光看书。动手就能遇到许多奇奇怪怪的问题。博文比较粗糙，需结合书本。博主边查边学，水平有限，有问题的地方评论区请多指教。书中的代码和数据，网上有很多请自行下载。

4.6 垃圾邮件过滤

4.6.1 准备数据：切分文本

对于文本字符串，可以用string.split 切分

>>> mySent = 'This book is the best book on python or M.L. I have ever laid eyes upon'>>> mySent.split()['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M.L.', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']>>>

标点符号也被当成词的一部分，可以使用正则表示式来切分，其中分隔符是除单词，数字外的任意字符串。

>>> import re >>> regEX = re.compile('//W*')>>> listOfTokens = regEX.split(mySent)>>> listOfTokens ['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']>>>

去空格（好像上面的已经把空格去了？？）字符串变小写

>>> [tok for tok in listOfTokens if len(tok)>0]['This', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'M', 'L', 'I', 'have', 'ever', 'laid', 'eyes', 'upon']>>> [tok.lower() for tok in listOfTokens if len(tok)>0]['this', 'book', 'is', 'the', 'best', 'book', 'on', 'python', 'or', 'm', 'l', 'i', 'have', 'ever', 'laid', 'eyes', 'upon']>>>

4.6.2 测试算法：使用朴素贝叶斯进行交叉验证

文件解析及完整的垃圾邮件测试函数

文件夹中有各有25个spam 和ham ，随机选择10个做测试集，其余是训练集。这种方法称为：留存交叉验证随机选择会导致，输出结果有差别。可以重复试验取平均

def textParse(bigString): #输入一个大字符串并解析为字符串列表 import re listOfTokens = re.split(r'/W*', bigString) #函数去掉少于2个字符的字符串，并全部转为小写 return [tok.lower() for tok in listOfTokens if len(tok) > 2] def spamTest(): docList=[]; classList = []; fullText =[] for i in range(1,26): WordList = textParse(open('email/spam/%d.txt' % i).read()) docList.append(wordList) #添加成[[][][]]形式 fullText.extend(wordList) #添加成[]形式 classList.append(1) #类标签 wordList = textParse(open('email/ham/%d.txt' % i).read()) docList.append(wordList) fullText.extend(wordList) classList.append(0) vocabList = createVocabList(docList) #调用函数createVocabList生成词表 trainingSet = range(50); testSet=[] #有50个训练样本 for i in range(10): #随机选10个做测试样本 randIndex = int(random.uniform(0,len(trainingSet))) testSet.append(trainingSet[randIndex]) del(trainingSet[randIndex]) trainMat=[]; trainClasses = [] for docIndex in trainingSet: trainMat.append(bagOfWords2VecMN(vocabList, docList[docIndex]))#词向量 trainClasses.append(classList[docIndex])#对应的类标签 p0V,p1V,pSpam = trainNB0(array(trainMat),array(trainClasses))#训练生成3个概率 errorCount = 0 for docIndex in testSet: #验证测试集 wordVector = bagOfWords2VecMN(vocabList, docList[docIndex]) #词向量 if classifyNB(array(wordVector),p0V,p1V,pSpam) != classList[docIndex]: errorCount += 1 #分类错误加加 PRint "classification error",docList[docIndex] print 'the error rate is: ',float(errorCount)/len(testSet) #return vocabList,fullText>>> bayes.spamTest()classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']the error rate is: 0.1>>> bayes.spamTest()the error rate is: 0.0>>> bayes.spamTest()classification error ['experience', 'with', 'biggerpenis', 'today', 'grow', 'inches', 'more', 'the', 'safest', 'most', 'effective', 'methods', 'of_penisen1argement', 'save', 'your', 'time', 'and', 'money', 'bettererections', 'with', 'effective', 'ma1eenhancement', 'products', 'ma1eenhancement', 'supplement', 'trusted', 'millions', 'buy', 'today']classification error ['yeah', 'ready', 'may', 'not', 'here', 'because', 'jar', 'jar', 'has', 'plane', 'tickets', 'germany', 'for']the error rate is: 0.2