Nutch 快速入门(Nutch 2.2.1)

2019-11-06 08:17:21

字体：大中小

来源：转载

供稿：网友

Nutch 2.x 与 Nutch 1.x 相比，剥离出了存储层，放到了gora中，可以使用多种数据库，例如Hbase, Cassandra, MySQL来存储数据了。Nutch 1.7 则是把数据直接存储在HDFS上。

1. 安装并运行HBase

为了简单起见，使用Standalone模式，参考 HBase Quick start

1.1 下载，解压

wget http://archive.apache.org/dist/hbase/hbase-0.90.4/hbase-0.90.4.tar.gztar zxf hbase-0.90.4.tar.gz1.2 修改 conf/hbase-site.xml
内容如下
<configuration>  <PRoperty>    <name>hbase.rootdir</name>    <value>file:///DIRECTORY/hbase</value>  </property>  <property>    <name>hbase.zookeeper.property.dataDir</name>    <value>/DIRECTORY/zookeeper</value>  </property></configuration>hbase.rootdir目录是用来存放HBase的相关信息的，默认值是/tmp/hbase-${user.name}/hbase； hbase.zookeeper.property.dataDir目录是用来存放zookeeper（HBase内置了zookeeper）的相关信息的，默认值是/tmp/hbase-${user.name}/zookeeper。
1.3 启动
$ ./bin/start-hbase.shstarting Master, logging to logs/hbase-user-master-example.org.out1.4 试用一下shell
$ ./bin/hbase shell HBase Shell; enter ‘help’ for list of supported commands. Type “exit” to leave the HBase Shell Version 0.90.4, r1150278, Sun Jul 24 15:53:29 PDT 2011
hbase(main):001:0>
创建一张名字为test的表，只有一个列，名为cf。为了验证创建是否成功，用list命令查看所有的table，并用put命令插入一些值。
hbase(main):003:0> create 'test', 'cf'0 row(s) in 1.2200 secondshbase(main):003:0> list 'test'..1 row(s) in 0.0550 secondshbase(main):004:0> put 'test', 'row1', 'cf:a', 'value1'0 row(s) in 0.0560 secondshbase(main):005:0> put 'test', 'row2', 'cf:b', 'value2'0 row(s) in 0.0370 secondshbase(main):006:0> put 'test', 'row3', 'cf:c', 'value3'0 row(s) in 0.0450 seconds用scan命令扫描table，验证一下刚才的插入是否成功。
hbase(main):007:0> scan 'test'ROW        COLUMN+CELLrow1       column=cf:a, timestamp=1288380727188, value=value1row2       column=cf:b, timestamp=1288380738440, value=value2row3       column=cf:c, timestamp=1288380747365, value=value33 row(s) in 0.0590 seconds现在，disable并drop掉你的表，这会把上面的所有操作清零。
hbase(main):012:0> disable 'test'0 row(s) in 1.0930 secondshbase(main):013:0> drop 'test'0 row(s) in 0.0770 seconds 退出shell，
hbase(main):014:0> exit1.5 停止
$ ./bin/stop-hbase.shstopping hbase...............1.6 再次启动
后面运行Nutch，需要把数据存储到HBase，因此需要启动HBase。
$ ./bin/start-hbase.shstarting Master, logging to logs/hbase-user-master-example.org.out2 编译Nutch 2.2.1
2.1 下载，解压
wget http://www.apache.org/dyn/closer.cgi/nutch/2.2.1/apache-nutch-2.2.1-src.tar.gztar zxf apache-nutch-2.2.1-src.tar.gz2.2 修改配置文件
参考Nutch 2.0 Tutorial
修改 conf/nutch-site.xml
<property>  <name>storage.data.store.class</name>  <value>org.apache.gora.hbase.store.HBaseStore</value>  <description>Default class for storing data</description></property>修改ivy/ivy.xml
<!-- Uncomment this to use HBase as Gora backend. --><dependency org="org.apache.gora" name="gora-hbase" rev="0.3" conf="*->default" />修改 conf/gora.properties，确保HBaseStore被设置为默认的存储，
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore2.3 编译
ant runtime刚开始会下载很多jar，需要等待一段时间。
有可能你会得到如下错误：
Trying to override old definition of task javac  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.ivy-probe-antlib:ivy-download:  [taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.无所谓，不用管它。
要等一会儿才能编译结束。编译完后，多出来了 build 和 runtime两个文件夹。
第3、4、5、6步与另一篇博客Nutch 快速入门(Nutch 1.7)中的第3、4、5、6步骤一模一样。
3 添加种子URL
mkdir ~/urlsvim ～/urls/seed.txthttp://movie.douban.com/subject/5323968/4 设置URL过滤规则
如果只想抓取某种类型的URL，可以在 conf/regex-urlfilter.txt设置正则表达式，于是，只有匹配这些正则表达式的URL才会被抓取。
例如，我只想抓取豆瓣电影的数据，可以这样设置：
#注释掉这一行# skip URLs containing certain characters as probable queries, etc.#-[?*!@=]# accept anything else#注释掉这行#+.+^http:////movie/.douban/.com//subject//[0-9]+//(/?.+)?$5 设置agent名字
conf/nutch-site.xml:
<property>  <name>http.agent.name</name>  <value>My Nutch Spider</value></property>这一步是从这本书上看到的，Web Crawling and Data Mining with Apache Nutch，第14页。
6 安装Solr
由于建索引的时候需要使用Solr，因此我们需要安装并启动一个Solr服务器。
参考Nutch Tutorial 第4、5、6步，以及Solr Tutorial。
6.1 下载，解压
wget http://mirrors.cnnic.cn/apache/lucene/solr/4.6.1/solr-4.6.1.tgz tar -zxf solr-4.6.1.tgz
6.2 运行Solr
cd examplejava -jar start.jar验证是否启动成功
用浏览器打开 http://localhost:8983/solr/admin/，如果能看到页面，说明启动成功。
6.3 将Nutch与Solr集成在一起
将NUTCH_DIR/conf/schema-solr4.xml拷贝到SOLR_DIR/solr/collection1/conf/，重命名为schema.xml，并在<fields>...</fields>最后添加一行(具体解释见Solr 4.2 - what is _version_field?)，
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>重启Solr，
# Ctrl+C to stop Solrjava -jar start.jar第7步和第8步也和Nutch 1.7那篇博客中的7、8步很类似。主要区别在于，Nutch 2.x的所有数据，不再以文件和目录的形式存放在硬盘上，而是存放到HBase里。
7 一步一步使用单个命令抓取网页
本节我将严格按照抓取的步骤，一步一步来，揭开爬虫的神秘面纱。感兴趣的读者也可以看看 bin/crawl 脚本里的内容，可以很清楚的看到各个步骤。
先删除第7节产生的数据，
$ rm -rf TestCrawl/7.1 基本概念
Nutch data is composed of:
The crawl database, or crawldb. This contains information about every URL known to Nutch, including whether it was fetched, and, if so, when.The link database, or linkdb. This contains the list of known links to each URL, including both the source URL and anchor text of the link.A set of segments. Each segment is a set of URLs that are fetched as a unit. Segments are directories with the following subdirectories:a crawl_generate names a set of URLs to be fetcheda crawl_fetch contains the status of fetching each URLa content contains the raw content retrieved from each URLa parse_text contains the parsed text of each URLa parse_data contains outlinks and metadata parsed from each URLa crawl_parse contains the outlink URLs, used to update the crawldb7.2 inject:使用种子URL列表，生成crawldb
$ bin/nutch inject TestCrawl/crawldb ~/urls将根据～/urls下的种子URL，生成一个URL数据库，放在crawdb目录下。
7.3 generate
$ bin/nutch generate TestCrawl/crawldb TestCrawl/segments这会生成一个 fetch list，存放在一个segments/日期目录下。我们将这个目录的名字保存在shell变量s1里：
$ s1=`ls -d TestCrawl/segments/2* | tail -1`$ echo $s17.4 fetch
$ bin/nutch fetch $s1将会在 $1 目录下，生成两个子目录, crawl_fetch 和 content。
7.5 parse
$ bin/nutch parse $s1将会在 $1 目录下，生成3个子目录, crawl_parse, parse_data 和 parse_text 。
7.6 updatedb
$ bin/nutch updatedb TestCrawl/crawldb $s1这将把crawldb/current重命名为crawldb/old，并生成新的 crawldb/current 。
7.7 查看结果
$ bin/nutch readdb TestCrawl/crawldb/ -stats7.8 invertlinks
在建立索引之前，我们首先要反转所有的链接，这样我们就可以获得一个页面所有的锚文本，并给这些锚文本建立索引。
$ bin/nutch invertlinks TestCrawl/linkdb -dir TestCrawl/segments7.9 solrindex, 提交数据给solr，建立索引
$ bin/nutch solrindex http://localhost:8983/solr TestCrawl/crawldb/ -linkdb TestCrawl/linkdb/ TestCrawl/segments/20140203004348/ -filter -normalize7.10 solrdedup, 给索引去重
有时重复添加了数据，导致索引里有重复数据，我们需要去重，
$bin/nutch solrdedup http://localhost:8983/solr7.11 solrclean, 删除索引
如果数据过时了，需要在索引里删除，也是可以的。
$ bin/nutch solrclean TestCrawl/crawldb/ http://localhost:8983/solr8 使用crawl脚本一键抓取
刚才我们是手工敲入多个命令，一个一个步骤，来完成抓取的，其实Nutch自带了一个脚本，./bin/crawl，把抓取的各个步骤合并成一个命令，看一下它的用法
$ ./bin/crawl Missing seedDir : crawl <seedDir> <crawlID> <solrURL> <numberOfRounds>注意，这里是crawlId，不再是crawlDir。
先删除第7节产生的数据，使用HBase shell，用disable删除表。
8.1 抓取网页
$ ./bin/crawl ~/urls/ TestCrawl http://localhost:8983/solr/ 2～/urls 是存放了种子url的目录TestCrawl 是crawlId，这会在HBase中创建一张以crawlId为前缀的表，例如TestCrawl_Webpage。http://localhost:8983/solr/ , 这是Solr服务器2，numberOfRounds，迭代的次数过了一会儿，屏幕上出现了一大堆url，可以看到爬虫正在抓取！
fetching http://music.douban.com/subject/25811077/ (queue crawl delay=5000ms)fetching http://read.douban.com/ebook/1919781 (queue crawl delay=5000ms)fetching http://www.douban.com/online/11670861/ (queue crawl delay=5000ms)fetching http://book.douban.com/tag/绘本 (queue crawl delay=5000ms)fetching http://movie.douban.com/tag/科幻 (queue crawl delay=5000ms)49/50 spinwaiting/active, 56 pages, 0 errors, 0.9 1 pages/s, 332 245 kb/s, 131 URLs in 5 queuesfetching http://music.douban.com/subject/25762454/ (queue crawl delay=5000ms)fetching http://read.douban.com/reader/ebook/1951242/ (queue crawl delay=5000ms)fetching http://www.douban.com/mobile/read-notes (queue crawl delay=5000ms)fetching http://book.douban.com/tag/诗歌 (queue crawl delay=5000ms)50/50 spinwaiting/active, 61 pages, 0 errors, 0.9 1 pages/s, 334 366 kb/s, 127 URLs in 5 queues8.2 查看结果
./bin/nutch readdb -crawlId TestCrawl -stats也可以进HBase shell 查看，
cd ~/hbase-0.90.4./bin/hbase shellhbase(main):001:0> scan 'TestCrawl_webpage'屏幕开始不断输出内容，可以用Ctrl+C 结束。
在运行scan查看表中内容时，对于列的含义不确定时可以查看conf/gora-hbase-mapping.xml文件，该文件定义了列族及列的含义。








上一篇：借书方案知多少


下一篇：对拍














发表评论
共有条评论






用户名:

密码:



验证码:

 

匿名发表


















学习交流
更多





索泰发布一款GTX 1070 Mini迷


AMD新旗舰显卡轻松干翻NVIDIA 






索泰发布一款GTX 1070 Mini迷你版本:小机
索泰发布一款GTX 1070 Mini迷你版本:小机箱大爱...






usb无线网卡怎么用,小编告诉你安装教程09-10

usb调试在哪,小编告诉你usb调试在哪09-10

优盘不显示,小编告诉你优盘不显示怎么办09-10

低级格式化,小编告诉你硬盘怎么低级格式化09-10




帝国cms分类信息的所在地在的修改09-08

将网站地图和友情链接table样式改为div+css09-08

用帝国cms实现不规则新闻或信息调用（应大站09-08

帝国调用DZ论坛精华帖09-08

用灵动标签调用discuz和phpwind的最新贴子09-08







热门图片
更多




芭蕾舞蹈表演，真实美到极致


下午茶时间，悠然自得的休憩




充斥这繁华奢靡气息的城市迪拜风景图片


从山间到田野再到大海美丽的自然风景图片




肉食主义者的最爱美食烤肉图片


夏日甜心草莓美食图片




人逢知己千杯少，喝酒搞笑图集


搞笑试卷，学生恶搞答题







猜你喜欢的新闻


荣耀总裁赵明乌镇演讲：荣耀首款5G手机V30下

搜狐张朝阳：回归媒体是搜狐重新崛起的关键

华为轮值董事长郭平：虚拟技术创造现实价值

第六届世界互联网大会开幕“to B”端成热门

滴滴英文服务上线两周年 用户已超200万

华为推出全球至快AI训练集群Atlas900

马斯克：特斯拉正组建中国技术团队

10年后6G将问世 速度有望比5G快100倍

WeworkCEO称已开始考虑未来职位 不排除放弃

谷歌软件商店模式变革：推出5美元会员 可用数





猜你喜欢的关注


【POJ 2528】Mayor&#39;s posters

spring maven 搭建dubbo框架(dubbo-admin)

【POJ 3667】Hotel

【POJ 2104】K-th Number&amp;主席树详解

flex4 spark 布局

CUDA线程协作之共享存储器“__shared__”&a

1038. Recover the Smallest Number (30)

字幕文件批量重命名脚本 —— Linux

1043. 输出PATest(20)

PAT甲级1003











新闻热点





荣耀总裁赵明乌镇演讲：荣耀首款5G手机V30下月发布
2019-10-23 09:17:05






搜狐张朝阳：回归媒体是搜狐重新崛起的关键
2019-10-21 09:20:02






华为轮值董事长郭平：虚拟技术创造现实价值
2019-10-21 09:00:12






滴滴英文服务上线两周年 用户已超200万
2019-09-26 08:57:12






华为推出全球至快AI训练集群Atlas900
2019-09-25 08:46:36






马斯克：特斯拉正组建中国技术团队
2019-09-25 08:15:43











疑难解答




索泰发布一款GTX 1070 Mini迷你版本:小机箱

AMD新旗舰显卡轻松干翻NVIDIA 有几个点我们

i5 6500配什么显卡最佳？i5 6500配1060显卡可

AMD新一批显卡曝光:更便宜的14nm北极星

A卡自修改BIOS安装16.12.1 ReLive驱动教程 

2016笔记本显卡性能哪个好？笔记本显卡天梯图

2016显卡性能怎么看好坏 显卡天梯图2016年1

PS4 Pro显卡解析:显存带宽相当于标准版PS4

iGame 1050烈焰战神U-2GD5版图赏版:最美非

EVGA FTW GTX 1080/1070显卡存在严重问题:







图片精选




Dictionary数据类型在Darwin视频服



可穿戴手势识别控制器



Dictionary数据类型在Darwin视频服



可穿戴手势识别控制器











网友关注




u盘无法识别怎么办,小编告诉你U盘无法识别怎

usb无线网卡怎么用,小编告诉你安装教程

usb调试在哪,小编告诉你usb调试在哪

优盘不显示,小编告诉你优盘不显示怎么办

低级格式化,小编告诉你硬盘怎么低级格式化

分区表丢失,小编告诉你分区表丢失如何修复

进入bios,小编告诉你戴尔笔记本进入bios设置u

怎么刷bios,小编告诉你华硕怎么刷bios

读卡器怎么用,小编告诉你如何使用读卡器

bios升级,小编告诉你华硕主板bios怎么升级