首页 > 编程 > HTML > 正文

对HTML 提取器(woody)的介绍

2020-03-24 17:41:13
字体:
来源:转载
供稿:网友
woody 是一款 Java 的HTML 解析/提取器,用法非常类似 webmagic, 是对其抽取模板完全重写,之所有单独提取出来是因为为来更好可重用。

一些新功能:

多种结果数据类型(String, char, byte, short int, long, double, float, string[], Set, List,Data)

支持用户之定义脚本处理函数(目前支持Javascript 函数配置处理)

支持css,xpath内核替换

支持filter功能

对css,xpath 内核对象的缓存

一个完整的例子:

html' target='_blank'>public class OsChinaBlog { public static void main(String[] args) throws Exception { Document doc = Jsoup.connect( http://www.oschina.net/news/43879/webmagic-0-3-0 ).timeout(60000) .userAgent( Mozilla/5.0 (Macintosh; Intel Mac OS X 10.8; rv:23.0) Gecko/20100101 Firefox/23.0 ).get(); String html = doc.html(); OsChinaBlogModel model = AnnotationExtractor.me().process(html, OsChinaBlogModel.class); System.out.println(model.toJson()); public static class OsChinaBlogModel extends Model { public OsChinaBlogModel() { //use to reflect @Inject @ComboExtract(value = { @ExtractBy(value = h1.OSCTitle , type = ExprType.CSS), @ExtractBy(value = //title/text() , type = ExprType.XPATH) }, op = OP.OR) public String title; @Inject @ExtractBy(value = p.PubDate a[href~=http://my//.oschina//.net/] , type = ExprType.CSS) public String author; @Inject @ExtractBy(value = 发布于.//s*(//d+年//d+月//d+日) , type = ExprType.REGEX) public Date publishDate; @Inject @ComboExtract(value = { @ExtractBy(value = p.PubDate , type = ExprType.CSS, setting = @Setting(outerHtml = true)), @ExtractBy(value = (//d+)评 , type = ExprType.REGEX) }, op = OP.AND) public int commentNum; @Inject @ExtractBy(value = span#p_favor_count , type = ExprType.CSS, setting = @Setting(function = @Function(value = replace , args = { + , }))) public int collectNum; @Inject @ComboExtract(value = { @ExtractBy(value = p[id=userComments] , type = ExprType.CSS, setting = @Setting(outerHtml = true)), @ExtractBy(value = p.TextContent , type = ExprType.CSS) }, op = OP.AND, multi = true) public List commentContents; @Inject @ExtractBy(value = p[id=toolbar_wrapper] , setting = @Setting(fliters = { b , span }), type = ExprType.CSS, impl = Document.class) public String weibo;}

【相关推荐】

1. 免费html在线视频教程

2. html开发手册

3. VeVb.com原创html5视频教程

以上就是对HTML 提取器(woody)的介绍的详细内容,html教程

郑重声明:本文版权归原作者所有,转载文章仅为传播更多信息之目的,如作者信息标记有误,请第一时间联系我们修改或删除,多谢。

发表评论 共有条评论
用户名: 密码:
验证码: 匿名发表