最近的应用场景为捉取中国南浔的新闻,分为两个部分,一个是首页轮播新闻图,一个是更多新闻的新闻列表分页。
1. 首页轮播新闻抓取 如下图
首页查看该网页的源码 找到轮播图部分 如下
从代码结构入手 我们需要的是5张轮播图的 href 指向地址,指向具体的新闻详情页。
该部分java方法如下
public List<String> getImagesUrls(){ String url = "http://www.nanxun.gov.cn/"; String content = HttpKit.get(url); Html html = new Html(content); List<String> urls = html.$("#featured .image a", "href").all(); //List<String> images = html.$("#featured .image a img","src").all(); 所有图片 for(int i =0;i<urls.size();i++){ urls.set(i, urls.get(i).replaceAll("http://www.nanxun.gov.cn/","/")); //System.out.PRintln(urls.get(i)); } return urls; }String content = HttpKit.get(url);
Html html = new Html(content);
urls = html.$("#featured .image a", "href").all();
这里是根据外层div的id来定位拿到了5张轮播图的href属性。下一步就是根据href的链接地址来捉取指定url的新闻页详情。
public void getImagesNewsDetail(){ List<Map<String,String>> mapList = new ArrayList<Map<String,String>>(); List<String> imagesUrl = getImagesUrls(); try { for(String url:imagesUrl){ Map<String, String> map = extractDetailPage(url); mapList.add(map); } } catch (Exception e) { e.printStackTrace(); } renderJson(mapList); }在上面方法调用 getImagesUrls() 拿到5个url, 循环来解析各个url对应的hmtl页面。 解析用到方法 extractDetailPage(String url),如下/** * 提取正文 * @param url 如:art/2016/5/11/art_22_474197.html * @return * @throws Exception */ private Map<String,String> extractDetailPage(String url) throws Exception { long ts = System.currentTimeMillis(); String content = HttpKit.get("http://www.nanxun.gov.cn/" + url); Html html = new Html(content); String title = html.regex("<//!--<//$//[标题名称//(html//)//]>begin-->(.*?)<//!--<//$//[标题名称//(html//)//]>end-->", 1).get(); String publishDate = html.regex("<td height=.28. align=.center.>.*<span>日期:(.*?)</span>", 1).get(); String source = html.regex("<//!--<//$//[信息来源//]>begin-->(.*?)<//!--<//$//[信息来源//]>end-->", 1).get(); String body = html.regex("<//!--<//$//[信息内容//]>begin-->(.*?)<//!--<//$//[信息内容//]>end-->", 1).get(); //String body2 = html.regex("<meta name=.ContentStart.*/?>(.*?)<meta name=.ContentEnd.*/?>", 1).get(); body = body.replaceAll("src=/"/", "src=/"http://www.nanxun.gov.cn/"); Map<String,String> map = new HashMap<String, String>(); map.put("url", url); map.put("title", title); map.put("publishDate", publishDate); map.put("source", source); map.put("content", body); //过滤图片 String image = getNewsImage(body); map.put("image",image); System.out.println(System.currentTimeMillis() - ts); return map; }以其中一个url为例 :art/2017/2/6/art_57_114777.html
首先还是看源码, 按照正则来取到我们需要的东西,这里需要取的是标题、发布时间、来源、正文与正文包含的第一张图片。
String title = html.regex("<//!--<//$//[标题名称//(html//)//]>begin-->(.*?)<//!--<//$//[标题名称//(html//)//]>end-->", 1).get();
String publishDate = html.regex("<td height=.28. align=.center.>.*<span>日期:(.*?)</span>", 1).get();
String source = html.regex("<//!--<//$//[信息来源//]>begin-->(.*?)<//!--<//$//[信息来源//]>end-->", 1).get();
String body = html.regex("<//!--<//$//[信息内容//]>begin-->(.*?)<//!--<//$//[信息内容//]>end-->", 1).get();
然后提取正文包含的第一张图片的方法 通过已封装的方法来取,同样是正则 如下public static String getNewsImage(String content){ String regexImage = "<img.+?src=/"(.+?)/".+?/?>"; String ImageStr = ""; String ImageSrcStr = ""; Pattern p = Pattern.compile(regexImage,Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(content); if(m.find()){ ImageStr = m.group(); //完整img代码段 ImageSrcStr = m.group(1); //src代码段 System.out.println(ImageSrcStr); } return ImageSrcStr;}按步骤走 最后来看看 getImagesNewsDetail() 这个方法以json格式返回的抓取结果
可以看到都拿到了,在数据量大的时候我们还可以进行缓存或者插入数据库,这里带过。
附上用到的http请求工具类
/** * Copyright (c) 2011-2016, James Zhan 詹波 (jfinal@126.com). * * Licensed under the Apache License, Version 2.0 (the "License"); * you may not use this file except in compliance with the License. * You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. */package com.jfinal.kit;import java.io.BufferedReader;import java.io.IOException;import java.io.InputStream;import java.io.InputStreamReader;import java.io.OutputStream;import java.io.UnsupportedEncodingException;import java.net.HttpURLConnection;import java.net.URL;import java.net.URLEncoder;import java.security.KeyManagementException;import java.security.NoSuchAlgorithmException;import java.security.NoSuchProviderException;import java.security.cert.CertificateException;import java.security.cert.X509Certificate;import java.util.Map;import java.util.Map.Entry;import javax.net.ssl.HostnameVerifier;import javax.net.ssl.HttpsURLConnection;import javax.net.ssl.SSLContext;import javax.net.ssl.SSLsession;import javax.net.ssl.SSLSocketFactory;import javax.net.ssl.TrustManager;import javax.net.ssl.X509TrustManager;import javax.servlet.http.HttpServletRequest;import com.jfinal.log.Logger;/** * HttpKit */public class HttpKit { private final static Logger log = Logger.getLogger(HttpKit.class); private HttpKit() {} /** * https 域名校验 */ private class TrustAnyHostnameVerifier implements HostnameVerifier { @Override public boolean verify(String hostname, SSLSession session) { return true; } } /** * https 证书管理 */ private class TrustAnyTrustManager implements X509TrustManager { @Override public X509Certificate[] getAcceptedIssuers() { return null; } @Override public void checkClientTrusted(X509Certificate[] chain, String authType) throws CertificateException { } @Override public void checkServerTrusted(X509Certificate[] chain, String authType) throws CertificateException { } } private static final String GET = "GET"; private static final String POST = "POST"; private static String CHARSET = "UTF-8"; private static final SSLSocketFactory sslSocketFactory = initSSLSocketFactory(); private static final TrustAnyHostnameVerifier trustAnyHostnameVerifier = new HttpKit().new TrustAnyHostnameVerifier(); private static SSLSocketFactory initSSLSocketFactory() { try { TrustManager[] tm = {new HttpKit().new TrustAnyTrustManager() }; SSLContext sslContext = SSLContext.getInstance("TLS"); // ("TLS", "SunJSSE"); sslContext.init(null, tm, new java.security.SecureRandom()); return sslContext.getSocketFactory(); } catch (Exception e) { throw new RuntimeException(e); } } public static void setCharSet(String charSet) { if (StrKit.isBlank(charSet)) { throw new IllegalArgumentException("charSet can not be blank."); } HttpKit.CHARSET = charSet; } private static HttpURLConnection getHttpConnection(String url, String method, Map<String, String> headers) throws IOException, NoSuchAlgorithmException, NoSuchProviderException, KeyManagementException { URL _url = new URL(url); HttpURLConnection conn = (HttpURLConnection)_url.openConnection(); if (conn instanceof HttpsURLConnection) { ((HttpsURLConnection)conn).setSSLSocketFactory(sslSocketFactory); ((HttpsURLConnection)conn).setHostnameVerifier(trustAnyHostnameVerifier); } conn.setRequestMethod(method); conn.setDoOutput(true); conn.setDoInput(true); conn.setConnectTimeout(30000); conn.setReadTimeout(30000); conn.setRequestProperty("Content-Type","application/x-www-form-urlencoded"); conn.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.146 Safari/537.36"); if (headers != null && !headers.isEmpty()) for (Entry<String, String> entry : headers.entrySet()) conn.setRequestProperty(entry.getKey(), entry.getValue()); return conn; } /** * Send GET request */ public static String get(String url, Map<String, String> queryParas, Map<String, String> headers) { HttpURLConnection conn = null; try { conn = getHttpConnection(buildUrlWithQueryString(url, queryParas), GET, headers); conn.connect(); return readResponseString(conn); } catch (Exception e) { throw new RuntimeException(e); } finally { if (conn != null) { conn.disconnect(); } } } public static String get(String url, Map<String, String> queryParas) { return get(url, queryParas, null); } public static String get(String url) { return get(url, null, null); } /** * Send POST request */ public static String post(String url, Map<String, String> queryParas, String data, Map<String, String> headers) { HttpURLConnection conn = null; try { conn = getHttpConnection(buildUrlWithQueryString(url, queryParas), POST, headers); conn.connect(); OutputStream out = conn.getOutputStream(); out.write(data.getBytes(CHARSET)); out.flush(); out.close(); return readResponseString(conn); } catch (Exception e) { throw new RuntimeException(e); } finally { if (conn != null) { conn.disconnect(); } } } public static String post(String url, Map<String, String> queryParas, String data) { return post(url, queryParas, data, null); } public static String post(String url, String data, Map<String, String> headers) { return post(url, null, data, headers); } public static String post(String url, String data) { return post(url, null, data, null); } private static String readResponseString(HttpURLConnection conn) { StringBuilder sb = new StringBuilder(); InputStream inputStream = null; try { inputStream = conn.getInputStream(); BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream, CHARSET)); String line = null; while ((line = reader.readLine()) != null){ sb.append(line).append("/n"); } return sb.toString(); } catch (Exception e) { throw new RuntimeException(e); } finally { if (inputStream != null) { try { inputStream.close(); } catch (IOException e) { log.error(e.getMessage(), e); } } } } /** * Build queryString of the url */ private static String buildUrlWithQueryString(String url, Map<String, String> queryParas) { if (queryParas == null || queryParas.isEmpty()) return url; StringBuilder sb = new StringBuilder(url); boolean isFirst; if (url.indexOf("?") == -1) { isFirst = true; sb.append("?"); } else { isFirst = false; } for (Entry<String, String> entry : queryParas.entrySet()) { if (isFirst) isFirst = false; else sb.append("&"); String key = entry.getKey(); String value = entry.getValue(); if (StrKit.notBlank(value)) try {value = URLEncoder.encode(value, CHARSET);} catch (UnsupportedEncodingException e) {throw new RuntimeException(e);} sb.append(key).append("=").append(value); } return sb.toString(); } public static String readData(HttpServletRequest request) { BufferedReader br = null; try { StringBuilder result = new StringBuilder(); br = request.getReader(); for (String line=null; (line=br.readLine())!=null;) { result.append(line).append("/n"); } return result.toString(); } catch (IOException e) { throw new RuntimeException(e); } finally { if (br != null) try {br.close();} catch (IOException e) {log.error(e.getMessage(), e);} } } @Deprecated public static String readIncommingRequestData(HttpServletRequest request) { return readData(request); }}用到的主要jar包webmagic-core-0.5.2.jarwebmagic-extension-0.5.2.jar
jsoup-1.7.2.jar
slf4j-api-1.7.2.jarslf4j-log4j12-1.6.6.jar
基础jar包
commons-attributes-api-2.1.jarcommons-beanutils-1.7.0.jarcommons-codec-1.3.jarcommons-collections-3.2.1.jarcommons-email-1.2.jarcommons-fileupload-1.2.1.jarcommons-httpclient-3.1.jarcommons-io-2.2.jarcommons-lang-2.6.jarcommons-lang3-3.1.jarcommons-logging-1.1.jarcommons-pool-1.5.5.jar
2. 第二部分 分页抓取见第二篇
新闻热点
疑难解答