JAVA获取网页文本内容
主要核心类就是:
URLConnection
代码如下:
public static String sendGet(String url, HashMap<String,String> requestHead) throws Exception { URL url1=new URL(url); URLConnection connection=url1.openConnection(); connection.setRequestProperty("Accept","*/*"); connection.setRequestProperty("Connection","Keep-Alive"); if(requestHead==null){ }else{ for(String key:requestHead.keySet()){ connection.setRequestProperty(key,requestHead.get(key)); } } InputStream inputStream=connection.getInputStream(); byte[] bytes=new byte[1024]; ByteArrayOutputStream outputStream=new ByteArrayOutputStream(); int len=0; while((len=inputStream.read(bytes))!=-1){ outputStream.write(bytes,0,len); } String ret=new String(outputStream.toByteArray()); String charset=getWebCharset(ret); return new String(outputStream.toByteArray(),charset); }
其中的getWebCharaset是自动匹配网页编码,代码如下:
public static String getWebCharset(String str){ String charset="UTF"; try{ charset=TextUtil.getMiddleText(str,"charset=",">").substring(0,3); charset=charset.replaceAll("\"",""); charset=charset.replaceAll("'",""); }catch (NullPointerException e){ } charset=charset.toUpperCase(); if(charset.startsWith("UT")){ charset="UTF8"; }else if(charset.startsWith("GB2")){ charset="GB2312"; }else if(charset.startsWith("GBK")){ charset="GBK"; } return charset; }
当然匹配的方式有很多种,可以自己实现。