JAVA获取网页文本内容
主要核心类就是:
URLConnection
代码如下:
public static String sendGet(String url, HashMap<String,String> requestHead) throws Exception {
URL url1=new URL(url);
URLConnection connection=url1.openConnection();
connection.setRequestProperty("Accept","*/*");
connection.setRequestProperty("Connection","Keep-Alive");
if(requestHead==null){
}else{
for(String key:requestHead.keySet()){
connection.setRequestProperty(key,requestHead.get(key));
}
}
InputStream inputStream=connection.getInputStream();
byte[] bytes=new byte[1024];
ByteArrayOutputStream outputStream=new ByteArrayOutputStream();
int len=0;
while((len=inputStream.read(bytes))!=-1){
outputStream.write(bytes,0,len);
}
String ret=new String(outputStream.toByteArray());
String charset=getWebCharset(ret);
return new String(outputStream.toByteArray(),charset);
}
其中的getWebCharaset是自动匹配网页编码,代码如下:
public static String getWebCharset(String str){
String charset="UTF";
try{
charset=TextUtil.getMiddleText(str,"charset=",">").substring(0,3);
charset=charset.replaceAll("\"","");
charset=charset.replaceAll("'","");
}catch (NullPointerException e){
}
charset=charset.toUpperCase();
if(charset.startsWith("UT")){
charset="UTF8";
}else if(charset.startsWith("GB2")){
charset="GB2312";
}else if(charset.startsWith("GBK")){
charset="GBK";
}
return charset;
}
当然匹配的方式有很多种,可以自己实现。