|
[广告] Excel易用宝 - 提升Excel的操作效率 · Excel / WPS表格插件 ★ 免费下载 ★ ★ 使用帮助★
大家好,我是杨明胜。今天给大家介绍一个VBA爬虫的新方法。在介绍新方法的同时,我们和老方法做一个对比。我们以获取百度电影排行榜为例。网址如下:
http://top.baidu.com/buzz/movie.html
用来方法的实现方式如下:
Sub GetBaiduMovieTop50ViaOldMethod()
Dim resp As String, i As Long, arr() As String, reg As Object
Dim Matches As Object, match As Object, str As String
Set reg = CreateObject("vbscript.regexp")
reg.Pattern = "list-title.*?(</a>)" '这个正则表达式需翻源码找规律
reg.Global = True
resp = GetWebTxt("http://top.baidu.com/buzz/movie.html")
Set Matches = reg.Execute(resp)
For Each match In Matches
str = Right(match, Len(match) - InStr(match, """>") - 1)
str = Left(str, InStr(str, "<") - 1)
ReDim Preserve arr(i)
arr(i) = str
i = i + 1
Next
ActiveSheet.Range("A1:A" & 1 + UBound(arr())) = WorksheetFunction.Transpose(arr())
End Sub
'获取网页源码
Public Function GetWebTxt(url As String) As String
Dim xmlHttp As Object
Application.DisplayAlerts = False
Set xmlHttp = CreateObject("MSXML2.XMLHTTP")
xmlHttp.Open "GET", url, False
xmlHttp.Send
While xmlHttp.ReadyState <> 4
DoEvents
Wend
GetWebTxt = BytesToBstr(xmlHttp.responsebody)
End Function
'将字节转换为字符串
Private Function BytesToBstr(Bytes)
Dim Unicode As String
If Len(encode) > 0 Then
Unicode = encode
Else
If IsUTF8(Bytes) Then '如果不是UTF-8编码则按照GB2312来处理
Unicode = "UTF-8"
Else
Unicode = "GB2312"
End If
End If
Dim objstream As Object
Set objstream = CreateObject("ADODB.Stream")
With objstream
.Type = 1
.Mode = 3
.Open
.Write Bytes
.Position = 0
.Type = 2
.Charset = Unicode
BytesToBstr = .ReadText
.Close
End With
End Function
'判断网页编码函数
Private Function IsUTF8(Bytes) As Boolean
Dim i As Long, AscN As Long, Length As Long
Length = UBound(Bytes) + 1
If Length < 3 Then
IsUTF8 = False
Exit Function
ElseIf Bytes(0) = &HEF And Bytes(1) = &HBB And Bytes(2) = &HBF Then
IsUTF8 = True
Exit Function
End If
Do While i <= Length - 1
If Bytes(i) < 128 Then
i = i + 1
AscN = AscN + 1
ElseIf (Bytes(i) And &HE0) = &HC0 And (Bytes(i + 1) And &HC0) = &H80 Then
i = i + 2
ElseIf i + 2 < Length Then
If (Bytes(i) And &HF0) = &HE0 And (Bytes(i + 1) And &HC0) = &H80 And (Bytes(i + 2) And &HC0) = &H80 Then
i = i + 3
Else
IsUTF8 = False
Exit Function
End If
Else
IsUTF8 = False
Exit Function
End If
Loop
If AscN = Length Then
IsUTF8 = False
Else
IsUTF8 = True
End If
End Function
大家可以看到用旧方法需要获取并分析网页源码,用正则匹配源码后还需用字符串处理函数获得电影名称。而用新方法无需查看网页源码,代码量也会少很多,只需以下一段代码即可:
Sub GetBaiduMovieTop50ViaNewMethod()
Dim i As Long, arr(49) As String, doc As Object, nodebefore As String, nodeafter As String
nodebefore = "/html[1]/body[1]/div[1]/div[3]/div[2]/div[1]/table[1]/tr["
nodeafter = "]/td[2]/a[1]"
Set doc = s.getdoc("http://top.baidu.com/buzz/movie.html", "gbk") '获取网页文档
arr(0) = s.getnode(doc, nodebefore & 2 & nodeafter) '前四个电影比较特殊
arr(1) = s.getnode(doc, nodebefore & 4 & nodeafter)
arr(2) = s.getnode(doc, nodebefore & 6 & nodeafter)
arr(3) = s.getnode(doc, nodebefore & 8 & nodeafter)
For i = 5 To 50
arr(i - 1) = s.getnode(doc, nodebefore & 4 + i & nodeafter)
Next i
ActiveSheet.Range("A1:A" & 1 + UBound(arr())) = WorksheetFunction.Transpose(arr())
End Sub
使用新方法获取网页文档后,只需从文档中根据标签路径取出需要的标签即可。那么问题来了,我们该怎样知道标签路径呢?其实方法很简单,只需在立即窗口(Ctrl+G)执行以下语句即可
?s.shownodes("http://top.baidu.com/buzz/movie.html","gbk")
通过以上语句可以将网页中的所有标签解析到工作表中,如下:
根据标签与内容的对应关系需要什么标签就取什么标签即可。
需要注意的是新方法需要引用一个插件sqlcelfuncs,该插件免费且只有三四兆,截至目前有90多个函数,本例爬虫只用了其中3个。sqlcelfuncs下载地址如下:
https://sqlcel.com/sqlcelfuncs/
|
评分
-
2
查看全部评分
-
|