|
下面代码是网抓整个网站的内容,请一定要注意看代码中的注释,还是那个话:建议一本抓一个word文档,你实在要5本一抓也行。
比如:For i = 1 To UBound(arr) 改为:For i = 1 To 1 抓第一本 For i = 2 To 2 抓第三本 For i = 3 To 3 抓第三本;For i = 1 To 5 抓第一本到第五本 For i = 1 To 10 抓第一本到第10本。- Sub shishi()
- Dim strText$, i&, j&, arr, arr1
- URL = "http://www.tcm100.com/"
- Application.ScreenUpdating = True
- With CreateObject("msxml2.xmlhttp")
- .Open "GET", URL & "zhongyiguji.aspx", False
- .send
- strText = .responseText
- arr = Split(strText, "<a target='_blank' href='") '表示有多少本书(本网站共有“762本书”)
- For i = 1 To UBound(arr) '建议每一次就设置“1”就是抓一本书,设置“2”就是抓第二本书,一定要改,否则把整个网站都抓下来了!呵呵呵!!!
- .Open "GET", URL & Split(Split(strText, "<a target='_blank' href='")(i), "'>")(0), False
- .send
- strText = .responseText
- arr1 = Split(strText, "<a target='_blank' href='")
- For j = 1 To UBound(arr1) '这个循环表示每一本书中的数目数量
- .Open "GET", Split(Split(strText, "<a target='_blank' href='")(j), "'>")(0), False
- .send
- strText = strText & .responseText '表示包含所有书的所有页面内容的超文本
- Next
- Next
- ' Debug.Print strText
- End With
- With CreateObject("VBScript.Regexp")
- .Global = True
- .Pattern = "<td[\s\S]*?<div\s*class='title'[\s\S]*?>([\s\S]+?)<[\s\S]*?<div\s* class='content'>([\s\S]+?)</div>[\s\S]*?</td>"
- For Each RegMatch In .Execute(strText)
- t1 = RegMatch.SubMatches(0)
- t2 = RegMatch.SubMatches(1)
- t = t & t1 & Chr(13) & t2 & Chr(13)
- Next
- .Pattern = "(?:<a\s*href=[\s\S]+?>)|</a>": t = .Replace(t, "")
- .Pattern = "(?!<br>)(?: )+": t = .Replace(t, " ")
- .Pattern = "<br>\s+": t = .Replace(t, Chr(13))
- End With
- Application.ScreenUpdating = True
- Documents.Add.Content.Text = t
- End Sub
复制代码 |
|