Python爬虫怎么获取下一页的URL和网页内容？

Z时代
2024-01-10
分类：技术分享

用BeautifulSoup爬取了第一页的内容，但是不知道剩下的页面怎么爬。

首页链接是长这样的：

http://gdemba.gicp.net/:82/interunit/ListMain.asp?FirstEnter=Yes&Style=0000100003&UID={A270A117-76A7-4059-AB8F-B11AC370240B}&TimeID=39116.81

通过点击一个“后翻一页”的gif图片按钮跳转到下一页：

第二页的链接是长这样的：

http://gdemba.gicp.net/:82/interunit/ListMain.asp?Keywords=&Style=0000100003&DateLowerLimit=
2000-1-1&DateUpperLimit= 2015-9-11&DateLowerLimitModify=
2000-1-1&DateUpperLimitModify=
2015-9-11&Classification1=0&Classification2=0&Classification3=0&Classification4=0&Classification6=0&Classification7=0&Classification8=0&Class=&Department=001&CreatorName=&CreatorTypeID=&UID={A270A117-76A7-4059-AB8F-B11AC370240B}&SortField=&CustormCondition=&PageNo=2&TimeID=39453.14

这里怎么观察出URL的规律呢？

那个“后翻一页”的链接如下：

<span class="OperateIcon1" title="后翻一页"
onclick="javascript:window.location.href =
'ListMain.asp?Keywords=&Style=0000100003&DateLowerLimit=
2000-1-1&DateUpperLimit= 2015-9-11&DateLowerLimitModify=
2000-1-1&DateUpperLimitModify=
2015-9-11&Classification1=0&Classification2=0&Classification3=0&Classification4=0&Classification6=0&Classification7=0&Classification8=0&Class=&Department=001&CreatorName=&CreatorTypeID=&UID={A270A117-76A7-4059-AB8F-B11AC370240B}&SortField=&CustormCondition=&PageNo=3&TimeID=39509.16'
; "><img border="0" src="../images/PageBar/IconNextPage.gif"
WIDTH="16" HEIGHT="16"></span>

要怎么获取下一页的URL和网页内容呢？
有需要更多信息我可以补充上来。
补充代码：

import urllib
import urllib2
import cookielib
import re
import csv
import codecs
from bs4 import BeautifulSoup
cookie = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
postdata = urllib.urlencode({
    'LoginName':'02',
    'Password':'dc20150820if'
})
req = urllib2.Request(    
    url = 'http://gdemba.gicp.net/:82/VerifyUser.asp',    
    data = postdata 
)
result = opener.open(req)
for item in cookie:
    print 'Cookie：Name = '+item.name    
    print 'Cookie：Value = '+item.value
result = opener.open('http://gdemba.gicp.net/:82/interunit/ListMain.asp?FirstEnter=Yes&Style=0000100003&UID={4C10B953-C0F3-4114-8341-81EF93DE7C55}&TimeID=49252.53')
info = result.read()
soup = BeautifulSoup(info, from_encoding="gb18030")
table = soup.find(id='Table11')
print table
client = ""
tag = ""
tel = ""
catalogue = ""
region = ""
client_type = ""
email = ""
creater = ""
department = ""
action = ""
f = open('table.csv', 'w')
csv_writer = csv.writer(f)
td = re.compile('td')
for row in table.find_all("tr"):
    cells = row.find_all("td")
    if len(cells) == 10:
        client = cells[0].text
        tag = cells[1].text
        tel = cells[2].text
        catalogue = cells[3].text
        region = cells[4].text
        client_type = cells[5].text
        email = cells[6].text
        creater = cells[7].text
        department = cells[8].text
        action = cells[9].text
    csv_writer.writerow([x.encode('utf-8') for x in [client, tag, tel, catalogue, region, client_type, email, creater, department, action]])
f.close()

回答：

<span>里面不是有onclick="javascript:window.location.href=xxxx"吗？
这句就是跳转啊，你给的例子里就是跳转到List.asp?Keywords=....
写爬虫的话，建议你学学HTML和JS。

更新：抓取下一页的URL

next_page_tag = soup.find(title='后翻一页')
next_page_onclick = next_page_tag['onclick']
next_page_url = re.search("'(.+)'", next_page_onclick).group(1)next_page_url = 'http://gdemba.gicp.net/:82/interunit/' + next_page_url

回答：

PageNo就是页码啊！！

回答：

就像楼上说的，那个onclick里面已经有了下一页的地址，使用beautifulsoup提取出来，加上host，应该就可以了

回答：

我是先抓到总的页数，然后用while循环来做。。。。

以上是 Python爬虫怎么获取下一页的URL和网页内容？的全部内容，来源链接： utcz.com/a/159945.html

Python爬虫怎么获取下一页的URL和网页内容？

回答：

回答：

PageNo就是页码啊！！

回答：

回答：

其他人也看了：