使用requests爬取[大街网]职位信息,尝试多次失败,帮忙看看我的代码有什么问题?应该怎么改呢?

之前爬取过几个静态的网站数据,都还比较顺利,这次遇到ajax,看了几个文档,感觉不是很难,就直接上手了,但还是卡住了。。。

目标:
爬取大街网的职位信息。

过程:
1,使用浏览器审查元素功能查看数据动态加载的地址信息。
图片描述

2,根据显示的信息配置requests的请求参数。

data = {

'keyword': 'python',

'order': '0',

'city': '',

'recruitType': '',

'salary': '',

'experience': '',

'page': '5',

'positionFunction': '',

'_CSRFToken': '',

'ajax': '1'

}

headers = {

'accept': 'application/json, text/javascript, */*; q=0.01',

'accept-language': 'zh-CN,zh;q=0.8',

'accept-encoding': 'gzip, deflate, sdch',

'cookie': 'DJ_UVID=MTQ5MDMyMTExNTAzODM2MTc5; DJ_RF=empty; DJ_EU=http%3A%2F%2Fjob.dajie.com%2F; __login_tips=1; dj_cap=9c8c95bdef72e84a9bd7493a5ab91694; USER_ACTION="request^A-^A-^Ajobdetail:^A-"; SO_COOKIE_V2=0c7cGprjIH0q9RHc53CWLLXf151DQ5QvUP5ccPQj4g0B/izuXHm8sp41lJjJJh3nmjAkroj8JczFN/SCLPAUzbOHW7wYWmQ6Zu7s',

'referer': 'https://so.dajie.com/job/search?keyword=%E9%A3%9E%E5%88%A9%E6%B5%A6&from=job&clicktype=blank',

'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36',

'X-Requested-With': 'XMLHttpRequest',

'method':'get'

}

3,将请求头信息添加到requests.get()中。

response = requests.get('https://so.dajie.com/job/ajax/search/filter', params=data, headers=headers)

4,查看返回的页面信息。

print response.url

print ''

print response.request.headers

print ''

print response.headers

print ''

print response.content[-1000:]

print ''

print response

5,返回的结果怎么不是期望的json数据呢。。。

response.url:

https://so.dajie.com/job/ajax/search/filter?salary=&city=&ajax=1&positionFunction=&_CSRFToken=&keyword=python&recruitType=&order=0&experience=&page=5

response.request.headers:

{'accept-language': 'zh-CN,zh;q=0.8', 'accept-encoding': 'gzip, deflate, sdch', 'X-Requested-With': 'XMLHttpRequest', 'accept': 'application/json, text/javascript, */*; q=0.01', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36', 'Connection': 'keep-alive', 'referer': 'https://so.dajie.com/job/search?keyword=%E9%A3%9E%E5%88%A9%E6%B5%A6&from=job&clicktype=blank', 'cookie': 'DJ_UVID=MTQ5MDMyMTExNTAzODM2MTc5; DJ_RF=empty; DJ_EU=http%3A%2F%2Fjob.dajie.com%2F; __login_tips=1; dj_cap=9c8c95bdef72e84a9bd7493a5ab91694; USER_ACTION="request^A-^A-^Ajobdetail:^A-"; SO_COOKIE_V2=0c7cGprjIH0q9RHc53CWLLXf151DQ5QvUP5ccPQj4g0B/izuXHm8sp41lJjJJh3nmjAkroj8JczFN/SCLPAUzbOHW7wYWmQ6Zu7s', 'method': 'get'}

response.headers:

{'Date': 'Wed, 19 Apr 2017 02:00:47 GMT', 'Content-Length': '5944', 'ETag': '"552f21de-1738"', 'Content-Type': 'text/html; charset=UTF-8', 'Connection': 'keep-alive'}

response.content[-1000:]:

,这个页面去火星了,试试搜索一下吧:</p>

<form action="http://so.dajie.com/job/search" target="_top" class="search" method="get">

<input type="text" placeholder="搜索感兴趣的职位" autocomplete="off" name="keyword"/><button type="submit">搜索</button>

<input type="hidden" name="jobsearch" value="8"/>

</form>

</div>

<div class="error-404">

<div class="buttonwrap">

<a class="button guest" id="guest" title="" href="http://www.dajie.com/"><b>逛逛大街</b></a>

<a class="button report" id="report" title="" href="mailto:service@dajie.com"><b>报告管理员</b></a>

</div>

</div>

</div>

<script type="text/javascript">

$(function(){

$('input[placeholder]').each(function(){

var $dom = $(this);

var tip = $dom.attr('placeholder');

$.placeholder($dom, {

placeTextClass : 'placeholder',

placeText : tip

});

});

});

</script>

</body>

</html>

response:

<Response [299]>

疑问:
1,‘https://so.dajie.com/job/ajax...’,这个页面打开怎么不是json数据页面呢?我之前看的教程里边给的链接打开就是数据额,比如:‘https://rate.tmall.com/list_d...’。
2,第一次使用requests请求ajax数据,是不是请求时少写什么东西了?
2,我现在只是尝试修改了各种请求参数,但是还是得不到json数据,思考方向错了?

谢。

回答:

# coding: utf-8

import requests

url = 'https://so.dajie.com/job/search'

page_url = 'https://so.dajie.com/job/ajax/search/filter?keyword=python&order=0&city=&recruitType=&salary=&experience=&page=1&positionFunction=&_CSRFToken=&ajax=1'

session = requests.Session()

session.headers['referer'] = url

session.headers['user-agent'] = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'

session.get(url)

r = session.get(page_url)

print r.text

直接传入cookies

# coding: utf-8

import requests

data = {

'keyword': 'python',

'order': '0',

'city': '',

'recruitType': '',

'salary': '',

'experience': '',

'page': '5',

'positionFunction': '',

'_CSRFToken': '',

'ajax': '1'

}

headers = {

'cookie': 'DJ_RF=empty; DJ_EU=http%3A%2F%2Fso.dajie.com%2Fjob%2Fsearch%3Fkeyword%3Dpython%26jobsearch%3D8; DJ_UVID=MTQ5MjU2OTgxOTU1ODg0Mzk1; __login_tips=1; dj_cap=1e41c3c0ca9602c45e6481cb53c19774; SO_COOKIE_V2=6a297gxq5vDDnl9D4q04fhTgrWB11xG9lMj7iLcnP1uM/Zuzzx1dkeHauV4blsO1KsRYQKEQDrDGdiAhRE9efdI8PnREZK1MhzR4',

'referer': 'https://so.dajie.com/job/search',

'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'

}

r = requests.get('https://so.dajie.com/job/ajax/search/filter', data=data, headers=headers)

print r.text

以上是 使用requests爬取[大街网]职位信息,尝试多次失败,帮忙看看我的代码有什么问题?应该怎么改呢? 的全部内容, 来源链接: utcz.com/a/162848.html

回到顶部