WeiboScrapy is simple node.js module for crawling sina weibo searching pages. WeiboScrapy uses cheerio as replacement for jQuery and phantomjs for using headless WebKit browser to get lazy rending web pages.
Enter the root folder(contain the package.json file), install dependency node modules(Make sure that you have installed nodejs)
npm install
Create a new instance of WeiboScrapy.
var scrapy = new WeiboScrapy(options);
Options are optional. Available options are:
delay
(default: 200) - time in ms waiting for page renderingdumpFile
(default: 'output.json') - file path where the generated JSON data is stored, only enabled in debug modelogFile
(default: '') - file path where the log information is storedcookieFile
(default: 'src/cookie.dat') - the cookie data used for monitoring logined user requestappKey
(default: '4209225449') - the app key generated by sina platformaccessToken
(default: '') - the access token generated by sina platform(Any of the appKey and access key is needed)eraseFiles
(default: true) - the remove the log and dump file before request datausePhantom
(default: false) - whether use phantomjs to send requestsos
(default: '') - the runnging platform specified for using phantomjssleep
(default: 30000) - time in ms between requestsconcurrency
(default: 1) - number of simultaneous requestsdebug
(default: false) - show some debug information in console and write JSON data to output.json which is under the working folderThe default function of crawling the searching data of sina weibo
var scrapy = WeiboScrapy({debug: true, logFile: 'test.log'});
scrapy.crawl();
Accepts two arguments. URL of a page to download and a callback. Returns the JSON data object as first argument
var scrapy = new WeiboScrapy(options);
scrapy.get('http://s.weibo.com/wb/vincent&xsort=hot&page=1', function(data) {
console.log(data);
});
Accepts three arguments - URL, loop options and a callback.
This methods allows to loop through pages with specific conditions. URL has to contain {%i}
wich will be replaced with number. Loop options are:
start
(default: 1) - start pagestep
(default: 1) - number by wich increase the iteratorend
(default: false) - when to end the loopwhile
(default: false) - can be a regexp (tested against page content), function (expects return of true or false) or a string (jQuery like selector. If no elements are found, loop stops). If set, loop will continue until while
condition is not met.Retrieve article titles of first 5 pages on net.tutsplus.com
var scrapy = new Scrapy();
scrapy.loop('http://net.tutsplus.com/page/{%i}/', { end: 5 }, function($) {
$('.post_title a').each(function() {
console.log( $(this).text().trim() );
});
});
More examples in examples
folder.
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。