WeiboScrapy

WeiboScrapy is simple node.js module for crawling sina weibo searching pages. WeiboScrapy uses cheerio as replacement for jQuery and phantomjs for using headless WebKit browser to get lazy rending web pages.

Install

Enter the root folder(contain the package.json file), install dependency node modules(Make sure that you have installed nodejs)

npm install

Usage

Create a new instance of WeiboScrapy.

var scrapy = new WeiboScrapy(options);

Options

Options are optional. Available options are:

delay (default: 200) - time in ms waiting for page rendering
dumpFile (default: 'output.json') - file path where the generated JSON data is stored, only enabled in debug mode
logFile (default: '') - file path where the log information is stored
cookieFile (default: 'src/cookie.dat') - the cookie data used for monitoring logined user request
appKey (default: '4209225449') - the app key generated by sina platform
accessToken (default: '') - the access token generated by sina platform(Any of the appKey and access key is needed)
eraseFiles (default: true) - the remove the log and dump file before request data
usePhantom (default: false) - whether use phantomjs to send requests
os (default: '') - the runnging platform specified for using phantomjs
sleep (default: 30000) - time in ms between requests
concurrency (default: 1) - number of simultaneous requests
debug (default: false) - show some debug information in console and write JSON data to output.json which is under the working folder

Available methods

crawl()

The default function of crawling the searching data of sina weibo

var scrapy = WeiboScrapy({debug: true, logFile: 'test.log'});

scrapy.crawl();

get(url, callback)

Accepts two arguments. URL of a page to download and a callback. Returns the JSON data object as first argument

var scrapy = new WeiboScrapy(options);

scrapy.get('http://s.weibo.com/wb/vincent&xsort=hot&page=1', function(data) {
	console.log(data);
});

loop(url, options, callback)

Accepts three arguments - URL, loop options and a callback. This methods allows to loop through pages with specific conditions. URL has to contain {%i} wich will be replaced with number. Loop options are:

start (default: 1) - start page
step (default: 1) - number by wich increase the iterator
end (default: false) - when to end the loop
while (default: false) - can be a regexp (tested against page content), function (expects return of true or false) or a string (jQuery like selector. If no elements are found, loop stops). If set, loop will continue until while condition is not met.

Example

Retrieve article titles of first 5 pages on net.tutsplus.com

var scrapy = new Scrapy();

scrapy.loop('http://net.tutsplus.com/page/{%i}/', { end: 5 }, function($) {

	$('.post_title a').each(function() {
		console.log( $(this).text().trim() );
	});

});

More examples in examples folder.

TODO

refine the cookie strateby
optimize the crawling strategy
support loop function and sleep/concurrency option
update document
add workers
more tests
create lightweight branch

Vincent Hou / WeiboScrapy

WeiboScrapy

Install

Usage

Options

Available methods

crawl()

get(url, callback)

loop(url, options, callback)

Example

TODO

简介

发行版

贡献者

近期动态

Vincent Hou / WeiboScrapy .gitee-modal { width: 500px !important; }

WeiboScrapy

Install

Usage

Options

Available methods

crawl()

get(url, callback)

loop(url, options, callback)

Example

TODO

简介

发行版

贡献者

近期动态

搜索帮助

Vincent Hou / WeiboScrapy