1 Star 1 Fork 1

Vincent Hou / WeiboScrapy

加入 Gitee
与超过 1200万 开发者一起发现、参与优秀开源项目,私有仓库也完全免费 :)
免费加入
该仓库未声明开源许可证文件(LICENSE),使用请关注具体项目描述及其代码上游依赖。
克隆/下载
贡献代码
同步代码
取消
提示: 由于 Git 不支持空文件夾,创建文件夹后会生成空的 .keep 文件
Loading...
README

WeiboScrapy

WeiboScrapy is simple node.js module for crawling sina weibo searching pages. WeiboScrapy uses cheerio as replacement for jQuery and phantomjs for using headless WebKit browser to get lazy rending web pages.

Install

Enter the root folder(contain the package.json file), install dependency node modules(Make sure that you have installed nodejs)

npm install

Usage

Create a new instance of WeiboScrapy.

var scrapy = new WeiboScrapy(options);

Options

Options are optional. Available options are:

  • delay (default: 200) - time in ms waiting for page rendering
  • dumpFile (default: 'output.json') - file path where the generated JSON data is stored, only enabled in debug mode
  • logFile (default: '') - file path where the log information is stored
  • cookieFile (default: 'src/cookie.dat') - the cookie data used for monitoring logined user request
  • appKey (default: '4209225449') - the app key generated by sina platform
  • accessToken (default: '') - the access token generated by sina platform(Any of the appKey and access key is needed)
  • eraseFiles (default: true) - the remove the log and dump file before request data
  • usePhantom (default: false) - whether use phantomjs to send requests
  • os (default: '') - the runnging platform specified for using phantomjs
  • sleep (default: 30000) - time in ms between requests
  • concurrency (default: 1) - number of simultaneous requests
  • debug (default: false) - show some debug information in console and write JSON data to output.json which is under the working folder

Available methods

crawl()

The default function of crawling the searching data of sina weibo

var scrapy = WeiboScrapy({debug: true, logFile: 'test.log'});

scrapy.crawl();

get(url, callback)

Accepts two arguments. URL of a page to download and a callback. Returns the JSON data object as first argument

var scrapy = new WeiboScrapy(options);

scrapy.get('http://s.weibo.com/wb/vincent&xsort=hot&page=1', function(data) {
	console.log(data);
});

loop(url, options, callback)

Accepts three arguments - URL, loop options and a callback. This methods allows to loop through pages with specific conditions. URL has to contain {%i} wich will be replaced with number. Loop options are:

  • start (default: 1) - start page
  • step (default: 1) - number by wich increase the iterator
  • end (default: false) - when to end the loop
  • while (default: false) - can be a regexp (tested against page content), function (expects return of true or false) or a string (jQuery like selector. If no elements are found, loop stops). If set, loop will continue until while condition is not met.

Example

Retrieve article titles of first 5 pages on net.tutsplus.com

var scrapy = new Scrapy();

scrapy.loop('http://net.tutsplus.com/page/{%i}/', { end: 5 }, function($) {

	$('.post_title a').each(function() {
		console.log( $(this).text().trim() );
	});

});

More examples in examples folder.

TODO

  • refine the cookie strateby
  • optimize the crawling strategy
  • support loop function and sleep/concurrency option
  • update document
  • add workers
  • more tests
  • create lightweight branch

空文件

简介

Simple Weibo search scrapy 展开 收起
JavaScript
取消

发行版

暂无发行版

贡献者

全部

近期动态

加载更多
不能加载更多了
JavaScript
1
https://gitee.com/VincentHou/WeiboScrapy.git
git@gitee.com:VincentHou/WeiboScrapy.git
VincentHou
WeiboScrapy
WeiboScrapy
master

搜索帮助