#--------------------------------------------------------------------------------- #说明:源码地址 #https://github.com/yasserg/crawler4j #此项目目的仅为学习此源代码。 #学习样例代码公布地址: #http://www.cnblogs.com/liinux 1.crawler4j源码学习(1):搜狐新闻网新闻标题采集爬虫 2.crawler4j源码学习(2):Ziroom租房网房源信息采集爬虫 #---------------------------------------------------------------------------------
crawler4j is an open source web crawler for Java which provides a simple interface for crawling the Web. Using it, you can setup a multi-threaded web crawler in few minutes.
To use the latest release of crawler4j, please use the following snippet in your pom.xml
<dependency>
<groupId>edu.uci.ics</groupId>
<artifactId>crawler4j</artifactId>
<version>4.2</version>
</dependency>
crawler4j JARs are available on the releases page and at Maven Central.
If you use crawler4j without Maven, be aware that crawler4j jar file has a couple of external dependencies. In releases page, you can find a file named crawler4j-X.Y-with-dependencies.jar that includes crawler4j and all of its dependencies as a bundle. You can add download it and add it to your classpath to get all the dependencies covered.
You need to create a crawler class that extends WebCrawler. This class decides which URLs should be crawled and handles the downloaded page. The following is a sample implementation:
public class MyCrawler extends WebCrawler {
private final static Pattern FILTERS = Pattern.compile(".*(\\.(css|js|gif|jpg"
+ "|png|mp3|mp3|zip|gz))$");
/**
* This method receives two parameters. The first parameter is the page
* in which we have discovered this new url and the second parameter is
* the new url. You should implement this function to specify whether
* the given url should be crawled or not (based on your crawling logic).
* In this example, we are instructing the crawler to ignore urls that
* have css, js, git, ... extensions and to only accept urls that start
* with "http://www.ics.uci.edu/". In this case, we didn't need the
* referringPage parameter to make the decision.
*/
@Override
public boolean shouldVisit(Page referringPage, WebURL url) {
String href = url.getURL().toLowerCase();
return !FILTERS.matcher(href).matches()
&& href.startsWith("http://www.ics.uci.edu/");
}
/**
* This function is called when a page is fetched and ready
* to be processed by your program.
*/
@Override
public void visit(Page page) {
String url = page.getWebURL().getURL();
System.out.println("URL: " + url);
if (page.getParseData() instanceof HtmlParseData) {
HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
String text = htmlParseData.getText();
String html = htmlParseData.getHtml();
Set<WebURL> links = htmlParseData.getOutgoingUrls();
System.out.println("Text length: " + text.length());
System.out.println("Html length: " + html.length());
System.out.println("Number of outgoing links: " + links.size());
}
}
}
As can be seen in the above code, there are two main functions that should be overridden:
You should also implement a controller class which specifies the seeds of the crawl, the folder in which intermediate crawl data should be stored and the number of concurrent threads:
public class Controller {
public static void main(String[] args) throws Exception {
String crawlStorageFolder = "/data/crawl/root";
int numberOfCrawlers = 7;
CrawlConfig config = new CrawlConfig();
config.setCrawlStorageFolder(crawlStorageFolder);
/*
* Instantiate the controller for this crawl.
*/
PageFetcher pageFetcher = new PageFetcher(config);
RobotstxtConfig robotstxtConfig = new RobotstxtConfig();
RobotstxtServer robotstxtServer = new RobotstxtServer(robotstxtConfig, pageFetcher);
CrawlController controller = new CrawlController(config, pageFetcher, robotstxtServer);
/*
* For each crawl, you need to add some seed urls. These are the first
* URLs that are fetched and then the crawler starts following links
* which are found in these pages
*/
controller.addSeed("http://www.ics.uci.edu/~lopes/");
controller.addSeed("http://www.ics.uci.edu/~welling/");
controller.addSeed("http://www.ics.uci.edu/");
/*
* Start the crawl. This is a blocking operation, meaning that your code
* will reach the line after this only when crawling is finished.
*/
controller.start(MyCrawler.class, numberOfCrawlers);
}
}
The controller class has a mandatory parameter of type CrawlConfig. Instances of this class can be used for configuring crawler4j. The following sections describe some details of configurations.
By default there is no limit on the depth of crawling. But you can limit the depth of crawling. For example, assume that you have a seed page "A", which links to "B", which links to "C", which links to "D". So, we have the following link structure:
A -> B -> C -> D
Since, "A" is a seed page, it will have a depth of 0. "B" will have depth of 1 and so on. You can set a limit on the depth of pages that crawler4j crawls. For example, if you set this limit to 2, it won't crawl page "D". To set the maximum depth you can use:
crawlConfig.setMaxDepthOfCrawling(maxDepthOfCrawling);
Although by default there is no limit on the number of pages to crawl, you can set a limit on this:
crawlConfig.setMaxPagesToFetch(maxPagesToFetch);
crawler4j is designed very efficiently and has the ability to crawl domains very fast (e.g., it has been able to crawl 200 Wikipedia pages per second). However, since this is against crawling policies and puts huge load on servers (and they might block you!), since version 1.3, by default crawler4j waits at least 200 milliseconds between requests. However, this parameter can be tuned:
crawlConfig.setPolitenessDelay(politenessDelay);
Should your crawl run behind a proxy? If so, you can use:
crawlConfig.setProxyHost("proxyserver.example.com");
crawlConfig.setProxyPort(8080);
If your proxy also needs authentication:
crawlConfig.setProxyUsername(username);
crawlConfig.setProxyPassword(password);
Sometimes you need to run a crawler for a long time. It is possible that the crawler terminates unexpectedly. In such cases, it might be desirable to resume the crawling. You would be able to resume a previously stopped/crashed crawl using the following settings:
crawlConfig.setResumableCrawling(true);
However, you should note that it might make the crawling slightly slower.
User-agent string is used for representing your crawler to web servers. See here for more details. By default crawler4j uses the following user agent string:
"crawler4j (https://github.com/yasserg/crawler4j/)"
However, you can overwrite it:
crawlConfig.setUserAgentString(userAgentString);
Copyright (c) 2010-2015 Yasser Ganjisaffar
Published under Apache License 2.0, see LICENSE
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。