# fess-crawler
**Repository Path**: mirrors_codelibs/fess-crawler
## Basic Information
- **Project Name**: fess-crawler
- **Description**: Web/FileSystem Crawler Library
- **Primary Language**: Unknown
- **License**: Apache-2.0
- **Default Branch**: master
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2020-08-08
- **Last Updated**: 2026-05-23
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Fess Crawler
[](https://github.com/codelibs/fess-crawler/actions/workflows/maven.yml)
[](https://opensource.org/licenses/Apache-2.0)
[](https://maven-badges.herokuapp.com/maven-central/org.codelibs.fess/fess-crawler-parent)
## Overview
**Fess Crawler** is a powerful, flexible Java-based web crawling framework designed for enterprise-scale content extraction and processing. Built with a modular architecture, it supports multiple protocols (HTTP/HTTPS, File System, FTP, SMB, Cloud Storage) and provides extensive content extraction capabilities from various document formats.
### Key Features
- **Multi-Protocol Support**: HTTP/HTTPS, File System, FTP, SMB/CIFS, Cloud Storage (MinIO, S3)
- **Comprehensive Content Extraction**: Office documents, PDFs, archives, images, audio/video files
- **Multi-Threading**: Configurable thread pools for high-performance crawling
- **Fault Tolerance**: Built-in retry mechanisms and error handling
- **Flexible Configuration**: XML-based dependency injection with LastaFlute DI
- **Extensible Architecture**: Plugin system for custom extractors, transformers, and clients
- **Rate Limiting**: Politeness policies and interval controllers
- **URL Filtering**: Regex-based inclusion/exclusion patterns
- **Data Persistence**: Multiple backend options including OpenSearch integration
## Technology Stack
- **Java**: 21+ (requires Java 21 or higher)
- **Build System**: Maven 3.x
- **DI Container**: LastaFlute DI
- **HTTP Client**: Apache HttpComponents
- **Content Extraction**: Apache Tika, Apache POI, PDFBox
- **Testing**: JUnit 4, UTFlute, Testcontainers
- **Storage Backends**: OpenSearch, Memory-based
## Quick Start
### Prerequisites
- Java 21 or higher
- Maven 3.6 or higher
### Installation
Add the following dependency to your `pom.xml`:
```xml
org.codelibs.fess
fess-crawler
15.2.0-SNAPSHOT
org.codelibs.fess
fess-crawler-lasta
15.2.0-SNAPSHOT
org.codelibs.fess
fess-crawler-opensearch
15.2.0-SNAPSHOT
```
### Basic Usage
```java
import org.codelibs.fess.crawler.Crawler;
import org.codelibs.fess.crawler.client.http.HcHttpClient;
import org.codelibs.fess.crawler.container.StandardCrawlerContainer;
import org.codelibs.fess.crawler.transformer.impl.FileTransformer;
public class BasicCrawlerExample {
public static void main(String[] args) throws Exception {
// Create crawler container
StandardCrawlerContainer container = new StandardCrawlerContainer();
// Configure basic components
container.singleton("crawler", Crawler.class)
.singleton("httpClient", HcHttpClient.class)
.singleton("fileTransformer", FileTransformer.class);
// Get crawler instance
Crawler crawler = container.getComponent("crawler");
// Configure crawling parameters
crawler.addUrl("https://example.com");
crawler.crawlerContext.setMaxAccessCount(100);
crawler.crawlerContext.setNumOfThread(5);
crawler.urlFilter.addInclude("https://example.com/.*");
// Execute crawling
String sessionId = crawler.execute();
System.out.println("Crawling completed. Session ID: " + sessionId);
}
}
```
### File System Crawling
```java
import org.codelibs.fess.crawler.client.fs.FileSystemClient;
// Configure for file system crawling
container.singleton("fsClient", FileSystemClient.class);
// Add file URL
crawler.addUrl("file:///path/to/directory");
crawler.urlFilter.addInclude("file:///path/to/directory/.*");
```
## Configuration
### XML Configuration
Fess Crawler uses XML-based configuration with LastaFlute DI. Place configuration files in your classpath:
```xml
```
### Crawler Context Configuration
```java
// Set maximum number of URLs to crawl
crawler.crawlerContext.setMaxAccessCount(1000);
// Set number of crawler threads
crawler.crawlerContext.setNumOfThread(10);
// Set maximum crawl depth
crawler.crawlerContext.setMaxDepth(3);
// Set request interval (politeness)
crawler.crawlerContext.setDefaultIntervalTime(1000); // 1 second
```
### URL Filtering
```java
// Include patterns
crawler.urlFilter.addInclude("https://example.com/.*");
crawler.urlFilter.addInclude(".*\\.pdf$");
// Exclude patterns
crawler.urlFilter.addExclude(".*\\.js$");
crawler.urlFilter.addExclude(".*login.*");
```
## Supported Protocols and Formats
### Protocols
- **HTTP/HTTPS**: Full web crawling support with cookies, authentication, redirects
- **File System**: Local and network file system access
- **FTP**: FTP server crawling with authentication
- **SMB/CIFS**: Windows network shares
- **Storage**: Cloud storage systems (MinIO, S3-compatible)
### Content Formats
#### Office Documents
- Microsoft Office (Word, Excel, PowerPoint)
- OpenOffice/LibreOffice documents
- RTF, WordPerfect
#### PDFs and Images
- PDF documents (text and metadata extraction)
- Images (JPEG, PNG, GIF, TIFF, BMP)
- Image metadata (EXIF, IPTC, XMP)
#### Archives and Compressed Files
- ZIP, TAR, GZ archives
- LHA compression format
- Nested archive extraction
#### Web and Markup
- HTML, XHTML with XPath support
- XML documents
- JSON and structured data
#### Media Files
- Audio formats (MP3, WAV, FLAC)
- Video formats (MP4, AVI, MOV)
- Metadata extraction from media files
## Architecture
### Multi-Module Structure
```
fess-crawler-parent/
├── fess-crawler/ # Core crawler framework
│ ├── client/ # Protocol clients (HTTP, FTP, SMB, etc.)
│ ├── extractor/ # Content extractors
│ ├── transformer/ # Data transformers
│ └── service/ # Core services
├── fess-crawler-lasta/ # LastaFlute DI integration
└── fess-crawler-opensearch/ # OpenSearch backend
```
### Key Components
#### Core Engine
- **Crawler**: Main orchestrator managing crawl execution
- **CrawlerContext**: Execution context and configuration
- **CrawlerThread**: Individual crawler thread implementation
#### Client Architecture
- **HcHttpClient**: HTTP/HTTPS client using Apache HttpComponents
- **FileSystemClient**: File system access
- **FtpClient**: FTP protocol support
- **SmbClient**: SMB/CIFS network shares
- **StorageClient**: Cloud storage integration
#### Content Processing Pipeline
- **Extractors**: Content extraction from various formats
- **Transformers**: Data transformation and enrichment
- **Filters**: URL filtering with regex patterns
- **Rules**: Content processing rules and validation
## Building and Testing
### Build Commands
```bash
# Build all modules
mvn clean install
# Build without tests
mvn clean install -DskipTests
# Build specific module
mvn clean install -pl fess-crawler
# Generate test coverage report
mvn jacoco:report
```
### Code Quality
```bash
# Format code
mvn formatter:format
# Update license headers
mvn license:format
# Run static analysis
mvn spotbugs:check
```
### Running Tests
```bash
# Run all tests
mvn test
# Run specific test class
mvn test -Dtest=CrawlerTest
# Run specific test method
mvn test -Dtest=CrawlerTest#test_execute_web
# Run tests for specific module
mvn test -pl fess-crawler
```
## Examples
### Web Crawling with Custom Rules
```java
// Create crawler with custom configuration
StandardCrawlerContainer container = new StandardCrawlerContainer();
// Configure HTTP client with custom settings
container.singleton("httpClient", HcHttpClient.class, client -> {
client.setUserAgent("MyBot/1.0");
client.setConnectionTimeout(30000);
client.setMaxConnections(100);
});
// Configure URL filtering
container.singleton("urlFilter", UrlFilterImpl.class, filter -> {
filter.addInclude("https://example.com/.*");
filter.addExclude(".*\\.(css|js|png|jpg|gif)$");
});
// Configure content extraction
container.singleton("tikaExtractor", TikaExtractor.class);
container.singleton("extractorFactory", ExtractorFactory.class, factory -> {
factory.addExtractor("text/html", container.getComponent("tikaExtractor"));
factory.addExtractor("application/pdf", container.getComponent("tikaExtractor"));
});
Crawler crawler = container.getComponent("crawler");
crawler.addUrl("https://example.com");
crawler.crawlerContext.setMaxAccessCount(500);
String sessionId = crawler.execute();
```
### Background Crawling
```java
// Configure for background execution
crawler.setBackground(true);
String sessionId = crawler.execute();
// Check crawling status
while (crawler.crawlerContext.getStatus() == CrawlerStatus.RUNNING) {
Thread.sleep(1000);
System.out.println("Crawling in progress...");
}
// Wait for completion
crawler.awaitTermination();
System.out.println("Crawling completed");
```
### Custom Content Extractor
```java
public class CustomExtractor extends AbstractExtractor {
@Override
public ExtractData getText(final InputStream inputStream, final Map params) {
// Custom extraction logic
ExtractData extractData = new ExtractData();
// ... implementation
return extractData;
}
}
// Register custom extractor
container.singleton("customExtractor", CustomExtractor.class);
container.singleton("extractorFactory", ExtractorFactory.class, factory -> {
factory.addExtractor("application/custom", container.getComponent("customExtractor"));
});
```
## Advanced Configuration
### Multi-Instance Crawling
```java
// Create multiple crawler instances
Crawler crawler1 = container.getComponent("crawler");
crawler1.setSessionId("session1");
crawler1.addUrl("https://site1.com");
Crawler crawler2 = container.getComponent("crawler");
crawler2.setSessionId("session2");
crawler2.addUrl("https://site2.com");
// Execute concurrently
crawler1.setBackground(true);
crawler2.setBackground(true);
String sessionId1 = crawler1.execute();
String sessionId2 = crawler2.execute();
crawler1.awaitTermination();
crawler2.awaitTermination();
```
### Custom Interval Control
```java
// Configure politeness policy
container.singleton("intervalController", DefaultIntervalController.class, controller -> {
controller.setDelayMillisForWaitingNewUrl(5000);
controller.setDefaultIntervalTime(1000);
});
```
### Sitemap Support
```java
// Enable sitemap processing
container.singleton("sitemapsRule", SitemapsRule.class, rule -> {
rule.addRule("url", ".*sitemap.*");
});
// Add sitemap URL
crawler.addUrl("https://example.com/sitemap.xml");
```
## Data Access and Storage
### Accessing Crawled Data
```java
// Get data service
DataService dataService = container.getComponent("dataService");
// Iterate through crawled data
dataService.iterate(sessionId, accessResult -> {
System.out.println("URL: " + accessResult.getUrl());
System.out.println("Status: " + accessResult.getHttpStatusCode());
System.out.println("Content Type: " + accessResult.getMimeType());
System.out.println("Content: " + accessResult.getContent());
System.out.println("---");
});
// Get specific result
AccessResult result = dataService.getAccessResult(sessionId, url);
// Delete session data
dataService.delete(sessionId);
```
### OpenSearch Integration
```java
// Add OpenSearch dependency and configure
container.singleton("opensearchDataService", OpenSearchDataService.class, service -> {
service.setIndexName("crawler-data");
service.setHostname("localhost");
service.setPort(9200);
});
```
## Performance Tuning
### Thread Configuration
```java
// Optimize thread pool settings
crawler.crawlerContext.setNumOfThread(20); // Number of crawler threads
crawler.crawlerContext.setMaxThreadCheckCount(50); // Thread monitoring frequency
```
### Connection Pool Tuning
```java
container.singleton("httpClient", HcHttpClient.class, client -> {
client.setMaxConnections(200); // Total connections
client.setMaxConnectionsPerRoute(20); // Per-host connections
client.setConnectionTimeout(30000); // Connection timeout
client.setSocketTimeout(60000); // Read timeout
});
```
### Memory Management
```java
// Configure memory usage
crawler.crawlerContext.setMaxAccessCount(10000); // Limit crawled URLs
crawler.crawlerContext.setMaxDepth(5); // Limit crawl depth
// Use streaming for large files
container.singleton("fileTransformer", FileTransformer.class, transformer -> {
transformer.setMaxContentSize(10 * 1024 * 1024); // 10MB limit
});
```
## Troubleshooting
### Common Issues
#### Connection Timeouts
```java
// Increase timeout values
client.setConnectionTimeout(60000); // 60 seconds
client.setSocketTimeout(120000); // 120 seconds
```
#### Memory Issues
```java
// Reduce concurrent threads and batch sizes
crawler.crawlerContext.setNumOfThread(5);
crawler.crawlerContext.setMaxAccessCount(1000);
```
#### SSL/TLS Issues
```java
// Configure SSL settings
container.singleton("httpClient", HcHttpClient.class, client -> {
client.setTrustAllCertificates(true); // For testing only
});
```
### Debug Logging
Enable debug logging by adding to your logging configuration:
```xml
```
### Monitoring
```java
// Monitor crawling progress
while (crawler.crawlerContext.getStatus() == CrawlerStatus.RUNNING) {
int processed = dataService.getCount(sessionId);
System.out.println("Processed: " + processed + " URLs");
Thread.sleep(5000);
}
```
## Contributing
We welcome contributions to Fess Crawler! Please follow these guidelines:
1. **Fork** the repository
2. **Create** a feature branch (`git checkout -b feature/amazing-feature`)
3. **Commit** your changes (`git commit -m 'Add amazing feature'`)
4. **Push** to the branch (`git push origin feature/amazing-feature`)
5. **Open** a Pull Request
### Development Setup
```bash
# Clone the repository
git clone https://github.com/codelibs/fess-crawler.git
cd fess-crawler
# Build the project
mvn clean install
# Run tests
mvn test
# Format code before committing
mvn formatter:format
mvn license:format
```
### Code Style
- Follow Java coding conventions
- Use proper JavaDoc comments for public APIs
- Include unit tests for new functionality
- Ensure all tests pass before submitting PR
## License
This project is licensed under the Apache License 2.0 - see the [LICENSE](LICENSE) file for details.