# doc-proc-solution-accelerator
**Repository Path**: mirrors_Azure/doc-proc-solution-accelerator
## Basic Information
- **Project Name**: doc-proc-solution-accelerator
- **Description**: Document Processing Solution Accelerator on Azure by Azure.
- **Primary Language**: Unknown
- **License**: MIT
- **Default Branch**: main
- **Homepage**: None
- **GVP Project**: No
## Statistics
- **Stars**: 0
- **Forks**: 0
- **Created**: 2025-10-11
- **Last Updated**: 2026-04-04
## Categories & Tags
**Categories**: Uncategorized
**Tags**: None
## README
# Document Processing Solution Accelerator
A comprehensive, enterprise-ready document processing solution built on Azure that enables organizations to rapidly deploy and scale document processing workflows. This accelerator combines the power of Azure AI services, cloud-native architecture, and modern development practices to provide a complete platform for document ingestion, processing, and analysis.
## π What is this Solution Accelerator?
This solution accelerator provides a production-ready foundation for building document processing applications on Azure. It includes:
- **Modular Processing Pipeline**: A flexible Python library (`doc-proc-lib`) for creating custom document processing workflows
- **Web-based Management UI**: React + TypeScript frontend (`doc-proc-web`) for managing pipelines, services, and monitoring executions
- **Scalable Worker Architecture**: High-performance background worker service (`doc-proc-worker`) with Azure Storage Queue integration
- **Data Source Crawler**: Scalable background worker service for crawling and retrieving content from sources (`doc-proc-crawler`)
- **RESTful API Backend**: FastAPI-based service (`doc-proc-api`) with Azure Cosmos DB for data persistence
- **Infrastructure as Code**: Bicep templates for automated Azure deployment (`doc-proc-deploy`)
### **[Getting Started β](#getting-started)**
## β¨ Key Benefits
- **πββοΈ Rapid Development**: Get started with document processing in minutes, not months
- **π§ Highly Configurable**: Modular architecture allows customization without core changes
- **βοΈ Cloud-Native**: Built specifically for Azure with best practices and security in mind
- **π Production-Ready**: Includes monitoring, logging, error handling, and scalability features
- **π Extensible**: Easy to integrate with existing systems and add custom processing steps
- **π° Cost-Effective**: Optimized resource usage with serverless and managed services
## ποΈ Architecture Overview
The solution consists of the following building blocks:
## π¦ Core Components
### π§ doc-proc-lib
**The Processing Engine** - A flexible Python library that serves as the heart of the document processing pipeline.
- **Modular Architecture**: Catalog-based configuration for services, steps, and pipelines
- **Azure Integration**: Built-in connectors for Blob Storage, AI Document Intelligence, OpenAI, and more
- **Async Processing**: High-performance asynchronous processing capabilities
- **Custom Components**: Easy framework for building custom processing steps and service integrations
- **Environment Management**: Comprehensive configuration management with environment-specific settings
π **[View detailed documentation β](./doc-proc-lib/README.md)**
### π¨ doc-proc-web
**The Management Interface** - A modern React-based web application for managing and monitoring document processing workflows.
- **Technology Stack**: React 18 + TypeScript + Vite
- **UI Framework**: Radix UI components with Tailwind CSS styling
- **Features**:
- Pipeline configuration and management
- Service and step catalog administration
- Real-time execution monitoring
- Interactive workflow designer
- Responsive design for desktop and mobile
π **[View detailed documentation β](./doc-proc-web/README.md)**
### π doc-proc-api
**The Backend API** - FastAPI-based service providing RESTful APIs for document processing operations.
- **Technology Stack**: FastAPI + Python with Azure Cosmos DB
- **Features**:
- RESTful API for all CRUD operations
- Service catalog management
- Step catalog management
- Pipeline configuration and execution
- Health monitoring and diagnostics
- CORS-enabled for web client integration
- Azure App Configuration integration
- Comprehensive error handling and validation
π **[View detailed documentation β](./doc-proc-api/README.md)**
### β‘ doc-proc-worker
**The Processing Engine** - High-performance background processing service for executing document processing jobs at scale.
- **Technology Stack**: Python with Azure Storage Queue integration
- **Key Features**:
- Asynchronous queue processing with Azure Storage Queues
- Pipeline execution with step-by-step orchestration
- Batch processing with progress tracking and error recovery
- Multiprocessing support for CPU-intensive workloads
- Fault tolerance with retry mechanisms and graceful degradation
- Health monitoring with automatic restart capabilities
- Cloud-native Azure integration
π **[View detailed documentation β](./doc-proc-worker/README.md)**
### π doc-proc-crawler
**The Document Discovery Engine** - Intelligent distributed crawler service for automated document discovery and ingestion from various sources.
- **Technology Stack**: Python with Azure Cosmos DB and distributed coordination
- **Key Features**:
- Distributed coordination with lease-based conflict prevention
- Automatic source discovery from Cosmos DB configuration
- Multi-source support (file systems, cloud storage, databases, APIs)
- Intelligent load balancing across multiple deployment instances
- Self-healing operations with automatic restart and error recovery
- Configurable crawling schedules and polling intervals
- Metadata extraction and content indexing
- Integration with document processing pipelines
π **[View detailed documentation β](./doc-proc-crawler/README.md)**
### ποΈ doc-proc-deploy
**Infrastructure as Code** - Automated deployment templates and scripts for Azure resources.
- **Bicep Templates**: Complete infrastructure provisioning with Azure Bicep
- **Automated Scripts**: Shell scripts for streamlined deployment process
- **Resource Provisioning**: Container Apps, Container Registry, Cosmos DB, Storage Accounts, App Configuration
- **Environment Configuration**: Support for multiple environments (dev, staging, prod)
- **CI/CD Ready**: Scripts designed for integration with automated pipelines
π **[View deployment documentation β](./doc-proc-deploy/README.md)**
## π‘ Use Cases and Scenarios
This solution accelerator can be applied across various industries and document processing workflows. Below are common use cases with domain-specific examples and configuration patterns.
### π Multi-Modal Document Processing
#### **Enterprise Content Unification**
Process diverse document types and formats in a unified workflow with intelligent format detection and specialized extraction:
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Mixed Content βββββΆβ Format βββββΆβ Specialized βββββΆβ Unified Data β
β Repository β β Detection β β Extraction β β Structure β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
```
**Supported Document Types:**
- **Structured PDFs**: Forms, invoices, contracts with precise field extraction
- **Scanned Images**: Historical documents, handwritten forms with OCR processing
- **Office Documents**: Word, Excel, PowerPoint with native content extraction
- **Email Archives**: Outlook PST/MSG files with attachment processing
- **Web Content**: HTML pages, online forms with content scraping
- **Audio/Video**: Meeting recordings, training materials with transcription
**Intelligent Processing Pipeline:**
- **Format Detection**: Automatic MIME type detection and content analysis
- **Content Routing**: Route documents to specialized extraction services based on format
- **Cross-Reference Linking**: Connect related documents across different formats
- **Metadata Harmonization**: Standardize metadata across diverse document types
- **Quality Validation**: Ensure extraction accuracy with confidence scoring
**Key Benefits:**
- **Format Agnostic**: Single pipeline handles any document type
- **Intelligent Routing**: Automatic selection of optimal processing methods
- **Preservation of Context**: Maintain relationships between multi-format document sets
- **Scalable Processing**: Parallel processing of different formats simultaneously
### π’ Financial Services
#### **Invoice Processing Automation**
Streamline accounts payable workflows with automated invoice processing:
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β PDF/Email βββββΆβ Document AI βββββΆβ Validation βββββΆβ ERP System β
β Invoice β β Extraction β β & Approval β β Integration β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
```
#### **Loan Document Processing**
Accelerate loan application processing with document verification:
- **Bank Statements**: Extract transaction history, balance verification, income calculation
- **Tax Returns**: Parse tax forms, verify income sources, calculate debt-to-income ratios
- **Employment Letters**: Extract salary information, employment status, tenure
#### **Automated Document Ingestion**
Enterprise-scale document discovery and ingestion with intelligent coordination:
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β File Shares βββββΆβ Distributed βββββΆβ Document βββββΆβ Processing β
β Cloud Storage β β Crawler β β Queue β β Pipeline β
β API Endpoints β β Discovery β β Management β β Execution β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
```
**Key Benefits:**
- **Multi-Source Support**: Automatically discover and ingest from file systems, SharePoint, cloud storage, databases, and REST APIs
- **Distributed Coordination**: Multiple crawler instances work together with lease-based coordination to prevent duplicates
- **Smart Scheduling**: Configurable polling intervals and change detection for efficient resource usage
- **Self-Healing**: Automatic restart of failed processes and graceful handling of source availability
### π₯ Healthcare
#### **Medical Records Digitization**
Transform paper-based medical records into structured digital formats:
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Scanned Chart βββββΆβ OCR + βββββΆβ FHIR Data βββββΆβ EHR β
β Documents β β Medical AI β β Mapping β β Integration β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
```
#### **Clinical Trial Document Processing**
Process research documents and patient data for clinical trials:
- **Consent Forms**: Extract patient consent status, trial parameters
- **Case Report Forms**: Structure clinical observations and measurements
- **Adverse Event Reports**: Parse safety data for regulatory compliance
### βοΈ Legal Services
#### **Contract Analysis and Review**
Automate contract review processes with AI-powered analysis:
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Contract βββββΆβ Clause βββββΆβ Risk Analysis βββββΆβ Review β
β Document β β Extraction β β & Compliance β β Dashboard β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
```
#### **Legal Discovery Document Processing**
Process large volumes of documents for litigation support:
- **Email Processing**: Extract metadata, identify privileged communications
- **Document Classification**: Categorize documents by relevance and privilege
- **Redaction**: Automatically redact sensitive information
### π Manufacturing & Supply Chain
#### **Quality Control Documentation**
Process inspection reports, certificates, and compliance documents:
```
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Inspection βββββΆβ Data βββββΆβ Compliance βββββΆβ Quality β
β Reports β β Extraction β β Verification β β Dashboard β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
```
#### **Supplier Document Management**
- **Certificates of Compliance**: Verify supplier certifications and standards
- **Material Safety Data Sheets**: Extract safety information for regulatory compliance
- **Purchase Orders**: Process and validate supplier documentation
## Getting Started
### Prerequisites
Before getting started, ensure you have the following:
**Deployment/Development Environment:**
- Python 3.11+ with pip
- Node.js 18+ with npm (required for building front-end)
- Docker Desktop (optional, for containerized development)
- Git for version control
- Azure CLI (required for Azure deployment)
- Powershell Version 7.0 or higher (if using powershell scripts for deployment)
- bash (if using bash scripts for deployment)
**Azure Resources:**
- Azure subscription with appropriate permissions
- Resource group with required permissions (Contributor and User Access Administrator)
- Azure AI Model Deployment Quota for AI services
### Quick Start Options
Choose the deployment method that best fits your needs:
#### βοΈ **Azure Cloud Deployment**
Production-ready deployment on Azure with full scalability:
**Using Powershell for deployment**
```powershell
# 0. Clone the repository
git clone https://github.com/Azure/doc-proc-solution-accelerator.git
cd doc-proc-solution-accelerator
# 1. Deploy Azure infrastructure (AI Foundry, Container Apps, Cosmos DB, Storage Account, etc.)
pwsh .\doc-proc-deploy\DeployAzureInfra.ps1 -ResourceGroup myResourceGroup -Location westus -p docproc
# 2. Build and push Docker images to Azure Container Registry
pwsh .\doc-proc-deploy\BuildAndPushImages.ps1 -Registry myregistry.azurecr.io -Tag latest
# 3. Deploy applications to Azure Container Apps
pwsh .\doc-proc-deploy\DeployApps.ps1 -ResourceGroup myResourceGroup
```
**Using shell scripts for deployment**
```bash
# 0. Clone the repository
git clone https://github.com/Azure/doc-proc-solution-accelerator.git
cd doc-proc-solution-accelerator
# 1. Deploy Azure infrastructure (AI Foundry, Container Apps, Cosmos DB, Storage Account, etc.)
./doc-proc-deploy/deploy-azure-infra.sh -g myResourceGroup -l westus -p docproc
# 2. Build and push Docker images to Azure Container Registry
./doc-proc-deploy/build-and-push-images.sh -r myregistry.azurecr.io -t latest
# 3. Deploy applications to Azure Container Apps
./doc-proc-deploy/deploy-apps.sh -g myResourceGroup
```
This creates:
- **Azure Container Registry** for storing Docker images
- **Azure Container Apps** for hosting API, Web, worker, and crawler services
- **Azure Cosmos DB** for configuration data persistence
- **Azure Storage Account** for queue management and blob storage
- **Azure App Configuration** for centralized configuration management
- **Azure AI Foundry** for AI Services
#### π§ **Local Development**
Once the Azure resources are deployed, you can run the solution services locally for development:
```pwsh
cd doc-proc-solution-accelerator
# Configure environment variables for each service
# Copy .env.example to .env and update with your Azure resource endpoints
cp doc-proc-api\.env.example doc-proc-api\.env
cp doc-proc-worker\.env.example doc-proc-worker\.env
cp doc-proc-crawler\.env.example doc-proc-crawler\.env
cp doc-proc-web\.env.example doc-proc-web\.env
# Edit the .env files with your Azure resource information:
# doc-proc-api\.env - Add Azure App Configuration endpoint
# doc-proc-worker\.env - Add Azure App Configuration endpoint
# doc-proc-crawler\.env - Add Azure App Configuration endpoint
# doc-proc-web\.env - Update API base URL if different from http://localhost:8090
# Start all services locally with auto-reload
pwsh .\doc-proc-deploy\StartServicesLocally.ps1
#./doc-proc-deploy/start-services-locally.sh # if using shell
```
**Required Configuration Values:**
- `AZURE_APP_CONFIG_ENDPOINT`: Endpoint URL (format: `https://.azconfig.io`)
- `VITE_API_BASE_URL`: API endpoint for the web application (default: `http://localhost:8090`)
This will start:
- **API Server**: http://localhost:8090 (FastAPI backend)
- **Web Application**: http://localhost:8080 (React frontend)
- **Worker Service**: Background processing service
- **Crawler Service**: Document discovery and ingestion service
π‘ **For detailed instructions and additional options, see the [comprehensive deployment guide β](./doc-proc-deploy/README.md)**
### π Monitoring and Scaling
#### Auto-Scaling Configuration
Container Apps are configured with intelligent auto-scaling:
- **HTTP-based scaling**: Scales based on concurrent requests
- **Queue-based scaling**: Worker scales based on queue depth
- **CPU/Memory scaling**: Scales based on resource utilization
#### Health Monitoring
All services include comprehensive health monitoring:
- **Health check endpoints** for Container Apps
- **Application Insights integration** for telemetry
- **Log Analytics workspace** for centralized logging
### π Security Features
- **Managed Identity**: Services use managed identities for Azure resource access
- **Network isolation**: Container Apps Environment with virtual network integration
- **HTTPS enforcement**: All endpoints secured with SSL/TLS
## π·οΈ Repository Structure
```
doc-proc-solution-accelerator/
βββ doc-proc-lib/ # π§ Core processing library and pipeline engine
β βββ doc/ # Processing modules and components
β βββ examples/ # Example pipelines and usage patterns
β βββ tests/ # Unit and integration tests
β βββ pipeline_config.yaml # Pipeline configuration examples
β βββ service_catalog.yaml # Service definitions and configurations
β βββ step_catalog.yaml # Step definitions and configurations
βββ doc-proc-api/ # π FastAPI backend service
β βββ app/ # Application code
β β βββ db/ # Database models and operations
β β βββ models/ # Pydantic models and schemas
β β βββ routers/ # API route handlers
β β βββ services/ # Business logic services
β βββ infra/ # Infrastructure configuration for API
β βββ Dockerfile # Container configuration
β βββ requirements.txt # Python dependencies
βββ doc-proc-web/ # π¨ React + TypeScript frontend
β βββ src/ # Source code
β β βββ components/ # Reusable UI components
β β βββ pages/ # Application pages
β β βββ services/ # API integration services
β β βββ types/ # TypeScript type definitions
β βββ infra/ # Infrastructure configuration for web
β βββ Dockerfile # Container configuration
β βββ package.json # Node.js dependencies
βββ doc-proc-worker/ # β‘ Background processing worker
β βββ app/ # Worker application code
β βββ demo/ # Demo scripts and examples
β βββ infra/ # Infrastructure configuration for worker
β βββ tmp/ # Temporary processing files
β βββ Dockerfile # Container configuration
β βββ requirements.txt # Python dependencies
βββ doc-proc-crawler/ # π Document discovery and crawling service
β βββ app/ # Crawler application code
β β βββ discovery/ # Distributed coordination and source discovery
β β βββ sources/ # Source connectors (filesystem, cloud, API, etc.)
β β βββ models/ # Data models for crawling and coordination
β β βββ proxy/ # Azure service integration proxies
β βββ infra/ # Infrastructure configuration for crawler
β βββ DISTRIBUTED_ARCHITECTURE.md # Distributed coordination documentation
β βββ Dockerfile # Container configuration
β βββ run_crawler.py # Main crawler entry point
β βββ requirements.txt # Python dependencies
βββ doc-proc-deploy/ # ποΈ Infrastructure as Code and deployment
β βββ infra/
β β βββ bicep/ # Azure Bicep templates
β β βββ main.bicep # Main infrastructure template
β β βββ modules/ # Reusable Bicep modules
β βββ deploy-azure-infra.sh # Deploy infrastructure script
β βββ build-and-push-images.sh # Build and push Docker images
β βββ deploy-apps.sh # Deploy applications script
β βββ start-services-locally.sh # Local development setup
β βββ DEPLOYMENT.md # Detailed deployment guide
βββ bicepconfig.json # Bicep configuration
βββ logo.svg # Solution logo
βββ LICENSE # MIT license
βββ README.md # This documentation
```
## π‘ Planned Features
Some of the great features planned for the next release:
- Deployment as per Zero Trust Architecture best practices with integration with VNETs.
- Pipelines for management of deletions of documents.
- More Sources, Steps and Services for various use cases.
## π€ Contributing
We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details on how to:
- Submit bug reports and feature requests
- Set up your development environment
- Submit pull requests
- Follow our coding standards
## π License
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
## π Support
- **Documentation**: Detailed guides in each component's README
- **Issues**: Report bugs and request features via GitHub Issues
- **Discussions**: Community discussions and Q&A in GitHub Discussions
---
β‘ **Ready to get started?** Follow the [Getting Started](#getting-started) guide above or dive deep into the [doc-proc-lib documentation](./doc-proc-lib/README.md).