# doc-proc-solution-accelerator **Repository Path**: mirrors_Azure/doc-proc-solution-accelerator ## Basic Information - **Project Name**: doc-proc-solution-accelerator - **Description**: Document Processing Solution Accelerator on Azure by Azure. - **Primary Language**: Unknown - **License**: MIT - **Default Branch**: main - **Homepage**: None - **GVP Project**: No ## Statistics - **Stars**: 0 - **Forks**: 0 - **Created**: 2025-10-11 - **Last Updated**: 2026-04-04 ## Categories & Tags **Categories**: Uncategorized **Tags**: None ## README

doc-proc-solution-accelerator

# Document Processing Solution Accelerator A comprehensive, enterprise-ready document processing solution built on Azure that enables organizations to rapidly deploy and scale document processing workflows. This accelerator combines the power of Azure AI services, cloud-native architecture, and modern development practices to provide a complete platform for document ingestion, processing, and analysis. ## πŸš€ What is this Solution Accelerator? This solution accelerator provides a production-ready foundation for building document processing applications on Azure. It includes: - **Modular Processing Pipeline**: A flexible Python library (`doc-proc-lib`) for creating custom document processing workflows - **Web-based Management UI**: React + TypeScript frontend (`doc-proc-web`) for managing pipelines, services, and monitoring executions - **Scalable Worker Architecture**: High-performance background worker service (`doc-proc-worker`) with Azure Storage Queue integration - **Data Source Crawler**: Scalable background worker service for crawling and retrieving content from sources (`doc-proc-crawler`) - **RESTful API Backend**: FastAPI-based service (`doc-proc-api`) with Azure Cosmos DB for data persistence - **Infrastructure as Code**: Bicep templates for automated Azure deployment (`doc-proc-deploy`) ### **[Getting Started β†’](#getting-started)** ## ✨ Key Benefits - **πŸƒβ€β™‚οΈ Rapid Development**: Get started with document processing in minutes, not months - **πŸ”§ Highly Configurable**: Modular architecture allows customization without core changes - **☁️ Cloud-Native**: Built specifically for Azure with best practices and security in mind - **πŸ“Š Production-Ready**: Includes monitoring, logging, error handling, and scalability features - **πŸ”Œ Extensible**: Easy to integrate with existing systems and add custom processing steps - **πŸ’° Cost-Effective**: Optimized resource usage with serverless and managed services ## πŸ—οΈ Architecture Overview The solution consists of the following building blocks: Text changing depending on mode. Light: 'So light!' Dark: 'So dark!' ## πŸ“¦ Core Components ### πŸ”§ doc-proc-lib **The Processing Engine** - A flexible Python library that serves as the heart of the document processing pipeline. - **Modular Architecture**: Catalog-based configuration for services, steps, and pipelines - **Azure Integration**: Built-in connectors for Blob Storage, AI Document Intelligence, OpenAI, and more - **Async Processing**: High-performance asynchronous processing capabilities - **Custom Components**: Easy framework for building custom processing steps and service integrations - **Environment Management**: Comprehensive configuration management with environment-specific settings πŸ“– **[View detailed documentation β†’](./doc-proc-lib/README.md)** ### 🎨 doc-proc-web **The Management Interface** - A modern React-based web application for managing and monitoring document processing workflows. - **Technology Stack**: React 18 + TypeScript + Vite - **UI Framework**: Radix UI components with Tailwind CSS styling - **Features**: - Pipeline configuration and management - Service and step catalog administration - Real-time execution monitoring - Interactive workflow designer - Responsive design for desktop and mobile πŸ“– **[View detailed documentation β†’](./doc-proc-web/README.md)** ### πŸš€ doc-proc-api **The Backend API** - FastAPI-based service providing RESTful APIs for document processing operations. - **Technology Stack**: FastAPI + Python with Azure Cosmos DB - **Features**: - RESTful API for all CRUD operations - Service catalog management - Step catalog management - Pipeline configuration and execution - Health monitoring and diagnostics - CORS-enabled for web client integration - Azure App Configuration integration - Comprehensive error handling and validation πŸ“– **[View detailed documentation β†’](./doc-proc-api/README.md)** ### ⚑ doc-proc-worker **The Processing Engine** - High-performance background processing service for executing document processing jobs at scale. - **Technology Stack**: Python with Azure Storage Queue integration - **Key Features**: - Asynchronous queue processing with Azure Storage Queues - Pipeline execution with step-by-step orchestration - Batch processing with progress tracking and error recovery - Multiprocessing support for CPU-intensive workloads - Fault tolerance with retry mechanisms and graceful degradation - Health monitoring with automatic restart capabilities - Cloud-native Azure integration πŸ“– **[View detailed documentation β†’](./doc-proc-worker/README.md)** ### πŸ” doc-proc-crawler **The Document Discovery Engine** - Intelligent distributed crawler service for automated document discovery and ingestion from various sources. - **Technology Stack**: Python with Azure Cosmos DB and distributed coordination - **Key Features**: - Distributed coordination with lease-based conflict prevention - Automatic source discovery from Cosmos DB configuration - Multi-source support (file systems, cloud storage, databases, APIs) - Intelligent load balancing across multiple deployment instances - Self-healing operations with automatic restart and error recovery - Configurable crawling schedules and polling intervals - Metadata extraction and content indexing - Integration with document processing pipelines πŸ“– **[View detailed documentation β†’](./doc-proc-crawler/README.md)** ### πŸ—οΈ doc-proc-deploy **Infrastructure as Code** - Automated deployment templates and scripts for Azure resources. - **Bicep Templates**: Complete infrastructure provisioning with Azure Bicep - **Automated Scripts**: Shell scripts for streamlined deployment process - **Resource Provisioning**: Container Apps, Container Registry, Cosmos DB, Storage Accounts, App Configuration - **Environment Configuration**: Support for multiple environments (dev, staging, prod) - **CI/CD Ready**: Scripts designed for integration with automated pipelines πŸ“– **[View deployment documentation β†’](./doc-proc-deploy/README.md)** ## πŸ’‘ Use Cases and Scenarios This solution accelerator can be applied across various industries and document processing workflows. Below are common use cases with domain-specific examples and configuration patterns. ### πŸ“„ Multi-Modal Document Processing #### **Enterprise Content Unification** Process diverse document types and formats in a unified workflow with intelligent format detection and specialized extraction: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Mixed Content │───▢│ Format │───▢│ Specialized │───▢│ Unified Data β”‚ β”‚ Repository β”‚ β”‚ Detection β”‚ β”‚ Extraction β”‚ β”‚ Structure β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **Supported Document Types:** - **Structured PDFs**: Forms, invoices, contracts with precise field extraction - **Scanned Images**: Historical documents, handwritten forms with OCR processing - **Office Documents**: Word, Excel, PowerPoint with native content extraction - **Email Archives**: Outlook PST/MSG files with attachment processing - **Web Content**: HTML pages, online forms with content scraping - **Audio/Video**: Meeting recordings, training materials with transcription **Intelligent Processing Pipeline:** - **Format Detection**: Automatic MIME type detection and content analysis - **Content Routing**: Route documents to specialized extraction services based on format - **Cross-Reference Linking**: Connect related documents across different formats - **Metadata Harmonization**: Standardize metadata across diverse document types - **Quality Validation**: Ensure extraction accuracy with confidence scoring **Key Benefits:** - **Format Agnostic**: Single pipeline handles any document type - **Intelligent Routing**: Automatic selection of optimal processing methods - **Preservation of Context**: Maintain relationships between multi-format document sets - **Scalable Processing**: Parallel processing of different formats simultaneously ### 🏒 Financial Services #### **Invoice Processing Automation** Streamline accounts payable workflows with automated invoice processing: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ PDF/Email │───▢│ Document AI │───▢│ Validation │───▢│ ERP System β”‚ β”‚ Invoice β”‚ β”‚ Extraction β”‚ β”‚ & Approval β”‚ β”‚ Integration β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` #### **Loan Document Processing** Accelerate loan application processing with document verification: - **Bank Statements**: Extract transaction history, balance verification, income calculation - **Tax Returns**: Parse tax forms, verify income sources, calculate debt-to-income ratios - **Employment Letters**: Extract salary information, employment status, tenure #### **Automated Document Ingestion** Enterprise-scale document discovery and ingestion with intelligent coordination: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ File Shares │───▢│ Distributed │───▢│ Document │───▢│ Processing β”‚ β”‚ Cloud Storage β”‚ β”‚ Crawler β”‚ β”‚ Queue β”‚ β”‚ Pipeline β”‚ β”‚ API Endpoints β”‚ β”‚ Discovery β”‚ β”‚ Management β”‚ β”‚ Execution β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` **Key Benefits:** - **Multi-Source Support**: Automatically discover and ingest from file systems, SharePoint, cloud storage, databases, and REST APIs - **Distributed Coordination**: Multiple crawler instances work together with lease-based coordination to prevent duplicates - **Smart Scheduling**: Configurable polling intervals and change detection for efficient resource usage - **Self-Healing**: Automatic restart of failed processes and graceful handling of source availability ### πŸ₯ Healthcare #### **Medical Records Digitization** Transform paper-based medical records into structured digital formats: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Scanned Chart │───▢│ OCR + │───▢│ FHIR Data │───▢│ EHR β”‚ β”‚ Documents β”‚ β”‚ Medical AI β”‚ β”‚ Mapping β”‚ β”‚ Integration β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` #### **Clinical Trial Document Processing** Process research documents and patient data for clinical trials: - **Consent Forms**: Extract patient consent status, trial parameters - **Case Report Forms**: Structure clinical observations and measurements - **Adverse Event Reports**: Parse safety data for regulatory compliance ### βš–οΈ Legal Services #### **Contract Analysis and Review** Automate contract review processes with AI-powered analysis: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Contract │───▢│ Clause │───▢│ Risk Analysis │───▢│ Review β”‚ β”‚ Document β”‚ β”‚ Extraction β”‚ β”‚ & Compliance β”‚ β”‚ Dashboard β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` #### **Legal Discovery Document Processing** Process large volumes of documents for litigation support: - **Email Processing**: Extract metadata, identify privileged communications - **Document Classification**: Categorize documents by relevance and privilege - **Redaction**: Automatically redact sensitive information ### 🏭 Manufacturing & Supply Chain #### **Quality Control Documentation** Process inspection reports, certificates, and compliance documents: ``` β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Inspection │───▢│ Data │───▢│ Compliance │───▢│ Quality β”‚ β”‚ Reports β”‚ β”‚ Extraction β”‚ β”‚ Verification β”‚ β”‚ Dashboard β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ ``` #### **Supplier Document Management** - **Certificates of Compliance**: Verify supplier certifications and standards - **Material Safety Data Sheets**: Extract safety information for regulatory compliance - **Purchase Orders**: Process and validate supplier documentation ## Getting Started ### Prerequisites Before getting started, ensure you have the following: **Deployment/Development Environment:** - Python 3.11+ with pip - Node.js 18+ with npm (required for building front-end) - Docker Desktop (optional, for containerized development) - Git for version control - Azure CLI (required for Azure deployment) - Powershell Version 7.0 or higher (if using powershell scripts for deployment) - bash (if using bash scripts for deployment) **Azure Resources:** - Azure subscription with appropriate permissions - Resource group with required permissions (Contributor and User Access Administrator) - Azure AI Model Deployment Quota for AI services ### Quick Start Options Choose the deployment method that best fits your needs: #### ☁️ **Azure Cloud Deployment** Production-ready deployment on Azure with full scalability: **Using Powershell for deployment** ```powershell # 0. Clone the repository git clone https://github.com/Azure/doc-proc-solution-accelerator.git cd doc-proc-solution-accelerator # 1. Deploy Azure infrastructure (AI Foundry, Container Apps, Cosmos DB, Storage Account, etc.) pwsh .\doc-proc-deploy\DeployAzureInfra.ps1 -ResourceGroup myResourceGroup -Location westus -p docproc # 2. Build and push Docker images to Azure Container Registry pwsh .\doc-proc-deploy\BuildAndPushImages.ps1 -Registry myregistry.azurecr.io -Tag latest # 3. Deploy applications to Azure Container Apps pwsh .\doc-proc-deploy\DeployApps.ps1 -ResourceGroup myResourceGroup ``` **Using shell scripts for deployment** ```bash # 0. Clone the repository git clone https://github.com/Azure/doc-proc-solution-accelerator.git cd doc-proc-solution-accelerator # 1. Deploy Azure infrastructure (AI Foundry, Container Apps, Cosmos DB, Storage Account, etc.) ./doc-proc-deploy/deploy-azure-infra.sh -g myResourceGroup -l westus -p docproc # 2. Build and push Docker images to Azure Container Registry ./doc-proc-deploy/build-and-push-images.sh -r myregistry.azurecr.io -t latest # 3. Deploy applications to Azure Container Apps ./doc-proc-deploy/deploy-apps.sh -g myResourceGroup ``` This creates: - **Azure Container Registry** for storing Docker images - **Azure Container Apps** for hosting API, Web, worker, and crawler services - **Azure Cosmos DB** for configuration data persistence - **Azure Storage Account** for queue management and blob storage - **Azure App Configuration** for centralized configuration management - **Azure AI Foundry** for AI Services #### πŸ”§ **Local Development** Once the Azure resources are deployed, you can run the solution services locally for development: ```pwsh cd doc-proc-solution-accelerator # Configure environment variables for each service # Copy .env.example to .env and update with your Azure resource endpoints cp doc-proc-api\.env.example doc-proc-api\.env cp doc-proc-worker\.env.example doc-proc-worker\.env cp doc-proc-crawler\.env.example doc-proc-crawler\.env cp doc-proc-web\.env.example doc-proc-web\.env # Edit the .env files with your Azure resource information: # doc-proc-api\.env - Add Azure App Configuration endpoint # doc-proc-worker\.env - Add Azure App Configuration endpoint # doc-proc-crawler\.env - Add Azure App Configuration endpoint # doc-proc-web\.env - Update API base URL if different from http://localhost:8090 # Start all services locally with auto-reload pwsh .\doc-proc-deploy\StartServicesLocally.ps1 #./doc-proc-deploy/start-services-locally.sh # if using shell ``` **Required Configuration Values:** - `AZURE_APP_CONFIG_ENDPOINT`: Endpoint URL (format: `https://.azconfig.io`) - `VITE_API_BASE_URL`: API endpoint for the web application (default: `http://localhost:8090`) This will start: - **API Server**: http://localhost:8090 (FastAPI backend) - **Web Application**: http://localhost:8080 (React frontend) - **Worker Service**: Background processing service - **Crawler Service**: Document discovery and ingestion service πŸ’‘ **For detailed instructions and additional options, see the [comprehensive deployment guide β†’](./doc-proc-deploy/README.md)** ### πŸ“Š Monitoring and Scaling #### Auto-Scaling Configuration Container Apps are configured with intelligent auto-scaling: - **HTTP-based scaling**: Scales based on concurrent requests - **Queue-based scaling**: Worker scales based on queue depth - **CPU/Memory scaling**: Scales based on resource utilization #### Health Monitoring All services include comprehensive health monitoring: - **Health check endpoints** for Container Apps - **Application Insights integration** for telemetry - **Log Analytics workspace** for centralized logging ### πŸ” Security Features - **Managed Identity**: Services use managed identities for Azure resource access - **Network isolation**: Container Apps Environment with virtual network integration - **HTTPS enforcement**: All endpoints secured with SSL/TLS ## 🏷️ Repository Structure ``` doc-proc-solution-accelerator/ β”œβ”€β”€ doc-proc-lib/ # πŸ”§ Core processing library and pipeline engine β”‚ β”œβ”€β”€ doc/ # Processing modules and components β”‚ β”œβ”€β”€ examples/ # Example pipelines and usage patterns β”‚ β”œβ”€β”€ tests/ # Unit and integration tests β”‚ β”œβ”€β”€ pipeline_config.yaml # Pipeline configuration examples β”‚ β”œβ”€β”€ service_catalog.yaml # Service definitions and configurations β”‚ └── step_catalog.yaml # Step definitions and configurations β”œβ”€β”€ doc-proc-api/ # πŸš€ FastAPI backend service β”‚ β”œβ”€β”€ app/ # Application code β”‚ β”‚ β”œβ”€β”€ db/ # Database models and operations β”‚ β”‚ β”œβ”€β”€ models/ # Pydantic models and schemas β”‚ β”‚ β”œβ”€β”€ routers/ # API route handlers β”‚ β”‚ └── services/ # Business logic services β”‚ β”œβ”€β”€ infra/ # Infrastructure configuration for API β”‚ β”œβ”€β”€ Dockerfile # Container configuration β”‚ └── requirements.txt # Python dependencies β”œβ”€β”€ doc-proc-web/ # 🎨 React + TypeScript frontend β”‚ β”œβ”€β”€ src/ # Source code β”‚ β”‚ β”œβ”€β”€ components/ # Reusable UI components β”‚ β”‚ β”œβ”€β”€ pages/ # Application pages β”‚ β”‚ β”œβ”€β”€ services/ # API integration services β”‚ β”‚ └── types/ # TypeScript type definitions β”‚ β”œβ”€β”€ infra/ # Infrastructure configuration for web β”‚ β”œβ”€β”€ Dockerfile # Container configuration β”‚ └── package.json # Node.js dependencies β”œβ”€β”€ doc-proc-worker/ # ⚑ Background processing worker β”‚ β”œβ”€β”€ app/ # Worker application code β”‚ β”œβ”€β”€ demo/ # Demo scripts and examples β”‚ β”œβ”€β”€ infra/ # Infrastructure configuration for worker β”‚ β”œβ”€β”€ tmp/ # Temporary processing files β”‚ β”œβ”€β”€ Dockerfile # Container configuration β”‚ └── requirements.txt # Python dependencies β”œβ”€β”€ doc-proc-crawler/ # πŸ” Document discovery and crawling service β”‚ β”œβ”€β”€ app/ # Crawler application code β”‚ β”‚ β”œβ”€β”€ discovery/ # Distributed coordination and source discovery β”‚ β”‚ β”œβ”€β”€ sources/ # Source connectors (filesystem, cloud, API, etc.) β”‚ β”‚ β”œβ”€β”€ models/ # Data models for crawling and coordination β”‚ β”‚ └── proxy/ # Azure service integration proxies β”‚ β”œβ”€β”€ infra/ # Infrastructure configuration for crawler β”‚ β”œβ”€β”€ DISTRIBUTED_ARCHITECTURE.md # Distributed coordination documentation β”‚ β”œβ”€β”€ Dockerfile # Container configuration β”‚ β”œβ”€β”€ run_crawler.py # Main crawler entry point β”‚ └── requirements.txt # Python dependencies β”œβ”€β”€ doc-proc-deploy/ # πŸ—οΈ Infrastructure as Code and deployment β”‚ β”œβ”€β”€ infra/ β”‚ β”‚ └── bicep/ # Azure Bicep templates β”‚ β”‚ β”œβ”€β”€ main.bicep # Main infrastructure template β”‚ β”‚ └── modules/ # Reusable Bicep modules β”‚ β”œβ”€β”€ deploy-azure-infra.sh # Deploy infrastructure script β”‚ β”œβ”€β”€ build-and-push-images.sh # Build and push Docker images β”‚ β”œβ”€β”€ deploy-apps.sh # Deploy applications script β”‚ β”œβ”€β”€ start-services-locally.sh # Local development setup β”‚ └── DEPLOYMENT.md # Detailed deployment guide β”œβ”€β”€ bicepconfig.json # Bicep configuration β”œβ”€β”€ logo.svg # Solution logo β”œβ”€β”€ LICENSE # MIT license └── README.md # This documentation ``` ## πŸ’‘ Planned Features Some of the great features planned for the next release: - Deployment as per Zero Trust Architecture best practices with integration with VNETs. - Pipelines for management of deletions of documents. - More Sources, Steps and Services for various use cases. ## 🀝 Contributing We welcome contributions! Please see our [Contributing Guidelines](CONTRIBUTING.md) for details on how to: - Submit bug reports and feature requests - Set up your development environment - Submit pull requests - Follow our coding standards ## πŸ“„ License This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details. ## πŸ†˜ Support - **Documentation**: Detailed guides in each component's README - **Issues**: Report bugs and request features via GitHub Issues - **Discussions**: Community discussions and Q&A in GitHub Discussions --- ⚑ **Ready to get started?** Follow the [Getting Started](#getting-started) guide above or dive deep into the [doc-proc-lib documentation](./doc-proc-lib/README.md).