Architecture
Understanding ActiCrawl's architecture helps you make the most of our platform and optimize your web scraping workflows.
System Overview
ActiCrawl is built on a distributed, cloud-native architecture designed for reliability, scalability, and performance.
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Client Apps │────▶│ API Gateway │────▶│ Load Balancer │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│
┌─────────────────────────┴─────────────────────────┐
│ │
┌───────▼────────┐ ┌────────▼───────┐
│ Authentication │ │ Rate Limiter │
│ Service │ │ Service │
└────────────────┘ └────────────────┘
│ │
└─────────────────┬─────────────────────────────────┘
│
┌────────▼────────┐
│ Task Queue │
│ (Solid Queue) │
└────────┬────────┘
│
┌─────────────────────────┴─────────────────────────┐
│ │
┌───────▼────────┐ ┌────────▼───────┐
│ Scraper Worker │ │ Scraper Worker │
│ Pool (n) │ │ Pool (n+1) │
└────────────────┘ └────────────────┘
│ │
└─────────────────┬─────────────────────────────────┘
│
┌─────────▼─────────┐
│ Data Processing │
│ Pipeline │
└─────────┬─────────┘
│
┌─────────────────┴─────────────────────┐
│ │
┌───────▼────────┐ ┌────────▼───────┐
│ Object Storage │ │ Database │
│ (S3) │ │ (PostgreSQL) │
└────────────────┘ └────────────────┘
Core Components
1. API Gateway
The entry point for all client requests. Responsibilities include:
- Request routing and load balancing
- SSL termination
- Request/response transformation
- API versioning
Technology Stack:
- NGINX for reverse proxy
- Kong for API management
- CloudFlare for DDoS protection
2. Authentication Service
Handles all authentication and authorization:
- API key validation
- JWT token management
- Permission checking
- Usage tracking
Features:
- Sub-millisecond authentication
- Distributed session management
- Role-based access control (RBAC)
3. Task Queue System
Manages asynchronous job processing:
- Job prioritization based on plan tier
- Retry logic with exponential backoff
- Dead letter queue for failed jobs
- Real-time job status updates
Technology:
- SQLite/PostgreSQL for queue management
- Solid Queue for job processing
- ActionCable for real-time updates
4. Scraper Workers
The heart of our scraping engine:
- Headless browser management (Chrome/Firefox)
- JavaScript rendering
- Cookie and session handling
- Anti-detection measures
Key Features:
- Dynamic worker scaling
- Browser fingerprint randomization
- Automatic proxy rotation
- Resource optimization
5. Data Processing Pipeline
Transforms raw scraped data:
- HTML parsing and cleaning
- Content extraction
- Format conversion (Markdown, JSON, etc.)
- AI-powered content enhancement
Processing Steps:
1. Raw HTML collection
2. JavaScript execution (if needed)
3. Content extraction
4. Format transformation
5. Quality validation
6. Compression and storage
6. Storage Layer
Distributed storage for reliability:
- Object Storage (S3): Screenshots, raw HTML
- Database (PostgreSQL): Metadata, user data, analytics
- Cache (Solid Cache): Frequently accessed data
- CDN: Global content delivery
Request Lifecycle
1. Request Initiation
Client → API Gateway → Authentication → Rate Limiting → Task Queue
2. Task Processing
Task Queue → Worker Selection → Browser Launch → Page Load → Content Extraction
3. Response Delivery
Data Processing → Storage → Response Formation → Client Delivery
Scaling Strategy
Horizontal Scaling
- Worker Pools: Automatically scale based on queue depth
- Database Replication: Read replicas for query distribution
- Cache Clustering: Solid Cache cluster for high availability
Vertical Scaling
- Resource Allocation: Dynamic CPU/memory allocation per job
- Browser Optimization: Lightweight browser configurations
- Connection Pooling: Efficient resource utilization
High Availability
Multi-Region Deployment
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ US-East-1 │────▶│ EU-West-1 │────▶│ AP-Southeast │
│ Primary │ │ Replica │ │ Replica │
└──────────────┘ └──────────────┘ └──────────────┘
Failover Strategy
- Active-Active: Load balanced across regions
- Health Checks: Continuous monitoring
- Automatic Failover: Sub-second switchover
- Data Consistency: Eventually consistent model
Security Architecture
Network Security
- VPC Isolation: Private network segments
- Security Groups: Granular access control
- WAF: Web application firewall
- DDoS Protection: Multi-layer defense
Data Security
- Encryption at Rest: AES-256 encryption
- Encryption in Transit: TLS 1.3
- Key Management: AWS KMS integration
- Access Logging: Comprehensive audit trails
Browser Security
- Sandboxing: Isolated browser environments
- Resource Limits: CPU/memory constraints
- Network Isolation: Separate proxy networks
- Clean State: Fresh browser per request
Performance Optimization
Caching Strategy
┌────────────┐ ┌────────────┐ ┌──────────────┐
│ Client │────▶│ CDN Cache │────▶│Solid Cache │
│ Cache │ │ (Global) │ │ (Local) │
└────────────┘ └────────────┘ └──────────────┘
Resource Management
- Browser Pooling: Pre-warmed browsers
- Connection Reuse: HTTP/2 multiplexing
- Lazy Loading: On-demand resource loading
- Compression: Brotli/gzip compression
Monitoring & Observability
Metrics Collection
- Application Metrics: Response times, error rates
- Infrastructure Metrics: CPU, memory, network
- Business Metrics: Usage patterns, success rates
Distributed Tracing
Request ID: abc-123
├─ API Gateway (2ms)
├─ Authentication (1ms)
├─ Queue Insert (3ms)
├─ Worker Processing (2500ms)
│ ├─ Browser Launch (500ms)
│ ├─ Page Load (1500ms)
│ └─ Content Extract (500ms)
└─ Response Delivery (5ms)
Total: 2511ms
Best Practices for Users
1. Optimize Request Patterns
- Batch similar requests
- Use webhooks for async processing
- Implement client-side caching
2. Choose Appropriate Options
- Select minimal wait strategies
- Use specific selectors
- Enable compression
3. Handle Failures Gracefully
- Implement retry logic
- Use exponential backoff
- Monitor error patterns
Future Architecture Plans
Upcoming Enhancements
- Edge Computing: Process closer to data sources
- ML Pipeline: Intelligent content extraction
- GraphQL API: Flexible data queries
- WebSocket Streaming: Real-time data updates
Experimental Features
- Distributed browser farms
- P2P proxy networks
- Blockchain-based authentication
- Quantum-resistant encryption
Conclusion
ActiCrawl's architecture is designed to provide reliable, scalable, and fast web scraping capabilities. By understanding these components, you can better optimize your integration and make the most of our platform's capabilities.