Architecture

Understanding ActiCrawl's architecture helps you make the most of our platform and optimize your web scraping workflows.

System Overview

ActiCrawl is built on a distributed, cloud-native architecture designed for reliability, scalability, and performance.

text

┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Client Apps   │────▶│   API Gateway   │────▶│  Load Balancer  │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                          │
                                ┌─────────────────────────┴─────────────────────────┐
                                │                                                   │
                        ┌───────▼────────┐                                 ┌────────▼───────┐
                        │ Authentication │                                 │  Rate Limiter  │
                        │    Service     │                                 │    Service     │
                        └────────────────┘                                 └────────────────┘
                                │                                                   │
                                └─────────────────┬─────────────────────────────────┘
                                                  │
                                         ┌────────▼────────┐
                                         │  Task Queue    │
                                         │ (Solid Queue) │
                                         └────────┬────────┘
                                                  │
                        ┌─────────────────────────┴─────────────────────────┐
                        │                                                   │
                ┌───────▼────────┐                                 ┌────────▼───────┐
                │ Scraper Worker │                                 │ Scraper Worker │
                │   Pool (n)     │                                 │   Pool (n+1)   │
                └────────────────┘                                 └────────────────┘
                        │                                                   │
                        └─────────────────┬─────────────────────────────────┘
                                          │
                                ┌─────────▼─────────┐
                                │  Data Processing  │
                                │     Pipeline      │
                                └─────────┬─────────┘
                                          │
                        ┌─────────────────┴─────────────────────┐
                        │                                       │
                ┌───────▼────────┐                     ┌────────▼───────┐
                │ Object Storage │                     │   Database     │
                │     (S3)       │                     │  (PostgreSQL)  │
                └────────────────┘                     └────────────────┘

Core Components

1. API Gateway

The entry point for all client requests. Responsibilities include:
- Request routing and load balancing
- SSL termination
- Request/response transformation
- API versioning

Technology Stack:
- NGINX for reverse proxy
- Kong for API management
- CloudFlare for DDoS protection

2. Authentication Service

Handles all authentication and authorization:
- API key validation
- JWT token management
- Permission checking
- Usage tracking

Features:
- Sub-millisecond authentication
- Distributed session management
- Role-based access control (RBAC)

3. Task Queue System

Manages asynchronous job processing:
- Job prioritization based on plan tier
- Retry logic with exponential backoff
- Dead letter queue for failed jobs
- Real-time job status updates

Technology:
- SQLite/PostgreSQL for queue management
- Solid Queue for job processing
- ActionCable for real-time updates

4. Scraper Workers

The heart of our scraping engine:
- Headless browser management (Chrome/Firefox)
- JavaScript rendering
- Cookie and session handling
- Anti-detection measures

Key Features:
- Dynamic worker scaling
- Browser fingerprint randomization
- Automatic proxy rotation
- Resource optimization

5. Data Processing Pipeline

Transforms raw scraped data:
- HTML parsing and cleaning
- Content extraction
- Format conversion (Markdown, JSON, etc.)
- AI-powered content enhancement

Processing Steps:
1. Raw HTML collection
2. JavaScript execution (if needed)
3. Content extraction
4. Format transformation
5. Quality validation
6. Compression and storage

6. Storage Layer

Distributed storage for reliability:
- Object Storage (S3): Screenshots, raw HTML
- Database (PostgreSQL): Metadata, user data, analytics
- Cache (Solid Cache): Frequently accessed data
- CDN: Global content delivery

Request Lifecycle

1. Request Initiation

text

Client → API Gateway → Authentication → Rate Limiting → Task Queue

2. Task Processing

text

Task Queue → Worker Selection → Browser Launch → Page Load → Content Extraction

3. Response Delivery

text

Data Processing → Storage → Response Formation → Client Delivery

Scaling Strategy

Horizontal Scaling

Worker Pools: Automatically scale based on queue depth
Database Replication: Read replicas for query distribution
Cache Clustering: Solid Cache cluster for high availability

Vertical Scaling

Resource Allocation: Dynamic CPU/memory allocation per job
Browser Optimization: Lightweight browser configurations
Connection Pooling: Efficient resource utilization

High Availability

Multi-Region Deployment

text

┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  US-East-1   │────▶│  EU-West-1   │────▶│ AP-Southeast │
│   Primary    │     │   Replica    │     │   Replica    │
└──────────────┘     └──────────────┘     └──────────────┘

Failover Strategy

Active-Active: Load balanced across regions
Health Checks: Continuous monitoring
Automatic Failover: Sub-second switchover
Data Consistency: Eventually consistent model

Security Architecture

Network Security

VPC Isolation: Private network segments
Security Groups: Granular access control
WAF: Web application firewall
DDoS Protection: Multi-layer defense

Data Security

Encryption at Rest: AES-256 encryption
Encryption in Transit: TLS 1.3
Key Management: AWS KMS integration
Access Logging: Comprehensive audit trails

Browser Security

Sandboxing: Isolated browser environments
Resource Limits: CPU/memory constraints
Network Isolation: Separate proxy networks
Clean State: Fresh browser per request

Performance Optimization

Caching Strategy

text

┌────────────┐     ┌────────────┐     ┌──────────────┐
│   Client   │────▶│ CDN Cache  │────▶│Solid Cache │
│   Cache    │     │  (Global)  │     │  (Local)   │
└────────────┘     └────────────┘     └──────────────┘

Resource Management

Browser Pooling: Pre-warmed browsers
Connection Reuse: HTTP/2 multiplexing
Lazy Loading: On-demand resource loading
Compression: Brotli/gzip compression

Monitoring & Observability

Metrics Collection

Application Metrics: Response times, error rates
Infrastructure Metrics: CPU, memory, network
Business Metrics: Usage patterns, success rates

Distributed Tracing

text

Request ID: abc-123
├─ API Gateway (2ms)
├─ Authentication (1ms)
├─ Queue Insert (3ms)
├─ Worker Processing (2500ms)
│  ├─ Browser Launch (500ms)
│  ├─ Page Load (1500ms)
│  └─ Content Extract (500ms)
└─ Response Delivery (5ms)
Total: 2511ms

Best Practices for Users

1. Optimize Request Patterns

Batch similar requests
Use webhooks for async processing
Implement client-side caching

2. Choose Appropriate Options

Select minimal wait strategies
Use specific selectors
Enable compression

3. Handle Failures Gracefully

Implement retry logic
Use exponential backoff
Monitor error patterns

Future Architecture Plans

Upcoming Enhancements

Edge Computing: Process closer to data sources
ML Pipeline: Intelligent content extraction
GraphQL API: Flexible data queries
WebSocket Streaming: Real-time data updates

Experimental Features

Distributed browser farms
P2P proxy networks
Blockchain-based authentication
Quantum-resistant encryption

Conclusion

ActiCrawl's architecture is designed to provide reliable, scalable, and fast web scraping capabilities. By understanding these components, you can better optimize your integration and make the most of our platform's capabilities.

Documentation

Architecture

System Overview

Core Components

1. API Gateway

2. Authentication Service

3. Task Queue System

4. Scraper Workers

5. Data Processing Pipeline

6. Storage Layer

Request Lifecycle

1. Request Initiation

2. Task Processing

3. Response Delivery

Scaling Strategy

Horizontal Scaling

Vertical Scaling

High Availability

Multi-Region Deployment

Failover Strategy

Security Architecture

Network Security

Data Security

Browser Security

Performance Optimization

Caching Strategy

Resource Management

Monitoring & Observability

Metrics Collection

Distributed Tracing

Best Practices for Users

1. Optimize Request Patterns

2. Choose Appropriate Options

3. Handle Failures Gracefully

Future Architecture Plans

Upcoming Enhancements

Experimental Features

Conclusion