Docs

Documentation

Learn how to automate your web scraping workflows with ActiCrawl

Architecture

Understanding ActiCrawl's architecture helps you make the most of our platform and optimize your web scraping workflows.

System Overview

ActiCrawl is built on a distributed, cloud-native architecture designed for reliability, scalability, and performance.

text
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│   Client Apps   │────▶│   API Gateway   │────▶│  Load Balancer  │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                          │
                                ┌─────────────────────────┴─────────────────────────┐
                                │                                                   │
                        ┌───────▼────────┐                                 ┌────────▼───────┐
                        │ Authentication │                                 │  Rate Limiter  │
                        │    Service     │                                 │    Service     │
                        └────────────────┘                                 └────────────────┘
                                │                                                   │
                                └─────────────────┬─────────────────────────────────┘
                                                  │
                                         ┌────────▼────────┐
                                         │  Task Queue    │
                                         │ (Solid Queue) │
                                         └────────┬────────┘
                                                  │
                        ┌─────────────────────────┴─────────────────────────┐
                        │                                                   │
                ┌───────▼────────┐                                 ┌────────▼───────┐
                │ Scraper Worker │                                 │ Scraper Worker │
                │   Pool (n)     │                                 │   Pool (n+1)   │
                └────────────────┘                                 └────────────────┘
                        │                                                   │
                        └─────────────────┬─────────────────────────────────┘
                                          │
                                ┌─────────▼─────────┐
                                │  Data Processing  │
                                │     Pipeline      │
                                └─────────┬─────────┘
                                          │
                        ┌─────────────────┴─────────────────────┐
                        │                                       │
                ┌───────▼────────┐                     ┌────────▼───────┐
                │ Object Storage │                     │   Database     │
                │     (S3)       │                     │  (PostgreSQL)  │
                └────────────────┘                     └────────────────┘

Core Components

1. API Gateway

The entry point for all client requests. Responsibilities include:
- Request routing and load balancing
- SSL termination
- Request/response transformation
- API versioning

Technology Stack:
- NGINX for reverse proxy
- Kong for API management
- CloudFlare for DDoS protection

2. Authentication Service

Handles all authentication and authorization:
- API key validation
- JWT token management
- Permission checking
- Usage tracking

Features:
- Sub-millisecond authentication
- Distributed session management
- Role-based access control (RBAC)

3. Task Queue System

Manages asynchronous job processing:
- Job prioritization based on plan tier
- Retry logic with exponential backoff
- Dead letter queue for failed jobs
- Real-time job status updates

Technology:
- SQLite/PostgreSQL for queue management
- Solid Queue for job processing
- ActionCable for real-time updates

4. Scraper Workers

The heart of our scraping engine:
- Headless browser management (Chrome/Firefox)
- JavaScript rendering
- Cookie and session handling
- Anti-detection measures

Key Features:
- Dynamic worker scaling
- Browser fingerprint randomization
- Automatic proxy rotation
- Resource optimization

5. Data Processing Pipeline

Transforms raw scraped data:
- HTML parsing and cleaning
- Content extraction
- Format conversion (Markdown, JSON, etc.)
- AI-powered content enhancement

Processing Steps:
1. Raw HTML collection
2. JavaScript execution (if needed)
3. Content extraction
4. Format transformation
5. Quality validation
6. Compression and storage

6. Storage Layer

Distributed storage for reliability:
- Object Storage (S3): Screenshots, raw HTML
- Database (PostgreSQL): Metadata, user data, analytics
- Cache (Solid Cache): Frequently accessed data
- CDN: Global content delivery

Request Lifecycle

1. Request Initiation

text
Client → API Gateway → Authentication → Rate Limiting → Task Queue

2. Task Processing

text
Task Queue → Worker Selection → Browser Launch → Page Load → Content Extraction

3. Response Delivery

text
Data Processing → Storage → Response Formation → Client Delivery

Scaling Strategy

Horizontal Scaling

  • Worker Pools: Automatically scale based on queue depth
  • Database Replication: Read replicas for query distribution
  • Cache Clustering: Solid Cache cluster for high availability

Vertical Scaling

  • Resource Allocation: Dynamic CPU/memory allocation per job
  • Browser Optimization: Lightweight browser configurations
  • Connection Pooling: Efficient resource utilization

High Availability

Multi-Region Deployment

text
┌──────────────┐     ┌──────────────┐     ┌──────────────┐
│  US-East-1   │────▶│  EU-West-1   │────▶│ AP-Southeast │
│   Primary    │     │   Replica    │     │   Replica    │
└──────────────┘     └──────────────┘     └──────────────┘

Failover Strategy

  1. Active-Active: Load balanced across regions
  2. Health Checks: Continuous monitoring
  3. Automatic Failover: Sub-second switchover
  4. Data Consistency: Eventually consistent model

Security Architecture

Network Security

  • VPC Isolation: Private network segments
  • Security Groups: Granular access control
  • WAF: Web application firewall
  • DDoS Protection: Multi-layer defense

Data Security

  • Encryption at Rest: AES-256 encryption
  • Encryption in Transit: TLS 1.3
  • Key Management: AWS KMS integration
  • Access Logging: Comprehensive audit trails

Browser Security

  • Sandboxing: Isolated browser environments
  • Resource Limits: CPU/memory constraints
  • Network Isolation: Separate proxy networks
  • Clean State: Fresh browser per request

Performance Optimization

Caching Strategy

text
┌────────────┐     ┌────────────┐     ┌──────────────┐
│   Client   │────▶│ CDN Cache  │────▶│Solid Cache │
│   Cache    │     │  (Global)  │     │  (Local)   │
└────────────┘     └────────────┘     └──────────────┘

Resource Management

  • Browser Pooling: Pre-warmed browsers
  • Connection Reuse: HTTP/2 multiplexing
  • Lazy Loading: On-demand resource loading
  • Compression: Brotli/gzip compression

Monitoring & Observability

Metrics Collection

  • Application Metrics: Response times, error rates
  • Infrastructure Metrics: CPU, memory, network
  • Business Metrics: Usage patterns, success rates

Distributed Tracing

text
Request ID: abc-123
├─ API Gateway (2ms)
├─ Authentication (1ms)
├─ Queue Insert (3ms)
├─ Worker Processing (2500ms)
│  ├─ Browser Launch (500ms)
│  ├─ Page Load (1500ms)
│  └─ Content Extract (500ms)
└─ Response Delivery (5ms)
Total: 2511ms

Best Practices for Users

1. Optimize Request Patterns

  • Batch similar requests
  • Use webhooks for async processing
  • Implement client-side caching

2. Choose Appropriate Options

  • Select minimal wait strategies
  • Use specific selectors
  • Enable compression

3. Handle Failures Gracefully

  • Implement retry logic
  • Use exponential backoff
  • Monitor error patterns

Future Architecture Plans

Upcoming Enhancements

  1. Edge Computing: Process closer to data sources
  2. ML Pipeline: Intelligent content extraction
  3. GraphQL API: Flexible data queries
  4. WebSocket Streaming: Real-time data updates

Experimental Features

  • Distributed browser farms
  • P2P proxy networks
  • Blockchain-based authentication
  • Quantum-resistant encryption

Conclusion

ActiCrawl's architecture is designed to provide reliable, scalable, and fast web scraping capabilities. By understanding these components, you can better optimize your integration and make the most of our platform's capabilities.