Docs

Documentation

Learn how to automate your web scraping workflows with ActiCrawl

Batch Processing

Process multiple URLs efficiently with ActiCrawl's batch processing capabilities. Learn how to scrape thousands of pages, handle rate limits, and optimize performance at scale.

Overview

Batch processing allows you to:
- Scrape multiple URLs concurrently
- Manage rate limits automatically
- Handle failures gracefully
- Monitor progress in real-time
- Optimize resource usage

Basic Batch Scraping

Simple Batch Request

Process multiple URLs with a single request:

javascript
const results = await client.batchScrape({
  urls: [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
  ],
  format: 'markdown'
});

// Results array maintains URL order
results.forEach((result, index) => {
  console.log(`URL ${index + 1}:`, result.url);
  console.log('Status:', result.status);
  console.log('Content:', result.content);
});

Batch with Extraction

Apply the same extraction rules to multiple URLs:

python
urls = [
    'https://shop.com/product/1',
    'https://shop.com/product/2',
    'https://shop.com/product/3'
]

results = client.batch_scrape(
    urls=urls,
    extract={
        'name': 'h1.product-name',
        'price': '.price',
        'availability': '.stock-status',
        'image': 'img.main-image@src'
    }
)

# Process results
for result in results:
    if result['status'] == 'success':
        print(f"Product: {result['extracted']['name']}")
        print(f"Price: {result['extracted']['price']}")

Advanced Batch Configuration

Concurrency Control

Optimize performance with concurrent requests:

javascript
const results = await client.batchScrape({
  urls: productUrls,
  concurrency: 5,  // Process 5 URLs simultaneously
  delay: 1000,     // 1 second delay between requests
  retries: 3,      // Retry failed requests up to 3 times
  timeout: 30000   // 30 second timeout per URL
});

Rate Limiting

Respect target website limits:

python
results = client.batch_scrape(
    urls=urls,
    rate_limit={
        'requests_per_second': 2,
        'burst': 5,  # Allow burst of 5 requests
        'per_domain': True  # Apply limits per domain
    }
)

Batch with Headers

Custom headers for all requests:

javascript
const results = await client.batchScrape({
  urls: urls,
  headers: {
    'User-Agent': 'MyBot/1.0',
    'Accept-Language': 'en-US',
    'X-Custom-Header': 'value'
  },
  cookies: {
    'session': 'abc123',
    'preferences': 'lang=en'
  }
});

Batch Crawling

Crawl Multiple Domains

Process entire websites in parallel:

python
crawl_configs = [
    {
        'start_url': 'https://site1.com',
        'max_pages': 100,
        'include_patterns': ['/products/*']
    },
    {
        'start_url': 'https://site2.com',
        'max_pages': 50,
        'include_patterns': ['/blog/*']
    }
]

results = client.batch_crawl(crawl_configs)

Dynamic URL Generation

Generate URLs programmatically:

javascript
// Generate pagination URLs
const baseUrl = 'https://example.com/products';
const urls = Array.from({length: 50}, (_, i) => 
  `${baseUrl}?page=${i + 1}`
);

// Generate category URLs
const categories = ['electronics', 'books', 'clothing'];
const categoryUrls = categories.flatMap(cat => 
  Array.from({length: 10}, (_, i) => 
    `${baseUrl}/${cat}?page=${i + 1}`
  )
);

const results = await client.batchScrape({
  urls: [...urls, ...categoryUrls],
  concurrency: 10
});

Error Handling and Retries

Comprehensive Error Management

javascript
const results = await client.batchScrape({
  urls: urls,
  retry: {
    attempts: 3,
    delay: 2000,  // Exponential backoff
    on: ['timeout', 'network', '5xx']  // Retry conditions
  },
  onError: 'continue'  // Don't stop on errors
});

// Process results with error handling
results.forEach(result => {
  if (result.status === 'success') {
    processSuccessfulResult(result);
  } else {
    logError({
      url: result.url,
      error: result.error,
      attempts: result.attempts
    });
  }
});

Partial Results Handling

python
def process_batch_with_fallback(urls):
    results = client.batch_scrape(
        urls=urls,
        on_partial_success=True  # Return partial results
    )

    successful = []
    failed = []

    for result in results:
        if result['status'] == 'success':
            successful.append(result)
        else:
            failed.append(result['url'])

    # Retry failed URLs with different settings
    if failed:
        retry_results = client.batch_scrape(
            urls=failed,
            javascript=True,  # Try with JS rendering
            timeout=60000
        )
        successful.extend(retry_results)

    return successful

Progress Monitoring

Real-time Progress Tracking

javascript
const batch = client.createBatch({
  urls: urls,
  onProgress: (progress) => {
    console.log(`Progress: ${progress.completed}/${progress.total}`);
    console.log(`Success rate: ${progress.successRate}%`);
    console.log(`ETA: ${progress.estimatedTimeRemaining}s`);
  }
});

// Start batch processing
const results = await batch.run();

Webhook Notifications

python
# Configure webhook for batch completion
batch_job = client.batch_scrape(
    urls=urls,
    webhook={
        'url': 'https://myapp.com/webhook/batch-complete',
        'events': ['complete', 'error', 'progress'],
        'headers': {
            'Authorization': 'Bearer token123'
        }
    },
    async=True  # Non-blocking execution
)

print(f"Batch job ID: {batch_job['id']}")
print(f"Status URL: {batch_job['status_url']}")

Performance Optimization

Memory-Efficient Processing

javascript
// Stream results instead of loading all in memory
const stream = client.batchStream({
  urls: urls,
  concurrency: 20
});

stream.on('data', (result) => {
  // Process each result as it arrives
  saveToDatabase(result);
});

stream.on('end', () => {
  console.log('Batch processing complete');
});

stream.on('error', (error) => {
  console.error('Batch error:', error);
});

Chunked Processing

python
def process_large_batch(urls, chunk_size=100):
    """Process large URL lists in chunks"""
    total_results = []

    for i in range(0, len(urls), chunk_size):
        chunk = urls[i:i + chunk_size]

        print(f"Processing chunk {i//chunk_size + 1}")

        results = client.batch_scrape(
            urls=chunk,
            concurrency=10
        )

        total_results.extend(results)

        # Save intermediate results
        save_checkpoint(total_results, i)

        # Rate limit between chunks
        time.sleep(5)

    return total_results

Data Export Options

Export to Different Formats

javascript
const results = await client.batchScrape({
  urls: urls,
  export: {
    format: 'csv',  // 'json', 'csv', 'excel', 'parquet'
    path: './exports/batch_results.csv',
    fields: ['url', 'title', 'price', 'timestamp'],
    compress: true  // Creates .csv.gz
  }
});

Streaming Exports

python
# Stream results directly to file
with client.batch_scrape_stream(urls) as stream:
    with open('results.jsonl', 'w') as f:
        for result in stream:
            f.write(json.dumps(result) + '\n')

Best Practices

1. URL Deduplication

javascript
// Remove duplicate URLs before processing
const uniqueUrls = [...new Set(urls)];

// Or with normalization
const normalizedUrls = urls
  .map(url => new URL(url).href)
  .filter((url, index, self) => self.indexOf(url) === index);

2. Batch Size Optimization

python
def optimal_batch_size(total_urls):
    """Calculate optimal batch size based on total URLs"""
    if total_urls < 100:
        return total_urls
    elif total_urls < 1000:
        return 50
    elif total_urls < 10000:
        return 100
    else:
        return 200

3. Resource Management

javascript
// Clean up resources after batch processing
const batch = client.createBatch({ urls });

try {
  const results = await batch.run();
  return results;
} finally {
  await batch.cleanup();  // Release resources
}

4. Monitoring and Alerting

python
def batch_with_monitoring(urls):
    """Batch processing with monitoring"""
    start_time = time.time()

    results = client.batch_scrape(
        urls=urls,
        on_complete=lambda r: send_metrics({
            'total_urls': len(urls),
            'successful': len([x for x in r if x['status'] == 'success']),
            'duration': time.time() - start_time,
            'success_rate': calculate_success_rate(r)
        })
    )

    # Alert on low success rate
    success_rate = calculate_success_rate(results)
    if success_rate < 0.8:
        send_alert(f"Low success rate: {success_rate}")

    return results

Common Use Cases

E-commerce Price Monitoring

javascript
async function monitorPrices(productUrls) {
  const results = await client.batchScrape({
    urls: productUrls,
    extract: {
      name: 'h1.product-title',
      price: '.current-price',
      originalPrice: '.original-price',
      availability: '.stock-status'
    },
    concurrency: 10
  });

  // Compare with previous prices
  const priceChanges = results
    .filter(r => r.status === 'success')
    .map(r => ({
      ...r.extracted,
      url: r.url,
      priceChange: calculatePriceChange(r.extracted.price)
    }))
    .filter(p => p.priceChange !== 0);

  return priceChanges;
}

Content Aggregation

python
def aggregate_news_articles(news_sites):
    """Aggregate articles from multiple news sites"""

    urls = []
    for site in news_sites:
        # Get article URLs from each site
        homepage = client.scrape(site['url'])
        article_urls = extract_article_urls(homepage, site['selector'])
        urls.extend(article_urls[:10])  # Top 10 from each

    # Batch scrape all articles
    articles = client.batch_scrape(
        urls=urls,
        extract={
            'title': 'h1',
            'author': '.author-name',
            'date': 'time@datetime',
            'content': 'article.content',
            'category': '.category'
        }
    )

    return process_articles(articles)

Troubleshooting

Common Issues

  1. Memory Issues with Large Batches

    • Use streaming instead of loading all results in memory
    • Process in chunks
    • Increase Node.js memory: node --max-old-space-size=4096
  2. Rate Limit Errors

    • Reduce concurrency
    • Add delays between requests
    • Implement exponential backoff
  3. Timeout Issues

    • Increase timeout values
    • Reduce batch size
    • Check network connectivity
  4. Inconsistent Results

    • Enable JavaScript rendering for dynamic content
    • Add random delays to avoid detection
    • Rotate user agents

Next Steps