Batch Processing
Process multiple URLs efficiently with ActiCrawl's batch processing capabilities. Learn how to scrape thousands of pages, handle rate limits, and optimize performance at scale.
Overview
Batch processing allows you to:
- Scrape multiple URLs concurrently
- Manage rate limits automatically
- Handle failures gracefully
- Monitor progress in real-time
- Optimize resource usage
Basic Batch Scraping
Simple Batch Request
Process multiple URLs with a single request:
const results = await client.batchScrape({
urls: [
'https://example.com/page1',
'https://example.com/page2',
'https://example.com/page3'
],
format: 'markdown'
});
// Results array maintains URL order
results.forEach((result, index) => {
console.log(`URL ${index + 1}:`, result.url);
console.log('Status:', result.status);
console.log('Content:', result.content);
});
Batch with Extraction
Apply the same extraction rules to multiple URLs:
urls = [
'https://shop.com/product/1',
'https://shop.com/product/2',
'https://shop.com/product/3'
]
results = client.batch_scrape(
urls=urls,
extract={
'name': 'h1.product-name',
'price': '.price',
'availability': '.stock-status',
'image': 'img.main-image@src'
}
)
# Process results
for result in results:
if result['status'] == 'success':
print(f"Product: {result['extracted']['name']}")
print(f"Price: {result['extracted']['price']}")
Advanced Batch Configuration
Concurrency Control
Optimize performance with concurrent requests:
const results = await client.batchScrape({
urls: productUrls,
concurrency: 5, // Process 5 URLs simultaneously
delay: 1000, // 1 second delay between requests
retries: 3, // Retry failed requests up to 3 times
timeout: 30000 // 30 second timeout per URL
});
Rate Limiting
Respect target website limits:
results = client.batch_scrape(
urls=urls,
rate_limit={
'requests_per_second': 2,
'burst': 5, # Allow burst of 5 requests
'per_domain': True # Apply limits per domain
}
)
Batch with Headers
Custom headers for all requests:
const results = await client.batchScrape({
urls: urls,
headers: {
'User-Agent': 'MyBot/1.0',
'Accept-Language': 'en-US',
'X-Custom-Header': 'value'
},
cookies: {
'session': 'abc123',
'preferences': 'lang=en'
}
});
Batch Crawling
Crawl Multiple Domains
Process entire websites in parallel:
crawl_configs = [
{
'start_url': 'https://site1.com',
'max_pages': 100,
'include_patterns': ['/products/*']
},
{
'start_url': 'https://site2.com',
'max_pages': 50,
'include_patterns': ['/blog/*']
}
]
results = client.batch_crawl(crawl_configs)
Dynamic URL Generation
Generate URLs programmatically:
// Generate pagination URLs
const baseUrl = 'https://example.com/products';
const urls = Array.from({length: 50}, (_, i) =>
`${baseUrl}?page=${i + 1}`
);
// Generate category URLs
const categories = ['electronics', 'books', 'clothing'];
const categoryUrls = categories.flatMap(cat =>
Array.from({length: 10}, (_, i) =>
`${baseUrl}/${cat}?page=${i + 1}`
)
);
const results = await client.batchScrape({
urls: [...urls, ...categoryUrls],
concurrency: 10
});
Error Handling and Retries
Comprehensive Error Management
const results = await client.batchScrape({
urls: urls,
retry: {
attempts: 3,
delay: 2000, // Exponential backoff
on: ['timeout', 'network', '5xx'] // Retry conditions
},
onError: 'continue' // Don't stop on errors
});
// Process results with error handling
results.forEach(result => {
if (result.status === 'success') {
processSuccessfulResult(result);
} else {
logError({
url: result.url,
error: result.error,
attempts: result.attempts
});
}
});
Partial Results Handling
def process_batch_with_fallback(urls):
results = client.batch_scrape(
urls=urls,
on_partial_success=True # Return partial results
)
successful = []
failed = []
for result in results:
if result['status'] == 'success':
successful.append(result)
else:
failed.append(result['url'])
# Retry failed URLs with different settings
if failed:
retry_results = client.batch_scrape(
urls=failed,
javascript=True, # Try with JS rendering
timeout=60000
)
successful.extend(retry_results)
return successful
Progress Monitoring
Real-time Progress Tracking
const batch = client.createBatch({
urls: urls,
onProgress: (progress) => {
console.log(`Progress: ${progress.completed}/${progress.total}`);
console.log(`Success rate: ${progress.successRate}%`);
console.log(`ETA: ${progress.estimatedTimeRemaining}s`);
}
});
// Start batch processing
const results = await batch.run();
Webhook Notifications
# Configure webhook for batch completion
batch_job = client.batch_scrape(
urls=urls,
webhook={
'url': 'https://myapp.com/webhook/batch-complete',
'events': ['complete', 'error', 'progress'],
'headers': {
'Authorization': 'Bearer token123'
}
},
async=True # Non-blocking execution
)
print(f"Batch job ID: {batch_job['id']}")
print(f"Status URL: {batch_job['status_url']}")
Performance Optimization
Memory-Efficient Processing
// Stream results instead of loading all in memory
const stream = client.batchStream({
urls: urls,
concurrency: 20
});
stream.on('data', (result) => {
// Process each result as it arrives
saveToDatabase(result);
});
stream.on('end', () => {
console.log('Batch processing complete');
});
stream.on('error', (error) => {
console.error('Batch error:', error);
});
Chunked Processing
def process_large_batch(urls, chunk_size=100):
"""Process large URL lists in chunks"""
total_results = []
for i in range(0, len(urls), chunk_size):
chunk = urls[i:i + chunk_size]
print(f"Processing chunk {i//chunk_size + 1}")
results = client.batch_scrape(
urls=chunk,
concurrency=10
)
total_results.extend(results)
# Save intermediate results
save_checkpoint(total_results, i)
# Rate limit between chunks
time.sleep(5)
return total_results
Data Export Options
Export to Different Formats
const results = await client.batchScrape({
urls: urls,
export: {
format: 'csv', // 'json', 'csv', 'excel', 'parquet'
path: './exports/batch_results.csv',
fields: ['url', 'title', 'price', 'timestamp'],
compress: true // Creates .csv.gz
}
});
Streaming Exports
# Stream results directly to file
with client.batch_scrape_stream(urls) as stream:
with open('results.jsonl', 'w') as f:
for result in stream:
f.write(json.dumps(result) + '\n')
Best Practices
1. URL Deduplication
// Remove duplicate URLs before processing
const uniqueUrls = [...new Set(urls)];
// Or with normalization
const normalizedUrls = urls
.map(url => new URL(url).href)
.filter((url, index, self) => self.indexOf(url) === index);
2. Batch Size Optimization
def optimal_batch_size(total_urls):
"""Calculate optimal batch size based on total URLs"""
if total_urls < 100:
return total_urls
elif total_urls < 1000:
return 50
elif total_urls < 10000:
return 100
else:
return 200
3. Resource Management
// Clean up resources after batch processing
const batch = client.createBatch({ urls });
try {
const results = await batch.run();
return results;
} finally {
await batch.cleanup(); // Release resources
}
4. Monitoring and Alerting
def batch_with_monitoring(urls):
"""Batch processing with monitoring"""
start_time = time.time()
results = client.batch_scrape(
urls=urls,
on_complete=lambda r: send_metrics({
'total_urls': len(urls),
'successful': len([x for x in r if x['status'] == 'success']),
'duration': time.time() - start_time,
'success_rate': calculate_success_rate(r)
})
)
# Alert on low success rate
success_rate = calculate_success_rate(results)
if success_rate < 0.8:
send_alert(f"Low success rate: {success_rate}")
return results
Common Use Cases
E-commerce Price Monitoring
async function monitorPrices(productUrls) {
const results = await client.batchScrape({
urls: productUrls,
extract: {
name: 'h1.product-title',
price: '.current-price',
originalPrice: '.original-price',
availability: '.stock-status'
},
concurrency: 10
});
// Compare with previous prices
const priceChanges = results
.filter(r => r.status === 'success')
.map(r => ({
...r.extracted,
url: r.url,
priceChange: calculatePriceChange(r.extracted.price)
}))
.filter(p => p.priceChange !== 0);
return priceChanges;
}
Content Aggregation
def aggregate_news_articles(news_sites):
"""Aggregate articles from multiple news sites"""
urls = []
for site in news_sites:
# Get article URLs from each site
homepage = client.scrape(site['url'])
article_urls = extract_article_urls(homepage, site['selector'])
urls.extend(article_urls[:10]) # Top 10 from each
# Batch scrape all articles
articles = client.batch_scrape(
urls=urls,
extract={
'title': 'h1',
'author': '.author-name',
'date': 'time@datetime',
'content': 'article.content',
'category': '.category'
}
)
return process_articles(articles)
Troubleshooting
Common Issues
Memory Issues with Large Batches
- Use streaming instead of loading all results in memory
- Process in chunks
- Increase Node.js memory:
node --max-old-space-size=4096
Rate Limit Errors
- Reduce concurrency
- Add delays between requests
- Implement exponential backoff
Timeout Issues
- Increase timeout values
- Reduce batch size
- Check network connectivity
Inconsistent Results
- Enable JavaScript rendering for dynamic content
- Add random delays to avoid detection
- Rotate user agents
Next Steps
- Learn about Webhooks for async processing
- Explore JavaScript Rendering for dynamic sites
- Read about Error Handling strategies
- Check out Rate Limiting best practices