Web Crawling
Learn how to effectively crawl websites with ActiCrawl's powerful crawling engine that handles modern JavaScript-heavy sites, dynamic content, and complex navigation patterns.
Understanding Web Crawling
Web crawling is the process of systematically browsing and extracting data from websites. ActiCrawl provides a sophisticated crawling engine that can:
- Navigate through multiple pages automatically
- Handle JavaScript-rendered content
- Follow links and discover new pages
- Respect robots.txt and crawl delays
- Manage sessions and authentication
Single Page Scraping
For scraping a single page, use the basic scrape endpoint:
const result = await client.scrape({
url: 'https://example.com/page',
format: 'markdown'
});
Multi-Page Crawling
Following Links
Automatically follow and crawl links within a domain:
const crawler = await client.crawl({
startUrl: 'https://example.com',
followLinks: true,
maxPages: 100,
linkSelector: 'a[href]', // CSS selector for links to follow
sameDomain: true, // Only follow links on the same domain
format: 'json'
});
// Process crawled pages
crawler.on('page', (page) => {
console.log(`Crawled: ${page.url}`);
console.log(`Found ${page.links.length} links`);
});
crawler.on('complete', (results) => {
console.log(`Crawled ${results.length} pages`);
});
URL Patterns
Define specific URL patterns to crawl:
from acticrawl import ActiCrawl
client = ActiCrawl(api_key='YOUR_API_KEY')
crawler = client.create_crawler({
'start_url': 'https://example.com/products',
'url_patterns': [
r'^https://example\.com/products/[\w-]+$', # Product pages
r'^https://example\.com/category/[\w-]+$' # Category pages
],
'max_pages': 500
})
results = crawler.run()
Sitemap Crawling
Efficiently crawl sites using their sitemap:
crawler = client.crawl_sitemap(
sitemap_url: 'https://example.com/sitemap.xml',
filter: ->(url) { url.include?('/blog/') }, # Only crawl blog posts
concurrency: 5,
format: 'markdown'
)
crawler.each do |page|
puts "Title: #{page['metadata']['title']}"
puts "Content: #{page['content']}"
end
Advanced Crawling Strategies
Depth-First vs Breadth-First
Choose your crawling strategy based on site structure:
// Breadth-first (default) - Good for discovering all pages at each level
const bfsCrawler = await client.crawl({
startUrl: 'https://example.com',
strategy: 'breadth-first',
maxDepth: 3
});
// Depth-first - Good for following specific paths deeply
const dfsCrawler = await client.crawl({
startUrl: 'https://example.com',
strategy: 'depth-first',
maxDepth: 10
});
Pagination Handling
Automatically handle paginated content:
crawler = client.create_crawler({
'start_url': 'https://example.com/listings?page=1',
'pagination': {
'type': 'query_param',
'param': 'page',
'max_pages': 50,
'increment': 1
}
})
# Or with next button clicking
crawler = client.create_crawler({
'start_url': 'https://example.com/results',
'pagination': {
'type': 'click',
'selector': 'button.next-page',
'wait_after_click': 2000,
'max_clicks': 20
}
})
Infinite Scroll
Handle infinite scroll pages:
const result = await client.scrape({
url: 'https://example.com/feed',
infinite_scroll: {
enabled: true,
max_scrolls: 10,
scroll_delay: 1000, // Wait 1s between scrolls
element_count_selector: '.feed-item', // Stop when no new items
timeout: 30000
}
});
Session Management
Maintaining Login State
Crawl authenticated areas:
// First, login
const session = await client.createSession({
loginUrl: 'https://example.com/login',
credentials: {
username: 'user@example.com',
password: 'password'
},
selectors: {
username: '#username',
password: '#password',
submit: 'button[type="submit"]'
}
});
// Then crawl with session
const crawler = await client.crawl({
startUrl: 'https://example.com/dashboard',
sessionId: session.id,
followLinks: true
});
Cookie Management
Use existing cookies for crawling:
cookies = [
{'name': 'session_id', 'value': 'abc123', 'domain': 'example.com'},
{'name': 'auth_token', 'value': 'xyz789', 'domain': 'example.com'}
]
crawler = client.create_crawler({
'start_url': 'https://example.com/protected',
'cookies': cookies,
'follow_links': True
})
Crawl Rules and Filters
URL Filtering
Control which URLs to crawl:
const crawler = await client.crawl({
startUrl: 'https://example.com',
urlFilters: {
include: [
/\/products\//, // Include product pages
/\/blog\// // Include blog posts
],
exclude: [
/\/admin\//, // Exclude admin pages
/\.pdf$/, // Exclude PDF files
/#/ // Exclude URL fragments
]
}
});
Content Filtering
Filter pages based on content:
def should_process(page):
# Only process pages with specific content
return (
page['metadata'].get('language') == 'en' and
len(page['content']) > 1000 and
'product' in page['content'].lower()
)
crawler = client.create_crawler({
'start_url': 'https://example.com',
'content_filter': should_process
})
Performance Optimization
Concurrent Crawling
Speed up crawling with concurrency:
const crawler = await client.crawl({
startUrl: 'https://example.com',
concurrency: 10, // Crawl 10 pages simultaneously
requestDelay: 1000, // Wait 1s between requests
respectRobotsTxt: true
});
Caching
Avoid re-crawling unchanged pages:
crawler = client.create_crawler({
'start_url': 'https://example.com',
'cache': {
'enabled': True,
'ttl': 86400, # Cache for 24 hours
'key_by': 'url_and_content_hash'
}
})
Selective Extraction
Only extract needed data to reduce processing:
const crawler = await client.crawl({
startUrl: 'https://example.com/catalog',
extract: {
title: 'h1',
price: '.price',
description: '.product-description',
image: 'img.product-image@src'
},
skipFullContent: true // Don't store full HTML
});
Error Handling and Retries
Automatic Retries
Configure retry behavior:
crawler = client.create_crawler({
'start_url': 'https://example.com',
'retry': {
'max_attempts': 3,
'delay': 2000,
'exponential_backoff': True,
'on_errors': [429, 500, 502, 503, 504]
}
})
Error Recovery
Continue crawling after errors:
const crawler = await client.crawl({
startUrl: 'https://example.com',
errorHandling: {
continueOnError: true,
maxConsecutiveErrors: 5,
errorLog: true
}
});
crawler.on('error', (error) => {
console.error(`Error crawling ${error.url}: ${error.message}`);
// Custom error handling
});
Monitoring and Progress
Real-time Progress
Track crawling progress:
crawler = client.create_crawler({
'start_url': 'https://example.com',
'progress_callback': lambda p: print(f"Progress: {p['crawled']}/{p['total']}")
})
# Or use webhooks
crawler = client.create_crawler({
'start_url': 'https://example.com',
'webhook': {
'url': 'https://your-app.com/crawl-progress',
'events': ['page_crawled', 'error', 'complete']
}
})
Crawl Statistics
Get detailed statistics:
const stats = await crawler.getStats();
console.log({
totalPages: stats.totalPages,
successfulPages: stats.successful,
failedPages: stats.failed,
averageResponseTime: stats.avgResponseTime,
totalDataExtracted: stats.dataSize
});
Best Practices
1. Respect Website Policies
- Always check and respect robots.txt
- Implement appropriate delays between requests
- Use reasonable concurrency limits
2. Optimize Selectors
- Use specific CSS selectors for better performance
- Avoid overly broad selectors that match too many elements
- Test selectors before large crawls
3. Handle Dynamic Content
- Wait for content to load before extraction
- Use appropriate wait strategies for JavaScript
- Consider screenshot validation for critical pages
4. Monitor Resource Usage
- Set maximum page limits
- Implement timeouts for long-running crawls
- Use webhooks for long-running crawls instead of polling
Example: E-commerce Site Crawler
Complete example crawling an e-commerce site:
async function crawlEcommerceSite() {
const crawler = await client.crawl({
startUrl: 'https://shop.example.com',
// Crawling rules
urlFilters: {
include: [/\/products\//, /\/category\//],
exclude: [/\/cart/, /\/checkout/, /\/account/]
},
// Performance settings
concurrency: 5,
maxPages: 1000,
requestDelay: 2000,
// Extraction rules
extract: {
title: 'h1.product-title',
price: '.price-now',
originalPrice: '.price-was',
description: '.product-description',
images: 'img.product-image@src',
inStock: '.availability',
reviews: {
selector: '.review',
multiple: true,
extract: {
rating: '.rating@data-rating',
text: '.review-text',
author: '.review-author'
}
}
},
// Error handling
retry: {
maxAttempts: 3,
delay: 5000
},
// Progress tracking
webhook: {
url: 'https://your-app.com/crawl-webhook',
events: ['page_crawled', 'complete', 'error']
}
});
return crawler.run();
}
Next Steps
- Learn about Data Extraction techniques
- Explore JavaScript Rendering options
- Read about Proxy Usage for geo-targeted crawling
- Understand Rate Limiting best practices