Web Crawling

Learn how to effectively crawl websites with ActiCrawl's powerful crawling engine that handles modern JavaScript-heavy sites, dynamic content, and complex navigation patterns.

Understanding Web Crawling

Web crawling is the process of systematically browsing and extracting data from websites. ActiCrawl provides a sophisticated crawling engine that can:

Navigate through multiple pages automatically
Handle JavaScript-rendered content
Follow links and discover new pages
Respect robots.txt and crawl delays
Manage sessions and authentication

Single Page Scraping

For scraping a single page, use the basic scrape endpoint:

            javascript
            
            const result = await client.scrape({
  url: 'https://example.com/page',
  format: 'markdown'
});

Multi-Page Crawling

Following Links

Automatically follow and crawl links within a domain:

            javascript
            
          

            const crawler = await client.crawl({
  startUrl: 'https://example.com',
  followLinks: true,
  maxPages: 100,
  linkSelector: 'a[href]', // CSS selector for links to follow
  sameDomain: true, // Only follow links on the same domain
  format: 'json'
});

// Process crawled pages
crawler.on('page', (page) => {
  console.log(`Crawled: ${page.url}`);
  console.log(`Found ${page.links.length} links`);
});

crawler.on('complete', (results) => {
  console.log(`Crawled ${results.length} pages`);
});

          

URL Patterns

Define specific URL patterns to crawl:

            python
            
          

            from acticrawl import ActiCrawl

client = ActiCrawl(api_key='YOUR_API_KEY')

crawler = client.create_crawler({
    'start_url': 'https://example.com/products',
    'url_patterns': [
        r'^https://example\.com/products/[\w-]+$',  # Product pages
        r'^https://example\.com/category/[\w-]+$'   # Category pages
    ],
    'max_pages': 500
})

results = crawler.run()

          

Sitemap Crawling

Efficiently crawl sites using their sitemap:

            ruby
            
          

            crawler = client.crawl_sitemap(
  sitemap_url: 'https://example.com/sitemap.xml',
  filter: ->(url) { url.include?('/blog/') }, # Only crawl blog posts
  concurrency: 5,
  format: 'markdown'
)

crawler.each do |page|
  puts "Title: #{page['metadata']['title']}"
  puts "Content: #{page['content']}"
end

          

Advanced Crawling Strategies

Depth-First vs Breadth-First

Choose your crawling strategy based on site structure:

            javascript
            
          

            // Breadth-first (default) - Good for discovering all pages at each level
const bfsCrawler = await client.crawl({
  startUrl: 'https://example.com',
  strategy: 'breadth-first',
  maxDepth: 3
});

// Depth-first - Good for following specific paths deeply
const dfsCrawler = await client.crawl({
  startUrl: 'https://example.com',
  strategy: 'depth-first',
  maxDepth: 10
});

          

Pagination Handling

Automatically handle paginated content:

            python
            
          

            crawler = client.create_crawler({
    'start_url': 'https://example.com/listings?page=1',
    'pagination': {
        'type': 'query_param',
        'param': 'page',
        'max_pages': 50,
        'increment': 1
    }
})

# Or with next button clicking
crawler = client.create_crawler({
    'start_url': 'https://example.com/results',
    'pagination': {
        'type': 'click',
        'selector': 'button.next-page',
        'wait_after_click': 2000,
        'max_clicks': 20
    }
})

          

Infinite Scroll

Handle infinite scroll pages:

            javascript
            
          

            const result = await client.scrape({
  url: 'https://example.com/feed',
  infinite_scroll: {
    enabled: true,
    max_scrolls: 10,
    scroll_delay: 1000, // Wait 1s between scrolls
    element_count_selector: '.feed-item', // Stop when no new items
    timeout: 30000
  }
});

          

Session Management

Maintaining Login State

Crawl authenticated areas:

            javascript
            
          

            // First, login
const session = await client.createSession({
  loginUrl: 'https://example.com/login',
  credentials: {
    username: 'user@example.com',
    password: 'password'
  },
  selectors: {
    username: '#username',
    password: '#password',
    submit: 'button[type="submit"]'
  }
});

// Then crawl with session
const crawler = await client.crawl({
  startUrl: 'https://example.com/dashboard',
  sessionId: session.id,
  followLinks: true
});

          

Cookie Management

Use existing cookies for crawling:

            python
            
          

            cookies = [
    {'name': 'session_id', 'value': 'abc123', 'domain': 'example.com'},
    {'name': 'auth_token', 'value': 'xyz789', 'domain': 'example.com'}
]

crawler = client.create_crawler({
    'start_url': 'https://example.com/protected',
    'cookies': cookies,
    'follow_links': True
})

          

Crawl Rules and Filters

URL Filtering

Control which URLs to crawl:

            javascript
            
          

            const crawler = await client.crawl({
  startUrl: 'https://example.com',
  urlFilters: {
    include: [
      /\/products\//,  // Include product pages
      /\/blog\//       // Include blog posts
    ],
    exclude: [
      /\/admin\//,     // Exclude admin pages
      /\.pdf$/,        // Exclude PDF files
      /#/              // Exclude URL fragments
    ]
  }
});

          

Content Filtering

Filter pages based on content:

            python
            
          

            def should_process(page):
    # Only process pages with specific content
    return (
        page['metadata'].get('language') == 'en' and
        len(page['content']) > 1000 and
        'product' in page['content'].lower()
    )

crawler = client.create_crawler({
    'start_url': 'https://example.com',
    'content_filter': should_process
})

          

Performance Optimization

Concurrent Crawling

Speed up crawling with concurrency:

            javascript
            
            const crawler = await client.crawl({
  startUrl: 'https://example.com',
  concurrency: 10, // Crawl 10 pages simultaneously
  requestDelay: 1000, // Wait 1s between requests
  respectRobotsTxt: true
});

Caching

Avoid re-crawling unchanged pages:

            python
            
          

            crawler = client.create_crawler({
    'start_url': 'https://example.com',
    'cache': {
        'enabled': True,
        'ttl': 86400,  # Cache for 24 hours
        'key_by': 'url_and_content_hash'
    }
})

          

Selective Extraction

Only extract needed data to reduce processing:

            javascript
            
          

            const crawler = await client.crawl({
  startUrl: 'https://example.com/catalog',
  extract: {
    title: 'h1',
    price: '.price',
    description: '.product-description',
    image: 'img.product-image@src'
  },
  skipFullContent: true // Don't store full HTML
});

          

Error Handling and Retries

Automatic Retries

Configure retry behavior:

            python
            
          

            crawler = client.create_crawler({
    'start_url': 'https://example.com',
    'retry': {
        'max_attempts': 3,
        'delay': 2000,
        'exponential_backoff': True,
        'on_errors': [429, 500, 502, 503, 504]
    }
})

          

Error Recovery

Continue crawling after errors:

            javascript
            
          

            const crawler = await client.crawl({
  startUrl: 'https://example.com',
  errorHandling: {
    continueOnError: true,
    maxConsecutiveErrors: 5,
    errorLog: true
  }
});

crawler.on('error', (error) => {
  console.error(`Error crawling ${error.url}: ${error.message}`);
  // Custom error handling
});

          

Monitoring and Progress

Real-time Progress

Track crawling progress:

            python
            
          

            crawler = client.create_crawler({
    'start_url': 'https://example.com',
    'progress_callback': lambda p: print(f"Progress: {p['crawled']}/{p['total']}")
})

# Or use webhooks
crawler = client.create_crawler({
    'start_url': 'https://example.com',
    'webhook': {
        'url': 'https://your-app.com/crawl-progress',
        'events': ['page_crawled', 'error', 'complete']
    }
})

          

Crawl Statistics

Get detailed statistics:

            javascript
            
          

            const stats = await crawler.getStats();
console.log({
  totalPages: stats.totalPages,
  successfulPages: stats.successful,
  failedPages: stats.failed,
  averageResponseTime: stats.avgResponseTime,
  totalDataExtracted: stats.dataSize
});

          

Best Practices

1. Respect Website Policies

Always check and respect robots.txt
Implement appropriate delays between requests
Use reasonable concurrency limits

2. Optimize Selectors

Use specific CSS selectors for better performance
Avoid overly broad selectors that match too many elements
Test selectors before large crawls

3. Handle Dynamic Content

Wait for content to load before extraction
Use appropriate wait strategies for JavaScript
Consider screenshot validation for critical pages

4. Monitor Resource Usage

Set maximum page limits
Implement timeouts for long-running crawls
Use webhooks for long-running crawls instead of polling

Example: E-commerce Site Crawler

Complete example crawling an e-commerce site:

            javascript
            
          

            async function crawlEcommerceSite() {
  const crawler = await client.crawl({
    startUrl: 'https://shop.example.com',

    // Crawling rules
    urlFilters: {
      include: [/\/products\//, /\/category\//],
      exclude: [/\/cart/, /\/checkout/, /\/account/]
    },

    // Performance settings
    concurrency: 5,
    maxPages: 1000,
    requestDelay: 2000,

    // Extraction rules
    extract: {
      title: 'h1.product-title',
      price: '.price-now',
      originalPrice: '.price-was',
      description: '.product-description',
      images: 'img.product-image@src',
      inStock: '.availability',
      reviews: {
        selector: '.review',
        multiple: true,
        extract: {
          rating: '.rating@data-rating',
          text: '.review-text',
          author: '.review-author'
        }
      }
    },

    // Error handling
    retry: {
      maxAttempts: 3,
      delay: 5000
    },

    // Progress tracking
    webhook: {
      url: 'https://your-app.com/crawl-webhook',
      events: ['page_crawled', 'complete', 'error']
    }
  });

  return crawler.run();
}

          

Next Steps

Learn about Data Extraction techniques
Explore JavaScript Rendering options
Read about Proxy Usage for geo-targeted crawling
Understand Rate Limiting best practices

Documentation