Docs

Documentation

Learn how to automate your web scraping workflows with ActiCrawl

Web Crawling

Learn how to effectively crawl websites with ActiCrawl's powerful crawling engine that handles modern JavaScript-heavy sites, dynamic content, and complex navigation patterns.

Understanding Web Crawling

Web crawling is the process of systematically browsing and extracting data from websites. ActiCrawl provides a sophisticated crawling engine that can:

  • Navigate through multiple pages automatically
  • Handle JavaScript-rendered content
  • Follow links and discover new pages
  • Respect robots.txt and crawl delays
  • Manage sessions and authentication

Single Page Scraping

For scraping a single page, use the basic scrape endpoint:

javascript
const result = await client.scrape({
  url: 'https://example.com/page',
  format: 'markdown'
});

Multi-Page Crawling

Following Links

Automatically follow and crawl links within a domain:

javascript
const crawler = await client.crawl({
  startUrl: 'https://example.com',
  followLinks: true,
  maxPages: 100,
  linkSelector: 'a[href]', // CSS selector for links to follow
  sameDomain: true, // Only follow links on the same domain
  format: 'json'
});

// Process crawled pages
crawler.on('page', (page) => {
  console.log(`Crawled: ${page.url}`);
  console.log(`Found ${page.links.length} links`);
});

crawler.on('complete', (results) => {
  console.log(`Crawled ${results.length} pages`);
});

URL Patterns

Define specific URL patterns to crawl:

python
from acticrawl import ActiCrawl

client = ActiCrawl(api_key='YOUR_API_KEY')

crawler = client.create_crawler({
    'start_url': 'https://example.com/products',
    'url_patterns': [
        r'^https://example\.com/products/[\w-]+$',  # Product pages
        r'^https://example\.com/category/[\w-]+$'   # Category pages
    ],
    'max_pages': 500
})

results = crawler.run()

Sitemap Crawling

Efficiently crawl sites using their sitemap:

ruby
crawler = client.crawl_sitemap(
  sitemap_url: 'https://example.com/sitemap.xml',
  filter: ->(url) { url.include?('/blog/') }, # Only crawl blog posts
  concurrency: 5,
  format: 'markdown'
)

crawler.each do |page|
  puts "Title: #{page['metadata']['title']}"
  puts "Content: #{page['content']}"
end

Advanced Crawling Strategies

Depth-First vs Breadth-First

Choose your crawling strategy based on site structure:

javascript
// Breadth-first (default) - Good for discovering all pages at each level
const bfsCrawler = await client.crawl({
  startUrl: 'https://example.com',
  strategy: 'breadth-first',
  maxDepth: 3
});

// Depth-first - Good for following specific paths deeply
const dfsCrawler = await client.crawl({
  startUrl: 'https://example.com',
  strategy: 'depth-first',
  maxDepth: 10
});

Pagination Handling

Automatically handle paginated content:

python
crawler = client.create_crawler({
    'start_url': 'https://example.com/listings?page=1',
    'pagination': {
        'type': 'query_param',
        'param': 'page',
        'max_pages': 50,
        'increment': 1
    }
})

# Or with next button clicking
crawler = client.create_crawler({
    'start_url': 'https://example.com/results',
    'pagination': {
        'type': 'click',
        'selector': 'button.next-page',
        'wait_after_click': 2000,
        'max_clicks': 20
    }
})

Infinite Scroll

Handle infinite scroll pages:

javascript
const result = await client.scrape({
  url: 'https://example.com/feed',
  infinite_scroll: {
    enabled: true,
    max_scrolls: 10,
    scroll_delay: 1000, // Wait 1s between scrolls
    element_count_selector: '.feed-item', // Stop when no new items
    timeout: 30000
  }
});

Session Management

Maintaining Login State

Crawl authenticated areas:

javascript
// First, login
const session = await client.createSession({
  loginUrl: 'https://example.com/login',
  credentials: {
    username: 'user@example.com',
    password: 'password'
  },
  selectors: {
    username: '#username',
    password: '#password',
    submit: 'button[type="submit"]'
  }
});

// Then crawl with session
const crawler = await client.crawl({
  startUrl: 'https://example.com/dashboard',
  sessionId: session.id,
  followLinks: true
});

Cookie Management

Use existing cookies for crawling:

python
cookies = [
    {'name': 'session_id', 'value': 'abc123', 'domain': 'example.com'},
    {'name': 'auth_token', 'value': 'xyz789', 'domain': 'example.com'}
]

crawler = client.create_crawler({
    'start_url': 'https://example.com/protected',
    'cookies': cookies,
    'follow_links': True
})

Crawl Rules and Filters

URL Filtering

Control which URLs to crawl:

javascript
const crawler = await client.crawl({
  startUrl: 'https://example.com',
  urlFilters: {
    include: [
      /\/products\//,  // Include product pages
      /\/blog\//       // Include blog posts
    ],
    exclude: [
      /\/admin\//,     // Exclude admin pages
      /\.pdf$/,        // Exclude PDF files
      /#/              // Exclude URL fragments
    ]
  }
});

Content Filtering

Filter pages based on content:

python
def should_process(page):
    # Only process pages with specific content
    return (
        page['metadata'].get('language') == 'en' and
        len(page['content']) > 1000 and
        'product' in page['content'].lower()
    )

crawler = client.create_crawler({
    'start_url': 'https://example.com',
    'content_filter': should_process
})

Performance Optimization

Concurrent Crawling

Speed up crawling with concurrency:

javascript
const crawler = await client.crawl({
  startUrl: 'https://example.com',
  concurrency: 10, // Crawl 10 pages simultaneously
  requestDelay: 1000, // Wait 1s between requests
  respectRobotsTxt: true
});

Caching

Avoid re-crawling unchanged pages:

python
crawler = client.create_crawler({
    'start_url': 'https://example.com',
    'cache': {
        'enabled': True,
        'ttl': 86400,  # Cache for 24 hours
        'key_by': 'url_and_content_hash'
    }
})

Selective Extraction

Only extract needed data to reduce processing:

javascript
const crawler = await client.crawl({
  startUrl: 'https://example.com/catalog',
  extract: {
    title: 'h1',
    price: '.price',
    description: '.product-description',
    image: 'img.product-image@src'
  },
  skipFullContent: true // Don't store full HTML
});

Error Handling and Retries

Automatic Retries

Configure retry behavior:

python
crawler = client.create_crawler({
    'start_url': 'https://example.com',
    'retry': {
        'max_attempts': 3,
        'delay': 2000,
        'exponential_backoff': True,
        'on_errors': [429, 500, 502, 503, 504]
    }
})

Error Recovery

Continue crawling after errors:

javascript
const crawler = await client.crawl({
  startUrl: 'https://example.com',
  errorHandling: {
    continueOnError: true,
    maxConsecutiveErrors: 5,
    errorLog: true
  }
});

crawler.on('error', (error) => {
  console.error(`Error crawling ${error.url}: ${error.message}`);
  // Custom error handling
});

Monitoring and Progress

Real-time Progress

Track crawling progress:

python
crawler = client.create_crawler({
    'start_url': 'https://example.com',
    'progress_callback': lambda p: print(f"Progress: {p['crawled']}/{p['total']}")
})

# Or use webhooks
crawler = client.create_crawler({
    'start_url': 'https://example.com',
    'webhook': {
        'url': 'https://your-app.com/crawl-progress',
        'events': ['page_crawled', 'error', 'complete']
    }
})

Crawl Statistics

Get detailed statistics:

javascript
const stats = await crawler.getStats();
console.log({
  totalPages: stats.totalPages,
  successfulPages: stats.successful,
  failedPages: stats.failed,
  averageResponseTime: stats.avgResponseTime,
  totalDataExtracted: stats.dataSize
});

Best Practices

1. Respect Website Policies

  • Always check and respect robots.txt
  • Implement appropriate delays between requests
  • Use reasonable concurrency limits

2. Optimize Selectors

  • Use specific CSS selectors for better performance
  • Avoid overly broad selectors that match too many elements
  • Test selectors before large crawls

3. Handle Dynamic Content

  • Wait for content to load before extraction
  • Use appropriate wait strategies for JavaScript
  • Consider screenshot validation for critical pages

4. Monitor Resource Usage

  • Set maximum page limits
  • Implement timeouts for long-running crawls
  • Use webhooks for long-running crawls instead of polling

Example: E-commerce Site Crawler

Complete example crawling an e-commerce site:

javascript
async function crawlEcommerceSite() {
  const crawler = await client.crawl({
    startUrl: 'https://shop.example.com',

    // Crawling rules
    urlFilters: {
      include: [/\/products\//, /\/category\//],
      exclude: [/\/cart/, /\/checkout/, /\/account/]
    },

    // Performance settings
    concurrency: 5,
    maxPages: 1000,
    requestDelay: 2000,

    // Extraction rules
    extract: {
      title: 'h1.product-title',
      price: '.price-now',
      originalPrice: '.price-was',
      description: '.product-description',
      images: 'img.product-image@src',
      inStock: '.availability',
      reviews: {
        selector: '.review',
        multiple: true,
        extract: {
          rating: '.rating@data-rating',
          text: '.review-text',
          author: '.review-author'
        }
      }
    },

    // Error handling
    retry: {
      maxAttempts: 3,
      delay: 5000
    },

    // Progress tracking
    webhook: {
      url: 'https://your-app.com/crawl-webhook',
      events: ['page_crawled', 'complete', 'error']
    }
  });

  return crawler.run();
}

Next Steps