Docs

Documentation

Learn how to automate your web scraping workflows with ActiCrawl

Error Handling

ActiCrawl provides comprehensive error handling mechanisms to ensure your web scraping operations are robust and reliable. This guide covers error types, handling strategies, and best practices.

Error Types

1. Network Errors

Network errors occur when connecting to target websites or during data transmission.

ruby
# Connection timeout
ActiCrawl::NetworkError::Timeout
# DNS resolution failure  
ActiCrawl::NetworkError::DNSFailure
# Connection refused
ActiCrawl::NetworkError::ConnectionRefused

2. HTTP Errors

HTTP errors are returned by the target server.

ruby
# 404 Not Found
ActiCrawl::HTTPError::NotFound
# 403 Forbidden
ActiCrawl::HTTPError::Forbidden
# 500 Internal Server Error
ActiCrawl::HTTPError::ServerError
# 429 Too Many Requests
ActiCrawl::HTTPError::RateLimited

3. Parsing Errors

Parsing errors occur when extracting data from web pages.

ruby
# Invalid selector
ActiCrawl::ParseError::InvalidSelector
# Element not found
ActiCrawl::ParseError::ElementNotFound
# Invalid HTML structure
ActiCrawl::ParseError::MalformedHTML

4. Validation Errors

Validation errors occur when data doesn't meet expected criteria.

ruby
# Missing required field
ActiCrawl::ValidationError::RequiredFieldMissing
# Invalid data format
ActiCrawl::ValidationError::InvalidFormat
# Data out of range
ActiCrawl::ValidationError::OutOfRange

Error Handling Strategies

1. Basic Try-Catch

Use try-catch blocks to handle errors gracefully:

ruby
begin
  result = crawler.fetch("https://example.com/page")
  data = result.extract_data
rescue ActiCrawl::NetworkError => e
  logger.error "Network error: #{e.message}"
  # Implement retry logic
rescue ActiCrawl::ParseError => e
  logger.error "Parse error: #{e.message}"
  # Skip this item
rescue => e
  logger.error "Unexpected error: #{e.message}"
  # General error handling
end

2. Retry Logic

Implement intelligent retry mechanisms for transient errors:

ruby
class RetryableRequest
  MAX_RETRIES = 3
  RETRY_DELAY = 5 # seconds

  def fetch_with_retry(url)
    retries = 0

    begin
      crawler.fetch(url)
    rescue ActiCrawl::NetworkError, ActiCrawl::HTTPError::ServerError => e
      retries += 1

      if retries <= MAX_RETRIES
        logger.warn "Retry #{retries}/#{MAX_RETRIES} after #{RETRY_DELAY}s"
        sleep(RETRY_DELAY * retries)
        retry
      else
        logger.error "Max retries reached for #{url}"
        raise
      end
    end
  end
end

3. Circuit Breaker Pattern

Implement circuit breakers to prevent cascading failures:

ruby
class CircuitBreaker
  FAILURE_THRESHOLD = 5
  TIMEOUT_DURATION = 60 # seconds

  def initialize
    @failure_count = 0
    @last_failure_time = nil
    @state = :closed
  end

  def call(url)
    check_state!

    begin
      result = yield
      reset! if @state == :half_open
      result
    rescue => e
      record_failure!
      raise
    end
  end

  private

  def check_state!
    case @state
    when :open
      if Time.now - @last_failure_time > TIMEOUT_DURATION
        @state = :half_open
      else
        raise ActiCrawl::CircuitBreakerOpen
      end
    end
  end

  def record_failure!
    @failure_count += 1
    @last_failure_time = Time.now

    if @failure_count >= FAILURE_THRESHOLD
      @state = :open
    end
  end

  def reset!
    @failure_count = 0
    @last_failure_time = nil
    @state = :closed
  end
end

4. Fallback Strategies

Implement fallback mechanisms for critical operations:

ruby
class DataFetcher
  def fetch_product_data(product_id)
    primary_source_data(product_id)
  rescue ActiCrawl::Error => e
    logger.warn "Primary source failed: #{e.message}"
    fallback_source_data(product_id)
  rescue => e
    logger.error "All sources failed: #{e.message}"
    cached_data(product_id) || default_data
  end

  private

  def primary_source_data(id)
    crawler.fetch("https://primary.example.com/products/#{id}")
  end

  def fallback_source_data(id)
    crawler.fetch("https://backup.example.com/products/#{id}")
  end

  def cached_data(id)
    cache.get("product:#{id}")
  end

  def default_data
    { status: 'unavailable', message: 'Data temporarily unavailable' }
  end
end

Error Monitoring

1. Logging

Implement comprehensive logging for all errors:

ruby
class ErrorLogger
  def log_error(error, context = {})
    log_entry = {
      timestamp: Time.now.iso8601,
      error_class: error.class.name,
      message: error.message,
      backtrace: error.backtrace[0..5],
      context: context
    }

    case error
    when ActiCrawl::NetworkError
      logger.error "NETWORK_ERROR: #{log_entry.to_json}"
    when ActiCrawl::ParseError
      logger.warn "PARSE_ERROR: #{log_entry.to_json}"
    else
      logger.error "UNKNOWN_ERROR: #{log_entry.to_json}"
    end
  end
end

2. Metrics Collection

Track error rates and patterns:

ruby
class ErrorMetrics
  def record_error(error_type, url)
    # Increment error counter
    metrics.increment("errors.#{error_type}")

    # Track error rate per domain
    domain = URI.parse(url).host
    metrics.increment("errors.by_domain.#{domain}")

    # Record response time for failed requests
    metrics.timing("error.response_time", Time.now - @start_time)
  end

  def error_rate(window = 300) # 5 minutes
    total_requests = metrics.get("requests.total", window)
    total_errors = metrics.get("errors.total", window)

    return 0 if total_requests.zero?
    (total_errors.to_f / total_requests * 100).round(2)
  end
end

3. Alerting

Set up alerts for critical error conditions:

ruby
class ErrorAlerter
  ALERT_THRESHOLDS = {
    error_rate: 10, # percentage
    consecutive_failures: 5,
    response_time: 30 # seconds
  }

  def check_and_alert
    if error_rate > ALERT_THRESHOLDS[:error_rate]
      send_alert("High error rate: #{error_rate}%")
    end

    if consecutive_failures > ALERT_THRESHOLDS[:consecutive_failures]
      send_alert("Multiple consecutive failures detected")
    end
  end

  private

  def send_alert(message)
    # Send email
    AlertMailer.error_alert(message).deliver_later

    # Send Slack notification
    slack_notifier.post(text: "🚨 #{message}")

    # Log alert
    logger.error "ALERT: #{message}"
  end
end

Best Practices

1. Fail Fast

Don't hide errors that indicate serious problems:

ruby
# Bad
def fetch_data
  begin
    crawler.fetch(url)
  rescue => e
    nil # Don't do this!
  end
end

# Good
def fetch_data
  begin
    crawler.fetch(url)
  rescue ActiCrawl::NetworkError => e
    logger.error "Network error: #{e.message}"
    raise # Re-raise or handle appropriately
  end
end

2. Provide Context

Include relevant context when logging errors:

ruby
def process_item(item)
  fetch_item_data(item.url)
rescue => e
  logger.error "Failed to process item", {
    item_id: item.id,
    url: item.url,
    error: e.message,
    retry_count: item.retry_count
  }
  raise
end

3. Use Specific Error Types

Create custom error classes for different scenarios:

ruby
module ActiCrawl
  class Error < StandardError; end

  class NetworkError < Error
    attr_reader :url, :response_code

    def initialize(message, url: nil, response_code: nil)
      super(message)
      @url = url
      @response_code = response_code
    end
  end

  class RateLimitError < NetworkError
    attr_reader :retry_after

    def initialize(message, retry_after: nil, **kwargs)
      super(message, **kwargs)
      @retry_after = retry_after
    end
  end
end

4. Graceful Degradation

Design your system to continue operating with reduced functionality:

ruby
class ProductScraper
  def scrape_product(url)
    product = {
      title: extract_title(url),
      price: extract_price(url),
      description: extract_description(url),
      images: extract_images(url)
    }

    # Validate required fields
    validate_required_fields!(product)
    product
  end

  private

  def extract_title(url)
    # Critical field - raise if not found
    page.at('.product-title')&.text || raise(ParseError, "Title not found")
  end

  def extract_price(url)
    # Important but not critical - use default
    page.at('.price')&.text || "Price not available"
  end

  def extract_description(url)
    # Optional field - can be nil
    page.at('.description')&.text
  rescue => e
    logger.warn "Failed to extract description: #{e.message}"
    nil
  end
end

Error Recovery

1. Automatic Recovery

Implement automatic recovery mechanisms:

ruby
class AutoRecovery
  def with_recovery(operation, recovery_action = nil)
    operation.call
  rescue => e
    logger.warn "Operation failed, attempting recovery: #{e.message}"

    if recovery_action
      recovery_action.call
    else
      default_recovery(e)
    end

    # Retry operation after recovery
    operation.call
  end

  private

  def default_recovery(error)
    case error
    when ActiCrawl::HTTPError::RateLimited
      wait_time = error.retry_after || 60
      logger.info "Rate limited, waiting #{wait_time}s"
      sleep(wait_time)
    when ActiCrawl::NetworkError::Timeout
      logger.info "Timeout occurred, resetting connection"
      reset_connection
    end
  end
end

2. Manual Intervention

Some errors require manual intervention:

ruby
class ManualInterventionRequired < ActiCrawl::Error
  def initialize(message, action_required:)
    super(message)
    @action_required = action_required
  end

  def notify_operator
    OperatorNotifier.send({
      error: message,
      action_required: @action_required,
      timestamp: Time.now
    })
  end
end

# Usage
begin
  process_sensitive_data
rescue SecurityError => e
  error = ManualInterventionRequired.new(
    "Security violation detected",
    action_required: "Review security logs and update credentials"
  )
  error.notify_operator
  raise error
end

Testing Error Handling

1. Unit Tests

Test error handling in isolation:

ruby
require 'test_helper'

class ErrorHandlingTest < ActiveSupport::TestCase
  def setup
    @scraper = ProductScraper.new
  end

  test "handles network timeout gracefully" do
    stub_request(:get, "https://example.com")
      .to_timeout

    assert_raises(ActiCrawl::NetworkError::Timeout) do
      @scraper.fetch("https://example.com")
    end
  end

  test "retries on transient errors" do
    stub_request(:get, "https://example.com")
      .to_return(status: 500).times(2)
      .then.to_return(status: 200, body: "Success")

    result = @scraper.fetch_with_retry("https://example.com")
    assert_equal "Success", result.body
  end
end

2. Integration Tests

Test error handling in real scenarios:

ruby
class ErrorHandlingIntegrationTest < ActionDispatch::IntegrationTest
  test "handles API errors gracefully" do
    # Simulate API error
    mock_api_error(500)

    get "/products/123"

    assert_response :success
    assert_select ".error-message", "Product data temporarily unavailable"
  end

  test "circuit breaker prevents cascading failures" do
    # Trigger multiple failures
    6.times do
      mock_api_error(500)
      get "/products/123"
    end

    # Circuit should be open
    get "/products/124"
    assert_response :service_unavailable
  end
end

Summary

Effective error handling is crucial for building reliable web scraping systems. Key takeaways:

  1. Identify and categorize different types of errors
  2. Implement retry logic for transient failures
  3. Use circuit breakers to prevent cascading failures
  4. Monitor and alert on error patterns
  5. Provide fallback mechanisms for critical operations
  6. Test error handling thoroughly
  7. Log errors with context for debugging
  8. Design for graceful degradation

By following these patterns and best practices, you can build robust scraping systems that handle errors gracefully and maintain high availability.