Error Handling
ActiCrawl provides comprehensive error handling mechanisms to ensure your web scraping operations are robust and reliable. This guide covers error types, handling strategies, and best practices.
Error Types
1. Network Errors
Network errors occur when connecting to target websites or during data transmission.
# Connection timeout
ActiCrawl::NetworkError::Timeout
# DNS resolution failure
ActiCrawl::NetworkError::DNSFailure
# Connection refused
ActiCrawl::NetworkError::ConnectionRefused
2. HTTP Errors
HTTP errors are returned by the target server.
# 404 Not Found
ActiCrawl::HTTPError::NotFound
# 403 Forbidden
ActiCrawl::HTTPError::Forbidden
# 500 Internal Server Error
ActiCrawl::HTTPError::ServerError
# 429 Too Many Requests
ActiCrawl::HTTPError::RateLimited
3. Parsing Errors
Parsing errors occur when extracting data from web pages.
# Invalid selector
ActiCrawl::ParseError::InvalidSelector
# Element not found
ActiCrawl::ParseError::ElementNotFound
# Invalid HTML structure
ActiCrawl::ParseError::MalformedHTML
4. Validation Errors
Validation errors occur when data doesn't meet expected criteria.
# Missing required field
ActiCrawl::ValidationError::RequiredFieldMissing
# Invalid data format
ActiCrawl::ValidationError::InvalidFormat
# Data out of range
ActiCrawl::ValidationError::OutOfRange
Error Handling Strategies
1. Basic Try-Catch
Use try-catch blocks to handle errors gracefully:
begin
result = crawler.fetch("https://example.com/page")
data = result.extract_data
rescue ActiCrawl::NetworkError => e
logger.error "Network error: #{e.message}"
# Implement retry logic
rescue ActiCrawl::ParseError => e
logger.error "Parse error: #{e.message}"
# Skip this item
rescue => e
logger.error "Unexpected error: #{e.message}"
# General error handling
end
2. Retry Logic
Implement intelligent retry mechanisms for transient errors:
class RetryableRequest
MAX_RETRIES = 3
RETRY_DELAY = 5 # seconds
def fetch_with_retry(url)
retries = 0
begin
crawler.fetch(url)
rescue ActiCrawl::NetworkError, ActiCrawl::HTTPError::ServerError => e
retries += 1
if retries <= MAX_RETRIES
logger.warn "Retry #{retries}/#{MAX_RETRIES} after #{RETRY_DELAY}s"
sleep(RETRY_DELAY * retries)
retry
else
logger.error "Max retries reached for #{url}"
raise
end
end
end
end
3. Circuit Breaker Pattern
Implement circuit breakers to prevent cascading failures:
class CircuitBreaker
FAILURE_THRESHOLD = 5
TIMEOUT_DURATION = 60 # seconds
def initialize
@failure_count = 0
@last_failure_time = nil
@state = :closed
end
def call(url)
check_state!
begin
result = yield
reset! if @state == :half_open
result
rescue => e
record_failure!
raise
end
end
private
def check_state!
case @state
when :open
if Time.now - @last_failure_time > TIMEOUT_DURATION
@state = :half_open
else
raise ActiCrawl::CircuitBreakerOpen
end
end
end
def record_failure!
@failure_count += 1
@last_failure_time = Time.now
if @failure_count >= FAILURE_THRESHOLD
@state = :open
end
end
def reset!
@failure_count = 0
@last_failure_time = nil
@state = :closed
end
end
4. Fallback Strategies
Implement fallback mechanisms for critical operations:
class DataFetcher
def fetch_product_data(product_id)
primary_source_data(product_id)
rescue ActiCrawl::Error => e
logger.warn "Primary source failed: #{e.message}"
fallback_source_data(product_id)
rescue => e
logger.error "All sources failed: #{e.message}"
cached_data(product_id) || default_data
end
private
def primary_source_data(id)
crawler.fetch("https://primary.example.com/products/#{id}")
end
def fallback_source_data(id)
crawler.fetch("https://backup.example.com/products/#{id}")
end
def cached_data(id)
cache.get("product:#{id}")
end
def default_data
{ status: 'unavailable', message: 'Data temporarily unavailable' }
end
end
Error Monitoring
1. Logging
Implement comprehensive logging for all errors:
class ErrorLogger
def log_error(error, context = {})
log_entry = {
timestamp: Time.now.iso8601,
error_class: error.class.name,
message: error.message,
backtrace: error.backtrace[0..5],
context: context
}
case error
when ActiCrawl::NetworkError
logger.error "NETWORK_ERROR: #{log_entry.to_json}"
when ActiCrawl::ParseError
logger.warn "PARSE_ERROR: #{log_entry.to_json}"
else
logger.error "UNKNOWN_ERROR: #{log_entry.to_json}"
end
end
end
2. Metrics Collection
Track error rates and patterns:
class ErrorMetrics
def record_error(error_type, url)
# Increment error counter
metrics.increment("errors.#{error_type}")
# Track error rate per domain
domain = URI.parse(url).host
metrics.increment("errors.by_domain.#{domain}")
# Record response time for failed requests
metrics.timing("error.response_time", Time.now - @start_time)
end
def error_rate(window = 300) # 5 minutes
total_requests = metrics.get("requests.total", window)
total_errors = metrics.get("errors.total", window)
return 0 if total_requests.zero?
(total_errors.to_f / total_requests * 100).round(2)
end
end
3. Alerting
Set up alerts for critical error conditions:
class ErrorAlerter
ALERT_THRESHOLDS = {
error_rate: 10, # percentage
consecutive_failures: 5,
response_time: 30 # seconds
}
def check_and_alert
if error_rate > ALERT_THRESHOLDS[:error_rate]
send_alert("High error rate: #{error_rate}%")
end
if consecutive_failures > ALERT_THRESHOLDS[:consecutive_failures]
send_alert("Multiple consecutive failures detected")
end
end
private
def send_alert(message)
# Send email
AlertMailer.error_alert(message).deliver_later
# Send Slack notification
slack_notifier.post(text: "🚨 #{message}")
# Log alert
logger.error "ALERT: #{message}"
end
end
Best Practices
1. Fail Fast
Don't hide errors that indicate serious problems:
# Bad
def fetch_data
begin
crawler.fetch(url)
rescue => e
nil # Don't do this!
end
end
# Good
def fetch_data
begin
crawler.fetch(url)
rescue ActiCrawl::NetworkError => e
logger.error "Network error: #{e.message}"
raise # Re-raise or handle appropriately
end
end
2. Provide Context
Include relevant context when logging errors:
def process_item(item)
fetch_item_data(item.url)
rescue => e
logger.error "Failed to process item", {
item_id: item.id,
url: item.url,
error: e.message,
retry_count: item.retry_count
}
raise
end
3. Use Specific Error Types
Create custom error classes for different scenarios:
module ActiCrawl
class Error < StandardError; end
class NetworkError < Error
attr_reader :url, :response_code
def initialize(message, url: nil, response_code: nil)
super(message)
@url = url
@response_code = response_code
end
end
class RateLimitError < NetworkError
attr_reader :retry_after
def initialize(message, retry_after: nil, **kwargs)
super(message, **kwargs)
@retry_after = retry_after
end
end
end
4. Graceful Degradation
Design your system to continue operating with reduced functionality:
class ProductScraper
def scrape_product(url)
product = {
title: extract_title(url),
price: extract_price(url),
description: extract_description(url),
images: extract_images(url)
}
# Validate required fields
validate_required_fields!(product)
product
end
private
def extract_title(url)
# Critical field - raise if not found
page.at('.product-title')&.text || raise(ParseError, "Title not found")
end
def extract_price(url)
# Important but not critical - use default
page.at('.price')&.text || "Price not available"
end
def extract_description(url)
# Optional field - can be nil
page.at('.description')&.text
rescue => e
logger.warn "Failed to extract description: #{e.message}"
nil
end
end
Error Recovery
1. Automatic Recovery
Implement automatic recovery mechanisms:
class AutoRecovery
def with_recovery(operation, recovery_action = nil)
operation.call
rescue => e
logger.warn "Operation failed, attempting recovery: #{e.message}"
if recovery_action
recovery_action.call
else
default_recovery(e)
end
# Retry operation after recovery
operation.call
end
private
def default_recovery(error)
case error
when ActiCrawl::HTTPError::RateLimited
wait_time = error.retry_after || 60
logger.info "Rate limited, waiting #{wait_time}s"
sleep(wait_time)
when ActiCrawl::NetworkError::Timeout
logger.info "Timeout occurred, resetting connection"
reset_connection
end
end
end
2. Manual Intervention
Some errors require manual intervention:
class ManualInterventionRequired < ActiCrawl::Error
def initialize(message, action_required:)
super(message)
@action_required = action_required
end
def notify_operator
OperatorNotifier.send({
error: message,
action_required: @action_required,
timestamp: Time.now
})
end
end
# Usage
begin
process_sensitive_data
rescue SecurityError => e
error = ManualInterventionRequired.new(
"Security violation detected",
action_required: "Review security logs and update credentials"
)
error.notify_operator
raise error
end
Testing Error Handling
1. Unit Tests
Test error handling in isolation:
require 'test_helper'
class ErrorHandlingTest < ActiveSupport::TestCase
def setup
@scraper = ProductScraper.new
end
test "handles network timeout gracefully" do
stub_request(:get, "https://example.com")
.to_timeout
assert_raises(ActiCrawl::NetworkError::Timeout) do
@scraper.fetch("https://example.com")
end
end
test "retries on transient errors" do
stub_request(:get, "https://example.com")
.to_return(status: 500).times(2)
.then.to_return(status: 200, body: "Success")
result = @scraper.fetch_with_retry("https://example.com")
assert_equal "Success", result.body
end
end
2. Integration Tests
Test error handling in real scenarios:
class ErrorHandlingIntegrationTest < ActionDispatch::IntegrationTest
test "handles API errors gracefully" do
# Simulate API error
mock_api_error(500)
get "/products/123"
assert_response :success
assert_select ".error-message", "Product data temporarily unavailable"
end
test "circuit breaker prevents cascading failures" do
# Trigger multiple failures
6.times do
mock_api_error(500)
get "/products/123"
end
# Circuit should be open
get "/products/124"
assert_response :service_unavailable
end
end
Summary
Effective error handling is crucial for building reliable web scraping systems. Key takeaways:
- Identify and categorize different types of errors
- Implement retry logic for transient failures
- Use circuit breakers to prevent cascading failures
- Monitor and alert on error patterns
- Provide fallback mechanisms for critical operations
- Test error handling thoroughly
- Log errors with context for debugging
- Design for graceful degradation
By following these patterns and best practices, you can build robust scraping systems that handle errors gracefully and maintain high availability.