Data Extraction

Master the art of extracting structured data from web pages using ActiCrawl's powerful extraction engine. Learn various techniques from simple CSS selectors to advanced AI-powered extraction.

Extraction Basics

ActiCrawl supports multiple methods to extract data from web pages:

CSS Selectors: Target elements using standard CSS syntax
XPath: Advanced path-based selection
JSON-LD: Extract structured data from JSON-LD scripts
Regex Patterns: Extract text using regular expressions
AI Extraction: Let AI understand and extract data intelligently

CSS Selector Extraction

The most common and straightforward method:

            javascript
            
          

            const result = await client.scrape({
  url: 'https://example.com/product',
  extract: {
    title: 'h1.product-title',
    price: '.price-current',
    description: '.product-description',
    image: 'img.main-image@src',
    rating: '.rating@data-rating'
  }
});

console.log(result.extracted);
// {
//   title: "Premium Wireless Headphones",
//   price: "$299.99",
//   description: "High-quality audio...",
//   image: "https://example.com/img/product.jpg",
//   rating: "4.5"
// }

          

Attribute Extraction

Extract specific attributes using @attribute:

            javascript
            
          

            extract: {
  imageUrl: 'img#product-image@src',
  imageAlt: 'img#product-image@alt',
  linkHref: 'a.product-link@href',
  dataId: 'div.product@data-product-id',
  metaDescription: 'meta[name="description"]@content'
}

          

Multiple Elements

Extract arrays of elements:

            javascript
            
          

            extract: {
  // Single element
  title: 'h1',

  // Multiple elements
  features: {
    selector: 'li.feature',
    multiple: true
  },

  // Nested extraction
  reviews: {
    selector: '.review',
    multiple: true,
    extract: {
      author: '.reviewer-name',
      rating: '.stars@data-rating',
      comment: '.review-text',
      date: '.review-date'
    }
  }
}

          

XPath Extraction

For complex selections that CSS can't handle:

            python
            
          

            result = client.scrape(
    url='https://example.com/article',
    extract={
        # Text after specific label
        'author': '//span[text()="Author:"]/following-sibling::text()',

        # Table cell by header
        'price': '//th[text()="Price"]/following-sibling::td/text()',

        # Complex conditions
        'in_stock': '//div[@class="availability" and contains(text(), "In Stock")]',

        # Parent navigation
        'category': '//li[@class="current"]/parent::ul/@data-category'
    }
)

          

Advanced Extraction Patterns

Table Extraction

Extract structured data from tables:

            javascript
            
          

            const tableData = await client.scrape({
  url: 'https://example.com/data',
  extract: {
    table: {
      selector: 'table#data-table',
      type: 'table',
      headers: 'auto', // or specify: ['Name', 'Price', 'Stock']
      skipRows: 1 // Skip header row
    }
  }
});

// Result:
// [
//   { Name: "Product A", Price: "$100", Stock: "In Stock" },
//   { Name: "Product B", Price: "$200", Stock: "Out of Stock" }
// ]

          

List Extraction

Extract structured lists:

            python
            
          

            extract = {
    'products': {
        'selector': '.product-grid .product-card',
        'type': 'list',
        'extract': {
            'name': 'h3',
            'price': '.price',
            'image': 'img@src',
            'specs': {
                'selector': '.spec',
                'multiple': True
            }
        }
    }
}

          

Pagination Data

Extract data across multiple pages:

            javascript
            
          

            const crawler = await client.crawl({
  startUrl: 'https://example.com/products?page=1',
  pagination: {
    nextSelector: 'a.next-page@href',
    maxPages: 10
  },
  extract: {
    products: {
      selector: '.product',
      multiple: true,
      extract: {
        name: '.product-name',
        price: '.product-price'
      }
    }
  }
});

          

Text Processing

Clean Text Extraction

Remove noise and get clean text:

            javascript
            
          

            extract: {
  articleText: {
    selector: 'article',
    textOnly: true,
    clean: true // Removes extra whitespace, ads, etc.
  },

  // With custom cleaning
  description: {
    selector: '.description',
    process: (text) => {
      return text
        .replace(/\s+/g, ' ')
        .trim()
        .substring(0, 200);
    }
  }
}

          

Regular Expression Extraction

Extract using regex patterns:

            python
            
          

            extract = {
    'phone': {
        'selector': '.contact',
        'regex': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
    },
    'email': {
        'selector': '.contact',
        'regex': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
    },
    'price': {
        'selector': '.price-text',
        'regex': r'\$[\d,]+\.?\d*',
        'type': 'float'  # Convert to number
    }
}

          

JSON-LD and Microdata

Extract structured data embedded in pages:

            javascript
            
          

            const result = await client.scrape({
  url: 'https://example.com/product',
  extract: {
    // Extract from JSON-LD
    structured: {
      selector: 'script[type="application/ld+json"]',
      type: 'json',
      parse: true
    },

    // Extract microdata
    product: {
      selector: '[itemtype="https://schema.org/Product"]',
      microdata: true
    }
  }
});

          

AI-Powered Extraction

Let AI understand and extract data:

            javascript
            
          

            const result = await client.scrape({
  url: 'https://example.com/article',
  aiExtract: {
    // Natural language queries
    author: "Who wrote this article?",
    publishDate: "When was this published?",
    mainPoints: "What are the main points? (list)",
    sentiment: "What is the overall sentiment?",

    // Structured extraction
    product: {
      query: "Extract product information",
      schema: {
        name: "string",
        price: "number",
        features: "array",
        available: "boolean"
      }
    }
  }
});

          

Custom AI Prompts

            python
            
          

            ai_extract = {
    'summary': {
        'prompt': 'Summarize this article in 3 bullet points',
        'max_tokens': 150
    },
    'entities': {
        'prompt': 'Extract all company names, people, and locations mentioned',
        'format': 'json'
    },
    'classification': {
        'prompt': 'Classify this content into one of: news, blog, product, documentation',
        'choices': ['news', 'blog', 'product', 'documentation']
    }
}

          

Data Transformation

Type Conversion

Convert extracted data to appropriate types:

            javascript
            
          

            extract: {
  price: {
    selector: '.price',
    type: 'number', // Converts "$29.99" to 29.99
    currency: 'USD'
  },
  inStock: {
    selector: '.availability',
    type: 'boolean', // Converts "In Stock" to true
    truthy: ['In Stock', 'Available']
  },
  rating: {
    selector: '.stars',
    type: 'float',
    attribute: 'data-rating'
  },
  publishDate: {
    selector: '.date',
    type: 'date',
    format: 'YYYY-MM-DD'
  }
}

          

Custom Transformations

Apply custom processing functions:

            python
            
          

            def process_price(value):
    # Remove currency symbol and convert to float
    return float(value.replace('$', '').replace(',', ''))

def normalize_date(value):
    # Convert various date formats
    from dateutil import parser
    return parser.parse(value).isoformat()

extract = {
    'price': {
        'selector': '.price',
        'transform': process_price
    },
    'date': {
        'selector': '.published',
        'transform': normalize_date
    }
}

          

Conditional Extraction

Extract based on conditions:

            javascript
            
          

            extract: {
  // Extract if element exists
  salePrice: {
    selector: '.sale-price',
    optional: true
  },

  // Conditional extraction
  availability: {
    conditions: [
      {
        selector: '.in-stock',
        exists: true,
        value: 'In Stock'
      },
      {
        selector: '.out-of-stock',
        exists: true,
        value: 'Out of Stock'
      }
    ],
    default: 'Unknown'
  },

  // Fallback selectors
  title: {
    selectors: [
      'h1.product-title',
      'h2.title',
      'meta[property="og:title"]@content'
    ]
  }
}

          

Performance Optimization

Selective Extraction

Only extract what you need:

            javascript
            
          

            // Bad - extracts entire page then filters
const result = await client.scrape({
  url: 'https://example.com',
  format: 'json'
});
const title = result.content.querySelector('h1').text;

// Good - extracts only needed data
const result = await client.scrape({
  url: 'https://example.com',
  extract: {
    title: 'h1'
  }
});

          

Batch Extraction

Extract from multiple URLs efficiently:

            python
            
          

            urls = [
    'https://example.com/product/1',
    'https://example.com/product/2',
    'https://example.com/product/3'
]

results = client.batch_scrape(
    urls=urls,
    extract={
        'name': 'h1',
        'price': '.price',
        'stock': '.availability'
    },
    concurrency=5
)

          

Error Handling

Handle extraction failures gracefully:

            javascript
            
          

            extract: {
  price: {
    selector: '.price',
    required: true,
    onError: 'skip' // or 'default' or 'fail'
  },

  description: {
    selector: '.description',
    default: 'No description available',
    maxLength: 500
  },

  images: {
    selector: 'img.product-image@src',
    multiple: true,
    validate: (urls) => urls.filter(url => url.startsWith('https'))
  }
}

          

Real-World Examples

E-commerce Product Extraction

            javascript
            
          

            const productExtractor = {
  extract: {
    product: {
      name: 'h1[itemprop="name"]',
      brand: '[itemprop="brand"]',
      price: {
        selector: '[itemprop="price"]@content',
        type: 'number'
      },
      currency: '[itemprop="priceCurrency"]@content',
      availability: {
        selector: '[itemprop="availability"]@href',
        transform: (val) => val.includes('InStock')
      },
      images: {
        selector: '.product-images img@src',
        multiple: true,
        limit: 5
      },
      features: {
        selector: '.feature-list li',
        multiple: true
      },
      rating: {
        value: '[itemprop="ratingValue"]@content',
        count: '[itemprop="reviewCount"]'
      }
    }
  }
};

          

Article/Blog Extraction

            python
            
          

            article_extractor = {
    'extract': {
        'article': {
            'title': 'h1.article-title',
            'author': '.author-name',
            'publishDate': {
                'selector': 'time[datetime]@datetime',
                'type': 'date'
            },
            'category': '.category a',
            'tags': {
                'selector': '.tag',
                'multiple': True
            },
            'content': {
                'selector': '.article-content',
                'clean': True,
                'markdown': True  # Convert to markdown
            },
            'relatedArticles': {
                'selector': '.related-article',
                'multiple': True,
                'extract': {
                    'title': 'h3',
                    'url': 'a@href'
                }
            }
        }
    }
}

          

Best Practices

Use Specific Selectors: More specific selectors are faster and more reliable
Validate Extracted Data: Always validate critical data
Handle Missing Data: Use defaults and optional flags
Test Selectors: Test on multiple pages before production
Monitor Changes: Set up alerts for extraction failures
Use AI Wisely: AI extraction is powerful but more expensive

Next Steps

Explore JavaScript Rendering for dynamic content
Learn about Error Handling strategies
Read about Webhooks for async processing
Check out Batch Processing for scale

Documentation