Data Extraction
Master the art of extracting structured data from web pages using ActiCrawl's powerful extraction engine. Learn various techniques from simple CSS selectors to advanced AI-powered extraction.
Extraction Basics
ActiCrawl supports multiple methods to extract data from web pages:
- CSS Selectors: Target elements using standard CSS syntax
- XPath: Advanced path-based selection
- JSON-LD: Extract structured data from JSON-LD scripts
- Regex Patterns: Extract text using regular expressions
- AI Extraction: Let AI understand and extract data intelligently
CSS Selector Extraction
The most common and straightforward method:
const result = await client.scrape({
url: 'https://example.com/product',
extract: {
title: 'h1.product-title',
price: '.price-current',
description: '.product-description',
image: 'img.main-image@src',
rating: '.rating@data-rating'
}
});
console.log(result.extracted);
// {
// title: "Premium Wireless Headphones",
// price: "$299.99",
// description: "High-quality audio...",
// image: "https://example.com/img/product.jpg",
// rating: "4.5"
// }
Attribute Extraction
Extract specific attributes using @attribute
:
extract: {
imageUrl: 'img#product-image@src',
imageAlt: 'img#product-image@alt',
linkHref: 'a.product-link@href',
dataId: 'div.product@data-product-id',
metaDescription: 'meta[name="description"]@content'
}
Multiple Elements
Extract arrays of elements:
extract: {
// Single element
title: 'h1',
// Multiple elements
features: {
selector: 'li.feature',
multiple: true
},
// Nested extraction
reviews: {
selector: '.review',
multiple: true,
extract: {
author: '.reviewer-name',
rating: '.stars@data-rating',
comment: '.review-text',
date: '.review-date'
}
}
}
XPath Extraction
For complex selections that CSS can't handle:
result = client.scrape(
url='https://example.com/article',
extract={
# Text after specific label
'author': '//span[text()="Author:"]/following-sibling::text()',
# Table cell by header
'price': '//th[text()="Price"]/following-sibling::td/text()',
# Complex conditions
'in_stock': '//div[@class="availability" and contains(text(), "In Stock")]',
# Parent navigation
'category': '//li[@class="current"]/parent::ul/@data-category'
}
)
Advanced Extraction Patterns
Table Extraction
Extract structured data from tables:
const tableData = await client.scrape({
url: 'https://example.com/data',
extract: {
table: {
selector: 'table#data-table',
type: 'table',
headers: 'auto', // or specify: ['Name', 'Price', 'Stock']
skipRows: 1 // Skip header row
}
}
});
// Result:
// [
// { Name: "Product A", Price: "$100", Stock: "In Stock" },
// { Name: "Product B", Price: "$200", Stock: "Out of Stock" }
// ]
List Extraction
Extract structured lists:
extract = {
'products': {
'selector': '.product-grid .product-card',
'type': 'list',
'extract': {
'name': 'h3',
'price': '.price',
'image': 'img@src',
'specs': {
'selector': '.spec',
'multiple': True
}
}
}
}
Pagination Data
Extract data across multiple pages:
const crawler = await client.crawl({
startUrl: 'https://example.com/products?page=1',
pagination: {
nextSelector: 'a.next-page@href',
maxPages: 10
},
extract: {
products: {
selector: '.product',
multiple: true,
extract: {
name: '.product-name',
price: '.product-price'
}
}
}
});
Text Processing
Clean Text Extraction
Remove noise and get clean text:
extract: {
articleText: {
selector: 'article',
textOnly: true,
clean: true // Removes extra whitespace, ads, etc.
},
// With custom cleaning
description: {
selector: '.description',
process: (text) => {
return text
.replace(/\s+/g, ' ')
.trim()
.substring(0, 200);
}
}
}
Regular Expression Extraction
Extract using regex patterns:
extract = {
'phone': {
'selector': '.contact',
'regex': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
},
'email': {
'selector': '.contact',
'regex': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
},
'price': {
'selector': '.price-text',
'regex': r'\$[\d,]+\.?\d*',
'type': 'float' # Convert to number
}
}
JSON-LD and Microdata
Extract structured data embedded in pages:
const result = await client.scrape({
url: 'https://example.com/product',
extract: {
// Extract from JSON-LD
structured: {
selector: 'script[type="application/ld+json"]',
type: 'json',
parse: true
},
// Extract microdata
product: {
selector: '[itemtype="https://schema.org/Product"]',
microdata: true
}
}
});
AI-Powered Extraction
Let AI understand and extract data:
const result = await client.scrape({
url: 'https://example.com/article',
aiExtract: {
// Natural language queries
author: "Who wrote this article?",
publishDate: "When was this published?",
mainPoints: "What are the main points? (list)",
sentiment: "What is the overall sentiment?",
// Structured extraction
product: {
query: "Extract product information",
schema: {
name: "string",
price: "number",
features: "array",
available: "boolean"
}
}
}
});
Custom AI Prompts
ai_extract = {
'summary': {
'prompt': 'Summarize this article in 3 bullet points',
'max_tokens': 150
},
'entities': {
'prompt': 'Extract all company names, people, and locations mentioned',
'format': 'json'
},
'classification': {
'prompt': 'Classify this content into one of: news, blog, product, documentation',
'choices': ['news', 'blog', 'product', 'documentation']
}
}
Data Transformation
Type Conversion
Convert extracted data to appropriate types:
extract: {
price: {
selector: '.price',
type: 'number', // Converts "$29.99" to 29.99
currency: 'USD'
},
inStock: {
selector: '.availability',
type: 'boolean', // Converts "In Stock" to true
truthy: ['In Stock', 'Available']
},
rating: {
selector: '.stars',
type: 'float',
attribute: 'data-rating'
},
publishDate: {
selector: '.date',
type: 'date',
format: 'YYYY-MM-DD'
}
}
Custom Transformations
Apply custom processing functions:
def process_price(value):
# Remove currency symbol and convert to float
return float(value.replace('$', '').replace(',', ''))
def normalize_date(value):
# Convert various date formats
from dateutil import parser
return parser.parse(value).isoformat()
extract = {
'price': {
'selector': '.price',
'transform': process_price
},
'date': {
'selector': '.published',
'transform': normalize_date
}
}
Conditional Extraction
Extract based on conditions:
extract: {
// Extract if element exists
salePrice: {
selector: '.sale-price',
optional: true
},
// Conditional extraction
availability: {
conditions: [
{
selector: '.in-stock',
exists: true,
value: 'In Stock'
},
{
selector: '.out-of-stock',
exists: true,
value: 'Out of Stock'
}
],
default: 'Unknown'
},
// Fallback selectors
title: {
selectors: [
'h1.product-title',
'h2.title',
'meta[property="og:title"]@content'
]
}
}
Performance Optimization
Selective Extraction
Only extract what you need:
// Bad - extracts entire page then filters
const result = await client.scrape({
url: 'https://example.com',
format: 'json'
});
const title = result.content.querySelector('h1').text;
// Good - extracts only needed data
const result = await client.scrape({
url: 'https://example.com',
extract: {
title: 'h1'
}
});
Batch Extraction
Extract from multiple URLs efficiently:
urls = [
'https://example.com/product/1',
'https://example.com/product/2',
'https://example.com/product/3'
]
results = client.batch_scrape(
urls=urls,
extract={
'name': 'h1',
'price': '.price',
'stock': '.availability'
},
concurrency=5
)
Error Handling
Handle extraction failures gracefully:
extract: {
price: {
selector: '.price',
required: true,
onError: 'skip' // or 'default' or 'fail'
},
description: {
selector: '.description',
default: 'No description available',
maxLength: 500
},
images: {
selector: 'img.product-image@src',
multiple: true,
validate: (urls) => urls.filter(url => url.startsWith('https'))
}
}
Real-World Examples
E-commerce Product Extraction
const productExtractor = {
extract: {
product: {
name: 'h1[itemprop="name"]',
brand: '[itemprop="brand"]',
price: {
selector: '[itemprop="price"]@content',
type: 'number'
},
currency: '[itemprop="priceCurrency"]@content',
availability: {
selector: '[itemprop="availability"]@href',
transform: (val) => val.includes('InStock')
},
images: {
selector: '.product-images img@src',
multiple: true,
limit: 5
},
features: {
selector: '.feature-list li',
multiple: true
},
rating: {
value: '[itemprop="ratingValue"]@content',
count: '[itemprop="reviewCount"]'
}
}
}
};
Article/Blog Extraction
article_extractor = {
'extract': {
'article': {
'title': 'h1.article-title',
'author': '.author-name',
'publishDate': {
'selector': 'time[datetime]@datetime',
'type': 'date'
},
'category': '.category a',
'tags': {
'selector': '.tag',
'multiple': True
},
'content': {
'selector': '.article-content',
'clean': True,
'markdown': True # Convert to markdown
},
'relatedArticles': {
'selector': '.related-article',
'multiple': True,
'extract': {
'title': 'h3',
'url': 'a@href'
}
}
}
}
}
Best Practices
- Use Specific Selectors: More specific selectors are faster and more reliable
- Validate Extracted Data: Always validate critical data
- Handle Missing Data: Use defaults and optional flags
- Test Selectors: Test on multiple pages before production
- Monitor Changes: Set up alerts for extraction failures
- Use AI Wisely: AI extraction is powerful but more expensive
Next Steps
- Explore JavaScript Rendering for dynamic content
- Learn about Error Handling strategies
- Read about Webhooks for async processing
- Check out Batch Processing for scale