Docs

Documentation

Learn how to automate your web scraping workflows with ActiCrawl

Data Extraction

Master the art of extracting structured data from web pages using ActiCrawl's powerful extraction engine. Learn various techniques from simple CSS selectors to advanced AI-powered extraction.

Extraction Basics

ActiCrawl supports multiple methods to extract data from web pages:

  • CSS Selectors: Target elements using standard CSS syntax
  • XPath: Advanced path-based selection
  • JSON-LD: Extract structured data from JSON-LD scripts
  • Regex Patterns: Extract text using regular expressions
  • AI Extraction: Let AI understand and extract data intelligently

CSS Selector Extraction

The most common and straightforward method:

javascript
const result = await client.scrape({
  url: 'https://example.com/product',
  extract: {
    title: 'h1.product-title',
    price: '.price-current',
    description: '.product-description',
    image: 'img.main-image@src',
    rating: '.rating@data-rating'
  }
});

console.log(result.extracted);
// {
//   title: "Premium Wireless Headphones",
//   price: "$299.99",
//   description: "High-quality audio...",
//   image: "https://example.com/img/product.jpg",
//   rating: "4.5"
// }

Attribute Extraction

Extract specific attributes using @attribute:

javascript
extract: {
  imageUrl: 'img#product-image@src',
  imageAlt: 'img#product-image@alt',
  linkHref: 'a.product-link@href',
  dataId: 'div.product@data-product-id',
  metaDescription: 'meta[name="description"]@content'
}

Multiple Elements

Extract arrays of elements:

javascript
extract: {
  // Single element
  title: 'h1',

  // Multiple elements
  features: {
    selector: 'li.feature',
    multiple: true
  },

  // Nested extraction
  reviews: {
    selector: '.review',
    multiple: true,
    extract: {
      author: '.reviewer-name',
      rating: '.stars@data-rating',
      comment: '.review-text',
      date: '.review-date'
    }
  }
}

XPath Extraction

For complex selections that CSS can't handle:

python
result = client.scrape(
    url='https://example.com/article',
    extract={
        # Text after specific label
        'author': '//span[text()="Author:"]/following-sibling::text()',

        # Table cell by header
        'price': '//th[text()="Price"]/following-sibling::td/text()',

        # Complex conditions
        'in_stock': '//div[@class="availability" and contains(text(), "In Stock")]',

        # Parent navigation
        'category': '//li[@class="current"]/parent::ul/@data-category'
    }
)

Advanced Extraction Patterns

Table Extraction

Extract structured data from tables:

javascript
const tableData = await client.scrape({
  url: 'https://example.com/data',
  extract: {
    table: {
      selector: 'table#data-table',
      type: 'table',
      headers: 'auto', // or specify: ['Name', 'Price', 'Stock']
      skipRows: 1 // Skip header row
    }
  }
});

// Result:
// [
//   { Name: "Product A", Price: "$100", Stock: "In Stock" },
//   { Name: "Product B", Price: "$200", Stock: "Out of Stock" }
// ]

List Extraction

Extract structured lists:

python
extract = {
    'products': {
        'selector': '.product-grid .product-card',
        'type': 'list',
        'extract': {
            'name': 'h3',
            'price': '.price',
            'image': 'img@src',
            'specs': {
                'selector': '.spec',
                'multiple': True
            }
        }
    }
}

Pagination Data

Extract data across multiple pages:

javascript
const crawler = await client.crawl({
  startUrl: 'https://example.com/products?page=1',
  pagination: {
    nextSelector: 'a.next-page@href',
    maxPages: 10
  },
  extract: {
    products: {
      selector: '.product',
      multiple: true,
      extract: {
        name: '.product-name',
        price: '.product-price'
      }
    }
  }
});

Text Processing

Clean Text Extraction

Remove noise and get clean text:

javascript
extract: {
  articleText: {
    selector: 'article',
    textOnly: true,
    clean: true // Removes extra whitespace, ads, etc.
  },

  // With custom cleaning
  description: {
    selector: '.description',
    process: (text) => {
      return text
        .replace(/\s+/g, ' ')
        .trim()
        .substring(0, 200);
    }
  }
}

Regular Expression Extraction

Extract using regex patterns:

python
extract = {
    'phone': {
        'selector': '.contact',
        'regex': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
    },
    'email': {
        'selector': '.contact',
        'regex': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
    },
    'price': {
        'selector': '.price-text',
        'regex': r'\$[\d,]+\.?\d*',
        'type': 'float'  # Convert to number
    }
}

JSON-LD and Microdata

Extract structured data embedded in pages:

javascript
const result = await client.scrape({
  url: 'https://example.com/product',
  extract: {
    // Extract from JSON-LD
    structured: {
      selector: 'script[type="application/ld+json"]',
      type: 'json',
      parse: true
    },

    // Extract microdata
    product: {
      selector: '[itemtype="https://schema.org/Product"]',
      microdata: true
    }
  }
});

AI-Powered Extraction

Let AI understand and extract data:

javascript
const result = await client.scrape({
  url: 'https://example.com/article',
  aiExtract: {
    // Natural language queries
    author: "Who wrote this article?",
    publishDate: "When was this published?",
    mainPoints: "What are the main points? (list)",
    sentiment: "What is the overall sentiment?",

    // Structured extraction
    product: {
      query: "Extract product information",
      schema: {
        name: "string",
        price: "number",
        features: "array",
        available: "boolean"
      }
    }
  }
});

Custom AI Prompts

python
ai_extract = {
    'summary': {
        'prompt': 'Summarize this article in 3 bullet points',
        'max_tokens': 150
    },
    'entities': {
        'prompt': 'Extract all company names, people, and locations mentioned',
        'format': 'json'
    },
    'classification': {
        'prompt': 'Classify this content into one of: news, blog, product, documentation',
        'choices': ['news', 'blog', 'product', 'documentation']
    }
}

Data Transformation

Type Conversion

Convert extracted data to appropriate types:

javascript
extract: {
  price: {
    selector: '.price',
    type: 'number', // Converts "$29.99" to 29.99
    currency: 'USD'
  },
  inStock: {
    selector: '.availability',
    type: 'boolean', // Converts "In Stock" to true
    truthy: ['In Stock', 'Available']
  },
  rating: {
    selector: '.stars',
    type: 'float',
    attribute: 'data-rating'
  },
  publishDate: {
    selector: '.date',
    type: 'date',
    format: 'YYYY-MM-DD'
  }
}

Custom Transformations

Apply custom processing functions:

python
def process_price(value):
    # Remove currency symbol and convert to float
    return float(value.replace('$', '').replace(',', ''))

def normalize_date(value):
    # Convert various date formats
    from dateutil import parser
    return parser.parse(value).isoformat()

extract = {
    'price': {
        'selector': '.price',
        'transform': process_price
    },
    'date': {
        'selector': '.published',
        'transform': normalize_date
    }
}

Conditional Extraction

Extract based on conditions:

javascript
extract: {
  // Extract if element exists
  salePrice: {
    selector: '.sale-price',
    optional: true
  },

  // Conditional extraction
  availability: {
    conditions: [
      {
        selector: '.in-stock',
        exists: true,
        value: 'In Stock'
      },
      {
        selector: '.out-of-stock',
        exists: true,
        value: 'Out of Stock'
      }
    ],
    default: 'Unknown'
  },

  // Fallback selectors
  title: {
    selectors: [
      'h1.product-title',
      'h2.title',
      'meta[property="og:title"]@content'
    ]
  }
}

Performance Optimization

Selective Extraction

Only extract what you need:

javascript
// Bad - extracts entire page then filters
const result = await client.scrape({
  url: 'https://example.com',
  format: 'json'
});
const title = result.content.querySelector('h1').text;

// Good - extracts only needed data
const result = await client.scrape({
  url: 'https://example.com',
  extract: {
    title: 'h1'
  }
});

Batch Extraction

Extract from multiple URLs efficiently:

python
urls = [
    'https://example.com/product/1',
    'https://example.com/product/2',
    'https://example.com/product/3'
]

results = client.batch_scrape(
    urls=urls,
    extract={
        'name': 'h1',
        'price': '.price',
        'stock': '.availability'
    },
    concurrency=5
)

Error Handling

Handle extraction failures gracefully:

javascript
extract: {
  price: {
    selector: '.price',
    required: true,
    onError: 'skip' // or 'default' or 'fail'
  },

  description: {
    selector: '.description',
    default: 'No description available',
    maxLength: 500
  },

  images: {
    selector: 'img.product-image@src',
    multiple: true,
    validate: (urls) => urls.filter(url => url.startsWith('https'))
  }
}

Real-World Examples

E-commerce Product Extraction

javascript
const productExtractor = {
  extract: {
    product: {
      name: 'h1[itemprop="name"]',
      brand: '[itemprop="brand"]',
      price: {
        selector: '[itemprop="price"]@content',
        type: 'number'
      },
      currency: '[itemprop="priceCurrency"]@content',
      availability: {
        selector: '[itemprop="availability"]@href',
        transform: (val) => val.includes('InStock')
      },
      images: {
        selector: '.product-images img@src',
        multiple: true,
        limit: 5
      },
      features: {
        selector: '.feature-list li',
        multiple: true
      },
      rating: {
        value: '[itemprop="ratingValue"]@content',
        count: '[itemprop="reviewCount"]'
      }
    }
  }
};

Article/Blog Extraction

python
article_extractor = {
    'extract': {
        'article': {
            'title': 'h1.article-title',
            'author': '.author-name',
            'publishDate': {
                'selector': 'time[datetime]@datetime',
                'type': 'date'
            },
            'category': '.category a',
            'tags': {
                'selector': '.tag',
                'multiple': True
            },
            'content': {
                'selector': '.article-content',
                'clean': True,
                'markdown': True  # Convert to markdown
            },
            'relatedArticles': {
                'selector': '.related-article',
                'multiple': True,
                'extract': {
                    'title': 'h3',
                    'url': 'a@href'
                }
            }
        }
    }
}

Best Practices

  1. Use Specific Selectors: More specific selectors are faster and more reliable
  2. Validate Extracted Data: Always validate critical data
  3. Handle Missing Data: Use defaults and optional flags
  4. Test Selectors: Test on multiple pages before production
  5. Monitor Changes: Set up alerts for extraction failures
  6. Use AI Wisely: AI extraction is powerful but more expensive

Next Steps