배치 처리

ActiCrawl의 배치 처리 기능으로 여러 URL을 효율적으로 처리하세요. 수천 개의 페이지를 스크래핑하고, 속도 제한을 처리하며, 대규모 성능을 최적화하는 방법을 알아보세요.

개요

배치 처리를 통해 다음을 수행할 수 있습니다:
- 여러 URL을 동시에 스크래핑
- 속도 제한 자동 관리
- 실패를 우아하게 처리
- 실시간 진행 상황 모니터링
- 리소스 사용 최적화

기본 배치 스크래핑

간단한 배치 요청

단일 요청으로 여러 URL 처리:

            javascript
            
          

            const results = await client.batchScrape({
  urls: [
    'https://example.com/page1',
    'https://example.com/page2',
    'https://example.com/page3'
  ],
  format: 'markdown'
});

// 결과 배열은 URL 순서를 유지합니다
results.forEach((result, index) => {
  console.log(`URL ${index + 1}:`, result.url);
  console.log('상태:', result.status);
  console.log('콘텐츠:', result.content);
});

          

추출 기능이 있는 배치

여러 URL에 동일한 추출 규칙 적용:

            python
            
          

            urls = [
    'https://shop.com/product/1',
    'https://shop.com/product/2',
    'https://shop.com/product/3'
]

results = client.batch_scrape(
    urls=urls,
    extract={
        'name': 'h1.product-name',
        'price': '.price',
        'availability': '.stock-status',
        'image': 'img.main-image@src'
    }
)

# 결과 처리
for result in results:
    if result['status'] == 'success':
        print(f"제품: {result['extracted']['name']}")
        print(f"가격: {result['extracted']['price']}")

          

고급 배치 구성

동시성 제어

동시 요청으로 성능 최적화:

            javascript
            
          

            const results = await client.batchScrape({
  urls: productUrls,
  concurrency: 5,  // 5개 URL 동시 처리
  delay: 1000,     // 요청 사이 1초 지연
  retries: 3,      // 실패한 요청 최대 3회 재시도
  timeout: 30000   // URL당 30초 타임아웃
});

          

속도 제한

대상 웹사이트 제한 준수:

            python
            
          

            results = client.batch_scrape(
    urls=urls,
    rate_limit={
        'requests_per_second': 2,
        'burst': 5,  # 5개 요청의 버스트 허용
        'per_domain': True  # 도메인별 제한 적용
    }
)

          

헤더가 있는 배치

모든 요청에 대한 사용자 정의 헤더:

            javascript
            
          

            const results = await client.batchScrape({
  urls: urls,
  headers: {
    'User-Agent': 'MyBot/1.0',
    'Accept-Language': 'ko-KR',
    'X-Custom-Header': 'value'
  },
  cookies: {
    'session': 'abc123',
    'preferences': 'lang=ko'
  }
});

          

배치 크롤링

여러 도메인 크롤링

전체 웹사이트를 병렬로 처리:

            python
            
          

            crawl_configs = [
    {
        'start_url': 'https://site1.com',
        'max_pages': 100,
        'include_patterns': ['/products/*']
    },
    {
        'start_url': 'https://site2.com',
        'max_pages': 50,
        'include_patterns': ['/blog/*']
    }
]

results = client.batch_crawl(crawl_configs)

          

동적 URL 생성

프로그래밍 방식으로 URL 생성:

            javascript
            
          

            // 페이지네이션 URL 생성
const baseUrl = 'https://example.com/products';
const urls = Array.from({length: 50}, (_, i) => 
  `${baseUrl}?page=${i + 1}`
);

// 카테고리 URL 생성
const categories = ['전자제품', '도서', '의류'];
const categoryUrls = categories.flatMap(cat => 
  Array.from({length: 10}, (_, i) => 
    `${baseUrl}/${cat}?page=${i + 1}`
  )
);

const results = await client.batchScrape({
  urls: [...urls, ...categoryUrls],
  concurrency: 10
});

          

오류 처리 및 재시도

포괄적인 오류 관리

            javascript
            
          

            const results = await client.batchScrape({
  urls: urls,
  retry: {
    attempts: 3,
    delay: 2000,  // 지수 백오프
    on: ['timeout', 'network', '5xx']  // 재시도 조건
  },
  onError: 'continue'  // 오류 시 중단하지 않음
});

// 오류 처리와 함께 결과 처리
results.forEach(result => {
  if (result.status === 'success') {
    processSuccessfulResult(result);
  } else {
    logError({
      url: result.url,
      error: result.error,
      attempts: result.attempts
    });
  }
});

          

부분 결과 처리

            python
            
          

            def process_batch_with_fallback(urls):
    results = client.batch_scrape(
        urls=urls,
        on_partial_success=True  # 부분 결과 반환
    )

    successful = []
    failed = []

    for result in results:
        if result['status'] == 'success':
            successful.append(result)
        else:
            failed.append(result['url'])

    # 실패한 URL을 다른 설정으로 재시도
    if failed:
        retry_results = client.batch_scrape(
            urls=failed,
            javascript=True,  # JS 렌더링으로 시도
            timeout=60000
        )
        successful.extend(retry_results)

    return successful

          

진행 상황 모니터링

실시간 진행 상황 추적

            javascript
            
          

            const batch = client.createBatch({
  urls: urls,
  onProgress: (progress) => {
    console.log(`진행: ${progress.completed}/${progress.total}`);
    console.log(`성공률: ${progress.successRate}%`);
    console.log(`예상 남은 시간: ${progress.estimatedTimeRemaining}초`);
  }
});

// 배치 처리 시작
const results = await batch.run();

          

웹훅 알림

            python
            
          

            # 배치 완료를 위한 웹훅 구성
batch_job = client.batch_scrape(
    urls=urls,
    webhook={
        'url': 'https://myapp.com/webhook/batch-complete',
        'events': ['complete', 'error', 'progress'],
        'headers': {
            'Authorization': 'Bearer token123'
        }
    },
    async=True  # 논블로킹 실행
)

print(f"배치 작업 ID: {batch_job['id']}")
print(f"상태 URL: {batch_job['status_url']}")

          

성능 최적화

메모리 효율적인 처리

            javascript
            
          

            // 메모리에 모두 로드하는 대신 스트림 결과
const stream = client.batchStream({
  urls: urls,
  concurrency: 20
});

stream.on('data', (result) => {
  // 각 결과가 도착할 때 처리
  saveToDatabase(result);
});

stream.on('end', () => {
  console.log('배치 처리 완료');
});

stream.on('error', (error) => {
  console.error('배치 오류:', error);
});

          

청크 처리

            python
            
            def process_large_batch(urls, chunk_size=100):
    """대량 URL 목록을 청크로 처리"""
    total_results = []

    for i in range(0, len(urls), chunk_size):
        chunk = urls[i:i + chunk_size]

        print(f"청크 {i//chunk_size + 1} 처리 중")

        results = client.batch_scrape(
            urls=chunk,
            concurrency=10
        )

        total_results.extend(results)

        # 중간 결과 저장
        save_checkpoint(total_results, i)

        # 청크 사이 속도 제한
        time.sleep(5)

    return total_results

데이터 내보내기 옵션

다양한 형식으로 내보내기

            javascript
            
          

            const results = await client.batchScrape({
  urls: urls,
  export: {
    format: 'csv',  // 'json', 'csv', 'excel', 'parquet'
    path: './exports/batch_results.csv',
    fields: ['url', 'title', 'price', 'timestamp'],
    compress: true  // .csv.gz 생성
  }
});

          

스트리밍 내보내기

            python
            
            # 결과를 파일로 직접 스트리밍
with client.batch_scrape_stream(urls) as stream:
    with open('results.jsonl', 'w') as f:
        for result in stream:
            f.write(json.dumps(result) + '\n')

모범 사례

1. URL 중복 제거

            javascript
            
            // 처리 전 중복 URL 제거
const uniqueUrls = [...new Set(urls)];

// 또는 정규화와 함께
const normalizedUrls = urls
  .map(url => new URL(url).href)
  .filter((url, index, self) => self.indexOf(url) === index);

2. 배치 크기 최적화

            python
            
          

            def optimal_batch_size(total_urls):
    """전체 URL 수에 따라 최적 배치 크기 계산"""
    if total_urls < 100:
        return total_urls
    elif total_urls < 1000:
        return 50
    elif total_urls < 10000:
        return 100
    else:
        return 200

          

3. 리소스 관리

            javascript
            
          

            // 배치 처리 후 리소스 정리
const batch = client.createBatch({ urls });

try {
  const results = await batch.run();
  return results;
} finally {
  await batch.cleanup();  // 리소스 해제
}

          

4. 모니터링 및 알림

            python
            
          

            def batch_with_monitoring(urls):
    """모니터링이 있는 배치 처리"""
    start_time = time.time()

    results = client.batch_scrape(
        urls=urls,
        on_complete=lambda r: send_metrics({
            'total_urls': len(urls),
            'successful': len([x for x in r if x['status'] == 'success']),
            'duration': time.time() - start_time,
            'success_rate': calculate_success_rate(r)
        })
    )

    # 낮은 성공률에 대한 알림
    success_rate = calculate_success_rate(results)
    if success_rate < 0.8:
        send_alert(f"낮은 성공률: {success_rate}")

    return results

          

일반적인 사용 사례

이커머스 가격 모니터링

            javascript
            
          

            async function monitorPrices(productUrls) {
  const results = await client.batchScrape({
    urls: productUrls,
    extract: {
      name: 'h1.product-title',
      price: '.current-price',
      originalPrice: '.original-price',
      availability: '.stock-status'
    },
    concurrency: 10
  });

  // 이전 가격과 비교
  const priceChanges = results
    .filter(r => r.status === 'success')
    .map(r => ({
      ...r.extracted,
      url: r.url,
      priceChange: calculatePriceChange(r.extracted.price)
    }))
    .filter(p => p.priceChange !== 0);

  return priceChanges;
}

          

콘텐츠 집계

            python
            
          

            def aggregate_news_articles(news_sites):
    """여러 뉴스 사이트에서 기사 집계"""

    urls = []
    for site in news_sites:
        # 각 사이트에서 기사 URL 가져오기
        homepage = client.scrape(site['url'])
        article_urls = extract_article_urls(homepage, site['selector'])
        urls.extend(article_urls[:10])  # 각각 상위 10개

    # 모든 기사 배치 스크래핑
    articles = client.batch_scrape(
        urls=urls,
        extract={
            'title': 'h1',
            'author': '.author-name',
            'date': 'time@datetime',
            'content': 'article.content',
            'category': '.category'
        }
    )

    return process_articles(articles)

          

문제 해결

일반적인 문제

대량 배치의 메모리 문제
- 메모리에 모든 결과를 로드하는 대신 스트리밍 사용
- 청크로 처리
- Node.js 메모리 증가: node --max-old-space-size=4096
속도 제한 오류
- 동시성 감소
- 요청 사이 지연 추가
- 지수 백오프 구현
타임아웃 문제
- 타임아웃 값 증가
- 배치 크기 감소
- 네트워크 연결 확인
일관되지 않은 결과
- 동적 콘텐츠에 대해 JavaScript 렌더링 활성화
- 탐지를 피하기 위해 무작위 지연 추가
- 사용자 에이전트 순환

다음 단계

비동기 처리를 위한 웹훅 알아보기
동적 사이트를 위한 JavaScript 렌더링 탐색
오류 처리 전략에 대해 읽기
속도 제한 모범 사례 확인

문서