데이터 추출

ActiCrawl의 강력한 추출 엔진을 사용하여 웹 페이지에서 구조화된 데이터를 추출하는 방법을 익히세요. 간단한 CSS 선택자부터 고급 AI 기반 추출까지 다양한 기술을 배워보세요.

추출 기초

ActiCrawl은 웹 페이지에서 데이터를 추출하는 여러 방법을 지원합니다:

CSS 선택자: 표준 CSS 구문을 사용한 요소 선택
XPath: 고급 경로 기반 선택
JSON-LD: JSON-LD 스크립트에서 구조화된 데이터 추출
정규식 패턴: 정규 표현식을 사용한 텍스트 추출
AI 추출: AI가 데이터를 이해하고 지능적으로 추출

CSS 선택자 추출

가장 일반적이고 간단한 방법:

            javascript
            
          

            const result = await client.scrape({
  url: 'https://example.com/product',
  extract: {
    title: 'h1.product-title',
    price: '.price-current',
    description: '.product-description',
    image: 'img.main-image@src',
    rating: '.rating@data-rating'
  }
});

console.log(result.extracted);
// {
//   title: "프리미엄 무선 헤드폰",
//   price: "₩399,000",
//   description: "고품질 오디오...",
//   image: "https://example.com/img/product.jpg",
//   rating: "4.5"
// }

          

속성 추출

@attribute를 사용하여 특정 속성 추출:

            javascript
            
          

            extract: {
  imageUrl: 'img#product-image@src',
  imageAlt: 'img#product-image@alt',
  linkHref: 'a.product-link@href',
  dataId: 'div.product@data-product-id',
  metaDescription: 'meta[name="description"]@content'
}

          

다중 요소

요소 배열 추출:

            javascript
            
          

            extract: {
  // 단일 요소
  title: 'h1',

  // 다중 요소
  features: {
    selector: 'li.feature',
    multiple: true
  },

  // 중첩 추출
  reviews: {
    selector: '.review',
    multiple: true,
    extract: {
      author: '.reviewer-name',
      rating: '.stars@data-rating',
      comment: '.review-text',
      date: '.review-date'
    }
  }
}

          

XPath 추출

CSS로 처리할 수 없는 복잡한 선택:

            python
            
          

            result = client.scrape(
    url='https://example.com/article',
    extract={
        # 특정 레이블 뒤의 텍스트
        'author': '//span[text()="작성자:"]/following-sibling::text()',

        # 헤더별 테이블 셀
        'price': '//th[text()="가격"]/following-sibling::td/text()',

        # 복잡한 조건
        'in_stock': '//div[@class="availability" and contains(text(), "재고 있음")]',

        # 부모 탐색
        'category': '//li[@class="current"]/parent::ul/@data-category'
    }
)

          

고급 추출 패턴

테이블 추출

테이블에서 구조화된 데이터 추출:

            javascript
            
          

            const tableData = await client.scrape({
  url: 'https://example.com/data',
  extract: {
    table: {
      selector: 'table#data-table',
      type: 'table',
      headers: 'auto', // 또는 지정: ['이름', '가격', '재고']
      skipRows: 1 // 헤더 행 건너뛰기
    }
  }
});

// 결과:
// [
//   { 이름: "제품 A", 가격: "₩100,000", 재고: "재고 있음" },
//   { 이름: "제품 B", 가격: "₩200,000", 재고: "재고 없음" }
// ]

          

리스트 추출

구조화된 리스트 추출:

            python
            
          

            extract = {
    'products': {
        'selector': '.product-grid .product-card',
        'type': 'list',
        'extract': {
            'name': 'h3',
            'price': '.price',
            'image': 'img@src',
            'specs': {
                'selector': '.spec',
                'multiple': True
            }
        }
    }
}

          

페이지네이션 데이터

여러 페이지에 걸친 데이터 추출:

            javascript
            
          

            const crawler = await client.crawl({
  startUrl: 'https://example.com/products?page=1',
  pagination: {
    nextSelector: 'a.next-page@href',
    maxPages: 10
  },
  extract: {
    products: {
      selector: '.product',
      multiple: true,
      extract: {
        name: '.product-name',
        price: '.product-price'
      }
    }
  }
});

          

텍스트 처리

깨끗한 텍스트 추출

노이즈를 제거하고 깨끗한 텍스트 얻기:

            javascript
            
          

            extract: {
  articleText: {
    selector: 'article',
    textOnly: true,
    clean: true // 추가 공백, 광고 등 제거
  },

  // 사용자 정의 정리
  description: {
    selector: '.description',
    process: (text) => {
      return text
        .replace(/\s+/g, ' ')
        .trim()
        .substring(0, 200);
    }
  }
}

          

정규식 추출

정규식 패턴을 사용한 추출:

            python
            
          

            extract = {
    'phone': {
        'selector': '.contact',
        'regex': r'\b\d{3}[-.]?\d{3}[-.]?\d{4}\b'
    },
    'email': {
        'selector': '.contact',
        'regex': r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
    },
    'price': {
        'selector': '.price-text',
        'regex': r'₩[\d,]+\.?\d*',
        'type': 'float'  # 숫자로 변환
    }
}

          

JSON-LD와 마이크로데이터

페이지에 포함된 구조화된 데이터 추출:

            javascript
            
          

            const result = await client.scrape({
  url: 'https://example.com/product',
  extract: {
    // JSON-LD에서 추출
    structured: {
      selector: 'script[type="application/ld+json"]',
      type: 'json',
      parse: true
    },

    // 마이크로데이터 추출
    product: {
      selector: '[itemtype="https://schema.org/Product"]',
      microdata: true
    }
  }
});

          

AI 기반 추출

AI가 데이터를 이해하고 추출하도록 하기:

            javascript
            
          

            const result = await client.scrape({
  url: 'https://example.com/article',
  aiExtract: {
    // 자연어 쿼리
    author: "이 글의 작성자는 누구인가요?",
    publishDate: "언제 게시되었나요?",
    mainPoints: "주요 포인트는 무엇인가요? (목록)",
    sentiment: "전반적인 감정은 어떤가요?",

    // 구조화된 추출
    product: {
      query: "제품 정보를 추출하세요",
      schema: {
        name: "string",
        price: "number",
        features: "array",
        available: "boolean"
      }
    }
  }
});

          

사용자 정의 AI 프롬프트

            python
            
          

            ai_extract = {
    'summary': {
        'prompt': '이 기사를 3개의 요점으로 요약하세요',
        'max_tokens': 150
    },
    'entities': {
        'prompt': '언급된 모든 회사명, 사람, 위치를 추출하세요',
        'format': 'json'
    },
    'classification': {
        'prompt': '이 콘텐츠를 다음 중 하나로 분류하세요: 뉴스, 블로그, 제품, 문서',
        'choices': ['뉴스', '블로그', '제품', '문서']
    }
}

          

데이터 변환

타입 변환

추출된 데이터를 적절한 타입으로 변환:

            javascript
            
          

            extract: {
  price: {
    selector: '.price',
    type: 'number', // "₩29,900"을 29900으로 변환
    currency: 'KRW'
  },
  inStock: {
    selector: '.availability',
    type: 'boolean', // "재고 있음"을 true로 변환
    truthy: ['재고 있음', '사용 가능']
  },
  rating: {
    selector: '.stars',
    type: 'float',
    attribute: 'data-rating'
  },
  publishDate: {
    selector: '.date',
    type: 'date',
    format: 'YYYY-MM-DD'
  }
}

          

사용자 정의 변환

사용자 정의 처리 함수 적용:

            python
            
          

            def process_price(value):
    # 통화 기호 제거 및 float 변환
    return float(value.replace('₩', '').replace(',', ''))

def normalize_date(value):
    # 다양한 날짜 형식 변환
    from dateutil import parser
    return parser.parse(value).isoformat()

extract = {
    'price': {
        'selector': '.price',
        'transform': process_price
    },
    'date': {
        'selector': '.published',
        'transform': normalize_date
    }
}

          

조건부 추출

조건에 따른 추출:

            javascript
            
          

            extract: {
  // 요소가 존재하는 경우 추출
  salePrice: {
    selector: '.sale-price',
    optional: true
  },

  // 조건부 추출
  availability: {
    conditions: [
      {
        selector: '.in-stock',
        exists: true,
        value: '재고 있음'
      },
      {
        selector: '.out-of-stock',
        exists: true,
        value: '재고 없음'
      }
    ],
    default: '알 수 없음'
  },

  // 대체 선택자
  title: {
    selectors: [
      'h1.product-title',
      'h2.title',
      'meta[property="og:title"]@content'
    ]
  }
}

          

성능 최적화

선택적 추출

필요한 것만 추출:

            javascript
            
          

            // 나쁨 - 전체 페이지를 추출한 후 필터링
const result = await client.scrape({
  url: 'https://example.com',
  format: 'json'
});
const title = result.content.querySelector('h1').text;

// 좋음 - 필요한 데이터만 추출
const result = await client.scrape({
  url: 'https://example.com',
  extract: {
    title: 'h1'
  }
});

          

배치 추출

여러 URL에서 효율적으로 추출:

            python
            
          

            urls = [
    'https://example.com/product/1',
    'https://example.com/product/2',
    'https://example.com/product/3'
]

results = client.batch_scrape(
    urls=urls,
    extract={
        'name': 'h1',
        'price': '.price',
        'stock': '.availability'
    },
    concurrency=5
)

          

오류 처리

추출 실패를 우아하게 처리:

            javascript
            
          

            extract: {
  price: {
    selector: '.price',
    required: true,
    onError: 'skip' // 또는 'default' 또는 'fail'
  },

  description: {
    selector: '.description',
    default: '설명이 없습니다',
    maxLength: 500
  },

  images: {
    selector: 'img.product-image@src',
    multiple: true,
    validate: (urls) => urls.filter(url => url.startsWith('https'))
  }
}

          

실제 사례

이커머스 제품 추출

            javascript
            
          

            const productExtractor = {
  extract: {
    product: {
      name: 'h1[itemprop="name"]',
      brand: '[itemprop="brand"]',
      price: {
        selector: '[itemprop="price"]@content',
        type: 'number'
      },
      currency: '[itemprop="priceCurrency"]@content',
      availability: {
        selector: '[itemprop="availability"]@href',
        transform: (val) => val.includes('InStock')
      },
      images: {
        selector: '.product-images img@src',
        multiple: true,
        limit: 5
      },
      features: {
        selector: '.feature-list li',
        multiple: true
      },
      rating: {
        value: '[itemprop="ratingValue"]@content',
        count: '[itemprop="reviewCount"]'
      }
    }
  }
};

          

기사/블로그 추출

            python
            
          

            article_extractor = {
    'extract': {
        'article': {
            'title': 'h1.article-title',
            'author': '.author-name',
            'publishDate': {
                'selector': 'time[datetime]@datetime',
                'type': 'date'
            },
            'category': '.category a',
            'tags': {
                'selector': '.tag',
                'multiple': True
            },
            'content': {
                'selector': '.article-content',
                'clean': True,
                'markdown': True  # 마크다운으로 변환
            },
            'relatedArticles': {
                'selector': '.related-article',
                'multiple': True,
                'extract': {
                    'title': 'h3',
                    'url': 'a@href'
                }
            }
        }
    }
}

          

모범 사례

구체적인 선택자 사용: 더 구체적인 선택자가 더 빠르고 신뢰할 수 있습니다
추출된 데이터 검증: 항상 중요한 데이터를 검증하세요
누락된 데이터 처리: 기본값과 선택적 플래그를 사용하세요
선택자 테스트: 프로덕션 전에 여러 페이지에서 테스트하세요
변경 사항 모니터링: 추출 실패에 대한 알림을 설정하세요
AI를 현명하게 사용: AI 추출은 강력하지만 더 비쌉니다

다음 단계

동적 콘텐츠를 위한 JavaScript 렌더링 탐색
오류 처리 전략에 대해 알아보기
비동기 처리를 위한 웹훅 읽기
규모 확장을 위한 배치 처리 확인