Web Automation#
Extract structured data from websites using AI-powered scraping, multi-page crawling, and intelligent web search.
Why This Matters#
Modern web data extraction requires more than simple HTML parsing. JavaScript-rendered content, anti-bot measures, and unstructured layouts make traditional scrapers ineffective. TEA’s web automation capabilities delegate to specialized AI-powered services that handle the complexity for you, returning clean, structured data ready for your agent workflows.
Whether you need to monitor competitor pricing, aggregate research content, or build data pipelines from web sources, TEA provides the actions to do it declaratively in YAML.
Quick Example#
name: product-monitor
description: Extract product data from any e-commerce site
state_schema:
target_url: str
products: list
nodes:
- name: scrape-products
uses: web.ai_scrape
with:
url: "{{ state.target_url }}"
prompt: "Extract all products with names, prices, and availability"
output_schema:
type: object
properties:
products:
type: array
items:
type: object
properties:
name: { type: string }
price: { type: string }
in_stock: { type: boolean }
output: products
edges:
- from: __start__
to: scrape-products
- from: scrape-products
to: __end__
Key Features#
Feature |
Description |
|---|---|
AI-Powered Extraction |
LLMs understand page structure, not just DOM selectors |
JavaScript Rendering |
Handle dynamic SPAs and client-side rendered content |
Schema-Driven Output |
Define exactly what data you need with JSON Schema |
Multi-Page Crawling |
Recursively crawl sites with path filters and depth limits |
Web Search |
Find relevant pages before scraping with AI-powered search |
No Browser Dependencies |
Cloud APIs mean no Playwright/Selenium to configure |
Tools#
TEA integrates with multiple web automation services:
Tool |
Actions |
Best For |
API Key Required |
|---|---|---|---|
Firecrawl |
|
LLM-ready markdown, multi-page crawling |
|
ScrapeGraphAI |
|
AI-powered structured extraction with Pydantic schemas |
|
Perplexity |
|
AI-powered web search with citations |
|
LlamaExtract |
|
Document extraction (PDFs, images, invoices) |
|
Available Actions#
Web Scraping (web.*)#
Action |
Description |
|---|---|
|
Extract LLM-ready markdown from a URL via Firecrawl |
|
Recursively crawl websites with path filters and depth limits |
|
Perform AI-powered web search via Perplexity |
|
Extract structured data using ScrapeGraphAI with Pydantic schemas |
Structured Extraction (llamaextract.*)#
Action |
Description |
|---|---|
|
Extract structured data from documents (PDF, images) |
|
Create/update extraction agent with schema |
|
List available extraction agents |
|
Get agent details and schema |
|
Delete an extraction agent |
Examples#
Deep Research Crawler - Multi-stage law firm data extraction with sitemap analysis
ScrapeGraph Simple Test - Quick test for AI-powered extraction
ScrapeGraph Production Test - Production validation example
Use Cases#
Research Automation#
# Search for relevant pages, then scrape them
- name: find-sources
uses: web.search
with:
query: "{{ state.topic }} latest research 2025"
num_results: 10
output: sources
- name: scrape-articles
uses: web.scrape
with:
url: "{{ state.sources.results[0].url }}"
formats: ["markdown", "links"]
output: article_content
Structured Data Extraction#
# Extract product data with schema validation
- name: extract-products
uses: web.ai_scrape
with:
url: "{{ state.product_page }}"
prompt: "Extract all product information"
schema:
uses: company/schemas@v1.0.0#products.json # Git ref to schema
output: products
Multi-Page Crawling#
# Crawl documentation site
- name: crawl-docs
uses: web.crawl
with:
url: "https://docs.example.com"
max_depth: 3
limit: 50
include_paths: ["/api/*", "/guides/*"]
exclude_paths: ["/blog/*", "/changelog/*"]
output: documentation
Configuration#
Environment Variables#
# Firecrawl (scraping and crawling)
export FIRECRAWL_API_KEY="fc-your-key"
# ScrapeGraphAI (AI extraction)
export SCRAPEGRAPH_API_KEY="sgai-your-key"
# Perplexity (web search)
export PERPLEXITY_API_KEY="pplx-your-key"
# LlamaExtract (document extraction)
export LLAMAEXTRACT_API_KEY="llx-your-key"
Schema Loading (Advanced)#
Web actions support loading extraction schemas from multiple sources:
# Git repository reference
schema:
uses: company/schemas@v1.0.0#extraction/product.json
# Cloud storage (S3, GCS, Azure)
schema:
uses: s3://bucket-name/schemas/invoice.json
# Multiple schemas merged (kubectl-style, last wins)
schema:
uses:
- base/schemas@v1#common/base.json
- s3://company-schemas/overlay.json
- company/private@main#overrides.json
Error Handling#
All web actions return consistent error responses:
{
"success": False,
"error": "Rate limit exceeded",
"error_type": "rate_limit" # configuration, rate_limit, timeout, api_error
}
Actions automatically retry on:
Rate limits (HTTP 429) with exponential backoff
Server errors (HTTP 5xx) with retry
Learn More#
Web Actions Story - Technical implementation details
ScrapeGraphAI Integration - AI-powered extraction
LlamaExtract Actions - Document extraction
YAML Reference - Full action documentation
Part of the TEA Capabilities documentation.