Documentation/API Reference/Crawling

Crawling API

API endpoint for crawling websites and building knowledge bases.

POST
/api/admin/crawl

Crawls a website starting from the provided URL, extracts text content, chunks it, generates embeddings, and stores everything in the client's knowledge base.

Request

Request Body
JSON payload with crawl configuration
FieldTypeRequiredDescription
clientId
string
Required
UUID of the client
url
string
Required
Starting URL to crawl
maxPages
number
Optional
Maximum pages to crawl (default: 50)

Example

cURL Request
curl -X POST https://sensei-ai-eight.vercel.app/api/admin/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "clientId": "d454991a-eddb-4e81-959d-87c868e050ca",
    "url": "https://example.com",
    "maxPages": 50
  }'

Response

200
Success
{
  "success": true,
  "sourceId": "abc123-def456",
  "pagesIndexed": 25,
  "documentsCreated": 78,
  "message": "Crawl completed successfully"
}
400
Bad Request
{
  "error": "Invalid URL provided",
  "details": "URL must be a valid HTTP/HTTPS URL"
}

How Crawling Works

Crawl Process
  1. 1

    Fetch Page

    Download HTML content from the URL

  2. 2

    Extract Content

    Parse HTML and extract text content (removes scripts, styles, navigation)

  3. 3

    Find Links

    Discover internal links to crawl next (same domain only)

  4. 4

    Chunk Text

    Split content into ~1000 character chunks with overlap

  5. 5

    Generate Embeddings

    Create vector embeddings for each chunk using OpenAI

  6. 6

    Store in Database

    Save chunks and embeddings to the documents table