Documentation/API Reference/Crawling

Crawling API

API endpoint for crawling websites and building knowledge bases.

POST

/api/admin/crawl

Crawls a website starting from the provided URL, extracts text content, chunks it, generates embeddings, and stores everything in the client's knowledge base.

Request

Request Body

JSON payload with crawl configuration

Field	Type	Required	Description
clientId	string	Required	UUID of the client
url	string	Required	Starting URL to crawl
maxPages	number	Optional	Maximum pages to crawl (default: 50)

Example

cURL Request

curl -X POST https://sensei-ai-eight.vercel.app/api/admin/crawl \
  -H "Content-Type: application/json" \
  -d '{
    "clientId": "d454991a-eddb-4e81-959d-87c868e050ca",
    "url": "https://example.com",
    "maxPages": 50
  }'

Response

200

Success

{
  "success": true,
  "sourceId": "abc123-def456",
  "pagesIndexed": 25,
  "documentsCreated": 78,
  "message": "Crawl completed successfully"
}

400

Bad Request

{
  "error": "Invalid URL provided",
  "details": "URL must be a valid HTTP/HTTPS URL"
}

How Crawling Works

Crawl Process

1
Fetch Page
Download HTML content from the URL
2
Extract Content
Parse HTML and extract text content (removes scripts, styles, navigation)
3
Find Links
Discover internal links to crawl next (same domain only)
4
Chunk Text
Split content into ~1000 character chunks with overlap
5
Generate Embeddings
Create vector embeddings for each chunk using OpenAI
6
Store in Database
Save chunks and embeddings to the documents table

Limitations:

• Only crawls same-domain links
• Respects robots.txt (when possible)
• JavaScript-rendered content may not be captured
• Large sites should use maxPages to limit scope

Tip: For best results, crawl pages that contain the most important information about the client's products, services, pricing, and FAQs.

Crawling API

Request

Example

Response

How Crawling Works

Fetch Page

Extract Content

Find Links

Chunk Text

Generate Embeddings

Store in Database