Documentation/API Reference/Crawling
Crawling API
API endpoint for crawling websites and building knowledge bases.
POST
/api/admin/crawlCrawls a website starting from the provided URL, extracts text content, chunks it, generates embeddings, and stores everything in the client's knowledge base.
Request
Request Body
JSON payload with crawl configuration
| Field | Type | Required | Description |
|---|---|---|---|
| clientId | string | Required | UUID of the client |
| url | string | Required | Starting URL to crawl |
| maxPages | number | Optional | Maximum pages to crawl (default: 50) |
Example
cURL Request
curl -X POST https://sensei-ai-eight.vercel.app/api/admin/crawl \
-H "Content-Type: application/json" \
-d '{
"clientId": "d454991a-eddb-4e81-959d-87c868e050ca",
"url": "https://example.com",
"maxPages": 50
}'Response
200
Success{
"success": true,
"sourceId": "abc123-def456",
"pagesIndexed": 25,
"documentsCreated": 78,
"message": "Crawl completed successfully"
}400
Bad Request{
"error": "Invalid URL provided",
"details": "URL must be a valid HTTP/HTTPS URL"
}How Crawling Works
Crawl Process
- 1
Fetch Page
Download HTML content from the URL
- 2
Extract Content
Parse HTML and extract text content (removes scripts, styles, navigation)
- 3
Find Links
Discover internal links to crawl next (same domain only)
- 4
Chunk Text
Split content into ~1000 character chunks with overlap
- 5
Generate Embeddings
Create vector embeddings for each chunk using OpenAI
- 6
Store in Database
Save chunks and embeddings to the documents table
Limitations:
- • Only crawls same-domain links
- • Respects robots.txt (when possible)
- • JavaScript-rendered content may not be captured
- • Large sites should use maxPages to limit scope
Tip: For best results, crawl pages that contain the most important information about the client's products, services, pricing, and FAQs.