Skip to Content
System DesignHigh-Level DesignFile Upload for AI Chat Applications — System Design

File Upload for AI Chat Applications — System Design

High-level design for file upload: chunked uploads, multimodal processing, security validation, and integration with AI models.

Iteration: v1 — Core File Upload Design Next: Real-time collaborative uploads, advanced RAG pipelines, multi-region CDN optimization


Table of Contents

  1. Problem Statement
  2. Requirements
  3. Capacity Estimations
  4. Data Model
  5. API Design
  6. High-Level Architecture
  7. Deep Dive: Core Problems
  8. Critical Tradeoffs
  9. Failure Modes & Recovery
  10. Interview Discussion Points
  11. Extensions for v2
  12. Real-World Implementations

1. Problem Statement

File upload in AI chat applications enables users to share documents, images, code files, and other media that the AI can analyze, understand, and reference during conversations. Unlike traditional file storage systems, AI chat file uploads require:

  1. Content extraction — Converting files into formats AI models can process
  2. Context integration — Making file content available within conversation context
  3. Real-time processing — Handling uploads without blocking user interaction
  4. Multimodal support — Processing diverse file types (text, images, PDFs, code)

Why is this challenging?

ChallengeDescription
Size limits vs. context windowsAI models have token limits; large files need intelligent chunking
Processing latencyUsers expect immediate feedback, but extraction takes time
Security concernsFiles may contain malware, PII, or sensitive data
Format diversityPDFs, images, spreadsheets, code — each needs different processing
Cost managementAI API calls are expensive; inefficient processing burns money
Stateful conversationsFile context must persist across conversation turns

The Core Challenge

Without proper file upload design: - Large files timeout or fail silently - AI can't access file content effectively - Security vulnerabilities from unvalidated uploads - Poor UX from synchronous blocking processing - Context lost when files exceed token limits With well-designed file upload: - Seamless handling of large files via chunked upload - Intelligent content extraction and summarization - Secure validation pipeline before processing - Async processing with real-time status updates - Smart chunking to fit within context windows

2. Requirements

2.1 Functional Requirements

FR#RequirementDescription
FR1File UploadUsers can upload files (drag-drop, file picker, paste)
FR2Multiple File TypesSupport PDFs, images, text files, code, spreadsheets, documents
FR3Large File SupportHandle files up to 100MB with resumable uploads
FR4Content ExtractionExtract text/data from files for AI consumption
FR5Conversation ContextAI can reference uploaded files in responses
FR6File PreviewUsers can preview uploaded files in the chat
FR7Download OriginalUsers can download the original uploaded file
FR8Progress IndicationReal-time upload and processing progress
FR9File DeletionUsers can remove files from conversation context

2.2 Non-Functional Requirements

NFRTargetWhy it matters
Upload Speed> 5MB/s for users on good connectionsUser experience; waiting is frustrating
Processing Time< 10s for most files, < 60s for large PDFsUsers need quick AI responses
Availability99.9% uptimeCore functionality for AI conversations
SecurityZero malware reaching AI processingSystem integrity and user trust
Scalability10K concurrent uploadsSupport growth without degradation
Cost Efficiency< $0.01 per file processedSustainable at scale
Data PrivacyNo unauthorized data accessCompliance and user trust

2.3 Out of Scope (v1)

  • Real-time collaborative file editing
  • Video/audio file transcription
  • File versioning and history
  • Cross-conversation file sharing
  • Advanced OCR for handwritten text
  • Encrypted file handling (E2E encrypted uploads)

3. Capacity Estimations

3.1 Scale Parameters

ParameterValueNotes
Daily active users1M DAUPeak hours: 2-3x average
Files per user per day2-3 filesPower users upload more
Average file size2MBMix of small images and larger docs
Max file size100MBCovers most document types
Peak concurrent uploads10,000During business hours
File retention90 daysConfigurable per account tier

3.2 Storage Calculations

Daily uploads: - Users uploading: 1M × 30% = 300K users upload daily - Files per uploader: 2.5 files average - Daily files: 300K × 2.5 = 750K files/day Storage: - Average size: 2MB per file - Daily storage: 750K × 2MB = 1.5TB/day - Monthly storage: 1.5TB × 30 = 45TB/month - 90-day retention: ~135TB active storage Extracted content: - Extraction ratio: ~10% of original size (text extraction) - Daily extracted: 150GB/day - 90-day extraction storage: ~13.5TB

3.3 Bandwidth Calculations

Upload bandwidth: - Peak uploads: 10,000 concurrent - Average upload size: 2MB - Upload duration: 2-5 seconds - Peak bandwidth: 10,000 × 2MB / 3s = 6.67 GB/s = ~54 Gbps Download bandwidth (previews + originals): - Download requests: 20% of uploads = 150K/day - Peak downloads: 2,000 concurrent - Peak bandwidth: 2,000 × 2MB / 3s = 1.3 GB/s = ~10 Gbps

3.4 Processing Capacity

Processing queue: - Files to process: 750K/day = 8.7 files/second average - Peak processing: 50 files/second - Processing time: 5-30 seconds average - Workers needed: 50 × 30 / 1 = 1,500 concurrent workers (worst case) - With auto-scaling: 100-500 workers typical, burst to 1,500 AI API calls: - Files needing AI processing: 80% = 600K/day - Tokens per file: ~2,000 tokens average (after chunking) - Daily tokens: 600K × 2,000 = 1.2B tokens/day - Cost (at $0.01/1K tokens): $12,000/day = ~$360K/month

3.5 Infrastructure Summary

ComponentSizingNotes
Object Storage (S3)150TB activePlus glacier for old files
CDN100 Gbps capacityFor preview delivery
Processing Workers100-1,500 (auto-scale)Kubernetes pods
Message Queue100K messages/minuteSQS/Kafka
Metadata DB10TB (PostgreSQL)File metadata, extraction results
Vector DB50TB (Pinecone/Weaviate)For RAG embeddings
Cache (Redis)100GBUpload sessions, rate limits

4. Data Model

4.1 Core Entities

File Upload

FileUpload { id: UUID // Primary identifier conversation_id: UUID // Parent conversation user_id: UUID // Uploader // Original file info original_filename: String // User's filename content_type: String // MIME type size_bytes: Long // Original size checksum: String // SHA-256 hash // Storage storage_key: String // S3 key for original cdn_url: String // CDN URL for previews (nullable) thumbnail_key: String // Thumbnail S3 key (nullable) // Processing status status: Enum // UPLOADING, PROCESSING, READY, FAILED processing_started_at: Timestamp // When processing began processing_completed_at: Timestamp // When processing finished error_message: String // If FAILED, why // Metadata created_at: Timestamp expires_at: Timestamp // For retention policy deleted_at: Timestamp // Soft delete }

Chunked Upload Session

ChunkedUploadSession { id: UUID // Session identifier file_upload_id: UUID // Target file user_id: UUID // Owner // Chunking config total_size: Long // Expected total bytes chunk_size: Integer // Size per chunk (e.g., 5MB) total_chunks: Integer // Total expected chunks // Progress chunks_received: Set<Integer> // Chunk numbers received bytes_received: Long // Total bytes so far // Session management status: Enum // ACTIVE, COMPLETED, EXPIRED, ABORTED created_at: Timestamp expires_at: Timestamp // Session timeout (24h) last_activity: Timestamp }

Extracted Content

ExtractedContent { id: UUID file_upload_id: UUID // Source file // Content type extraction_type: Enum // TEXT, TABLE, IMAGE_DESCRIPTION, CODE // Extracted data content: Text // Extracted text/data page_number: Integer // For paginated docs (nullable) section: String // Section identifier (nullable) // Token estimation token_count: Integer // Estimated tokens // For images image_description: Text // AI-generated description detected_objects: JSON // Object detection results // Metadata extraction_method: String // "pdfplumber", "tesseract", "gpt-4-vision" confidence_score: Float // Extraction confidence created_at: Timestamp }

File Chunk (for RAG)

FileChunk { id: UUID file_upload_id: UUID extracted_content_id: UUID // Chunk info chunk_index: Integer // Order within file content: Text // Chunk text token_count: Integer // Tokens in chunk // Embedding embedding_vector: Vector[1536] // OpenAI ada-002 or similar embedding_model: String // Model used // Context metadata: JSON // Page, section, headers for context created_at: Timestamp }

4.2 Relationships

┌─────────────────┐ │ Conversation │ └────────┬────────┘ │ 1:N ┌─────────────────┐ 1:1 ┌──────────────────────┐ │ FileUpload │◄────────────────►│ ChunkedUploadSession │ └────────┬────────┘ └──────────────────────┘ │ 1:N ┌─────────────────┐ │ ExtractedContent│ └────────┬────────┘ │ 1:N ┌─────────────────┐ │ FileChunk │ ──────► Vector DB (for RAG retrieval) └─────────────────┘

5. API Design

5.1 Upload APIs

Initiate Upload (for chunked uploads)

POST /api/v1/conversations/{conversation_id}/uploads/initiate Request: { "filename": "quarterly-report.pdf", "content_type": "application/pdf", "size_bytes": 52428800, "checksum": "sha256:abc123..." } Response: 201 Created { "upload_id": "uuid", "session_id": "uuid", "chunk_size": 5242880, "total_chunks": 10, "upload_urls": [ { "chunk_number": 0, "upload_url": "https://presigned-s3-url...", "expires_at": "2024-01-01T00:15:00Z" }, // ... more chunks ], "expires_at": "2024-01-01T12:00:00Z" }

Upload Chunk

PUT /api/v1/uploads/{session_id}/chunks/{chunk_number} Headers: Content-Type: application/octet-stream Content-Length: 5242880 X-Chunk-Checksum: sha256:def456... Body: <binary chunk data> Response: 200 OK { "chunk_number": 0, "received_bytes": 5242880, "chunks_completed": 1, "chunks_remaining": 9 }

Complete Upload

POST /api/v1/uploads/{session_id}/complete Request: { "chunk_checksums": [ {"chunk_number": 0, "checksum": "sha256:..."}, // ... ] } Response: 202 Accepted { "upload_id": "uuid", "status": "PROCESSING", "estimated_completion_seconds": 15, "status_url": "/api/v1/uploads/{upload_id}/status" }

Simple Upload (for small files < 10MB)

POST /api/v1/conversations/{conversation_id}/uploads Headers: Content-Type: multipart/form-data Body: file: <file binary> Response: 202 Accepted { "upload_id": "uuid", "status": "PROCESSING", "original_filename": "image.png", "size_bytes": 1048576 }

5.2 Status & Retrieval APIs

Get Upload Status

GET /api/v1/uploads/{upload_id}/status Response: 200 OK { "upload_id": "uuid", "status": "READY", // UPLOADING | PROCESSING | READY | FAILED "original_filename": "quarterly-report.pdf", "content_type": "application/pdf", "size_bytes": 52428800, "processing_progress": 100, "preview_url": "https://cdn.example.com/previews/...", "download_url": "https://cdn.example.com/files/...", "extracted_summary": "Q3 financial report showing...", "page_count": 45, "token_count": 28500, "created_at": "2024-01-01T10:00:00Z", "expires_at": "2024-04-01T10:00:00Z" }

Get File Content (for AI context)

GET /api/v1/uploads/{upload_id}/content Query params: format: "full" | "summary" | "chunks" max_tokens: 4000 page: 1 (for paginated access) Response: 200 OK { "upload_id": "uuid", "format": "chunks", "total_chunks": 12, "chunks": [ { "chunk_id": "uuid", "content": "...", "token_count": 350, "metadata": { "page": 1, "section": "Executive Summary" } } ], "has_more": true, "next_page": 2 }

5.3 WebSocket Events (Real-time Updates)

// Client subscribes to upload events ws://api.example.com/ws/uploads/{conversation_id} // Server pushes events: // Upload progress { "event": "upload_progress", "upload_id": "uuid", "chunks_received": 5, "chunks_total": 10, "bytes_received": 26214400, "bytes_total": 52428800 } // Processing progress { "event": "processing_progress", "upload_id": "uuid", "stage": "extracting", // validating | extracting | embedding | complete "progress_percent": 65, "current_page": 30, "total_pages": 45 } // Processing complete { "event": "processing_complete", "upload_id": "uuid", "status": "READY", "preview_url": "...", "summary": "..." } // Error { "event": "processing_error", "upload_id": "uuid", "error_code": "EXTRACTION_FAILED", "error_message": "Unable to parse PDF structure" }

6. High-Level Architecture

6.1 System Overview

┌─────────────────────────────────────────────────────────────────────────────┐ │ CLIENT LAYER │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Web App │ │ Mobile App │ │ Desktop │ │ CLI │ │ │ │ (React) │ │ (React N.) │ │ (Electron) │ │ │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ └─────────┼────────────────┼────────────────┼────────────────┼────────────────┘ │ │ │ │ └────────────────┴────────────────┴────────────────┘ ┌───────▼───────┐ │ CDN / Edge │ │ (CloudFront) │ └───────┬───────┘ ┌───────────────────────────────────┼─────────────────────────────────────────┐ │ API GATEWAY │ │ ┌────────────────────────────────┼────────────────────────────────────┐ │ │ │ Load Balancer (ALB) + WAF + Rate Limiting │ │ │ └────────────────────────────────┼────────────────────────────────────┘ │ └───────────────────────────────────┼─────────────────────────────────────────┘ ┌───────────────────────────────────┼─────────────────────────────────────────┐ │ APPLICATION LAYER │ │ │ │ │ ┌────────────────────────────────┼────────────────────────────────────┐ │ │ │ Upload Service │ │ │ │ ┌──────────────┐ ┌───────────────┐ ┌─────────────────┐ │ │ │ │ │ Upload API │ │ Chunk Manager │ │ Presigned URL │ │ │ │ │ │ Controller │ │ │ │ Generator │ │ │ │ │ └──────────────┘ └───────────────┘ └─────────────────┘ │ │ │ └─────────┬────────────────┬─────────────────────┬────────────────────┘ │ │ │ │ │ │ │ ┌─────────▼────────┐ ┌─────▼──────┐ ┌──────────▼───────────┐ │ │ │ Session Store │ │ File Meta │ │ WebSocket Service │ │ │ │ (Redis) │ │ (Postgres) │ │ (Real-time updates) │ │ │ └──────────────────┘ └────────────┘ └──────────────────────┘ │ └───────────────────────────────────┬─────────────────────────────────────────┘ ┌────────▼────────┐ │ Message Queue │ │ (SQS / Kafka) │ └────────┬────────┘ ┌───────────────────────────────────┼─────────────────────────────────────────┐ │ PROCESSING LAYER │ │ │ │ │ ┌────────────────────────────────┼────────────────────────────────────┐ │ │ │ Processing Workers (K8s Pods) │ │ │ │ │ │ │ │ │ ┌─────────────────────────────┼─────────────────────────────────┐ │ │ │ │ │ Processing Pipeline │ │ │ │ │ │ │ │ │ │ │ │ │ ┌───────────┐ ┌──────────▼─────────┐ ┌──────────────────┐ │ │ │ │ │ │ │ Security │ │ Content Extractor │ │ AI Processor │ │ │ │ │ │ │ │ Scanner │──► ┌─────────────┐ │──► ┌────────────┐ │ │ │ │ │ │ │ │ │ │ │ PDF Parser │ │ │ │ Embeddings │ │ │ │ │ │ │ │ │ • Malware │ │ │ Image OCR │ │ │ │ Summarizer │ │ │ │ │ │ │ │ │ • MIME │ │ │ Doc Parser │ │ │ │ Chunker │ │ │ │ │ │ │ │ │ • Size │ │ │ Code Parse │ │ │ └────────────┘ │ │ │ │ │ │ │ └───────────┘ │ └─────────────┘ │ └──────────────────┘ │ │ │ │ │ │ └────────────────────┘ │ │ │ │ │ └───────────────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────────┘ │ │ │ │ └───────────────────────────────────┼─────────────────────────────────────────┘ ┌───────────────────────────────────┼─────────────────────────────────────────┐ │ STORAGE LAYER │ │ │ │ │ ┌────────────────┐ ┌───────────▼────────┐ ┌──────────────────┐ │ │ │ Object Store │ │ Vector Database │ │ Metadata Store │ │ │ │ (S3) │ │ (Pinecone/Weaviate│ │ (PostgreSQL) │ │ │ │ │ │ │ │ │ │ │ │ • Originals │ │ • Embeddings │ │ • File metadata │ │ │ │ • Thumbnails │ │ • Chunk vectors │ │ • User data │ │ │ │ • Previews │ │ • Similarity idx │ │ • Conversations │ │ │ └────────────────┘ └────────────────────┘ └──────────────────┘ │ │ │ └──────────────────────────────────────────────────────────────────────────────┘

6.2 Upload Flow (Chunked)

┌────────┐ ┌────────┐ ┌─────────┐ ┌────┐ ┌─────────┐ │ Client │ │ Upload │ │ Redis │ │ S3 │ │ Queue │ │ │ │ Service│ │ │ │ │ │ │ └───┬────┘ └───┬────┘ └────┬────┘ └─┬──┘ └────┬────┘ │ │ │ │ │ │ 1. Initiate │ │ │ │ │ ─────────────────►│ │ │ │ │ │ 2. Create session │ │ │ │ │ ─────────────────►│ │ │ │ │ │ │ │ │ │ 3. Generate presigned URLs │ │ │ │ ─────────────────────────────────►│ │ │ │ │ │ │ │ 4. Return URLs │ │ │ │ │ ◄─────────────────│ │ │ │ │ │ │ │ │ │ 5. Upload chunk directly to S3 │ │ │ │ ─────────────────────────────────────────────────────►│ │ │ │ │ │ │ │ 6. Notify chunk complete │ │ │ │ ─────────────────►│ │ │ │ │ │ 7. Update session │ │ │ │ │ ─────────────────►│ │ │ │ │ │ │ │ │ (repeat 5-7 for all chunks) │ │ │ │ │ │ │ │ │ 8. Complete upload│ │ │ │ │ ─────────────────►│ │ │ │ │ │ 9. Verify all chunks │ │ │ │ ─────────────────────────────────►│ │ │ │ │ │ │ │ │ 10. Assemble multipart │ │ │ │ ─────────────────────────────────►│ │ │ │ │ │ │ │ │ 11. Queue for processing │ │ │ │ ─────────────────────────────────────────────────►│ │ │ │ │ │ │ 12. Accepted │ │ │ │ │ ◄─────────────────│ │ │ │ │ │ │ │ │

6.3 Processing Pipeline

┌─────────────────────────────────────────────────────────────────────────────┐ │ File Processing Pipeline │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐ │ │ │ Validate │───►│ Sanitize │───►│ Extract │───►│ Chunk │───►│ Embed │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └───────┘ │ │ │ │ │ │ │ │ │ ▼ ▼ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌───────┐ │ │ │ MIME │ │ Strip │ │ PDF/Doc/ │ │ Semantic │ │OpenAI │ │ │ │ Check │ │ Metadata │ │ Image │ │ Chunking │ │ ada │ │ │ │ Size │ │ Resize │ │ Code │ │ ~512 tok │ │ │ │ │ │ Malware │ │ Convert │ │ Parser │ │ overlap │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └───────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ Processing Results │ │ │ │ • Original stored in S3 │ │ │ │ • Thumbnail/preview generated │ │ │ │ • Extracted text stored in PostgreSQL │ │ │ │ • Chunks with embeddings stored in Vector DB │ │ │ │ • Summary generated for quick AI context │ │ │ └──────────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────────┘

7. Deep Dive: Core Problems

7.1 Problem: Large File Uploads

Uploading large files (>10MB) over HTTP is unreliable due to network interruptions, timeouts, and browser limitations.

Challenge Analysis

IssueImpact
Connection dropsUpload fails, user must restart
Browser memoryLarge files consume client memory
Server timeoutLong uploads exceed request limits
Progress visibilityUsers don’t know if upload is working
Bandwidth wasteFailed uploads waste already-transmitted data

Solution: Chunked Resumable Uploads

┌─────────────────────────────────────────────────────────────────┐ │ Chunked Upload Strategy │ │ │ │ Original File: 50MB │ │ ┌─────────────────────────────────────────────────────────┐ │ │ │█████████████████████████████████████████████████████████│ │ │ └─────────────────────────────────────────────────────────┘ │ │ ▼ │ │ Chunked (5MB each): │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ ... ┌─────┐ │ │ │ 1 │ │ 2 │ │ 3 │ │ 4 │ │ 5 │ │ 6 │ │ 10 │ │ │ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ └─────┘ │ │ ✓ ✓ ✓ ✗ - - - │ │ │ │ │ Network failure at chunk 4 │ │ ▼ │ │ Resume: Only re-upload from chunk 4 │ │ ┌─────┐ ┌─────┐ ┌─────┐ ┌─────┐ │ │ │ 4 │ │ 5 │ │ 6 │ ... │ 10 │ │ │ └─────┘ └─────┘ └─────┘ └─────┘ │ └─────────────────────────────────────────────────────────────────┘

Implementation Components

Client-side chunking:

// Pseudocode for client-side chunking async function uploadFile(file, sessionInfo) { const chunkSize = sessionInfo.chunk_size; const totalChunks = Math.ceil(file.size / chunkSize); // Track progress locally (for resume) const completedChunks = loadCompletedChunks(sessionInfo.session_id); for (let i = 0; i < totalChunks; i++) { if (completedChunks.has(i)) continue; // Skip completed const start = i * chunkSize; const end = Math.min(start + chunkSize, file.size); const chunk = file.slice(start, end); // Upload with retry await uploadChunkWithRetry(chunk, i, sessionInfo); // Save progress saveChunkProgress(sessionInfo.session_id, i); // Report progress onProgress((i + 1) / totalChunks * 100); } // Complete the upload await completeUpload(sessionInfo.session_id); }

Server-side assembly:

// Pseudocode for server-side chunk assembly public void completeMultipartUpload(String sessionId, List<ChunkChecksum> checksums) { ChunkedUploadSession session = sessionStore.get(sessionId); // Verify all chunks received if (session.getChunksReceived().size() != session.getTotalChunks()) { throw new IncompleteUploadException("Missing chunks"); } // Verify checksums for (ChunkChecksum cs : checksums) { String storedChecksum = s3Client.getObjectChecksum( getChunkKey(sessionId, cs.getChunkNumber()) ); if (!storedChecksum.equals(cs.getChecksum())) { throw new ChecksumMismatchException(cs.getChunkNumber()); } } // Assemble in S3 (server-side, no download needed) s3Client.completeMultipartUpload( session.getMultipartUploadId(), session.getCompletedParts() ); // Queue for processing messageQueue.send(new ProcessFileMessage(session.getFileUploadId())); }

7.2 Problem: Content Extraction at Scale

Different file types require different extraction strategies. Extraction must be fast, accurate, and cost-efficient.

File Type Processing Matrix

File TypeExtraction MethodProcessing TimeComplexity
Plain text (.txt, .md)Direct read< 1sLow
Code files (.py, .java)Syntax-aware parsing1-2sMedium
PDF (text-based)pdfplumber / PyMuPDF2-10sMedium
PDF (scanned/image)OCR (Tesseract/Cloud Vision)10-60sHigh
ImagesGPT-4 Vision / BLIP-22-5sMedium
Word docs (.docx)python-docx2-5sMedium
Spreadsheets (.xlsx)openpyxl with structure detection5-15sHigh
Presentations (.pptx)python-pptx + image extraction10-30sHigh

Extraction Pipeline Architecture

┌─────────────────────────────────────────────────────────────────────────┐ │ Extraction Router │ │ │ │ Input: File + MIME type │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ Route by MIME Type │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ├──────────┬──────────┬──────────┬──────────┬──────────┐ │ │ ▼ ▼ ▼ ▼ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ │ │ Text │ │ Code │ │ PDF │ │ Image │ │ Office │ │ │ │ Extractor│ │ Extractor│ │ Extractor│ │ Extractor│ │ Extractor│ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ └────┬─────┘ │ │ │ │ │ │ │ │ │ ▼ ▼ │ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ │ ┌──────────┐ ┌──────────┐ │ │ │ Preserve │ │ AST + │ │ │ Vision │ │ Structure │ │ │ │ Structure│ │ Comments │ │ │ Analysis │ │ Detection │ │ │ └──────────┘ └──────────┘ │ └──────────┘ └──────────┘ │ │ │ │ │ ┌───────────┴───────────┐ │ │ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ │ │ │Text-based│ │ Scanned │ │ │ │ PDF │ │ PDF │ │ │ └────┬─────┘ └────┬─────┘ │ │ ▼ ▼ │ │ ┌──────────┐ ┌──────────┐ │ │ │pdfplumber│ │ OCR │ │ │ │ PyMuPDF │ │ Pipeline │ │ │ └──────────┘ └──────────┘ │ │ │ │ All paths converge to: │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ Unified Content Output │ │ │ │ { │ │ │ │ "text": "...", │ │ │ │ "structure": { "pages": [...], "sections": [...] }, │ │ │ │ "tables": [...], │ │ │ │ "images": [{ "description": "..." }], │ │ │ │ "code_blocks": [...] │ │ │ │ } │ │ │ └──────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘

OCR Decision Tree (PDF Processing)

PDF Received ┌─────────────────┐ │ Extract text │ │ with pdfplumber │ └────────┬────────┘ ┌────────────┐ Yes ┌───────────────────┐ │ Text found │─────────►│ Use extracted text│ │ > 100 chars│ │ (fast path) │ └────────────┘ └───────────────────┘ │ No ┌─────────────────┐ │ Check if scanned│ │ (image-based) │ └────────┬────────┘ ┌────────────┐ No ┌───────────────────┐ │ Has images │─────────►│ Mark as empty/ │ │ per page? │ │ unextractable │ └────────────┘ └───────────────────┘ │ Yes ┌─────────────────┐ │ Quality check: │ │ DPI > 150? │ └────────┬────────┘ ┌────┴────┐ Yes No │ │ ▼ ▼ ┌───────┐ ┌──────────┐ │ OCR │ │ Upscale │ │Direct │ │ then OCR │ └───────┘ └──────────┘

7.3 Problem: Fitting Files into AI Context Windows

AI models have token limits (e.g., 128K for GPT-4, 200K for Claude). Large documents exceed these limits.

Context Window Challenge

┌─────────────────────────────────────────────────────────────────────────┐ │ Context Window Problem │ │ │ │ Document: 500-page PDF = ~300,000 tokens │ │ Model context window: 128,000 tokens │ │ Conversation history: 20,000 tokens │ │ Available for document: 108,000 tokens │ │ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │████████████████████████████████████████████████████████████████ │ │ │ │ Document: 300,000 tokens │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────┐ │ │ │████████████████████████████████████ │ Context Window: 128K tokens │ │ └──────────────────────────────────────┘ │ │ │ │ Problem: Document is 2.8x larger than available context! │ └─────────────────────────────────────────────────────────────────────────┘

Solution Strategy: Hierarchical Retrieval

┌─────────────────────────────────────────────────────────────────────────┐ │ Hierarchical Content Strategy │ │ │ │ TIER 1: Summary (Always in context) │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Document summary: 500-1000 tokens │ │ │ │ "This is a quarterly financial report for Q3 2024 containing │ │ │ │ revenue data, expense breakdowns, and future projections..." │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ TIER 2: Section Index (On-demand retrieval) │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Section 1: Executive Summary (pages 1-3, 2000 tokens) │ │ │ │ Section 2: Revenue Analysis (pages 4-15, 8000 tokens) │ │ │ │ Section 3: Expense Breakdown (pages 16-25, 6000 tokens) │ │ │ │ ... │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ TIER 3: Semantic Chunks (RAG retrieval) │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ 600 chunks × ~500 tokens each │ │ │ │ Each chunk has embedding vector for semantic search │ │ │ │ Retrieved based on user query similarity │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ RUNTIME: Query "What was Q3 revenue?" │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Context assembled: │ │ │ │ - Document summary (1000 tokens) │ │ │ │ - Retrieved chunks about revenue (3000 tokens) │ │ │ │ - Conversation history (5000 tokens) │ │ │ │ Total: ~9000 tokens (fits easily in context) │ │ │ └─────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘

Chunking Strategy

┌─────────────────────────────────────────────────────────────────────────┐ │ Semantic Chunking Algorithm │ │ │ │ Input: Extracted document text │ │ │ │ Step 1: Identify natural boundaries │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ • Paragraph breaks │ │ │ │ • Section headers (H1, H2, H3) │ │ │ │ • Page breaks │ │ │ │ • Sentence boundaries (fallback) │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ Step 2: Create chunks with overlap │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Target chunk size: 512 tokens │ │ │ │ Overlap: 50 tokens (context continuity) │ │ │ │ │ │ │ │ ┌─────────────────────────────────────────┐ │ │ │ │ │ Chunk 1 (512 tokens) │ │ │ │ │ └─────────────────────────────────────────┤ │ │ │ │ ┌─────────────┴─────────────────────┐│ │ │ │ │ Chunk 2 (512 tokens) ││ │ │ │ └───────────────────────────────────┤│ │ │ │ ┌─────────────────────┴┐ │ │ │ │ Chunk 3 (512 tok) │ │ │ │ └─────────────────────┘ │ │ │ ◄───── 50 token overlap ────► │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ Step 3: Preserve metadata │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Each chunk includes: │ │ │ │ • Source page number │ │ │ │ • Section header │ │ │ │ • Table context (if from table) │ │ │ │ • Previous/next chunk IDs (for context expansion) │ │ │ └─────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘

7.4 Problem: Security & Validation

Files from users can contain malware, exceed quotas, or violate content policies.

Security Layer Architecture

┌─────────────────────────────────────────────────────────────────────────┐ │ Security Pipeline │ │ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ LAYER 1: Client-side Validation (Defense in depth, not trusted) │ │ │ │ • File extension check │ │ │ │ • Size limit check │ │ │ │ • Basic MIME type detection │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ LAYER 2: Upload Gateway Validation │ │ │ │ • Size enforcement (hard limit) │ │ │ │ • Rate limiting per user/IP │ │ │ │ • Content-Type header validation │ │ │ │ • Request signature verification │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ LAYER 3: Deep Content Validation (Processing Workers) │ │ │ │ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │ │ │ │ Magic Byte │ │ Malware │ │ Content Policy Check │ │ │ │ │ │ Verification│ │ Scanning │ │ (NSFW, PII detection) │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ │ │ │ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐ │ │ │ │ │ Archive │ │ Image │ │ Document Structure │ │ │ │ │ │ Bomb Check │ │ Validation │ │ Validation (PDF/DOCX) │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌─────────┴─────────┐ │ │ ▼ ▼ │ │ ┌───────────┐ ┌───────────┐ │ │ │ PASSED │ │ REJECTED │ │ │ │ Continue │ │ Quarantine│ │ │ │ Processing│ │ & Alert │ │ │ └───────────┘ └───────────┘ │ └─────────────────────────────────────────────────────────────────────────┘

Validation Rules

┌─────────────────────────────────────────────────────────────────────────┐ │ Validation Rules Matrix │ │ │ │ File Type Whitelist (allow list approach): │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Category │ Extensions │ Max Size │ Notes │ │ │ │───────────────┼──────────────────────────┼──────────┼─────────────│ │ │ │ Documents │ .pdf, .docx, .doc, .txt │ 50MB │ OCR enabled │ │ │ │ Spreadsheets │ .xlsx, .xls, .csv │ 25MB │ 1M cell max │ │ │ │ Presentations │ .pptx, .ppt │ 100MB │ │ │ │ │ Images │ .png, .jpg, .gif, .webp │ 20MB │ 4096px max │ │ │ │ Code │ .py, .js, .java, .go... │ 5MB │ 100K lines │ │ │ │ Markdown │ .md, .mdx │ 2MB │ │ │ │ │ Archives │ BLOCKED │ - │ Security │ │ │ │ Executables │ BLOCKED │ - │ Security │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ Rate Limits: │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ Limit Type │ Free Tier │ Pro Tier │ Enterprise │ │ │ │──────────────────────┼──────────────┼──────────────┼─────────────│ │ │ │ Files per hour │ 10 │ 100 │ 1000 │ │ │ │ Total storage │ 100MB │ 5GB │ 100GB │ │ │ │ Max file size │ 10MB │ 50MB │ 100MB │ │ │ │ Concurrent uploads │ 2 │ 10 │ 50 │ │ │ └───────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘

8. Critical Tradeoffs

8.1 Storage Strategy

OptionProsConsWhen to Use
S3 StandardHigh durability, scalableCost at scalePrimary storage for active files
S3 Intelligent-TieringAuto cost optimizationMonitoring feesUnknown access patterns
S3 GlacierVery cheapRetrieval latencyArchival (>90 days)
Local/EBSLow latencyLimited scale, single point of failureProcessing cache only

Recommendation: S3 Standard for active files, Glacier for archived files with lifecycle policies.


8.2 Sync vs. Async Processing

┌─────────────────────────────────────────────────────────────────────────┐ │ Sync vs Async Processing │ │ │ │ SYNCHRONOUS (Small files < 5MB) │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ User uploads ──► Process immediately ──► Return result │ │ │ │ │ │ │ │ Pros: │ │ │ │ • Simple implementation │ │ │ │ • Immediate feedback │ │ │ │ • No state management │ │ │ │ │ │ │ │ Cons: │ │ │ │ • Blocks user if processing is slow │ │ │ │ • Timeout risk for larger files │ │ │ │ • Resource contention under load │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ ASYNCHRONOUS (Large files > 5MB) │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ User uploads ──► Queue job ──► Return immediately │ │ │ │ │ │ │ │ │ ▼ │ │ │ │ Worker processes ──► Notify via WebSocket │ │ │ │ │ │ │ │ Pros: │ │ │ │ • Non-blocking UX │ │ │ │ • Handles large files gracefully │ │ │ │ • Scalable processing │ │ │ │ │ │ │ │ Cons: │ │ │ │ • Complex state management │ │ │ │ • Requires notification mechanism │ │ │ │ • Eventual consistency │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ RECOMMENDATION: Hybrid approach │ │ • Files < 5MB: Sync processing, immediate response │ │ • Files > 5MB: Async with WebSocket status updates │ └─────────────────────────────────────────────────────────────────────────┘

8.3 Direct S3 Upload vs. Server Proxy

ApproachProsConsBest For
Direct to S3 (Presigned URLs)No server bandwidth consumed, faster for large filesRequires client-side complexity, CORS setupLarge files, high volume
Server ProxySimple client, centralized validationServer becomes bottleneck, doubles bandwidthSmall files, low volume
HybridBest of bothMore complex architectureProduction systems
Direct S3 Upload Flow: ┌────────┐ 1. Get presigned URL ┌─────────┐ │ Client │ ─────────────────────────────► │ Server │ │ │ ◄───────────────────────────── │ │ └───┬────┘ 2. Return URL └─────────┘ │ 3. Upload directly to S3 ┌────────┐ │ S3 │ └────────┘ │ 4. S3 Event Notification ┌─────────┐ │ Lambda │ ──► Processing Queue └─────────┘ Benefits: - Server handles 0 bytes of file data - Parallel uploads to S3 - S3 handles retries and multipart

Recommendation: Direct S3 upload with presigned URLs for files > 1MB.


8.4 RAG vs. Full Context

ApproachProsConsBest For
Full ContextComplete document understanding, no retrieval errorsToken limit constraints, expensiveSmall documents (<50K tokens)
RAG (Retrieval)Handles any document size, cost-efficientMay miss relevant context, retrieval quality variesLarge documents, knowledge bases
HybridBest accuracy for important sections + scale for large docsComplexityProduction systems
Decision Matrix: ┌──────────────────────────────────────────────────────────────────────┐ │ │ │ Document Size Strategy │ │ ──────────────────────────────────────────────────────────────── │ │ < 10K tokens Full context (include entire document) │ │ 10K-50K tokens Summary + relevant sections │ │ > 50K tokens Summary + RAG retrieval │ │ │ │ Query Type Context Strategy │ │ ──────────────────────────────────────────────────────────────── │ │ Specific question RAG retrieval (precise chunks) │ │ Summary request Document summary + section summaries │ │ Analysis task Full relevant sections │ │ Comparison Multiple chunk retrieval │ │ │ └──────────────────────────────────────────────────────────────────────┘

8.5 Preprocessing Depth vs. Latency

LevelProcessingLatencyStorageUse Case
MinimalStore only, extract on-demand~1s uploadLowInfrequent access
StandardExtract text, basic chunking5-15sMediumMost documents
DeepExtract + embed + summarize30-60sHighFrequently queried docs
PremiumAll above + multiple model analysis2-5minVery highCritical documents

Recommendation: Standard processing by default, with option to trigger deep processing for important documents.


9. Failure Modes & Recovery

9.1 Upload Failures

FailureDetectionRecoveryPrevention
Network interruptionClient detects disconnectResume from last chunkChunked uploads with session persistence
Server timeout504 Gateway TimeoutRetry with exponential backoffAsync processing, proper timeouts
Storage failureS3 returns 5xxRetry to different regionMulti-region replication
Quota exceeded413 Payload Too LargeInform user, suggest compressionPre-flight quota check

9.2 Processing Failures

FailureDetectionRecoveryPrevention
Extraction timeoutWorker timeoutRetry with simpler extractionTimeout per file type, fallback extractors
OCR failureTesseract errorTry cloud OCR, then mark as image-onlyMultiple OCR providers
Malformed fileParser exceptionMark as unprocessable, store originalValidate before processing
AI API failureAPI returns 5xxRetry with backoff, use cached embeddingsMultiple API providers, local fallback

9.3 System Failures

┌─────────────────────────────────────────────────────────────────────────┐ │ Failure Recovery Matrix │ │ │ │ Component Failure Impact Recovery Time │ │ ─────────────────────────────────────────────────────────────────── │ │ Upload Service New uploads fail Auto-heal: 30s-2min │ │ Processing Workers Queue builds up Scale up: 1-5min │ │ Message Queue Processing stops Failover: 30s │ │ S3 Uploads/downloads Region failover: 1-5min │ │ PostgreSQL Metadata unavailable Replica promotion: 30s │ │ Vector DB RAG retrieval fails Fallback to summaries │ │ Redis Sessions lost Clients must re-init │ │ │ │ Graceful Degradation Strategies: │ │ ┌───────────────────────────────────────────────────────────────────┐ │ │ │ 1. Processing backlog: Accept uploads, delay processing │ │ │ │ 2. RAG unavailable: Use document summaries only │ │ │ │ 3. Embedding unavailable: Serve text without semantic search │ │ │ │ 4. CDN unavailable: Serve directly from S3 (slower) │ │ │ └───────────────────────────────────────────────────────────────────┘ │ └─────────────────────────────────────────────────────────────────────────┘

9.4 Data Recovery

Backup Strategy: ┌─────────────────────────────────────────────────────────────────────────┐ │ │ │ Data Type Backup Frequency Retention RTO RPO │ │ ──────────────────────────────────────────────────────────────────── │ │ Original files (S3) Continuous (CRR) 90 days 1h 0 │ │ File metadata (PG) Hourly snapshots 30 days 30min 1h │ │ Extracted content Daily backup 30 days 2h 24h │ │ Embeddings (Vector) Weekly backup 7 days 4h 7d │ │ Upload sessions No backup (Redis) - - - │ │ │ │ Recovery procedures documented in runbook │ └─────────────────────────────────────────────────────────────────────────┘

10. Interview Discussion Points

10.1 Clarifying Questions to Ask

  1. Scale: How many concurrent users? Expected file sizes?
  2. File types: Which formats must be supported? Video/audio?
  3. Processing requirements: Real-time or batch? Accuracy vs. speed?
  4. AI model: Which LLM? Context window size?
  5. Security: Compliance requirements (HIPAA, GDPR)?
  6. Multi-tenancy: Shared infrastructure or isolated?

10.2 Key Design Decisions to Justify

DecisionWhyAlternative Considered
Chunked uploadsReliability for large filesSimple POST (fails for >10MB)
Presigned URLsOffload bandwidth from serversProxy through server (bottleneck)
Async processingNon-blocking UXSync (timeout issues)
RAG for large docsHandle unlimited document sizeFull context (token limits)
S3 + CDNScale and global deliveryLocal storage (single point of failure)

10.3 Deep Dive Topics

  • Chunking strategies: Semantic vs. fixed-size, overlap handling
  • OCR pipeline: When to use local vs. cloud, accuracy tradeoffs
  • Security: Defense in depth, malware scanning pipeline
  • Cost optimization: Caching strategies, embedding model selection
  • Real-time updates: WebSocket vs. polling, connection management

10.4 Red Flags to Avoid

  • ❌ Storing files on application servers
  • ❌ Synchronous processing for all files
  • ❌ No malware scanning
  • ❌ Trusting client-side validation
  • ❌ No rate limiting or quotas
  • ❌ Blocking on AI API calls

11. Extensions for v2

11.1 Planned Enhancements

FeatureDescriptionComplexity
Video/Audio transcriptionWhisper API integration for media filesHigh
Collaborative annotationsMultiple users annotating same documentHigh
Version historyTrack file versions and changesMedium
Cross-conversation filesShare files across conversationsMedium
Advanced OCRHandwriting recognition, form extractionHigh
E2E encryptionClient-side encryption for sensitive filesHigh

11.2 Multi-Region Architecture

┌─────────────────────────────────────────────────────────────────────────┐ │ Multi-Region File Upload (v2) │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ US-East │ │ EU-West │ │ AP-South │ │ │ │ Region │ │ Region │ │ Region │ │ │ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ │ │ │ │ │ │ │ └────────────────────┼────────────────────┘ │ │ │ │ │ ┌─────────▼─────────┐ │ │ │ Global Router │ │ │ │ (Route53/CF) │ │ │ └───────────────────┘ │ │ │ │ Features: │ │ • Geo-based routing to nearest region │ │ • Cross-region replication for disaster recovery │ │ • Data residency compliance (keep EU data in EU) │ │ • Global CDN for file delivery │ └─────────────────────────────────────────────────────────────────────────┘

12. Real-World Implementations

12.1 Reference Architectures

ProductApproachNotable Features
ChatGPTIntegrated file uploadCode interpreter, image analysis
ClaudeDirect file processingLarge context window (200K)
Google WorkspaceChunked uploadsResumable uploads API
DropboxBlock-level dedupDelta sync, content hashing
Notion AIWorkspace-integratedEmbedded in documents

12.2 Open Source References

  • tus.io: Resumable upload protocol
  • Uppy: File uploader with plugins
  • Minio: S3-compatible object storage
  • Apache Tika: Content extraction
  • LangChain: RAG implementation patterns

12.3 Relevant AWS Services

┌─────────────────────────────────────────────────────────────────────────┐ │ AWS Service Mapping │ │ │ │ Component AWS Service Alternative │ │ ──────────────────────────────────────────────────────────────────── │ │ Object Storage S3 GCS, Azure Blob │ │ CDN CloudFront Cloudflare, Akamai │ │ Message Queue SQS Kafka, RabbitMQ │ │ Processing Workers Lambda / ECS Kubernetes │ │ Metadata DB RDS PostgreSQL Aurora, CockroachDB │ │ Vector DB OpenSearch Pinecone, Weaviate │ │ Cache ElastiCache (Redis) Memcached │ │ Malware Scanning GuardDuty + Custom ClamAV │ │ Monitoring CloudWatch Datadog, Prometheus │ └─────────────────────────────────────────────────────────────────────────┘

Summary

Designing file upload for AI chat applications requires balancing:

  1. User experience: Fast uploads, real-time feedback, seamless AI integration
  2. Scalability: Handling millions of files with varying sizes
  3. Security: Protecting against malware and enforcing content policies
  4. Cost efficiency: Optimizing storage, processing, and AI API costs
  5. AI integration: Making file content accessible within context limits

The key architectural decisions are:

  • Chunked resumable uploads for reliability
  • Direct-to-S3 with presigned URLs for scale
  • Async processing pipeline for non-blocking UX
  • Hierarchical RAG for handling large documents
  • Defense-in-depth security for protection

This design provides a production-ready foundation that can scale to millions of users while maintaining security and cost efficiency.

Last updated on