Building a Document Processing Platform
· 3 min · docsdingo.com
What Is Docs Dingo?
Docs Dingo is an online PDF processing suite. It handles the document tasks people run into constantly: merging PDFs, splitting pages, compressing file sizes, converting between formats, and extracting text. Everything runs in the browser and on serverless infrastructure, so there is nothing to install.
Files are processed and returned immediately. Documents are not stored on our servers after processing completes.
Architecture Overview
Document Upload
-> Browser (client-side pre-processing when possible)
-> S3 presigned URL (direct upload)
-> Lambda (process document)
-> S3 (store result with TTL)
-> Presigned download URL returned to browser
Processing Pipeline
-> Merge: Lambda combines uploaded PDFs in order
-> Split: Lambda extracts specified page ranges
-> Compress: Lambda reduces image quality + removes metadata
-> Convert: Lambda transforms between PDF/PNG/JPG formats
Files are uploaded directly to S3 using presigned URLs, bypassing API Gateway's payload size limits. This allows documents up to 100MB to be processed. The Lambda function picks up the file from S3, processes it, writes the result back to S3, and returns a time-limited download link.
PDF Processing Tools
Merge combines multiple PDFs into a single document. Users drag files into order and the Lambda function concatenates them while preserving bookmarks and links. Split extracts specific page ranges or individual pages from a PDF.
Compress reduces file size by downsampling images, removing duplicate resources, and stripping metadata. Typical compression ratios are 40-70% reduction without visible quality loss. Convert transforms PDFs to images (PNG/JPG) or images to PDF, handling multi-page documents correctly.
Privacy and Cleanup
All uploaded and processed files are stored in S3 with a lifecycle policy that automatically deletes them after one hour. Download links expire after 30 minutes. No document content is logged or indexed.
For smaller operations like page counting or text extraction, processing happens entirely in the browser using client-side JavaScript libraries. The file never leaves the user's device for these lightweight tasks.