PDFs are everywhere in business invoices, tax forms, legal contracts. But when it comes to automating or analyzing them, they become a nightmare. They’re unstructured, unpredictable, and not built for machines.
For my Kaggle Generative AI Capstone Project, I set out to change that. I built a smart document assistant powered by Google’s Gemini, vector search, and retrieval-augmented generation (RAG). It classifies documents, extracts structured JSON from them, and even answers natural language questions like “Who paid the most tax?” or “List property contracts over $500K.”
To simulate a real-world document processing scenario, The dataset contains 15 synthetically generated documents:
• 🧾 Invoices
• 📄 Tax returns
• 🏡 Property sale contracts
These were handcrafted using Jinja templates and exported as PDFs with randomized values to preserve realism while avoiding private data.
Here’s the breakdown:
| Document Type | Count | Description |
|---------------------|-------|----------------------------------------------|
| Invoices | 5 | Billing statements with totals and client info |
| Tax Returns | 5 | U.S. 1040-style income and deductions |
| Property Contracts | 5 | Buyer/seller agreements and sale prices |
{
"invoice_number": "INV-2024-100",
"client_name": "Daniel Lee",
"items": [
{"name": "Consulting Services", "qty": 3, "unit_price": 185, "total": 555},
{"name": "Support Hours", "qty": 2, "unit_price": 183, "total": 366}
],
"subtotal": 921,
"tax": 92.1,
"total": 1013.1
}
Before structured data can be extracted, the system must first determine the type of document being processed.
This is achieved using few-shot prompting, where the model is given a few labeled examples such as an invoice, a tax return, and a property contract and is then asked to classify new, unseen documents based on their content.
This approach enables classification without any model fine-tuning, relying solely on the model’s general language understanding and a few representative examples.
Below is an example of the prompt used for classification:
Classify the type of the following document as one of the following:
invoice, tax_return, property_contract.Document:
INVOICE
Invoice Number: INV-2024-100
Date: 2025-04-05
Due Date: 2025-05-05
Billed To: Daniel Lee...
Type: invoice
Document:
U.S. Individual Income Tax Return
Taxpayer: Alex Miller
Wages: $40,000...
Type: tax_return
Document:
[Insert new document here...]
Type:
In response to the prompt, Gemini correctly identified the document as an invoice. The model made this decision by recognizing keywords such as “Invoice Number”, “Billed To”, and monetary line items — all indicative of a typical billing document.
Predicted Type:
“invoice”
This classification method generalizes well across varied layouts and content styles, making it highly scalable for processing large volumes of business documents without requiring any labeled training data.