Turning PDFs into Structured Intelligence with Generative AI: My Kaggle Capstone Experience | by amir tabatabaei

Turning PDFs into Structured Intelligence with Generative AI: My Kaggle Capstone Experience | by amir tabatabaei | Apr, 2025

PDFs are everywhere in business invoices, tax forms, legal contracts. But when it comes to automating or analyzing them, they become a nightmare. They’re unstructured, unpredictable, and not built for machines.

For my Kaggle Generative AI Capstone Project, I set out to change that. I built a smart document assistant powered by Google’s Gemini, vector search, and retrieval-augmented generation (RAG). It classifies documents, extracts structured JSON from them, and even answers natural language questions like “Who paid the most tax?” or “List property contracts over $500K.”

To simulate a real-world document processing scenario, The dataset contains 15 synthetically generated documents:

• 🧾 Invoices

• 📄 Tax returns

• 🏡 Property sale contracts

These were handcrafted using Jinja templates and exported as PDFs with randomized values to preserve realism while avoiding private data.

Here’s the breakdown:

| Document Type       | Count | Description                                  |
|---------------------|-------|----------------------------------------------|
| Invoices            | 5     | Billing statements with totals and client info |
| Tax Returns         | 5     | U.S. 1040-style income and deductions         |
| Property Contracts  | 5     | Buyer/seller agreements and sale prices       |

{
"invoice_number": "INV-2024-100",
"client_name": "Daniel Lee",
"items": [
{"name": "Consulting Services", "qty": 3, "unit_price": 185, "total": 555},
{"name": "Support Hours", "qty": 2, "unit_price": 183, "total": 366}
],
"subtotal": 921,
"tax": 92.1,
"total": 1013.1
}

Before structured data can be extracted, the system must first determine the type of document being processed.

This is achieved using few-shot prompting, where the model is given a few labeled examples such as an invoice, a tax return, and a property contract and is then asked to classify new, unseen documents based on their content.

This approach enables classification without any model fine-tuning, relying solely on the model’s general language understanding and a few representative examples.

Below is an example of the prompt used for classification:

Classify the type of the following document as one of the following: 
invoice, tax_return, property_contract.Document:
INVOICE  
Invoice Number: INV-2024-100  
Date: 2025-04-05  
Due Date: 2025-05-05  
Billed To: Daniel Lee...
Type: invoice
Document:
U.S. Individual Income Tax Return  
Taxpayer: Alex Miller  
Wages: $40,000...
Type: tax_return
Document:
[Insert new document here...]
Type:

In response to the prompt, Gemini correctly identified the document as an invoice. The model made this decision by recognizing keywords such as “Invoice Number”, “Billed To”, and monetary line items — all indicative of a typical billing document.

Predicted Type:

“invoice”

This classification method generalizes well across varied layouts and content styles, making it highly scalable for processing large volumes of business documents without requiring any labeled training data.

Turning PDFs into Structured Intelligence with Generative AI: My Kaggle Capstone Experience | by amir tabatabaei | Apr, 2025

4 Growth Processes AI Could Help Startups Optimize

Why AI Needs Large Numerical Models (LNMs) for Mathematical Mastery • AI Blog

softbliss

Related Posts

5 Error Handling Patterns in Python (Beyond Try-Except)

How I Automated My Machine Learning Workflow with Just 10 Lines of Python

What It Is and Why It Matters—Part 3 – O’Reilly

New AI Innovation Hub in Tunisia Drives Technological Advancement Across Africa

Beyond Text Compression: Evaluating Tokenizers Across Scales

Why AI Needs Large Numerical Models (LNMs) for Mathematical Mastery • AI Blog

Premium Content

Why “Theme” Should ALWAYS Be One Word

Enter Abu Dhabi Open Data Spark Hackathon to Win AED 170K!

Step-by-Step Diffusion: An Elementary Tutorial

Browse by Category

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?

Turning PDFs into Structured Intelligence with Generative AI: My Kaggle Capstone Experience | by amir tabatabaei | Apr, 2025

4 Growth Processes AI Could Help Startups Optimize

Why AI Needs Large Numerical Models (LNMs) for Mathematical Mastery • AI Blog

Related Posts

Premium Content

Browse by Category

Browse by Tags

Soft Bliss Academy

Categories

Recent Posts

Are you sure want to unlock this post?

Are you sure want to cancel subscription?