-
Notifications
You must be signed in to change notification settings - Fork 6
Blog: Getting structured data out of images with Granite Vision 4.1 #48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
planetf1
wants to merge
7
commits into
generative-computing:main
Choose a base branch
from
planetf1:blog/granite-vision-structured-extraction
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
a55318a
docs: draft blog — structured image extraction with Granite Vision 4.1
planetf1 f6b6d27
chore: rename receipt image to match blog post slug
planetf1 ca35393
fix: resolve CI lint error, bug in validation_fn, and add conclusion
planetf1 abdc88f
fix: IVR check and narrative conclusion in vision structured extracti…
planetf1 4033878
wip: update receipt image and sync blog to 6-item receipt
planetf1 20df5a9
blog: fix strategy class and strengthen IVR section
planetf1 f0be9bb
blog: address ajbozarth review comments
planetf1 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,260 @@ | ||
| --- | ||
| title: "Getting Structured Data Out of Images with Granite Vision 4.1" | ||
| date: "2026-06-03" | ||
| author: "Nigel Jones" | ||
| excerpt: "Vision models return prose. This post shows how to get a typed Python object back instead, using Mellea's format= parameter and ImageBlock." | ||
| tags: ["vision", "structured-output", "granite", "IVR", "image-extraction"] | ||
| --- | ||
|
|
||
| Vision models narrate. Hand one a receipt and you get three paragraphs describing it — and now | ||
| you're writing a parser for natural language, which is exactly what you were trying to avoid. | ||
|
|
||
| The usual workaround is to append "respond with JSON matching this schema" to your prompt, hope | ||
| the model complies, catch the `json.JSONDecodeError` when it doesn't, and wonder why extracting | ||
| a number from a picture turned into a reliability project. | ||
|
|
||
| There's a cleaner path. | ||
|
|
||
| --- | ||
|
|
||
| > **EDITORIAL NOTE — remove before publishing** | ||
| > | ||
| > **Status:** Draft — scenario is still being iterated to produce a compelling IVR repair | ||
| > demonstration. Code structure and Mellea API usage are stable; receipt values and the exact | ||
| > IVR check may change before publication. | ||
| > | ||
| > **Model availability:** This blog is written for Ollama (final published form — don't change | ||
| > the code examples). Granite Vision 4.1 is not yet in the Ollama library, but is expected | ||
| > there soon; it is also available in safetensors form on Hugging Face. For testing and review, | ||
| > run the code against mlx-vlm instead: | ||
| > | ||
| > ```bash | ||
| > mkdir granite-vision-test && cd granite-vision-test | ||
| > uv init --bare --python 3.12 | ||
| > uv add mlx-vlm mellea pillow | ||
| > uv run python -m mlx_vlm.server --model ibm-granite/granite-vision-4.1-4b | ||
| > # Serves at http://localhost:8080/v1 — model downloads (~8 GB) on first run. | ||
| > # This is the full bfloat16 safetensors weights, not a quantized GGUF — | ||
| > # expect roughly double the size you'd see from an Ollama pull. | ||
| > ``` | ||
| > | ||
| > Then change the session setup in each code block from: | ||
| > `m = start_session(model_id="granite-vision-4.1")` | ||
| > to: | ||
| > `m = MelleaSession(OpenAIBackend("ibm-granite/granite-vision-4.1-4b", base_url="http://localhost:8080/v1", api_key="mlx"))` | ||
| > | ||
| > Watch the [Ollama library](https://ollama.com/library) for `granite-vision-4.1`. When it | ||
| > lands: delete this note, verify `ollama pull granite-vision-4.1` works, publish. | ||
|
|
||
| --- | ||
|
|
||
| ## Running locally | ||
|
|
||
| [Granite Vision 4.1](https://huggingface.co/ibm-granite/granite-vision-4.1-4b) runs locally | ||
| on Ollama. No API key, no cloud bill: | ||
|
|
||
| ```bash | ||
| ollama pull granite-vision-4.1 | ||
| uv add mellea pillow | ||
| ``` | ||
|
|
||
| ## The problem | ||
|
|
||
| Here's the receipt we'll work with — a deli order with non-trivial quantities and a | ||
| partially smudged subtotal (thermal printer wear): | ||
|
|
||
|  | ||
|
|
||
| Start with the naïve approach: ask the model to describe it. | ||
|
|
||
| ```python | ||
| from mellea import start_session | ||
| from mellea.core import ImageBlock | ||
| from PIL import Image | ||
|
|
||
| m = start_session(model_id="granite-vision-4.1") | ||
| img = ImageBlock.from_pil_image(Image.open("receipt.jpg")) | ||
|
|
||
| result = m.instruct("What's on this receipt?", images=[img]) | ||
| print(result) | ||
| ``` | ||
|
|
||
| Output: | ||
|
|
||
| ```text | ||
| "This receipt is from Grove Street Deli in Portland, dated March 22nd 2026, | ||
| order #2231. It lists three cold brew coffees at $4.75 each, two grain bowls | ||
| at $12.95 each, four granola bars at $2.95 each, three oat milk add-ons at | ||
| $0.75 each, one avocado toast at $11.50, and two blueberry muffins at $3.95 | ||
| each. The subtotal is $73.60, tax at 8.5% is $6.26, for a total of $79.86." | ||
| ``` | ||
|
|
||
| Output will vary — models describe the same receipt differently. The structure of the problem is the same. | ||
|
|
||
| Readable. Useless as data. You can't do `result.total` or `result.items[0].unit_price`. | ||
|
|
||
| ## The return type is the extraction schema | ||
|
|
||
| Define what you want as a Pydantic model and pass it to `format=`. Mellea uses constrained | ||
| decoding to guarantee the output matches — no prompt-engineering the JSON shape, no parse | ||
| errors to catch. | ||
|
|
||
| ```python | ||
| from pydantic import BaseModel | ||
| from mellea import start_session | ||
| from mellea.core import ImageBlock | ||
| from PIL import Image | ||
|
|
||
|
|
||
| class LineItem(BaseModel): | ||
| description: str | ||
| quantity: int | ||
| unit_price: float | ||
|
|
||
|
|
||
| class Receipt(BaseModel): | ||
| vendor: str | ||
| date: str | ||
| items: list[LineItem] | ||
| subtotal: float | ||
| tax: float | ||
| total: float | ||
|
|
||
|
|
||
| m = start_session(model_id="granite-vision-4.1") | ||
| img = ImageBlock.from_pil_image(Image.open("receipt.jpg")) | ||
|
|
||
| result = m.instruct("Extract the receipt data.", images=[img], format=Receipt) | ||
| receipt = Receipt.model_validate_json(str(result)) | ||
|
|
||
| print(receipt.vendor) # "Grove Street Deli" | ||
| print(receipt.total) # 79.86 | ||
| print(receipt.items[0].quantity) # 3 | ||
| ``` | ||
|
|
||
| `ImageBlock.from_pil_image()` converts any PIL image to the base64 PNG the backends expect. | ||
| `format=Receipt` switches the model into constrained decoding. `model_validate_json` gives you | ||
| a fully typed Python object with IDE autocomplete on every field. | ||
|
|
||
| ## When the type isn't enough | ||
|
|
||
| Pydantic catches structural failures: wrong shape, missing fields, values that can't be coerced. | ||
| It won't catch semantic ones. If the model reads the total as `-22.19`, that's valid JSON. | ||
| If it parses the date as `"March 15"` instead of `"2026-03-15"`, the field is populated — | ||
| it's just wrong. | ||
|
|
||
| `requirements=` handles this. Pass plain-English constraints; if the first attempt fails one, | ||
| Mellea repairs and retries with the failure reason fed back into the prompt: | ||
|
|
||
| ```python | ||
| from mellea.stdlib.sampling import RepairTemplateStrategy | ||
|
|
||
| result = m.instruct( | ||
| "Extract the receipt data.", | ||
| images=[img], | ||
| format=Receipt, | ||
| requirements=[ | ||
| "total must be a positive number", | ||
| "date must be in ISO 8601 format (YYYY-MM-DD)", | ||
| "each item's unit_price must be a positive number", | ||
| ], | ||
| strategy=RepairTemplateStrategy(loop_budget=3), | ||
| ) | ||
| receipt = Receipt.model_validate_json(str(result)) | ||
| ``` | ||
|
|
||
| Worth being clear about the limit: requirements validate the *extracted values*, not whether | ||
| they match what's physically in the image. A requirement catches the model hallucinating a | ||
| negative total; it can't verify the number on screen was $79.86 rather than $78.86. For that | ||
| you need an external check. | ||
|
|
||
| ## When to reach for IVR | ||
|
|
||
| If you have a concrete verifiable property — something independent of the image — wire it as a | ||
| `validation_fn`. Mellea runs it on each attempt and feeds the failure reason back into the | ||
| repair prompt if it fails. | ||
|
|
||
| Receipt arithmetic is the natural case here: the sum of every line item (quantity × unit price) | ||
| must equal the subtotal. The model reads each line independently, so with non-round quantities | ||
| like `3 × $4.75` and `4 × $2.95`, it's easy for the accumulated total to drift — especially | ||
| when the printed subtotal is partially obscured. The validation function catches it and tells | ||
| the model exactly what went wrong: | ||
|
|
||
| ```python | ||
| from mellea.stdlib.requirements import req, simple_validate | ||
| from mellea.stdlib.sampling import RepairTemplateStrategy | ||
|
|
||
|
|
||
| def check_line_items(json_str: str) -> tuple[bool, str]: | ||
| r = Receipt.model_validate_json(json_str) | ||
| computed = round(sum(i.quantity * i.unit_price for i in r.items), 2) | ||
| if abs(computed - r.subtotal) > 0.01: | ||
| return False, f"line items sum to {computed}, subtotal shows {r.subtotal}" | ||
| return True, "" | ||
|
|
||
|
|
||
| result = m.instruct( | ||
| "Extract the receipt data.", | ||
| images=[img], | ||
| format=Receipt, | ||
| requirements=[ | ||
| "total must be a positive number", | ||
| req("line items sum to subtotal", validation_fn=simple_validate(check_line_items)), | ||
| ], | ||
| strategy=RepairTemplateStrategy(loop_budget=3), | ||
| ) | ||
| receipt = Receipt.model_validate_json(str(result)) | ||
| ``` | ||
|
|
||
| When the check fails, Mellea feeds the error string back into the next attempt — | ||
| `"line items sum to 73.60, subtotal shows 70.60"` — so the model knows which numbers | ||
| to revisit rather than starting from scratch. | ||
|
|
||
| The general progression: `format=` alone → `requirements=` for semantic constraints → | ||
| `validation_fn` when you have something concrete to verify programmatically. Most image | ||
| extraction stops at step two. Reach for `validation_fn` when you'd be writing the same check | ||
| in post-processing anyway — it belongs in the prompt loop, not after it. | ||
|
|
||
| ## Swapping backends | ||
|
|
||
| `ImageBlock` is backend-agnostic. The only thing that changes is the session setup: | ||
|
|
||
| ```python | ||
| # Ollama (this post) | ||
| from mellea import start_session | ||
| m = start_session(model_id="granite-vision-4.1") | ||
|
|
||
| # Any OpenAI-compatible endpoint (vLLM, mlx-vlm, cloud) | ||
| from mellea import MelleaSession | ||
| from mellea.backends.openai import OpenAIBackend | ||
| m = MelleaSession(OpenAIBackend("ibm-granite/granite-vision-4.1-4b", | ||
| base_url="http://localhost:8080/v1", api_key="mlx")) | ||
| ``` | ||
|
|
||
| The `instruct` call — `images=`, `format=`, `requirements=`, `strategy=` — is identical | ||
| across all backends. | ||
|
|
||
| ## From narration to data | ||
|
|
||
| Vision models are already good at reading documents — they just default to telling you about | ||
| them rather than handing you the data. `format=` shifts that. | ||
|
|
||
| The more important thing to understand about `requirements=` and `validation_fn` is what they | ||
| guarantee. Detection is reliable: the validation layer always surfaces a mismatch — if the | ||
| arithmetic is wrong, you'll know. Repair depends on model capacity. A 4b model working from a | ||
| partially obscured image will not always correct itself in three tries; a larger model usually | ||
| will. The point of wiring the check programmatically is that a silent wrong answer is no longer | ||
| possible. Repair success is a separate question. | ||
|
|
||
| ## Going further | ||
|
|
||
| - [Use Images and Vision Models](https://docs.mellea.ai/how-to/use-images-and-vision) — | ||
| image loading, backend configuration, multi-image prompts | ||
| - [Enforce Structured Output](https://docs.mellea.ai/how-to/enforce-structured-output) — | ||
| `format=`, `@generative`, and constrained decoding in detail | ||
| - [The Requirements System](https://docs.mellea.ai/concepts/requirements-system) — | ||
| how `Requirement`, `ValidationResult`, and `simple_validate` work together | ||
| - [Instruct-Validate-Repair](https://docs.mellea.ai/concepts/instruct-validate-repair) — | ||
| the IVR loop, sampling strategies, and repair prompts explained | ||
| - [Write Custom Verifiers](https://docs.mellea.ai/how-to/write-custom-verifiers) — | ||
| validation functions beyond simple string checks | ||
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.