Danni Yang, Sitao Chen, Changyao Tian
If you find our work helpful, please give us a β or cite our paper. See the InternVL-U technical report appendix for more details.
- [2026/03/06] TextEdit benchmark released.
- [2026/03/06] Evaluation code released.
- [2026/03/06] Leaderboard updated with latest models.
- Precise spatial alignment
- Font and style consistency
- Background preservation
- Layout-constrained reasoning
We introduce TextEdit, a high-quality, multi-scenario benchmark designed to evaluate fine-grained text editing capabilities in image generation models.
TextEdit covers a diverse set of real-world and virtual scenarios, spanning 18 subcategories with a total of 2,148 high-quality source images and manually annotated edited ground-truth images.
To comprehensively assess model performance, we combine classic OCR, image-fidelity metrics and modern multimodal LLM-based evaluation across target accuracy, text preservation, scene integrity, local realism and visual coherence. This dual-track protocol enables comprehensive assessment.
Our goal is to provide a standardized, realistic, and scalable benchmark for text editing research.
π Full Benchmark Results
| Models | # Params | Real | Virtual | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OA | OP | OR | F1 | NED | CLIP | AES | OA | OP | OR | F1 | NED | CLIP | AES | ||
| Generation Models | |||||||||||||||
| Qwen-Image-Edit | 20B | 0.75 | 0.68 | 0.66 | 0.67 | 0.71 | 0.75 | 5.72 | 0.78 | 0.75 | 0.73 | 0.74 | 0.75 | 0.81 | 5.21 |
| GPT-Image-1.5 | - | 0.74 | 0.69 | 0.67 | 0.68 | 0.68 | 0.75 | 5.78 | 0.73 | 0.72 | 0.71 | 0.71 | 0.70 | 0.80 | 5.28 |
| Nano Banana Pro | - | 0.77 | 0.72 | 0.70 | 0.71 | 0.72 | 0.75 | 5.79 | 0.80 | 0.78 | 0.77 | 0.78 | 0.78 | 0.81 | 5.28 |
| Unified Models | |||||||||||||||
| Lumina-DiMOO | 8B | 0.22 | 0.23 | 0.19 | 0.20 | 0.19 | 0.69 | 5.53 | 0.22 | 0.25 | 0.21 | 0.22 | 0.20 | 0.72 | 4.76 |
| Ovis-U1 | 2.4B+1.2B | 0.40 | 0.37 | 0.34 | 0.35 | 0.35 | 0.72 | 5.32 | 0.37 | 0.40 | 0.38 | 0.39 | 0.33 | 0.75 | 4.66 |
| BAGEL | 7B+7B | 0.60 | 0.59 | 0.53 | 0.55 | 0.55 | 0.74 | 5.71 | 0.57 | 0.60 | 0.56 | 0.57 | 0.54 | 0.78 | 5.19 |
| InternVL-U (Ours) | 2B+1.7B | 0.77 | 0.73 | 0.70 | 0.71 | 0.72 | 0.75 | 5.70 | 0.79 | 0.77 | 0.75 | 0.75 | 0.77 | 0.80 | 5.12 |
| Models | # Params | Real | Virtual | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TA | TP | SI | LR | VC | Avg | TA | TP | SI | LR | VC | Avg | ||
| Generation Models | |||||||||||||
| Qwen-Image-Edit | 20B | 0.92 | 0.82 | 0.75 | 0.57 | 0.80 | 0.77 | 0.57 | 0.79 | 0.92 | 0.80 | 0.77 | 0.77 |
| GPT-Image-1.5 | - | 0.96 | 0.94 | 0.86 | 0.80 | 0.93 | 0.90 | 0.82 | 0.93 | 0.96 | 0.91 | 0.87 | 0.90 |
| Nano Banana Pro | - | 0.96 | 0.95 | 0.85 | 0.88 | 0.93 | 0.91 | 0.87 | 0.92 | 0.96 | 0.94 | 0.89 | 0.92 |
| Unified Models | |||||||||||||
| Lumina-DiMOO | 8B | 0.17 | 0.06 | 0.04 | 0.02 | 0.05 | 0.09 | 0.02 | 0.06 | 0.16 | 0.05 | 0.03 | 0.08 |
| Ovis-U1 | 2.4B+1.2B | 0.31 | 0.12 | 0.12 | 0.07 | 0.18 | 0.18 | 0.06 | 0.16 | 0.31 | 0.14 | 0.13 | 0.19 |
| BAGEL | 7B+7B | 0.68 | 0.60 | 0.38 | 0.35 | 0.56 | 0.53 | 0.38 | 0.51 | 0.68 | 0.62 | 0.42 | 0.54 |
| InternVL-U (Ours) | 2B+1.7B | 0.94 | 0.90 | 0.71 | 0.80 | 0.80 | 0.88 | 0.87 | 0.86 | 0.91 | 0.82 | 0.62 | 0.83 |
π Mini-set Benchmark Results(500 samples)
| Models | # Params | Real | Virtual | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| OA | OP | OR | F1 | NED | CLIP | AES | OA | OP | OR | F1 | NED | CLIP | AES | ||
| Generation Models | |||||||||||||||
| Qwen-Image-Edit | 20B | 0.76 | 0.69 | 0.67 | 0.67 | 0.70 | 0.75 | 5.81 | 0.74 | 0.71 | 0.70 | 0.70 | 0.70 | 0.80 | 5.27 |
| GPT-Image-1.5 | - | 0.72 | 0.68 | 0.66 | 0.67 | 0.67 | 0.75 | 5.85 | 0.68 | 0.69 | 0.68 | 0.68 | 0.65 | 0.80 | 5.32 |
| Nano Banana Pro | - | 0.76 | 0.71 | 0.69 | 0.70 | 0.70 | 0.75 | 5.86 | 0.77 | 0.76 | 0.75 | 0.75 | 0.76 | 0.81 | 5.32 |
| Unified Models | |||||||||||||||
| Lumina-DiMOO | 8B | 0.20 | 0.22 | 0.18 | 0.19 | 0.19 | 0.70 | 5.58 | 0.22 | 0.25 | 0.21 | 0.22 | 0.19 | 0.73 | 4.87 |
| Ovis-U1 | 2.4B+1.2B | 0.37 | 0.34 | 0.32 | 0.32 | 0.33 | 0.72 | 5.39 | 0.39 | 0.41 | 0.38 | 0.39 | 0.33 | 0.74 | 4.75 |
| BAGEL | 7B+7B | 0.61 | 0.59 | 0.52 | 0.54 | 0.54 | 0.74 | 5.79 | 0.53 | 0.58 | 0.53 | 0.55 | 0.51 | 0.78 | 5.25 |
| InternVL-U (Ours) | 2B+1.7B | 0.77 | 0.74 | 0.70 | 0.71 | 0.71 | 0.76 | 5.79 | 0.74 | 0.72 | 0.69 | 0.70 | 0.72 | 0.79 | 5.14 |
| Models | # Params | Real | Virtual | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| TA | TP | SI | LR | VC | Avg | TA | TP | SI | LR | VC | Avg | ||
| Generation Models | |||||||||||||
| Qwen-Image-Edit | 20B | 0.93 | 0.85 | 0.77 | 0.55 | 0.78 | 0.80 | 0.60 | 0.82 | 0.91 | 0.81 | 0.74 | 0.76 |
| GPT-Image-1.5 | - | 0.97 | 0.94 | 0.86 | 0.79 | 0.92 | 0.91 | 0.85 | 0.93 | 0.95 | 0.92 | 0.83 | 0.88 |
| Nano Banana Pro | - | 0.96 | 0.95 | 0.85 | 0.86 | 0.92 | 0.91 | 0.87 | 0.92 | 0.96 | 0.93 | 0.87 | 0.92 |
| Unified Models | |||||||||||||
| Lumina-DiMOO | 8B | 0.16 | 0.04 | 0.04 | 0.02 | 0.06 | 0.08 | 0.02 | 0.05 | 0.19 | 0.07 | 0.03 | 0.10 |
| Ovis-U1 | 2.4B+1.2B | 0.29 | 0.11 | 0.11 | 0.08 | 0.20 | 0.17 | 0.04 | 0.16 | 0.35 | 0.18 | 0.15 | 0.22 |
| BAGEL | 7B+7B | 0.68 | 0.61 | 0.38 | 0.34 | 0.59 | 0.53 | 0.36 | 0.52 | 0.69 | 0.64 | 0.40 | 0.54 |
| InternVL-U (Ours) | 2B+1.7B | 0.94 | 0.91 | 0.72 | 0.73 | 0.75 | 0.89 | 0.88 | 0.87 | 0.90 | 0.78 | 0.57 | 0.79 |
You can download images from this page. The TextEdit benchmark data is organized under data/ by and category:
- Virtual (categories
1.x.x): Synthetic/virtual scene images - Real (categories
2.x): Real-world scene images
Evaluation prompts are provided under eval_prompts/ in two subsets:
| Subset | Directory | Description |
|---|---|---|
| Fullset | eval_prompts/fullset/ |
Complete benchmark with all samples |
| Miniset (500) | eval_prompts/miniset/ |
500-sample subset uniformly sampled from the fullset |
Each .jsonl file contains per-sample fields: id, prompt, original_image, gt_image, source_text, target_text, gt_caption.
You need to use your model to perform image editing inference process. Please organize the outputs in the folder structure shown below to facilitate evaluation.
output/
βββ internvl-u/ # Your Model Name
β βββ 1.1.1 # Category Name
β βββ 1007088003726.0.jpg # Model Output Images
β βββ 1013932004096.0.jpg
β βββ ...
β βββ 1.1.2
β βββ 1.1.3
β βββ ...
β βββ 2.7
Classic metrics evaluate text editing quality using OCR-based text accuracy, image-text alignment, and aesthetic quality. All metrics are reported separately for Virtual and Real splits.
| Abbreviation | Metric | Description |
|---|---|---|
| OA | OCR Accuracy | Whether the target text is correctly rendered in the editing region |
| OP | OCR Precision | Precision of text content (target + background) in the generated image |
| OR | OCR Recall | Recall of text content (target + background) in the generated image |
| F1 | OCR F1 | Harmonic mean of OCR Precision and Recall |
| NED | Normalized Edit Distance | ROI-aware normalized edit distance between target and generated text |
| CLIP | CLIPScore | CLIP-based image-text alignment score |
| AES | Aesthetic Score | Predicted aesthetic quality score of the generated image |
Evaluation scripts are provided separately for fullset and miniset:
eval_scripts/classic_metrics_eval_full.shβ evaluate on the full benchmarkeval_scripts/classic_metrics_eval_mini.shβ evaluate on the 500-sample miniset
Step 1. Modify the contents of the configure script according to your project directory. (e.g., eval_scripts/classic_metrics_eval_full.sh):
MODELS="model-a,model-b,model-c" # Comma-separated list of model names to be evaluated
path="your_project_path_here"
CACHE_DIR="$path/TextEdit/checkpoint" # Directory for all model checkpoints (OCR, CLIP, etc.)
BENCHMARK_DIR="$path/TextEdit/eval_prompts/fullset"
GT_ROOT_DIR="$path/TextEdit/data" # Root path for original & GT images
MODEL_OUTPUT_ROOT="$path/TextEdit/output" # Root path for model infer outputs
OUTPUT_DIR="$path/TextEdit/result/classic_fullset" # Evaluation result root path for classic metricNote: All required model checkpoints (PaddleOCR, CLIP, aesthetic model, etc.) should be placed under the
CACHE_DIRdirectory.
Step 2.Run evaluation shell script to evaluate your model output.
# Fullset evaluation
bash eval_scripts/classic_metrics_eval_full.sh
# Miniset evaluation
bash eval_scripts/classic_metrics_eval_mini.shResults are saved as {model_name}.json under the output directory, containing per-sample scores and aggregated metrics for both Virtual and Real splits.
Our VLM-based evaluation uses Gemini-3-Pro-Preview as an expert judge to score text editing quality across five fine-grained dimensions. The evaluation is a two-step pipeline.
| Abbreviation | Metric | Description |
|---|---|---|
| TA | Text Accuracy | Spelling correctness and completeness of the target text (1β5) |
| TP | Text Preservation | Preservation of non-target background text (1β5) |
| SI | Scene Integrity | Geometric stability of non-edited background areas (1β5) |
| LR | Local Realism | Inpainting quality, edge cleanness, and seamlessness (1β5) |
| VC | Visual Coherence | Style matching (font, lighting, shadow, texture harmony) (1β5) |
| Avg | Weighted Average | Weighted average of all five dimensions (default weights: 0.4 / 0.3 / 0.1 / 0.1 / 0.1) |
All raw scores (1β5) are normalized to 0β1 for reporting. A cutoff mechanism is available: if TA (Q1) < 4, the remaining dimensions are set to 0, reflecting that a failed text edit invalidates other quality dimensions.
Send (Original Image, GT Image, Edited Image) triplets to the Gemini API for scoring.
Configure and run eval_scripts/vlm_metrics_eval_step1.sh:
API_KEY="your_gemini_api_key_here"
BASE_URL="your_gemini_api_base_url_here"
python eval_pipeline/vlm_metrics_eval_step1.py \
--input_data_dir <your_path>/TextEdit/eval_prompts/fullset \
--model_output_root <your_path>/TextEdit/output \
--gt_data_root <your_path>/TextEdit/data \
--output_base_dir <your_path>/TextEdit/result/vlm_gemini_full_answers \
--model_name "gemini-3-pro-preview" \
--models "model-a,model-b,model-c" \
--api_key "$API_KEY" \
--base_url "$BASE_URL" \
--num_workers 64Per-model .jsonl answer files are saved under the output_base_dir.
Aggregate the per-sample Gemini responses into a final report.
Configure and run eval_scripts/vlm_metrics_eval_step2.sh:
# Fullset report
python eval_pipeline/vlm_metrics_eval_step2.py \
--answer_dir <your_path>/TextEdit/result/vlm_gemini_full_answers \
--output_file <your_path>/TextEdit/result/gemini_report_fullset.json \
--weights 0.4 0.3 0.1 0.1 0.1 \
--enable_cutoff
# Miniset report
python eval_pipeline/vlm_metrics_eval_step2.py \
--answer_dir <your_path>/TextEdit/result/vlm_gemini_mini_answers \
--output_file <your_path>/TextEdit/result/gemini_report_miniset.json \
--weights 0.4 0.3 0.1 0.1 0.1 \
--enable_cutoffKey parameters:
--weights: Weights for Q1βQ5 (default:0.4 0.3 0.1 0.1 0.1).--enable_cutoff: Enable cutoff mechanism β if Q1 < 4, set Q2βQ5 to 0.
The output includes a JSON report, a CSV table, and a Markdown-formatted leaderboard printed to the console.
If you find our TextEdit Bench useful, please cite our InternVL-U technical report using this BibTeX.

