← Back to blog

Pixels vs. Numbers: PDF Charts Your RAG Can Query

June 27, 2026

doclingmineruchart-extractionragdocument-parsing

Every PDF pipeline eventually meets the same wall: the chart. A quarterly revenue bar graph is trivial for a human - four bars, four labels, four numbers - but for most document parsers it is a dead end. The parser sees a rectangle of pixels, crops it out, drops it into a folder, and moves on. Your RAG system is left holding a .jpg. It cannot answer "what was Q1 2026 revenue?" because that number was never turned into text - it stayed locked inside the image.

I ran the same investor presentation through two of the best open-source document parsers available today - MinerU (OpenDataLab) and Docling (IBM, with the Granite Vision chart-extraction model) - to see exactly where they diverge. The headline result is not subtle. Faced with an identical "Revenue ($M)" bar chart, one tool gave me a picture, and the other gave me a table I can query.

This article walks through why, step by step.

The test: one chart, two parsers

The source is a single slide from SMR-1Q26-Presentation.pdf - a navy-blue bar chart titled Revenue ($M) with four bars:

Revenue ($M)

  • Q2 2025 → $8.1
  • Q3 2025 → $8.2
  • Q4 2025 → $1.8
  • Q1 2026 → $0.6

What MinerU produced

MinerU correctly located the chart and even classified it. Here is the exact record from its _content_list.json:

{
  "type": "chart",
  "img_path": "images/58bd777e...db196463.jpg",
  "content": "",
  "chart_caption": [],
  "chart_footnote": [],
  "bbox": [572, 148, 944, 885],
  "page_idx": 9
}

Look closely at two fields. "type": "chart" - MinerU knows it is a chart. And "content": "" - but it extracted nothing from it. The output Markdown reflects this:

Liquidity & Capital Resources ($M)
![](images/58bd777e...db196463.jpg)

The chart is now a 42 KB JPEG. The numbers $8.1, $8.2, $1.8, $0.6 exist only as colored pixels inside that image. To a downstream LLM, this region is opaque.

See it in the published repo: the cropped chart JPEG, MinerU's full Markdown output, and the _content_list.json record.

What Docling + Granite Vision produced

From the identical slide, Docling emitted this directly into the Markdown - no image, no OCR guesswork:

Bar chart

| Quarter   |   Value |
|-----------|---------|
| Q2 2025   |     8.1 |
| Q3 2025   |     8.2 |
| Q4 2025   |     1.8 |
| Q1 2026   |     0.6 |

Straight from the published Docling output — this Revenue table lives at lines 248–255, and the same values are also written out as a machine-readable charts CSV.

Every value matches the slide exactly. The chart is now structured data. A RAG system can read line 255 - | Q1 2026 | 0.6 | - and answer the revenue question without ever touching an image model.

Same PDF. Same chart. One produced a photograph; the other produced a spreadsheet.

Step by step: why the outputs differ

The difference is not accuracy or model size - it is pipeline design. Each tool makes a deliberate decision about what to do when it meets a figure.

Step 1 - Layout detection (both tools do this)

Both parsers start by segmenting the page into regions: text blocks, titles, tables, and figures. On this run MinerU used its PP-DocLayoutV2 layout detector, with paddleocr_torch for text, unimernet_hf_small_2503 for formulas, and SlanetPlus/UnetStructure for tables; Docling ran its docling-layout-heron layout model with TableFormer for tables. At this stage, both tools agree: "there is a figure here, at this bounding box." So far, identical.

Step 2 - Figure classification (the fork in the road)

Here the paths split.

  • MinerU tags the region with a type - and in newer versions it even distinguishes chart from a plain image. But classification is where its visual understanding ends. There is no downstream model whose job is to read the chart's data. So the region is cropped, saved as a JPEG, and content is left empty.

  • Docling, when you enable do_chart_extraction, automatically enables picture classification as a gate. Each detected picture is classified - bar chart, pie chart, line chart, or "not a chart." A positive classification doesn't end the process; it triggers the next stage.

This is the crucial architectural insight: for MinerU, "chart" is a label. For Docling, "chart" is a routing decision that sends the image to a specialist model.

Step 3 - Chart-to-data reconstruction (only Docling has this)

When Docling classifies a picture as a chart, it hands the cropped image to a dedicated IBM Granite Vision model. On the machine used for this test, Docling's default chart-extraction kind is GRANITE_VISION_V4, which loads ibm-granite/granite-vision-4.1-4b - a 4-billion-parameter vision-language model - and drives it with a <chart2csv> task token. (Docling also ships the lighter granite-vision-3.3-2b-chart2csv-preview, which is steered with the natural-language prompt "Convert the information in this chart into a data table in CSV format.")

Either way, the model was trained on code-generated synthetic charts, so it has learned to map bar heights, axis ticks, data labels, and category names back into rows and columns. It outputs structured CSV - not a caption, not a description, but the actual numbers. The supported chart types are bar_chart, pie_chart, and line_chart.

MinerU has no equivalent stage. Its VLM and OCR models are tuned to read text and formulas, not to reverse-engineer a plot's data. There is simply no chart2csv step in its graph, so a chart can never become a table.

Step 4 - Serialization into the document

Docling takes the model's tabular output and threads it back into the document object as a real table. In code, the extracted data lives on the picture item itself:

for item, _level in doc.iterate_items():
    if isinstance(item, PictureItem) and item.meta and item.meta.tabular_chart:
        chart_type = item.meta.classification.get_main_prediction().class_name
        table = item.meta.tabular_chart.chart_data
        grid = [[""] * table.num_cols for _ in range(table.num_rows)]
        for cell in table.table_cells:
            grid[cell.start_row_offset_idx][cell.start_col_offset_idx] = cell.text
        # grid is now [["Q2 2025","8.1"], ["Q3 2025","8.2"], ...]

That grid is what becomes the Markdown table - and, if you want, a CSV or an Excel sheet. MinerU's serializer, by contrast, has only an empty content string and an image path to write, so it emits an ![](...) link.

Putting it in a table

StageMinerUDocling + Granite Vision
Locate the figure✅ Yes✅ Yes
Classify as "chart"✅ Yes (type: chart)✅ Yes (as a routing gate)
Read chart data❌ None (content: "")granite-vision-4.1-4b (<chart2csv>)
Output for this chartA .jpg image cropA Markdown/CSV data table
RAG can query the values❌ No (pixels)✅ Yes (text)

Why this matters for RAG

A retrieval pipeline can only retrieve what was turned into text. When MinerU emits a chart as an image, the four most important numbers on that slide vanish from the searchable corpus. Your vector index never embeds "$0.6M Q1 2026 revenue," because that string was never produced. The user asks the question, the retriever finds the surrounding prose, and the model either says "I don't know" or - worse - hallucinates a figure.

Docling closes that gap. By reconstructing the chart into | Q1 2026 | 0.6 |, it makes the visual data a first-class, embeddable, retrievable citizen of the document. The chart stops being "dark matter" and becomes part of the knowledge base.

The honest caveats

This is not a claim that Docling is universally better - it's a claim that it does something MinerU structurally does not. A few things to keep in mind:

  • Chart-to-data is probabilistic. Granite Vision reconstructs values from pixels; on unconventional or cluttered charts it can misread a bar or invent a row. Always treat extracted chart tables as high-quality estimates, not audited figures.
  • It costs compute. The chart2csv stage is an extra 4-billion-parameter VLM pass per chart - meaningfully slower, and it really wants a GPU.
  • MinerU is excellent at what it targets. For dense text, formulas (→ LaTeX), and tables (→ HTML) across 109 languages, MinerU is fast and strong. If your documents are text- and equation-heavy rather than chart-heavy, the missing chart2csv step may not matter to you at all.

Conclusion

The same PDF, the same bar chart, two philosophies. MinerU treats a chart as an object to preserve - it finds it, labels it, and faithfully saves the picture. Docling, armed with Granite Vision, treats a chart as data to recover - it finds it, routes it to a specialist model, and reconstructs the numbers behind the bars.

For document archival, MinerU's image-faithful output is perfectly reasonable. But for any pipeline whose whole purpose is to make a document answerable - RAG, financial analysis, research aggregation - the difference between 58bd777e....jpg and | Q1 2026 | 0.6 | is the difference between a parser that sees your charts and one that actually reads them.

Appendix: exactly what ran on this machine

For reproducibility, here is the precise stack behind the two outputs above. Both ran locally on an NVIDIA GeForce RTX 5070 (Blackwell, sm_120) with PyTorch 2.12.1+cu130 (CUDA 13.0), on the same source file SMR-1Q26-Presentation.pdf.

Docling (chart → table):

  • docling 2.107.0, docling-core 2.85.0, docling-ibm-models 3.13.3, transformers 5.8.1
  • Layout: docling-project/docling-layout-heron
  • Tables: docling-project/docling-models (TableFormer)
  • Figure/chart gate: docling-project/DocumentFigureClassifier-v2.5
  • Chart-to-CSV: ibm-granite/granite-vision-4.1-4b (default kind GRANITE_VISION_V4, <chart2csv> task token)
  • Pipeline flags: do_chart_extraction=True (which auto-enables picture classification)

MinerU (chart → image):

  • MinerU 3.4.0, hybrid backend, effort medium, OCR disabled (_ocr_enable: false) - from the run's _middle.json
  • VLM: opendatalab/MinerU2.5-Pro-2605-1.2B
  • Pipeline models (opendatalab/PDF-Extract-Kit-1.0, ONNX): layout PP-DocLayoutV2, formula unimernet_hf_small_2503, OCR paddleocr_torch, table classify paddle_table_cls, table structure SlanetPlus / UnetStructure
  • The chart region was recorded in _content_list.json as "type": "chart" with "content": "" and an img_path to the cropped .jpg.

Further reading