Paper Review: “Donut : Document Understanding Transformer without OCR”

Key Points

  • visual document understanding model which does OCR + downstream task in one step with a single end-to-end model
  • outputs are generative, and formatted to be convertible to JSON, which makes this architecture highly compatible to various downstream tasks.
  • present SynthDoG, a synthetic document image generator used in this work

Overall Structure

Visual Encoder

Textual Decoder

Model Input and Output

  • image
  • text with prompt

Synthetic document generator



Document Classification

Document Parsing

Document VQA




