Paper Review: “Donut : Document Understanding Transformer without OCR”

Key Points

  • visual document understanding model which does OCR + downstream task in one step with a single end-to-end model
  • outputs are generative, and formatted to be convertible to JSON, which makes this architecture highly compatible to various downstream tasks.
  • present SynthDoG, a synthetic document image generator used in this work

Overall Structure

Visual Encoder

Textual Decoder

Model Input and Output

  • image
  • text with prompt

Synthetic document generator



Document Classification

Document Parsing

Document VQA




Deep Learning Engineer LinkedIn:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Prediction Engineering: How to Set Up Your Machine Learning Problem

Overfitting And Underfitting In Machine Learning

Machine Learning in Production: Using Istio to Mesh Microservices in Google Kubernetes Engine

Exploring ‘ Game Of Thrones ’ Battles and Predicting the fate of the characters using Machine…

Teacher’s Aid: A NLP tool to help teachers evaluate student reflections.

Create Your Own Image Classification Model in ML Kit (AI Create)

End to End Deployment of Heart Disease Prediction Through Flask With Machine learning Algorithm

K-Nearest Neighbors

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store


Deep Learning Engineer LinkedIn:

More from Medium

gMLP: Winning over Transformers?

An practical introduction to Diff-Pruning for BERT

Hugging Face Hosts Kensho Engineers for Talk on Speech-to-Text Technologies

Prompt-based Learning

A computers command prompt waiting for input.