arxiv: https://arxiv.org/abs/2204.08387 key points use linear embedding for image embedding instead of a dedicated CNN based network, making the model more simple use three pretraining tasks: MLM, MIM, WPA I think this work is strongly based on its previous work LayoutLMv2, so if you haven’t checked it out, I recommend doing so…