Dynamic Relation Transformer for Contextual Scene Text Understanding

1 University of Science and Technology of China,
2 Xi'an Jiaotong University,
3 Microsoft Research Asia
* Indicates Equal Contribution

† Indicates Corresponding Author
Fig. 1: Overview of our proposed framework for contextual text block detection.

Abstract

Contextual Text Block Detection (CTBD) is the task of identifying coherent text blocks within the complexity of natural scenes. Previous methodologies have treated CTBD as either a visual relation extraction challenge within computer vision or as a sequence modeling problem from the perspective of natural language processing. We introduce a new framework that frames CTBD as a graph generation problem. This methodology consists of two essential procedures: identifying individual text units as graph nodes and discerning the sequential reading order relationships among these units as graph edges. Leveraging the cutting-edge capabilities of DQ-DETR for node detection, our framework innovates further by integrating a novel mechanism, a Dynamic Relation Transformer (DRFormer), dedicated to edge generation. DRFormer incorporates a dual interactive transformer decoder that deftly manages a dynamic graph structure refinement process. Through this iterative process, the model systematically enhances the graph's fidelity, ultimately resulting in improved precision in detecting contextual text blocks. Comprehensive experimental evaluations conducted on both SCUT-CTW-Context and ReCTS-Context datasets substantiate that our method achieves state-of-the-art results, underscoring the effectiveness and potential of our graph generation framework in advancing the field of CTBD.

Method

Fig. 2: The architecture of our proposed DRFormer, consisting of a dual interactive decoder. For clarity, we only show two attention layers of Transformer decoder and omit the FFN blocks.

Main Results

Table 1: Quantitative performance comparison of DRFormer with state-of-the-art methods on SCUT-CTW-Context dataset. (LA: Local Accuracy; LC: Local Continuity; GA: Global Accuracy)
Table 2: Quantitative performance comparison of DRFormer with state-of-the-art methods on ReCTS-Context dataset. (LA: Local Accuracy; LC: Local Continuity; GA: Global Accuracy)
Table 3: Quantitative performance comparison of DRFormer with state-of-the-art methods on integral text grouping and ordering task on SCUT-CTW-Context and ReCTS-Context datasets. (LA: Local Accuracy; LC: Local Continuity; GA: Global Accuracy)

Ablation Study

Table 4: Ablation studies of various components within DRFormer on SCUT-CTW-Context dataset. (DGSR: Dynamic Graph Structure Refinement; CAF: Cross-Attention First; RASA: Relation-Aware Self-Attention)

Qualitative Results

Fig. 3: Qualitative comparison between our proposed baseline (top) and DRFormer (bottom).
Black bounding boxes indicate word-level or character-level integral texts, i.e., graph's nodes, and brown arrows represent the reading order relationships between these integral texts, i.e., graph's edges, which finally lead to contextual text blocks outlined in green bounding boxes.

Citation

@misc{wang2024dynamic,
  title={Dynamic Relation Transformer for Contextual Text Block Detection},
  author={Wang, Jiawei and Zhang, Shunchi and Hu, Kai and Ma, Chixiang and Zhong, Zhuoyao and Sun, Lei and Huo, Qiang},
  year={2024},
  eprint={2401.09232},
  archivePrefix={arXiv},
  primaryClass={cs.CV}
}