Skip to main page content
U.S. flag

An official website of the United States government

Dot gov

The .gov means it’s official.
Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

Https

The site is secure.
The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Access keys NCBI Homepage MyNCBI Homepage Main Content Main Navigation
. 2025 Apr 7;15(1):11874.
doi: 10.1038/s41598-025-96397-6.

Linguistic-visual based multimodal Yi character recognition

Affiliations

Linguistic-visual based multimodal Yi character recognition

Haipeng Sun et al. Sci Rep. .

Abstract

The recognition of Yi characters is challenged by considerable variability in their morphological structures and complex semantic relationships, leading to decreased recognition accuracy. This paper presents a multimodal Yi character recognition method comprehensively incorporating linguistic and visual features. The visual transformer, integrated with deformable convolution, effectively captures key features during the visual modeling phase. It effectively adapts to variations in Yi character images, improving recognition accuracy, particularly for images with deformations and complex backgrounds. In the linguistic modeling phase, a Pyramid Pooling Transformer incorporates semantic contextual information across multiple scales, enhancing feature representation and capturing the detailed linguistic structure. Finally, a fusion strategy utilizing the cross-attention mechanism is employed to refine the relationships between feature regions and combine features from different modalities, thereby achieving high-precision character recognition. Experimental results demonstrate that the proposed method achieves a recognition accuracy of 99.5%, surpassing baseline methods by 3.4%, thereby validating its effectiveness.

Keywords: Character recognition; Deep learning; Linguistic-visual model; Transformer.

PubMed Disclaimer

Conflict of interest statement

Declarations. Competing interests: The authors declare no competing interests.

Figures

Figure 1
Figure 1
The overall architecture of the proposed method (the input image is processed through the visual module, where ResNet-45 extracts features related to the structure and strokes of Yi characters, while Deformable DETR captures spatial relationships and deformations, effectively addressing the distinctive characteristics of Yi characters. These features are passed through a Linear layer and a softmax layer to generate the visual prediction map, which is then input into the linguistic model. The Pyramid Pooling Transformer reduces sequence length and enhances feature representation. The linguistic features are then processed through feed-forward layers to generate the linguistic prediction map. Although the linguistic prediction map does not contribute to the final output, it is crucial for training the linguistic model, improving its ability to extract accurate features for Yi character recognition. Finally, Cross Attention aligns the visual and linguistic features, and linear and softmax operations generate the final prediction result).
Figure 2
Figure 2
Overall structure of the visual model.
Figure 3
Figure 3
Architecture of the deformable DETR.
Figure 4
Figure 4
Overall structure of linguistic model.
Figure 5
Figure 5
Architecture of the pyramid pooling transformer.
Figure 6
Figure 6
Architecture of the cross attention.
Figure 7
Figure 7
Visualization of text recognition results.

References

    1. Chinthaginjala, R., Dhanamjayulu, C., Kim, T.-H., Ahmed, S., Kim, S.-Y., Kumar, A. S., Annepu, V., & Ahmad, S. Enhancing Handwritten Text Recognition Accuracy with Gated Mechanisms. Sci. Rep.14(1), 16800 (2024). - PMC - PubMed
    1. Ptucha, R., Such, F. P., Pillai, S., Brockler, F., Singh, V., & Hutkowski, P. Intelligent Character Recognition Using Fully Convolutional Neural Networks. Pattern Recogn.88, 604–613 (2019).
    1. Chen, S., Yang, Y., Liu, X. & Zhu, S. Dual discriminator gan: Restoring ancient yi characters. ACM Trans. Asian Low-Resour. Lang. Inf. Process.2(4), 1–23 (2022).
    1. Yin, X., Min, D., Huo, Y. & Yoon, S.-E. Contour-Aware Equipotential Learning for Semantic Segmentation. IEEE Trans. Multimedia25, 6146–6156 (2022).
    1. Yin, X., Im, W., Min, D., Huo, Y., Pan, F., & Yoon, S.-E. Fine-grained Background Representation for Weakly Supervised Semantic Segmentation. IEEE Trans. Circ. Syst. Video Technol. (2024).

LinkOut - more resources