Integrating Text, Voice, and Visual Inputs for a Cohesive Multimodal Conversational Experience: Iranian EFL Intermediate Students in Focus
- English Department, Najafabad Branch, Islamic Azad University, Najafabad, Iran
Received: 2024-12-21
Revised: 2025-02-14
Accepted: 2025-01-23
Published in Issue 2025-04-23
Copyright (c) 2025 Hossein Vahid Dastjerdi (Author)

This work is licensed under a Creative Commons Attribution 4.0 International License.
How to Cite
PDF views: 172
Abstract
This holistic study took a critical look at the integration of various modes of communication, particularly text, voice, and visual inputs, for the development of a seamless and cohesive multimodal conversational experience within AI chatbots. The focus of this research was directed toward Iranian EFL intermediate high school students aged between 15 and 19 years. It is important to note that the existing conversational AI systems usually rely on a single mode of interaction, which in effect seriously limits their overall effectiveness and usability. By integrating text, voice, and visual elements in this innovative approach, this research aims at increasing user engagement and satisfaction levels among the students participating in the study. A sample of 200 male and female students was conveniently selected and engaged in multiple interactions with custom-developed AI chatbots over a period of three months. Each subject experienced text-only, voice-only, visual-only, and multimodal interactions in a random order. Data collections included interaction duration, frequency, and satisfaction surveys, while further data was collected through focus groups. Quantitative analysis using MANOVA and qualitative thematic analysis have together shed some very important light on multimodal interaction. These interactions, having been shown to significantly heighten the overall user experience, promise new directions in the further development of conversational AI technologies.
Keywords
- AI chatbots,
- Conversational AI;,
- Multimodal interaction,
- User engagemen,
- User satisfaction,
- Text,
- Voice,
- Visual inputs
References
- Binns, R., Veale, M., Van Kleek, M., & Shadbolt, N. (2018). 'It's Reducing a Human Being to a Percentage': Perceptions of Justice in Algorithmic Decisions. Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (pp. 1-14). ACM. https://doi.org/10.1145/3173574.3173951
- Kopp, S., Gesellensetter, L., Krämer, N. C., & Wachsmuth, I. (2005). A conversational agent as museum guide–design and evaluation of a real-world application. Intelligent Virtual Agents, 329-343. https://doi.org/10.1007/11550617_28
- Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2022). Multimodal conversational AI: Advancing human-computer interaction with voice, text, and visual input. Journal of Artificial Intelligence, 34(2), 121-136. https://doi.org/10.1016/j.jai.2022.03.006
- Kimmel, M., Avidan, D., & Zilberman, A. (2020). Combining modalities for more intuitive AI interaction: A review of multimodal systems. AI Review, 48(4), 2475-2494. https://doi.org/10.1007/s10462-020-09845-0
- Li, Q., Xu, Z., & Li, X. (2023). Real-time adaptation of multimodal systems: From theory to practice. Journal of Human-Computer Interaction, 39(1), 72-84. https://doi.org/10.1145/3210134.3210213
- Liu, H., Zhang, H., & Ma, X. (2021). Adapting multimodal AI to personalized user needs: A study on dynamic adjustments of input modes. Journal of Intelligent Systems, 33(2), 1034-1046. https://doi.org/10.1002/j.1678-3458.2021.00385.x
- Mayer, R. E. (2001). Multimedia learning. Cambridge University Press. https://doi.org/10.1017/CBO9780511811678
- Miao, E. (2024, July). Using fine-grained data to track group effectiveness and individual characteristics of teachers in blended teacher learning. In 2024 International Symposium on Educational Technology (ISET) (pp. 310-316). IEEE.
- Morana, S., Turi, M., Ravšelj, D., Schuetzler, R. M., & Maedche, A. (2017). Individual differences in multimodal chatbot interaction: The impact of perceived system's persona and communication strategy. International Conference on Information Systems (ICIS) 2017 Proceedings. https://aisel.aisnet.org/icis2017/Interaction/Presentations/4/
- Murphy, K., Richards, S., & Woods, D. (2022). AI for the next generation: Engaging younger audiences through multimodal systems. Journal of Educational Technology, 61(3), 215-227. https://doi.org/10.1109/JET.2022.00523
- Reeves, B., & Nass, C. (1996). The media equation: How people treat computers, television, and new media like real people and places. Cambridge University Press. https://doi.org/10.1017/CBO9780511564079
- Sengupta, S., Srinivasan, V., & Gupta, A. (2022). Enhancing emotional intelligence in AI through multimodal communication. AI in Mental Health, 19(2), 86-95. https://doi.org/10.1007/s00542-022-06731-w
- Skantze, G., & Al Moubayed, S. (2012). IrisTK: a statechart-based toolkit for multi-party face-to-face interaction. Proceedings of the 14th ACM international conference on Multimodal interaction (pp. 69-76). ACM. https://doi.org/10.1145/2388676.2388689
- Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems, 3104-3112. https://doi.org/10.48550/arXiv.1409.3215
- Wang, X., Zhang, Y., & Li, X. (2024). Long-term user engagement with multimodal AI systems: Insights from longitudinal studies. Journal of Human-Computer Studies, 46(1), 124-137. https://doi.org/10.1007/s00462-024-09983-2
- Wei, L., Liu, J., & Zhang, Q. (2023). The impact of multimodal interaction on user satisfaction in AI systems. International Journal of Human-Computer Interaction, 43(5), 457-469. https://doi.org/10.1080/10447318.2023.1884739
- Xu, S., He, L., & Wang, M. (2021). Multimodal AI for accessibility: Addressing the needs of users with hearing impairments. Journal of AI and Disability Studies, 3(1), 29-41. https://doi.org/10.1080/26303024.2021.1884312
- Yang, H., Kim, H., Lee, J. H., & Shin, D. (2022). Implementation of an AI chatbot as an English conversation partner in EFL speaking classes. ReCALL, 34(3), 327-343.
- Zhang, Y., Yin, Z., Wang, J., & Shi, Z. (2019). Multimodal AI: A survey of methods and applications. arXiv preprint arXiv:1905.13804. https://doi.org/10.48550/arXiv.1905.13804
10.57647/jntell.2025.0401.05
