Predicting variable English and Cantonese usage in a digital Medium: A computational analysis of WhatsApp code-switching in Hong Kong (19988)
AI prediction tools like ChatGPT are increasingly being explored for language prediction (MacKenzie 2020; Szmrecsanyi et al. 2019). However, existing language models have primarily been trained on dominant languages like English, neglecting lesser-known linguistic practices. These models typically focus solely on linguistic variables, which limits their understanding of unique linguistic phenomena and sociolinguistic nuances, and consequently, their performance.
This paper hopes to address this by developing a supervised predictive model that incorporates both social and linguistic factors to understand the variable use of languages in underrepresented multilingual practices. The study focuses on Cantonese-English code-switching in Hong Kong, specifically examining how linguistic and social factors influence language choice (English vs. Cantonese) in the context of Computer-Mediated Communication (CMC), particularly WhatsApp instant messaging. The goal is to identify robust predictors of bilingual variation in a digital setting. Then, utilizing these results, we evaluate how accurately a model informed by this analysis can forecast bilingual variation in the context of digital code-switching. For our analysis, we adopted a ‘bag-of-words’ approach to investigating language choice (Goldberg 2017:69). A Bayesian regression analysis of roughly 329,087 Cantonese and English words from 55,000 utterances from 24 Hong Kong residents (linked to sociolinguistic data acquired from a survey) revealed some noteworthy findings.
The model was able to accurately predict the outcome (i.e., English, Cantonese lexicon) 85% of the time. The correlation between actual responses and predicted responses is significant (ρ=0.61 [CI=0.59,0.62],t=77.356,p<0.0001). This study stands out for using less-commonly used methods for analyzed variation in social media. Sociolinguistic variables, such as part-of-speech, style, and sentiment, are computationally inferred from the linguistic context instead of relying solely on participant reports. We hope to provide insights into another dimension of Cantonese-English code-switching and contribute to existing knowledge on the variable use of languages in Hong Kong and East Asia.