1 Introduction

In an age where information floods our digital landscapes, recommendation systems emerge as essential beacons, leading users simply and effectively through the vast ocean of data. These systems, elegantly designed to analyze and predict user preferences, have revolutionized how content is consumed online. At their core, recommendation systems are sophisticated algorithms that look through vast datasets to present users with choices customized to their historical interactions, preferences, and behaviors. This personalized approach not only enhances user experience but also significantly impacts cultural and political narratives by influencing what is watched, read, and discussed across the globe. A typical example of this influence is YouTube’s recommendation algorithm, which has become a pivotal force in shaping viewing habits among billions of users worldwide. Such algorithms have the power to subtly direct the flow of information, underscoring the importance of understanding their underlying mechanisms and the implications of their widespread adoption. The evolution of recommendation systems, as detailed in Burke et al. (2011), Lü et al. (2012), spans a trajectory from simple collaborative filtering techniques (where recommendations are made based on the preferences of similar users) to complex, multi-faceted approaches that incorporate a variety of artificial intelligence methodologies. This progression reflects a deepening sophistication in how digital platforms engage with users, ensuring that the content they encounter resonates with their individual tastes and preferences. However, the immense influence wielded by these seemingly impartial systems also brings to the forefront the need for a critical examination of how digital content is curated and the potential consequences of its reach.

At the core of platforms like YouTube, recommender systems stand as technological wonders, adept at predicting and shaping our preferences. They search through vast content, aligning choices with our past interactions to enhance our experience. These systems, by prioritizing content they predict will be of interest, significantly shape our digital diets, potentially narrowing our exposure to a homogenized set of perspectives. However, these algorithms, for all their sophistication, are not without their biases, which can skew content diversity and fairness in information distribution, raising concerns about echo chambers, filter bubbles, and the significant influence on public discourse (Polatidis and Georgiadis 2013). Recent studies have underscored the need to understand and mitigate biases in recommender systems, including but not limited to selection bias, position bias, exposure bias, and popularity bias. Not addressing these inherent biases could lead to serious issues, such as a disparity between offline evaluation and online metrics, negatively affecting user satisfaction and trust in the recommendation service (Chen et al. 2020).

Building on the understanding of recommender systems’ impact on user experience and the potential biases inherent in such systems, this study aims to delve deeper into the specific dynamics at play within YouTube’s ecosystem. YouTube, as a leading platform for video content, offers a fertile ground for examining how recommender algorithms shape the content landscape and user interactions over time. Accordingly, this research seeks to address the following pivotal questions:

  • RQ1: In what ways do YouTube’s recommendation algorithms influence drift within content over time?

  • RQ2: How do YouTube’s algorithms affect content diversity, narrative visibility, fairness of information distribution, user engagement, and echo chamber formation?

This study embarks on a critical examination of YouTube’s recommendation algorithm, specifically its role in driving narrative drift within the digital content ecosystem over time. Narrative drift, defined as the gradual changes in themes, topics, and perspectives within recommended content, significantly influences the diversity and depth of information accessible to users, thereby shaping their informational landscape.

Our investigation employs a multidimensional analytical framework, blending statistical evaluations with advanced analyses of emotion and moral sentiment, among other methodologies, to dissect the complex dynamics of YouTube’s content recommendation system. This comprehensive approach blends quantitative data analysis with qualitative insights, offering a detailed exploration of how algorithms impact user engagement and content evolution.

Structured as follows, the paper aims to provide a coherent narrative: Sect. 3 introduces the narratives under study and offers an overview of related domains, setting the stage for the methodologies applied. Section 4 details our data collection processes and analytical methods. Section 5 presents our findings on narrative drift, supplemented by detailed graphical analyses. Finally, Sect. 6 summarizes our study’s key insights.

The goal of this research extends beyond merely mapping narrative drift; it seeks to delve into the implications of algorithmic content curation on content diversity and the fair distribution of information across the digital landscape. By scrutinizing YouTube’s recommendation algorithm and identifying potential biases, this study contributes to the ongoing dialogue on digital media consumption, governance, and its societal impact.

In doing so, we challenge the current state of digital content recommendation, envisioning a path toward more transparent, equitable, and diverse digital ecosystems. Through this exploration of YouTube’s recommendation system, we aim to shed light on the nuances of algorithmic governance, fostering a richer and more inclusive digital commons for all.

2 Background

In this section, we delve into the intricate web of geopolitical issues that shape our world, exploring conflicts and disputes that not only have regional implications but also resonate on the global stage. From the deep-rooted tensions in China’s Xinjiang province to the strategic complexities of the South China Sea dispute, and the nuanced use of historical narratives, such as the story of Cheng Ho, in modern-day diplomacy, these offer insightful analyses into some of the most pressing and contentious geopolitical challenges of our times.

2.1 China-Uyghur conflict

The Xinjiang conflict, deeply rooted in complex historical, cultural, and political factors, has emerged as a significant issue in global discourse. Central to this conflict is the difficult situation faced by the Uyghur Muslim minority in China’s Xinjiang province, an area filled with ethnic tensions and controversial government actions. Research highlights the cultural and linguistic aspects of the conflict, focusing on the Uyghur identity and language policy, underscoring how identity plays a crucial role in the ongoing tensions (Dwyer 2005). Another study examines the conflict through the lens of majority-minority dynamics within China, providing insights into the socio-political factors that have contributed to the escalation of tensions (Hasmath 2019). Further analysis explores the broader implications of the conflict, particularly China’s national policies and their impact on the Uyghur population, offering a critical view of the government’s approach to handling ethnic diversity and disagreement (Israeli 2010). This is complemented by discussions on the involvement of international organizations like Amnesty International in addressing the discrimination and conflict faced by Uyghurs, highlighting the period from 2018 to 2022 and the international community’s response (Al-Asad and Zarkachi 2023). Additionally, studies on Uyghur Muslim ethnic separatism clarify the complexities of ethnic identity and the desire for self-governance within Xinjiang, illustrating the intricate relationship between ethnic identity, political aspirations, and the broader conflict narrative (Davis 2008). These scholarly perspectives paint a multifaceted picture of the Xinjiang conflict, demonstrating its multi-dimensional nature that includes cultural, political, and international elements.

2.2 South China sea dispute

The South China Sea dispute represents a complex intersection of geopolitical, economic, and legal challenges, crucial for understanding contemporary international relations. This region, pivotal for global maritime trade, sees an estimated one-third of the world’s shipping pass through its waters, highlighting its significance in international commerce. The area is not only a key maritime route but also possesses considerable untapped natural resources, including substantial oil and gas reserves, making it an economically strategic zone. Central to this dispute is the People’s Republic of China (PRC)’s assertive attitude, as explored in Chubb (2020). China’s strategy includes extensive island-building and militarization, particularly within the Paracel and Spratly Islands, reshaping the region’s geopolitical landscape. These efforts involve constructing artificial islands and establishing military bases, actions that have significantly altered regional dynamics and heightened tensions among neighboring states. The complexity of the South China Sea dispute extends beyond its strategic maritime routes, as detailed in Fravel (2011). The promise of rich undersea resources, including significant reserves of oil and natural gas, plays a pivotal role in fueling the conflicting territorial claims. This economic potential, closely mixed with layers of historical claims and national pride, adds to the complexity of the situation. The legal dimension of the dispute gained importance following the 2016 decision by the Permanent Court of Arbitration, which challenged China’s extensive “nine-dash line" territorial claim, deeming it inconsistent with international law (Macaraig and Fenton 2021). Although China has rejected the ruling, it introduced an important legal aspect to the dispute, emphasizing the role of international law in maritime territorial rights. China’s activities in the South China Sea, including land reclamation and militarization, have implications far beyond the immediate region, raising significant concerns regarding the principles of freedom of navigation and overflight in crucial international waters. The South China Sea dispute, with its intricate blend of geopolitical significance, economic interests, and legal complexities, stands as a testament to the challenges facing the international order in the 21st century.

2.3 Cheng Ho propaganda

In the context of current international politics and strategies, the story of Cheng Ho, also known as Zheng He, has gained new significance, especially in how it’s used in the propaganda of the Chinese Communist Party (CCP). Historically recognized as a celebrated Chinese naval admiral of the early 15th century, Zheng He is renowned for his extensive maritime voyages across Southeast Asia, India, and the Middle East, as detailed in Wade (2005), Finlay (2008). These expeditions are traditionally viewed as exploratory and diplomatic in nature, emphasizing peaceful engagement and trade. However, in recent times, the CCP has strategically repurposed Zheng He’s legacy, as discussed in Dotson (2011), to serve its contemporary political and strategic agendas. This recontextualization of Zheng He’s historical image is particularly evident in the narrative that portrays him as a figure who not only spread Islam and religious tolerance but also as a symbol of China’s peaceful rise and kindness. This portrayal aligns with the CCP’s broader objective of countering international criticism regarding its treatment of the Uyghur Muslim population, as well as bolstering its geopolitical influence, especially in relation to the South China Sea conflict. By recasting Zheng He as a compassionate gift-giver and a peaceful diplomat, the narrative serves to project an image of China as a historically tolerant and inclusive nation. Furthermore, the manipulation of Zheng He’s story is intricately linked with China’s “Maritime Silk Road" initiative, aiming to expand its economic and strategic influence across Asia, Africa, and Europe. The narrative serves as a tool to foster regional support for China’s ambitious project, presenting it as a continuation of a peaceful and cooperative maritime tradition. This tactic of referencing historical events is a sophisticated approach to engage in international relations, strengthen credibility within China, and improve its global position. The Cheng Ho propaganda is a testament to the power of historical narratives in serving current political objectives, where the past is actively reinterpreted to shape the present and future.

2.4 Significance of selected geopolitical topics

We have chosen these geopolitical topics due to their inherent controversy and their significant impact on international relations and public opinion. The selected topics include two anti-China perspectives: the China-Uyghur conflict and the South China Sea dispute, and one pro-China narrative, the story of Cheng Ho. This balanced selection allows us to examine issues where bias might be particularly evident due to their polarizing nature.

These geopolitical issues are highly relevant in current global discourse and receive significant media coverage, making them ideal for analyzing recommendation bias on platforms like YouTube. The controversies surrounding these topics often result in strong opinions and divided audiences, providing a fertile ground for studying how recommendation systems present content in response to user interactions. By examining these topics, we aim to uncover potential biases in YouTube’s recommendation algorithms that might influence the direction of content suggestions.

Additionally, these topics encompass a range of cultural, political, and historical elements, providing a comprehensive framework for studying the complexity of bias in recommendation systems. This selection allows us to assess whether YouTube’s algorithms exhibit any tendencies in the progression of recommended content starting from these contentious geopolitical issues. Understanding these tendencies is crucial for recognizing the broader implications of algorithmic bias in shaping public opinion and the potential consequences for international relations and social harmony.

3 Literature review

This literature review systematically explores a wide range of scholarly work on the multifaceted nature of biases in digital spaces, algorithmic influences in recommendation systems, and the psychological and ethical dimensions of online content interaction.

3.1 Recommendation bias

This section delves into the intricate web of biases inherent in recommendation algorithms, examining their profound implications on information consumption, user behavior, and societal discourse across various digital platforms.

The author in Stinson (2022) explores biases inherent in collaborative filtering algorithms, used extensively in recommendation and search systems. Highlighting the cold-start problem, popularity bias, over-specialization, and homogenization, the author argues these statistical biases can marginalize already marginalized groups. This insight is crucial for the broader discourse on algorithmic fairness, stressing the importance of addressing both data and algorithmic mechanisms to mitigate biases and ensure more equitable outcomes in digital recommendation environments. Building on this foundational understanding of recommendation biases, the comprehensive survey by Chen et al. (2023) meticulously explores the multifaceted nature of biases in recommendation algorithms. They provide a deeper dive into each of the seven identified biases: selection, exposure, conformity, position, inductive, popularity, and unfairness; their research inspects various strategies for mitigating such biases. Among the debiasing techniques, the survey discusses the effectiveness of tendency score adjustments, adversarial learning, and other methods in enhancing the fairness and diversity of recommendations. By elaborating on the challenges and solutions associated with each bias type, the survey enriches the discourse on creating equitable and inclusive recommender systems, aligning closely with the thematic concerns of recommendation bias in our study.

Similarly, the researchers in Zhan et al. (2022) present a novel approach to address the duration bias in video watch-time prediction models. By employing a causal graph and backdoor adjustment, the study innovatively separates the intrinsic effect of video duration on watch-time from its biased impact on video exposure. This methodology allows for more accurate and fair recommendations by mitigating the undue preference for longer videos, which has been shown to skew platform engagement metrics and user experience. Through extensive offline evaluations and live experiments on the Kuaishou platform, the researchers demonstrate the effectiveness of this approach in enhancing watch-time prediction accuracy and, consequently, video consumption, further emphasizing the importance of addressing biases in recommendation systems.

The researchers in Haroon et al. (2022) conduct a comprehensive audit of YouTube’s recommendation system to assess ideological biases and potential radicalization through recommendations. They employ a novel methodology using “sock puppets" to mimic user behavior across different ideological spectrums. The findings reveal YouTube’s tendency to guide users, particularly those leaning right, towards increasingly radical content. Additionally, the study proposes a bottom-up intervention strategy aimed at mitigating these biases, demonstrating its effectiveness in diversifying recommendations. This research adds critical insights into the ongoing debate on social media’s role in ideological bias and radicalization, highlighting the complex challenges faced by digital platforms in ensuring fair and balanced content distribution.

The study by Nechushtai et al. (2023) investigates the effect of algorithmic recommendation systems on the diversity of news exposure across major digital platforms, presenting a comprehensive analysis that compares the extent of homogenization in news recommendations. It emphasizes the tendency of these platforms to favor nationally oriented news sources over local or regional ones, highlighting concerns regarding the centralization of information, the potential reduction in exposure diversity, and the implications for public discourse. The study employs a crowd-sourced audit methodology to assess the recommendations made by Google, Google News, Facebook, YouTube, and Twitter to a diverse set of users in the United States, examining the interplay between user characteristics and algorithmic sorting in shaping news consumption patterns. This analysis underscores the subtle impacts of YouTube’s recommendation algorithms on content diversity and user perception, illuminating a complex picture of algorithmic bias.

Recent studies, including Cakmak et al. (2024b), Okeke et al. (2023), Cakmak et al. (2024a), Gurung et al. (2024) and Onyepunuka et al. (2023), further contribute to the understanding of YouTube’s recommendation algorithm’s complexities. These analyses reveal trends towards positive emotions, reduced focus on moral dilemmas, systematic content filtration, and shifts in thematic and emotional engagement, which highlight the algorithms’ influence on shaping viewers’ feelings, beliefs, and public discourse dynamics. Specifically, Onyepunuka et al. (2023) explores the Cheng Ho narrative to assess topic and emotion drift, finding that YouTube’s recommendations tend to deviate towards content subtly introducing pro-China topics, which target specific demographics. These findings emphasize the algorithm’s capacity to shift discussion points and influence audience perception, adding depth to the landscape of recommendation biases.

Exploring the nuances of bias and misinformation propagation within digital ecosystems, scholarly investigations such as Kirdemir et al. (2021a), Kirdemir and Agarwal (2022), Kirdemir et al. (2021b), Cakmak et al. (2024a), Cakmak and Agarwal (2024b), Poudel et al. (2024), and Srba et al. (2023) unveil the inherent structural preferences and the algorithm’s role in fostering content homogeneity, tightly knit content communities, and polarized content spheres. These studies underscore the critical need for transparent, accountable algorithmic practices and the development of debiasing interventions to cultivate a more diverse and accurate digital information landscape, thereby reinforcing the pivotal themes discussed in the aforementioned research on recommendation biases.

3.2 Behavioral dynamics in social media

The emotional, moral, and toxic behavioral dynamics of social media are foundational to user engagement and the development of effective recommender systems. Recognizing and interpreting these dynamics is critical for creating algorithms that resonate with user preferences and enhance their online experiences.

3.2.1 Emotions in social media

The emotional dynamics of social media significantly influence user engagement and content dissemination. Recognizing and interpreting emotional expressions in user-generated content is crucial for creating algorithms that enhance user experiences. Research highlights the importance of accurately identifying emotions to prevent the spread of misinformation and counteract online radicalization (Kušen et al. 2017). Studies delve into the role of emotions in shaping user influence and engagement, advocating for emotionally intelligent recommendation systems that capture the essence of user-generated content and ensure emotionally engaging recommendations (Chung and Zeng 2020; Panger 2017). Further research explores the impact of visual and auditory cues on user emotional responses, emphasizing the potential for leveraging these elements to enhance recommendation accuracy and appeal (Cakmak et al. 2024c; Yousefi et al. 2024a). Multimodal emotion analysis, combining textual and auditory data through deep learning, improves emotion detection accuracy and underscores the development of more emotionally intelligent recommendation systems, aligning recommendations more closely with user emotions (Banjo et al. 2022).

3.2.2 Morality in social media

Understanding how moral considerations intersect with social media dynamics is imperative for addressing biases in digital spaces. Research highlights the significant impact of social media on moral reasoning, judgments, and behaviors, emphasizing the need for theoretical contributions to understanding morality on these platforms (Neumann and Rhodes 2024). Studies delve into the effects of moral outrage on political polarization, showing how social media magnifies aggression and withdrawal from political conversations (Carpenter et al. 2020). The interplay between social media and morality can amplify both negative (e.g., outrage, intergroup conflict) and positive (e.g., social support, prosociality) aspects of morality (Van Bavel et al. 2024). The design and algorithmic preferences of social media platforms significantly shape the spread of moral narratives, embedding moral biases within the algorithms responsible for content recommendations. This highlights the importance of incorporating moral values into recommender systems to create a balanced and less biased digital environment (Mbila-Uma et al. 2023).

3.2.3 Toxic behavior in social media

Toxic content on social media presents a significant challenge for the development of unbiased recommendation algorithms. Studies utilizing Reddit data reveal how toxic content impacts communities and biases algorithmic decisions by amplifying negative discourse (Yousefi et al. 2024b). Research on the contagious nature of toxic tweets underscores the urgency for algorithms to understand how harmful content multiplies (Yousefi et al. 2023). Comparisons of toxicity levels across different platforms suggest that distinct strategies may be needed to curb bias (DiCicco et al. 2020; Noor et al. 2023). Recent studies extend our understanding of the toxicity landscape by examining its role in amplifying public health debates and affecting polarization in the wake of feminist protests (Pascual-Ferrá et al. 2021; Estrada et al. 2022). These insights emphasize the necessity for recommendation algorithms to account for toxicity dynamics to refine algorithms, promote healthier discourse, and mitigate biases, ultimately fostering constructive public engagement.

In conclusion, a comprehensive understanding of the emotional, moral, and toxic behavioral dynamics in social media is essential for developing recommendation systems that can effectively manage biases, enhance user engagement, and promote a healthier online environment.

3.3 Topic modeling in social media

Understanding the importance of topic content in social media is pivotal for enhancing recommender systems and ensuring their fairness and relevance. The dynamism of social media platforms, such as Sina Weibo, underscores the necessity to analyze and comprehend the thematic shifts in user-generated content. By examining the distribution of hot topics and their correlations across different platforms, researchers can gain insights into user interests and behaviors, which are crucial for developing more accurate and unbiased recommender systems (Yu et al. 2014). This analysis not only helps in identifying trending topics but also in understanding the broader social context in which these discussions occur.

Moreover, the application of advanced topic modeling techniques to social media content enables the discovery of underlying topic facets and their evolution over time. Such methodologies are instrumental in capturing the rich tapestry of online discourse, facilitating a deeper understanding of the thematic structures within vast datasets (Rohani et al. 2016). This knowledge is invaluable for recommender systems, as it allows for the refinement of content curation algorithms to better match user preferences and mitigate the risk of reinforcing echo chambers.

The significance of topic analysis extends beyond bare content filtering and recommendation. It plays a critical role in identifying shifts in public sentiment and emerging trends, thereby enabling decision support systems to adapt to changing user needs and preferences (Li et al. 2023). Furthermore, the study of changes in social media content over time provides marketers and content creators with insights into the effectiveness of their strategies and the changing interests of their audience, as evidenced by research in marketing science by Zhong and Schweidel (2020).

In conclusion, topic content on social media emerges as a critical element requiring meticulous analysis for identifying and mitigating bias within recommender systems. Its influence on recommendation algorithms underscores the necessity for careful examination to ensure the integrity and fairness of these systems.

3.4 Social network analysis

Social Network Analysis (SNA) has emerged as a vital tool for analyzing the complicated web of interactions within social media platforms, offering profound insights into the diffusion of information and the identification of influential actors within these networks, as described in Harrigan et al. (2021). By mapping out the relationships and flows between users, SNA facilitates a deeper understanding of how information spreads through these digital landscapes, highlighting the individuals who wield disproportionate influence over these processes. Influencers, identified through their central positions within the network, play a critical role in shaping audience attitudes and facilitating the spread of information, thus acting as gatekeepers in the dissemination of content (Khanam et al. 2023). The study by Shaik et al. (2024) further elucidates the role of multimedia in amplifying these dynamics, underscoring the potential of multimedia content to engage and mobilize communities through social networks.

The significance of influencers extends beyond mere popularity, as their strategic position within the network grants them the ability to affect the flow and reach of information significantly. This influence is not uniform but varies based on the network’s structure and the nature of the connections. Research describes distinct types of influencers, such as disseminators, engagers, and leaders, each playing unique roles in information spread (del Fresno García et al. 2024). These distinctions underscore the ways in which influence manifests within social networks, shaping how information is shared and received.

In the context of recommender systems, understanding the dynamics of social networks and the role of influencers is paramount. These systems, designed to select and recommend content to users based on various algorithms, can significantly benefit from including insights derived from SNA. By recognizing and leveraging the influence of key actors, recommender systems can enhance their effectiveness, ensuring that the content reaches a broader audience and resonates more deeply with users (Aïmeur et al. 2023). Moreover, the inclusion of social network insights can help mitigate the challenges of filter bubbles and echo chambers, promoting a more diverse and engaging content landscape.

Incorporating SNA into the examination and refinement of recommender systems on social media platforms presents a strategic approach to mitigating bias and enriching the content landscape. By meticulously identifying and deciphering the roles of influencers within these digital networks, platforms have the opportunity to recalibrate their algorithms to leverage these pivotal actors effectively. This strategy not only amplifies the diversity and relevance of recommendations but also addresses underlying biases by ensuring a broader, more inclusive representation of perspectives and content. As highlighted by Alp et al. (2022), analyzing social media discussions around critical health issues like COVID-19 and vaccines can offer invaluable insights into public sentiment, further enriching the dataset for recommender systems. Consequently, this integration fosters a digital ecosystem that is not only more interconnected and vibrant but also fairer and more transparent. Through such tailored algorithmic adjustments, social media platforms can excel conventional limitations, offering users a richer, more balanced and unbiased content experience.

Recent studies by Bhattacharya et al. (2024a) highlight the critical role of network analysis in uncovering biases within social networks. The authors used computational methods to analyze identity formation in political protests, showing how social networks can coalesce into cohesive movements and revealing potential biases in these networks. Additionally, Bhattacharya et al. (2024b) examined the socio-technical factors behind modern social movements, emphasizing how network analysis can identify biases and better understand the interplay between solidarity and collective action in digital environments.

3.5 User engagement effect on recommendation systems

In the realm of digital platforms, recommender systems serve as the keystone for navigating the vast array of content available to users, aiming to enhance user engagement by tailoring recommendations to individual preferences. However, the attempt to personalize user experiences and maximize engagement does not come without its challenges, particularly in terms of recommendation bias and its impact on the visibility of diversified content. This understanding of recommender systems functionality underscores the critical balance between user engagement and the equitable representation of content.

The research of Maslowska et al. (2022) emphasizes the pivotal role of recommender systems in fostering user engagement, suggesting that the design and operational variations of recommender systems significantly influence users’ long-term interactions with platforms. While accuracy in predicting user preferences is traditionally valued, their work suggests a broader scope for evaluating recommender systems effectiveness, highlighting the importance of understanding user engagement dynamics in depth.

Complementing this perspective, Ping et al. (2024) investigates the effects of diversity, novelty, and serendipity on user engagement and reveals the intricate relationship between recommender systems design choices and the potential for bias. Their findings illuminate how an overemphasis on popular or trending content could inadvertently marginalize less conventional, yet potentially engaging, content. This phenomenon, often referred to as the popularity bias, underlines the critical trade-offs recommender systems designers face between optimizing for user engagement and ensuring a diverse content ecosystem.

Moreover, the study by Zhao et al. (2018) on the differential impacts of explicit versus implicit feedback on user engagement and satisfaction offers insights into the mechanisms through which recommender systems might amplify or mitigate biases. By highlighting the intricate ways in which user feedback is included into recommender systems, their research underscores the potential for recommender systems to either perpetuate or challenge existing biases, depending on the design and implementation choices made.

Additionally, research by Shajari et al. (2024a, 2024b) explores anomalous engagement and commenter behavior on YouTube, providing valuable insights into how engagement metrics can be manipulated and the implications this has for recommender systems. Adeliyi et al. (2024) further investigate inorganic user engagement, emphasizing the impact of automated and semi-automated activities on the integrity of user engagement metrics. Their findings highlight the importance of implementing robust mechanisms to detect and address such behaviors to maintain the integrity of user engagement metrics.

Addressing these complexities, it becomes evident that the larger goal of recommender systems to enhance user engagement must be critically examined through the lens of recommendation bias. The challenge lies not only in designing recommender systems that are expert at capturing and sustaining user interest but also in ensuring that these systems promote a balanced and inclusive representation of content.

3.6 Statistical evaluation in recommender systems

The advancement of recommender systems has underscored the vital need for robust statistical evaluation frameworks. As these systems increasingly influence user experiences across various digital platforms, the need for their effectiveness and fairness cannot be overstated. Statistical evaluation methodologies provide a foundation for understanding, improving, and benchmarking recommender systems. The integration of Information Retrieval (IR) metrics into recommender system evaluation has emerged as a pivotal area of study, aiming to bridge the gap between predicted user preferences and actual user satisfaction (Shani and Gunawardana 2011). However, this adaptation is not without challenges, as noted in Bellogín et al. (2017), who highlighted the inherent statistical biases, such as sparsity and popularity biases, that could distort evaluation outcomes and obstruct the comparability of recommender systems.

In the realm of evaluating recommender systems, traditional error-based metrics have shown limitations, prompting a shift towards more user-centric evaluation criteria (Knijnenburg and Willemsen 2015). This shift has been influential in capturing the multifaceted nature of user preferences and the complex dynamics of recommendation processes. The work by Shani and Gunawardana (2011) further elaborates on the intricacies of applying IR methodologies to recommender systems, emphasizing the need for a systematic approach to address these challenges.

The evaluation of recommender systems has also brought to light the significance of addressing and mitigating biases inherent in recommendation algorithms. These biases, if unchecked, can skew the recommendation process, potentially leading to a reinforcement of existing user preferences and hindering the discovery of diverse content (Herlocker et al. 2004).

Furthermore, the exploration of statistical robustness in the evaluation of stream-based recommender systems by Vinagre et al. (2019) adds an additional layer of complexity, necessitating the development of dynamic evaluation metrics that can adapt to the evolving nature of user interactions with content. This dynamic evaluation underscores the importance of time aspects in the assessment of recommender systems, highlighting the need for metrics that can capture the transient preferences of users and the fluidity of content relevance.

In conclusion, the literature underscores the paramount importance of statistical measurements in the evaluation of recommender systems. As these systems continue to evolve and play a crucial role in shaping digital experiences, the development and refinement of statistical evaluation methodologies will remain a critical area of research. This endeavor not only aids in benchmarking the performance of recommender systems but also in ensuring their fairness, transparency, and adaptability to the diverse and changing needs of users.

4 Methodology

Our methodology outlines the structured approach we adopt to investigate the intricate dynamics of recommendation algorithms, combining data collection, analysis, and theoretical examination to unveil the underlying biases within digital platforms.

4.1 Data collection

In this section, we detail the rigorous methodology employed for gathering and processing data, setting the foundation for our comprehensive analysis of the narratives explored in this study.

4.1.1 Narrative keywords

To initiate data collection as outlined in Sect. 2, we conducted a series of workshops with subject matter experts. These sessions were instrumental in generating a list of relevant keywords associated with the three narratives. Subsequently, these keywords were utilized to facilitate the search for related videos on YouTube.

  1. 1.

    China-Uyghur Conflict: in our research on the China-Uyghur conflict, the carefully chosen keywords reflect the critical themes identified in our literature review. These keywords are detailed in Table 1. The keywords include human rights abuses, cultural and religious identities, and the international response to the conflict. Terms like “Oppression”, “Muslim Uyghur”, and “Stop Genocide” are used to capture the mistreatment of the Uyghur population, their cultural and religious significance, and the global reaction to these issues. By incorporating specific organizations and notable figures, our data collection becomes comprehensive, ensuring our study accurately represents the complexities and the proved realities of the China-Uyghur conflict.

  2. 2.

    South China Sea Dispute: as shown in Table 2, our selected keywords for the South China Sea dispute study encapsulate the conflict’s key aspects: legal rulings (“Permanent Court", “Arbitration", “UNCLOS"), geopolitical tensions (“China + Philippines", “sovereignty"), and economic interests (“economic cooperation", “natural resources"). These terms are essential to examine the intricate blend of legal, political, and economic factors in the dispute, particularly focusing on China’s territorial claims and the responses of neighboring states like the Philippines. The keywords enable a comprehensive analysis, aligning with the diverse perspectives and complexities discussed in our literature review.

  3. 3.

    As outlined in Table 3, our study utilized specific keywords to investigate Cheng Ho’s (Zheng He) maritime expeditions and their modern reinterpretation by the Chinese Communist Party (CCP). “Cheng Ho", “Zheng He", and related terms explore his historical significance and cultural impact. Keywords linking to the Uighur region and figures like “Gavin Menzies" connect his legacy with contemporary geopolitical narratives and popular theories. This selection facilitates a comprehensive examination of Zheng He’s historical role and his portrayal in current political strategies, aligning with our research focus.

Table 1 China-Uyghur conflict related keywords
Table 2 South China sea dispute related keywords
Table 3 Chengh Ho propoganda related keywords

These keywords, as presented in their respective tables, play a pivotal role in uncovering the complex themes and narratives at the heart of our study. This approach not only structures our in-depth analysis but also contains a blend of English and non-English terms. This bilingual approach accounts for the original content of videos and their broader dissemination in English, ensuring a comprehensive understanding of the subject matter from multiple linguistic perspectives. It’s important to note that the unbalanced count of keywords between topics does not detract from our study’s validity. Each topic is unique, encompassing a varied range of terms. Moreover, we selected an equal number of initial seed videos for each topic, a methodological choice that will be elaborated upon in the upcoming section.

4.1.2 Recommendation depth collection

To accurately measure bias in YouTube’s recommendations, we needed to collect the videos recommended by YouTube. Our methodology mirrors the approach used by the authors in Onyepunuka et al. (2023). Initially, we selected seed videos using the keywords mentioned in Sect. 4.1.1. These seed videos were manually chosen based on their relevance to the subject.

For the collection of recommended videos, we employed Selenium, a widely-used open-source library for web scraping. Selenium utilizes the WebDriver protocol to control web browsers, such as Chrome in our case. We individually opened each seed video in the browser and scraped the videos recommended by YouTube, typically displayed in the right-hand corner of the screen. These recommended videos are related to the current video being viewed. After completing the collection at each recommendation depth, the newly gathered videos served as the starting point for the subsequent depth. This process helped us to construct a recommendation network. Ultimately, we achieved four new depths of recommendations. The number of videos and their corresponding depths are presented in Table 4.

Table 4 Narrative’s video counts

We initially selected 40 seed videos for each narrative. After each depth, there was a variation in the count of recommended videos. This variation is attributable to YouTube’s algorithm, which factors in aspects such as content metadata, viewer engagement, video length, ongoing algorithmic adjustments, content availability, and current trends. Throughout the video collection process, we ensured an unbiased approach by not logging into any YouTube account. Each browsing session started afresh with cleared cookies to eliminate any influence of user history on the data.

4.1.3 Attribute retrieval

In this research, we analyzed various video attributes including the title, description, transcription, comments, views, and likes. For all attributes except transcription, we utilized the YouTube Data API v3. This API facilitates the retrieval of feeds related to videos, among other functionalities.

For transcriptions, we adopted a different approach, as detailed in Cakmak et al. (2023), Cakmak and Agarwal (2024). In these studies, the authors developed a method to efficiently collect video transcripts from YouTube. This process primarily involved the use of the YouTube Transcript API (Depoix 2023), which extracts transcripts from YouTube videos. For videos without available transcripts, we used the OpenAI Whisper model by Radford et al. (2023), which applies speech generation algorithms, to create the necessary transcriptions. This method effectively streamlined the transcription collection process, demonstrating the practical use of advanced computational techniques in extracting data from online multimedia sources.

Additionally, due to geo-specific issues encountered during our data collection, we dealt with non-English data. To enhance the understanding and accuracy of the models we used, we translated the data into English. This was achieved using the Googletrans Library in Python, a free and unlimited library that implements the Google Translate API. The library leverages the Google Translate Ajax API for functions such as language detection and translation.

4.2 Emotion assessment

The bias inherent in recommended video content can be effectively analyzed through the lens of emotional shifts. Emotions significantly influence our interaction with media, shaping our reactions, responses, and engagement levels. In the realm of recommended videos, a complex relationship exists between the emotional tone of the content and the viewer’s current emotional state. This interaction can result in a skewed selection of recommendations, as algorithms might favor content that evokes emotions leading to higher engagement from viewers. This tendency can create a cycle where viewers are continually presented with content that provokes specific emotional responses, potentially resulting in a more engaged but less diverse viewing experience. Understanding this mechanism is key in identifying and addressing bias in video recommendations, highlighting the subtle ways in which emotional targeting can influence viewing habits and content exposure.

To quantify these emotional shifts in video recommendations, our approach involved analyzing the emotions in each video using a transformer model, a tool at the forefront of advancements in natural language processing. Renowned for their ability to contextually interpret language, models like BERT, GPT, and RoBERTa (Devlin et al. 2019; Radford and Narasimhan 2018; Brown et al. 2020; Liu et al. 2019) are particularly adept at accurate emotion analysis.

Our research utilized RoBERTa and its more efficient variant, DistilRoBERTa. We selected the model (Hartmann 2022) from Hugging Face, a refined version of DistilRoBERTa, which has been meticulously trained on diverse datasets to identify a range of emotions: anger, disgust, fear, joy, neutral, sadness, and surprise.

Employing this model enabled us to conduct a thorough analysis of the emotional content in video titles, descriptions, transcriptions, and user comments. This methodology yielded valuable insights into the nature of emotional content within these videos and its impact on audience engagement. Such findings are integral to deepening our understanding of emotional biases in video recommendations and developing strategies to mitigate their effects.

4.3 Moral foundation assessment

In our exploration of the biases present in video recommendations, we acknowledge that alongside emotions, the subtle yet powerful influence of moral values plays a crucial role in steering viewer choices. These values, which form the backbone of personal ethics and decision-making, are instrumental in shaping how audiences perceive and interact with video content. Our study, therefore, delves into the realm of morality in recommended videos, aiming to unravel how these ethical dimensions influence viewer behavior and content preferences.

To investigate moral values in video content, we employed the extended Moral Foundations Dictionary (eMFD), a sophisticated tool designed for extracting moral content from textual data (Hopp et al. 2021). The eMFD represents a significant advancement in moral analysis, leveraging the input of a large and diverse group of human annotators to capture a wide range of moral intuitions. This methodology contrasts with previous approaches, which often relied on a small group of experts and resulted in a more constrained interpretation of morality.

For this study, we have used the eMFD as a quantitative tool and did not engage in its construction process. The eMFD’s construction involved a detailed annotation process that profoundly enhanced our analysis of moral content in video recommendations. In this process, each word in the eMFD is assigned continuously weighted vectors, reflecting its likelihood of association with five core moral foundations: Care/Harm, Fairness/Cheating, Loyalty/Betrayal, Authority/Subversion, and Sanctity/Degradation. Additionally, a key aspect of our methodology is the eMFD’s capability to evaluate the moral connotations of each word based on its alignment with vice or virtue characteristics. This dual approach, considering both the moral foundations and the vice-virtue spectrum, allows for a nuanced assessment of the moral undertones in video content. By examining how words align with these ethical dimensions, we gain a detailed view of how moral values are conveyed, offering insights into their potential impact on viewer behavior and preferences.

The methodology employed by the eMFD does not utilize probability distributions to analyze or interpret text. Instead, the approach is fundamentally frequency-based and categorical. When applied to a body of text, the eMFD quantifies the presence of moral language by counting the occurrences of words and phrases associated with each of the five moral foundations. This process results in a set of metrics that reflect the extent to which each moral foundation is represented in the text. The output is thus a direct measure of moral rhetoric, expressed through the frequency of specific lexicon usage across the identified moral dimensions. This direct and categorical assessment of moral content allows for a clear understanding of how moral values are embedded and communicated in video content, enhancing our ability to analyze the ethical implications of video recommendations.

Furthermore, the eMFD incorporates sentiment analysis through the Valence Aware Dictionary and sEntiment Reasoner (VADER), providing an additional layer of depth to the moral language assessment. This combination allows for a sophisticated examination of the moral and ethical themes within the video content, considering not just the presence of moral language but also the context and sentiment surrounding it.

By applying the eMFD to our analysis of recommended videos, we aim to explore the moral dimensions in video titles, descriptions, transcriptions, and user comments. This approach helps us assess narratives and character portrayals, uncovering the moral and ethical implications embedded in these aspects. Understanding how moral values are represented across these video attributes will provide insights into their influence on viewer engagement and preferences, offering a more comprehensive view of moral biases in video recommendations.

4.4 Toxicity assessment

In exploring the biases in video recommendations, the measurement of toxicity is as crucial as the assessment of emotion and moral values. Toxicity, which encompasses rude, disrespectful, or unreasonable content, is a key factor that can significantly influence people’s choices and interactions with online content. Its importance lies in its potential impact on the viewer’s experience and decision-making process. Just like emotional and moral content, toxic elements in videos can subtly shape preferences and behaviors, which in turn may affect the functioning of recommendation algorithms. Therefore, incorporating toxicity as a measure in our analysis is essential to gain a comprehensive understanding of the various factors that contribute to recommendation biases and their implications on user engagement and content consumption patterns.

To methodologically assess toxicity in video content, we employed the Detoxify model (Hanu 2020). Detoxify is a state-of-the-art machine learning model designed to detect and quantify various forms of toxic behavior in textual content. Among the variants of Detoxify, we specifically chose the “unbiased" model, which is designed to minimize biases that often accompany toxicity detection, such as those related to gender, race, or specific ideologies. This model’s architecture is based on a transformer-based framework, leveraging the power of models like BERT for context-sensitive analysis. It has been trained on a large dataset comprising diverse and challenging text samples, allowing it to accurately identify and score a range of toxic behaviors, including insults, threats, and hate speech.

Our use of the unbiased Detoxify model involved analyzing textual elements of videos such as titles, descriptions, transcriptions, and user comments. By applying this model, we were able to generate toxicity scores for each video element, giving us a quantifiable measure of the toxic content present in the recommended videos. This approach allowed us to systematically evaluate the prevalence and severity of toxic content in video recommendations and to understand how such content could influence viewer behavior and preferences.

4.5 Topic analysis

Understanding the biases in video recommendations requires an analysis of the topics presented in these videos. The topic or theme of a video is a crucial element that can influence the recommendation algorithm. If certain topics are consistently recommended while others are neglected, this can indicate a bias in the algorithm. This bias may arise because viewers tend to watch certain topics more frequently or engage more deeply with them, prompting the algorithm to favor these topics in its recommendations. Such a trend can lead to a homogenization of content, where diverse or less popular topics are underrepresented. This aspect of topic analysis is essential in understanding how content diversity is maintained or limited by recommendation systems and its impact on viewer exposure to a broad range of subjects.

To conduct a comprehensive topic analysis, we employed a BERTopic model (Grootendorst 2022), a sophisticated machine learning tool designed for topic classification and extraction. The fine-tuned version of this model, referenced in Grootendorst (2023), is pre-trained on approximately 1,000,000 Wikipedia pages, covering a broad spectrum of knowledge. It is capable of identifying 2,377 distinct topics, providing us with a robust framework for analyzing the thematic content of videos. The BERTopic model operates using a transformer-based architecture, similar to BERT, which excels in understanding and categorizing complex textual data. This capability allows for precise and nuanced topic detection, essential for accurately assessing the range and diversity of topics in video content.

Our methodology involved applying the BERTopic model to various textual elements of videos, such as titles, descriptions, transcriptions, and comments. By doing so, we could systematically categorize videos into specific topics and analyze the distribution of these topics within the recommended videos. This approach enabled us to observe patterns and trends in topic representation, offering insights into how certain topics are either prioritized or overlooked by the recommendation algorithm.

4.6 Network analysis

Network analysis offers a powerful perspective for unraveling the often hidden biases within video recommendation systems. This approach goes beyond surface-level observations, diving into the complex web of connections that define how videos are interlinked and recommended. By mapping these networks, we can expose subtle patterns and relationships, illuminating how certain videos gain notability or become marginalized within the recommendation ecosystem.

Our focus on network analysis derives from a desire to decode the intricacies of YouTube’s recommendation algorithm, as detailed in Sect. 4.1.2. Here, we examine the structure of the recommendation network, which is composed of parent videos and their associated recommended (child) videos. This network representation is key to understanding how certain content gains traction and influences viewer choices, potentially leading to biases in what is recommended to users.

Central to our network analysis is the application of Eigenvector Centrality (EC), as shown in Eq. 1. This metric is insightful because it evaluates both the number and the quality of connections a video has. In the equation, \(EC(v)\) represents the centrality of a video \(v\). The term \(\lambda\) is the largest eigenvalue of the adjacency matrix \(A\), which normalizes the centrality values. The adjacency matrix \(A\) itself reflects the connection strengths between videos, where \(A_{uv}\) indicates the strength between video \(v\) and its neighbor \(u\). The sum \(\sum _{u \in N(v)}\) takes into account all neighboring videos \(u\) of \(v\). Essentially, a video with high eigenvector centrality, as calculated by this formula, is one that is recommended by other influential videos. This indicates a form of indirect influence that can significantly shape viewer consumption patterns, highlighting videos that are important due to their strong connections to other significant videos.

$$\begin{aligned} EC(v) = \frac{1}{\lambda } \sum _{u \in N(v)} A_{uv} EC(u) \end{aligned}$$
(1)

Our analysis, conducted with the aid of Gephi software (Bastian et al. 2009) identified videos that serve as pivotal nodes within the recommendation network. These influential videos can be seen as key influencers or trend starters, guiding the direction of video recommendations based on how viewers interact with them. By examining these key influencers, we aimed to discern whether and how they contribute to perpetuating certain biases within the recommendation algorithm.

We also used modularity values in the network analysis. Modularity values are important because they help to identify those nodes (individuals, entities) that are central or influential within their respective communities or modules. By focusing on nodes with high modularity values, one can target influencers who are not just broadly connected across the entire network but also pivotal within their specific communities. This approach enhances the effectiveness of strategies that rely on these influencer nodes for information dissemination, ensuring that efforts are concentrated on individuals who can mobilize or impact their immediate community significantly.

We also incorporated modularity values into our network analysis to refine the identification of influential nodes within their respective communities or modules. High modularity values signify nodes that are not only broadly connected but also hold pivotal positions within specific communities, enhancing the targeted strategies for information dissemination. This consideration ensures that efforts are concentrated on individuals who can significantly impact their immediate community.

The modularity \(Q\) of a network, particularly utilizing the Louvain method, is defined by the formula shown in Eq. 2.

$$\begin{aligned} Q = \frac{1}{2m} \sum _{i,j} \left[ A_{ij} - \frac{k_i k_j}{2m} \right] \delta (c_i, c_j) \end{aligned}$$
(2)

In this formula \(A_{ij}\) represents the adjacency matrix element, \(k_i\) and \(k_j\) are the degrees of nodes \(i\) and \(j\), \(m\) is the total weight of all edges, and \(\delta (c_i, c_j)\) is the Kronecker delta function; the Kronecker delta is 1 if nodes \(i\) and \(j\) are in the same community and 0 otherwise. This formula aids in identifying nodes that are central within their communities, thereby highlighting the influencers who play a critical role in the dissemination of content within specific segments of the network.

4.7 Examination of engagement metrics

The analysis of engagement metrics plays a pivotal role in understanding biases in video recommendation systems. Engagement metrics, such as views, likes, and comment counts, serve as indicators of a video’s popularity and audience interaction. These metrics can be instrumental in revealing whether recommendation algorithms disproportionately favor more popular content, potentially leading to a bias in recommendations.

Our examination focused on the hypothesis that recommendation algorithms might be inclined to suggest videos with higher engagement metrics as users explore content more deeply. This potential bias could manifest in a cycle where already popular videos gain further visibility, overshadowing less-viewed content regardless of its relevance or quality. To investigate this, we considered a video popular based on the high view counts, substantial likes, and a significant number of comments.

In our methodology, we tracked the engagement metrics of videos across various recommendation depths. This approach allowed us to analyze patterns in how the recommendation algorithm prioritizes content based on viewer engagement. If we observe a trend where videos with higher engagement consistently appear in recommendations, it could indicate an algorithmic preference for popular content.

This analysis is crucial in understanding how engagement metrics could skew the diversity of recommended content. A bias towards highly engaged videos might limit the exposure of newer or relevant content, potentially narrowing the spectrum of ideas and perspectives presented to viewers. By examining engagement metrics, we aim to uncover and understand these potential biases, contributing to a more comprehensive understanding of the factors influencing content recommendations on digital platforms.

4.8 Statistical measurement

Our analysis utilizes statistical methods to validate the presence and extent of biases within YouTube’s recommendation system, as outlined in Sect. 4. Through quantitative evaluation, we assess the significance of deviations in content distribution and engagement metrics from expected norms. This approach ensures a robust understanding of recommendation biases, supporting our findings with concrete evidence of how these biases may affect content visibility and user interaction patterns across the platform.

4.8.1 Drift significance

A key component of our statistical analysis is the evaluation of drift significance, particularly how content distribution changes across recommendation depths. For this purpose, we utilized the Chi-Square test (Pearson 1900), a robust statistical method for examining the relationship between categorical variables. This test compares the observed frequency of categories at different recommendation depths against expected frequencies, assuming no underlying bias, as detailed in Eq. 3.

$$\begin{aligned} \chi ^2 = \sum \frac{(O_i - E_i)^2}{E_i} \end{aligned}$$
(3)

In this context, \(O_{i}\) represents the observed frequency of each category within the recommendation depths, while \(E_{i}\) denotes the expected frequency, calculated based on the assumption of uniform distribution across depths. The expected frequencies are derived using the formula shown in Eq. 4. This calculation helps us to establish a baseline against which to measure the extent of deviation (or drift) from expected content distribution patterns. In our case, the rows were the categorical values, and the columns were the recommendation depths. This setup allows for a detailed analysis of how content categories distribute across different levels of recommendation, providing insights into the recommendation algorithm’s behavior and its potential biases.

$$\begin{aligned} E = \frac{(\text {Row Total}) \times (\text {Column Total})}{\text {Grand Total}} \end{aligned}$$
(4)

We set a distinct significance level of 0.05 to ascertain the threshold for rejecting the null hypothesis, which posits that “there is no drift in the distribution of the categorical variables between the different recommendation depths." The degrees of freedom, crucial for understanding the distribution’s variance, are calculated as shown in Eq. 5.

$$\begin{aligned} dof = (\text {Number of rows} - 1) \times (\text {Number of columns} - 1) \end{aligned}$$
(5)

The p-value, which is the area under the Chi-Square distribution curve to the right of the observed \(\chi ^2\) statistic for the calculated degrees of freedom, indicates the probability of observing a result as extreme as, or more extreme than, what was actually found, assuming the null hypothesis is true. If the p-value is small (less than the significance level), then we reject the null hypothesis and conclude that there is evidence of drift in the distribution of the categories between the depths. Conversely, a larger p-value suggests that the observed differences could have occurred by random chance, leading us to fail to reject the null hypothesis, indicating no significant drift or difference between the depths.

As we mention, the calculation of the p-value requires integrating the Chi-Square probability density function (PDF) from the observed \(\chi ^2\) value to infinity, which is not typically done by hand. Instead, one would use statistical tables designed for this purpose, but due to the limited values that can be retrieved from the Chi-Square distribution table, we have used a statistical software, a Python module which uses the “chi2_contingency" function from the “scipy.stats" package. This module uses an approximation method to calculate the p-value for large Chi-Square statistics. For very large Chi-Square statistics, the p-value may be approximated to 0.0 due to limitations in floating-point arithmetic and computational precision. This approximation is reasonable because such large Chi-Square statistics indicate a very strong deviation from the null hypothesis, making it highly unlikely to observe such extreme results under the assumption of independence.

Through this statistical framework, we aim to provide a concrete measure of bias, offering a more definitive understanding of how recommendation algorithms might skew content distribution. This methodological rigor enhances the credibility of our findings, facilitating a deeper exploration into the mechanics of bias within YouTube’s recommendation system.

4.8.2 Inequality quantification

In examining biases within YouTube’s content recommendation algorithms, analyzing disparities in engagement metrics reveals critical insights. This method allows us to inspect how viewer interactions are distributed among recommended videos, shedding light on potential inequalities that may affect content visibility and the overall user experience.

The Atkinson Index (Atkinson et al. 1970), a measure initially developed to assess income inequality within populations, provides a useful framework for examining disparities in content engagement on YouTube. This index quantifies the extent to which individual data points (in our case, engagement metrics such as likes, comments, and views) diverge from a perfectly equal distribution. The Atkinson Index is defined as shown in Eq. 6.

$$\begin{aligned} A(\epsilon ) = 1 - \left( \frac{1}{n} \sum _{i=1}^{n} p_{i}^{1-\epsilon } \right) ^{\frac{1}{1-\epsilon }} \end{aligned}$$
(6)

\(A(\epsilon )\) represents the Atkinson Index, with \(\epsilon\) being a parameter that determines the sensitivity of the measure to changes in different parts of the distribution. A higher value of \(\epsilon\) indicates a greater sensitivity to inequalities at the lower end of the distribution. \(n\) is the number of videos considered in a particular depth of recommendation. \(p_i\) is the proportion of total engagement (likes, views, or comments) that the \(i\)-th video receives relative to the total engagement of all videos in the dataset.

In our analysis, we adopted an \(\epsilon\) value of 0.5 to balance the measure’s sensitivity to inequalities at both the lower and upper ends of the engagement spectrum. The choice of \(\epsilon\) is significant because it allows for the adjustment of the index’s focus, with higher values prioritizing the lower end of the distribution. As \(\epsilon\) approaches infinity, the Atkinson Index nears 1, reflecting an increasing emphasis on disparities at the lower end of the engagement spectrum.

By applying the Atkinson Index to the engagement metrics of recommended videos, we aim to quantify the level of inequality present within each recommendation depth. This analysis allows us to assess whether the YouTube recommendation algorithm exhibits a bias towards videos with significantly higher engagement metrics, potentially marginalizing content with lower but still substantial engagement levels.

Evaluating engagement inequality with the Atkinson Index sheds light on the dynamics of content recommendation and visibility on YouTube. It helps identify if the recommendation system perpetuates a concentration of attention on a small subset of highly popular videos, thereby reinforcing existing visibility and engagement disparities. Such insights are crucial for understanding the broader implications of algorithmic recommendation practices on content diversity and user exposure.

4.8.3 Understanding gaussian distributions and statistical measures

In the study of complex datasets, whether examining patterns in digital narratives or analyzing trends in social data, the application of statistical measures provides a foundational framework for both describing and understanding variability within the data. Central to this framework is the concept of the Gaussian distribution, often referred to as the normal distribution, which is a fundamental statistical distribution pattern observed in many natural phenomena and datasets.

The Gaussian distribution is characterized by its symmetric, bell-shaped curve, where the majority of observations cluster around a central value (the mean), decreasing in frequency as they diverge towards the extremes. This distribution is mathematically defined by its mean \(\mu\) and standard deviation \(\sigma\), where the following is true:

  • The mean \(\mu\) represents the average value of the dataset, providing a central point around which the data is distributed.

  • The standard deviation \(\sigma\) quantifies the dispersion or variability of the dataset, indicating how spread out the data points are from the mean.

Formally, the Gaussian distribution can be expressed through its probability density function (PDF) as shown in Eq. 7:

$$\begin{aligned} f(x | \mu , \sigma ) = \frac{1}{\sigma \sqrt{2\pi }} e^{ -\frac{(x-\mu )^2}{2\sigma ^2} } \end{aligned}$$
(7)

Within the context of Gaussian distributions, the concepts of mean + std (\(\mu\) + \(\sigma\)) and mean + 2std (\(\mu\) + \(2\sigma\)) serve as crucial analytical thresholds. These measures are instrumental in understanding the distribution of data:

  • Mean + std (\(\mu\) + \(\sigma\)): Approximately 68% of the data in a Gaussian distribution falls within one standard deviation of the mean. This range identifies the most common variance from the average, encapsulating the bulk of data points in a typical distribution.

  • Mean + 2std (\(\mu\) + \(2\sigma\)): Expanding the range to two standard deviations from the mean encompasses approximately 95% of the data. This broader threshold is critical for identifying outliers, which are the data points that lie beyond the typical range of variation. These outliers can signify extreme cases or occurrences that deviate significantly from the norm.

The application of these statistical measures and thresholds provides a powerful lens for analyzing and interpreting data. In real-world scenarios, understanding the distribution of data within these thresholds enables researchers to do the following:

  • Identify patterns and trends that are central to the dataset

  • Detect outliers or anomalies that may warrant further investigation

  • Make informed decisions based on the statistical behavior of the data

Employing these statistical concepts allows for a nuanced analysis that goes beyond mere averages, offering insights into the variability and extremities of the data. This approach is invaluable across a spectrum of fields, from social sciences to natural phenomena, enabling a deeper comprehension of the underlying patterns and behaviors within complex datasets.

5 Results

In this section, we delve into our findings, unraveling the dynamics of narrative drift across various dimensions, including influencer nodes, engagement metrics, and other pivotal elements that underscore the algorithmic influence on content dissemination and reception.

5.1 Emotion drift

In our comprehensive analysis, as detailed in Sect. 4.2, we embarked on an investigation to discern the presence of emotion drift across various narratives, alongside their respective attributes as elaborated in Sect. 4.1.3. This inquiry was established on the hypothesis that emotional tones could significantly shift across the depth of recommendations, a phenomenon we aimed to quantify and understand within the context of digital discourse.

Our findings, illustrated in Fig. 1, particularly spotlight the China-Uyghur Conflict as a case study. Initial expectations, based on the narrative’s nature discussed in Sect. 2.1, suggested predominantly negative emotions. However, the data revealed notable emotion drifts, especially noticeable in the attributes of titles and descriptions (Fig. 1a and b). These attributes exhibited a remarkable increase in neutrality and emergence of joy, with a simultaneous decrease in negative emotions such as anger and fear.

Conversely, the transcriptions and comments associated with the videos exhibited less variation. This could be attributed to the extended length of transcripts, which tends towards neutrality, and the varied nature of comments. Specifically, Fig. 1c demonstrated an increase in neutral expressions and a decrease in disgust, whereas Fig. 1d showcased an uptick in joy and a trace of surprise. Collectively, these findings underscore a general decline in negative sentiment across the recommendation depth, signaling a notable emotion drift within the context of the China-Uyghur Conflict narrative.

For the narrative concerning the South China Sea Dispute, our examination was rooted in expectations of initially negative emotions, as outlined in Sect. 2.2. This narrative, similar to the China-Uyghur conflict, demonstrated a decline in negative sentiments, with a discernible decrease in anger as shown in Fig. 2. Intriguingly, the descriptions, as captured in Fig. 2b, unveiled an initial spike in fear at the outset of recommendations. Yet, this was followed by a notable reduction in fear levels later on. These observations collectively highlight a broader trend towards diminished negative sentiments, underscoring an emotion drift within the discourse of the South China Sea Dispute.

For the Cheng Ho narrative, expectations were set for initially positive emotions, as outlined in Sect. 2.3. True to prediction, this narrative began with a higher combination of neutrality and joy compared to others. As depicted in Fig. 3, these levels largely remained consistent across recommendation depths. Notably, Fig. 3b shows a minor increase in fear, yet the overall emotional distribution maintained its positivity. Thus, the Cheng Ho narrative exhibited minimal emotion drift, with emotional levels showing negligible fluctuation, distinguishing it from the variability observed in other narratives.

In summarizing the emotion analysis across different narratives, our investigation revealed a subtle spectrum of emotion drift. Narratives characterized by negative or contentious themes exhibited notable shifts towards less negative emotional expressions across recommendation depths, indicating a dynamic emotional response to changing content. Conversely, narratives with inherently positive themes demonstrated stability in their emotional tone, with minimal shifts observed.

The analysis revealed significant shifts in the emotional tone of narratives, particularly from initially negative to increasingly neutral and positive emotions. This shift can be attributed to YouTube’s recommendation algorithm, which may prioritize content fostering longer engagement and positive user experiences. As users engage with content, the algorithm adjusts to suggest videos that are less emotionally charged and more balanced in tone. This trend highlights the dynamic nature of YouTube’s recommendation system and its potential impact on user perceptions and behavior. Understanding these shifts is crucial for developing strategies to enhance the quality and impact of online content recommendations, promoting constructive dialogue and reducing polarization.

To meticulously assess the presence and statistical significance of emotion drift across narratives, we utilized the Chi-Square method as outlined in Sect. 4.8.1. With seven emotion categories across five depths, our analysis yielded 24 degrees of freedom, derived from the formula in Eq. 5.

The p-values obtained for the emotion drift, as detailed in Table 5, were below the strict significance threshold of 0.05 for both the China-Uyghur Conflict and the South China Sea Dispute narratives. This indicates a significant deviation from the null hypothesis, affirming the presence of an emotion drift among the depths. Conversely, the Cheng Ho Propaganda narrative exhibited p-values near the threshold but below it, except in one instance where the value was higher, in contrast to the extremely low values (almost zero) observed in other narratives. This suggests a presence of drift, though in a weaker form compared to other narratives. Finally, given that the sample size for comments is approximately 100 times larger than that for the other elements, it inherently possesses a higher sensitivity to detect changes. This increased sensitivity is reflected in the larger Chi-Square statistics observed for comments.

Table 5 p-values for emotion drift
Fig. 1
figure 1

Emotion distribution across all attributes of the China-Uyghur conflict

Fig. 2
figure 2

Emotion distribution across all attributes of the South China sea dispute

Fig. 3
figure 3

Emotion distribution across all attributes of the Cheng Ho propaganda

5.2 Morality drift

In our comprehensive analysis of morality drift within digital narratives, we delve into the evolution of moral values across various narratives, examining how these values fluctuate through the recommendation depth of video content. This undertaking, detailed in Sect. 4.3, involves assessing the prevalence of moral virtues and vices, captured through mean scores at the sentence level, to measure the moral tone spreading different layers of digital discourse.

Our investigation into the China-Uyghur Conflict anticipated a dominance of negative moral values. The findings, as illustrated in Figs. 4 and 5, reveal an initial decline in vices such as harm, which was the highest one initially, as well as cheating, betrayal, subversion, and degradation, with a trend towards stabilization in deeper recommendation levels. Conversely, virtue scores, especially loyalty as highlighted in Fig. 5a–d, generally exhibit an increase or remain stable, underscoring a shift towards more positive moral values in the narrative discourse. In summary, for the China-Uyghur Conflict specifically, our analysis revealed a marked decrease in negative moral values (vices) and an increase or stabilization in positive moral values (virtues), notably loyalty.

In analyzing the South China Sea Dispute, subtle shifts in moral values were observed. Vice values, indicated in Fig. 6, showed minor variations; increases in degradation and harm were noted in titles in Fig. 6a, while a slight overall decrease in vices, except for degradation, was seen in descriptions in Fig. 6b. Transcriptions in Fig. 6c revealed a slight rise in harm and degradation, with other values remaining steady. Comments in Fig. 6d initially increased in vices at depth one, but subsequently all vice values declined.

Virtue values presented more fluctuation, particularly in titles in Fig. 7a, where three virtues increased and two decreased, with sanctity experiencing the most significant change. In descriptions in Fig. 7b, we observed an initial decrease and then an increase later on. Transcriptions in Fig. 7c showed virtue levels to be relatively stable. Lastly, in Fig. 7d we noticed an increase in all virtues for comments.

In summarizing the moral dynamics within the South China Sea Dispute, it becomes evident that the shifts in moral values across various depths of recommendations present a complex pattern, lacking a straightforward trajectory.

In the analysis of the Cheng Ho Propaganda, the moral landscape presented diverse shifts. Vice values, referenced in Fig. 8, depicted an uptick in harm across titles as shown in Fig. 8a, alongside minor increases in other vices. Descriptions and transcriptions, in Fig. 8b and c, demonstrated slight decreases in some values, while others remained unchanged. Comments, as per Fig. 8d, showed a mix of stability and slight increases in certain vices.

Virtue values, detailed in Fig. 9, varied across the board, with transcriptions in Fig. 9c experiencing more uniform changes, contrasting with the significant fluctuations in other attributes, both in terms of increases and decreases.

In the examination of narratives surrounding the China-Uyghur Conflict, South China Sea Dispute, and Cheng Ho Propaganda, our investigation into the phenomenon of morality drift reveals a multifaceted spectrum of moral values that vary significantly across different levels of analysis. Through rigorous sentence-level evaluation, this analysis uncovers a dynamic shift accompanied by marked fluctuations in virtues and vices. Some narratives illustrate a discernible movement towards the stabilization of virtues or a reduction in vices, whereas others display complex patterns that go against a linear progression. These observations collectively highlight the relationship between the dissemination of digital content and the changing moral perspectives of audiences.

Building upon this foundation, our subsequent analysis, detailed in Sect. 4.8.1, delves into the statistical significance of morality drift. We calculated the p-values and chose the maximum value as the moral value for that sentence. In this way, we have calculated the count for the Chi-Square evaluation. Therefore, these moral values are not count distribution values; they are mean scores of each word in a sentence across the entire dataset, as shown in Figs. 4, 5, 6, 7, 8, and 9. The empirical evidence, as shown in Table 6, characterized by p-values consistently falling below the threshold of 0.05, clearly confirms the presence of significant shifts in moral content across the majority of cases examined. In some instances, the degree of drift is profoundly marked, with p-values approaching zero or, in certain cases, rounding to zero. This quantitative validation reinforces the notion of an ongoing evolution in moral values, further illustrating the complex interplay between content exposure and the evolution of moral perception within the digital era.

Table 6 p-values for morality drift
Fig. 4
figure 4

Moral vices distribution across all attributes of the China-Uyghur conflict

Fig. 5
figure 5

Moral virtues distribution across all attributes of the China-Uyghur conflict

Fig. 6
figure 6

Moral vices distribution across all attributes of the South China sea dispute

Fig. 7
figure 7

Moral virtues distribution across all attributes of the South China sea dispute

Fig. 8
figure 8

Moral vices distribution across all attributes of the Chengh Ho propaganda

Fig. 9
figure 9

Moral virtues distribution across all attributes of the Chengh Ho propaganda

5.3 Toxicity drift

In our refined analysis of toxicity shifts within digital narratives, we leveraged the toxicity measurement framework as outlined in Sect. 4.4. This approach emphasizes not only the evaluation of average toxicity levels but also a rigorous examination of extreme toxicity instances, employing the mean + 2std metric. As detailed in Sect. 4.8.3, this metric, grounded in the principles of Gaussian distributions, effectively highlights data points that significantly deviate from the norm, serving as a crucial threshold for identifying highly toxic content.

For our toxicity analysis, we primarily focused on average toxicity values, which constituted a single variable. Recognizing that the Chi-Square test is unsuitable for such cases, we instead applied Gaussian distribution principles to identify and scrutinize outliers, particularly those representing high toxicity levels, thus providing a clearer understanding of underlying trends.

This methodology facilitates a complex understanding of toxicity across different digital platforms, enabling us to distinguish between prevalent toxicity trends and the emergence of content that escalates from being merely unpleasant to unequivocally harmful. By applying the mean + 2std threshold, we can precisely identify and analyze instances of extreme toxicity, thereby illuminating the dynamics of toxicity within digital narratives at various levels of content recommendation.

In the context of our analysis on the narrative surrounding the China-Uyghur conflict, initial findings revealed an intriguing pattern: the mean toxicity levels were notably low from the starting point. More interestingly, a specific trend was observed at the initial depth of content recommendation, where mean toxicity demonstrated a marked reduction, hinting at the narrative that was becoming progressively less inflammatory over time. This trend of low mean toxicity levels not only persisted but also stabilized, maintaining a low profile across the entirety of our observation period.

Upon delving into the high toxicity levels, distinct fluctuations across different content aspects became apparent. For instance, the toxicity levels within narrative titles, as illustrated in Fig. 10a, initially decreased, only to rise again, reflecting a fluctuating pattern of toxicity over time. Conversely, the narrative descriptions, shown in Fig. 10b, experienced an initial uptick in toxicity levels, which subsequently decreased, indicating a shift towards moderation after an initial period of heightened toxicity. The most pronounced decline in high toxicity levels was observed in transcription content, as detailed in Fig. 10c, suggesting a significant reduction in the toxicity of this content segment. On the other hand, user-generated comments, as seen in Fig. 10d, experienced a slight increase in toxicity.

Despite these variable trends in instances of high toxicity, the narrative experienced a decrease in aggregate toxicity levels. This observation suggests a gradual improvement in the content recommendation algorithms of digital platforms, steering the narrative towards a more moderated and less extreme discourse on the China-Uyghur conflict. Such a trend is indicative of an evolving digital ecosystem that is becoming increasingly adept at managing and mitigating the spread of toxic content, contributing to a more controlled and constructive online discourse.

In our comprehensive analysis of the South China Sea Dispute narrative, we observed an initial trend where mean toxicity levels were notably low and stable, closely paralleling the findings from the China-Uyghur Conflict. These levels were so minimal that they nearly approached zero, a trend clearly depicted in Fig. 11. This consistency underscores a baseline of low toxicity across different narratives within our dataset.

Remarkably, at the initial recommendation depth (depth 0), our analysis identified no instances of highly toxic content within titles, descriptions, and transcriptions. These content attributes uniformly failed to cross the high toxicity threshold, as delineated in Fig. 11a–c.

However, the narrative complexity increased beyond depth 0, where we began to observe the emergence of high toxicity content. Specifically, Fig. 11a revealed a progressive increase in toxicity levels with each subsequent depth, although this trend plateaued at depth 4, indicating a stabilization in the toxicity of narrative titles. Conversely, Fig. 11b showcased an initial spike in toxicity at depth 1, followed by a significant reduction at depth 2, with only a slight recovery thereafter. A somewhat parallel trend was observed in Fig. 11c, mirroring the behavior of descriptions but with a less pronounced decrease at depth 2, followed by a mild increase in toxicity levels.

The most notable finding was within the user comments, as captured in Fig. 11d, where high toxicity levels peaked, nearing 0.9. This intensity was consistently maintained across different depths, suggesting an area of concentrated toxicity within the narrative. Despite this, the overall mean toxicity distribution remained largely unaffected across all attributes, illustrating a dynamic that mirrors the previously analyzed China-Uyghur Conflict narrative. This observation points to a nuanced understanding of narrative engagement, where, despite the presence of highly toxic comments, the aggregate toxicity level across the narrative’s content did not exhibit a significant shift, maintaining a low and stable mean toxicity level.

In the exploration of the Cheng Ho Propaganda narrative, our analysis found a pattern consistent with other narratives regarding initial mean toxicity levels: predominantly low, with occasional spikes in high toxicity instances, as illustrated in Fig. 12. The titles showed an immediate increase in high toxicity levels in Fig. 12a, followed by subsequent fluctuations. This variability suggests a dynamic narrative engagement from the outset. Meanwhile, descriptions depicted a different pattern, with a notable decrease in high toxicity at depth 1, before increasing again as shown in Fig. 12b. This oscillation in toxicity levels indicates a nuanced content landscape that evolves with user interaction depth.

Contrastingly, transcriptions did not exhibit high toxicity at the surface level (depth 0), as shown in Fig. 12c; however, an increase was observed as users engaged more deeply, eventually stabilizing. Comments, on the other hand, demonstrated a gradual increase in high toxicity levels with each deeper engagement level in Fig. 12d, hinting at a compounding effect of user interactions on toxicity.

Our findings suggest that as users navigate through deeper layers of recommendations, the likelihood of encountering highly toxic content subtly increases. Nonetheless, it appears that content recommendation algorithms are designed with a cautious balance in mind, aiming to maintain overall low toxicity levels. This approach suggests a potential systemic bias towards creating a safer, more welcoming digital environment at the cost of possibly filtering out a broader spectrum of voices. Such a strategy underscores the inherent challenge platforms face in curating content: they must navigate the fine line between reducing exposure to potentially harmful content and preserving a space favorable to free expression and diverse viewpoints. This balancing act reflects the complexities of managing digital narratives, aiming to ensure user safety while fostering an inclusive and neutral platform.

Fig. 10
figure 10

Toxicity distribution across all attributes of the China-Uyghur conflict

Fig. 11
figure 11

Toxicity distribution across all attributes of the South China sea dispute

Fig. 12
figure 12

Toxicity distribution across all attributes of the Chengh Ho propaganda

5.4 Topic drift

To examine the evolution of topics within the narratives, we employed BERTopic, as detailed in Sect. 4.5. Given the model’s extensive topic generation, we focused on the three most prevalent topics for each depth level, enabling us to track topic transitions effectively. This approach allowed us to identify a range from a minimum of three to a maximum of fifteen topics by the conclusion of depth 4, dependent on the degree of topic overlap and shift.

For the analysis of topic drifts, it’s important to note that at each depth level, both the count and labels of topics change; for instance, depth \(a\) might have \(x\) topics, while depth \(b\) might have \(y\) different topics. Given this variability, the Chi-Square test, which requires a fixed distribution for accurate calculations, was not applicable in our case.

In the case of the China-Uyghur conflict, Fig. 13 reveals a significant initial emphasis on the topic of “genocide" (topic_id 384), characterized by keywords such as “genocide, detainees, persecution, internment, holocaust". This is visually represented in blue, being a part of a large portion of the dialogue at depth 0.

Across various attributes, we observed a rapid decrease in the prevalence of the genocide topic in the initial depths, eventually vanishing in the latter stages. At the same time, new topics emerged, notably those related to soccer, as shown in Fig. 13a, as well as themes involving actresses and singers, depicted in Fig. 13b, folklore, highlighted in Fig. 13c, and songs, detailed in Fig. 13d by depth 4. This dramatic shift in thematic focus underscores a significant topic drift within the discourse surrounding the China-Uyghur Conflict.

For the South China Sea Dispute narrative, our analysis using BERTopic identified an initial focus on political elements, specifically the topic of “candidate" illustrated by keywords such as “candidacy, candidate, candidates, presidential, presidency" at depth 0. This focus was evident in titles, descriptions, and transcriptions as portrayed in Fig. 14. Early discussions also highlighted environmental concerns, with “reefs" and “corals" mentioned as at risk from Chinese operations as illustrated in Fig. 14a, alongside “harbour" and “naval" topics, indicating initial sea-related discussions. As the narrative progressed, these topics gave way to more militaristic themes, such as “warships" and “missiles" which maintained a connection to the sea and potential conflicts. Interestingly, the comments diverged, introducing unrelated topics like films, authors, and singers by depth 4 as shown in Fig. 14d, illustrating a topic shift. However, except for comments, the emerging topics remained aligned with the narrative’s core themes, indicating a focused yet evolving discourse.

In the Cheng Ho narrative, the initial dominant topics were associated with keywords like “yang, yin, rituals, religions, shamanism" as illustrated in Fig. 15. This suggests Zheng He’s voyages were not just exploratory but also aimed at fostering harmony and spiritual unity through engagement with various religious practices and shamanistic rituals. Throughout the narrative, the topics remained relevant to these original themes. By depth 4, discussions evolved to include cultural expressions such as festivals, celebrations, dance, and folklore. However, the comments section, as depicted in Fig. 15d, showcased a mix of related and unrelated discussions. This narrative, akin to the South China Sea Dispute, experienced a thematic evolution where initial and later topics, despite their differences, were interconnected. The comments, however, displayed a broader spectrum of topics by the end, indicating a diversification of discussion themes.

Fig. 13
figure 13

Topic distribution across all attributes of the China-Uyghur conflict

Fig. 14
figure 14

Topic distribution across all attributes of the South China sea dispute

Fig. 15
figure 15

Topic distribution across all attributes of the Cheng Ho propaganda

5.5 Influencer nodes

As detailed in Sect. 4.6, we employed Gephi software for network visualization and identification of influencer nodes. To isolate influential entities, we computed modularity values and eigenvector centrality. Modularity values helped us to dismiss insignificant communities due to their minimal size and relevance. Furthermore, nodes demonstrating higher eigenvector centrality were indicative of their influential capacity, as represented by their enlarged node sizes. For each network depth, beginning from depth 1, we pinpointed the top 5 influencer nodes. This initial depth was chosen deliberately to concentrate on nodes that apply a direct impact on the network’s foundational layer, thereby playing a pivotal role in the dissemination of information or behaviors. Unlike previous approaches that utilized BERTopic for topic identification, we choose manual inspection of video titles. Our decision to focus on titles, rather than other elements, was strategic. While we acknowledge the importance of various factors in content engagement, titles are often the decisive factor for viewers when navigating through recommendation networks. This approach allowed us to focus on influencer videos effectively, understanding their pivotal role in guiding viewer choices across the network depths.

In exploring the China-Uyghur conflict narrative through YouTube’s recommendation network, we uncover the subtle role of influencer nodes across varying depths. This is illustrated in both Table 7 and Fig. 16. Initially, viewers are presented with a diverse array of videos, from a Hebrew alphabet tutorial to an analysis of Taiwan-China military drills, illustrating the broad and wide-ranging gateway into the network. This initial diversity sets the stage for a journey through thematic shifts and content divergence.

As the viewer explores further, the network skillfully shifts attention, presenting videos on topics like alternative medicine discussions and financial questions. Although these topics are not directly connected to the central theme of geopolitics, they showcase how the algorithm plays a role in expanding the range of the conversation. These topics, despite diverging from the initial narrative, hold high eigenvector centrality scores, indicating their significant influence within the network’s structure.

By the third depth, a thematic centralization occurs around specific sub-themes such as legal critiques and financial scrutiny, further demonstrating the algorithm’s capacity to narrow or expand the viewer’s focus. This stage reflects a complex balance between engaging with the core narrative and exploring peripheral topics.

At the final analyzed depth, the narrative journey expands dramatically, introducing a wide range of topics from U.S. political scandals to space telescope discoveries. This broadening into unrelated areas showcases the influencer nodes’ pivotal role in shaping content pathways, revealing the algorithm’s potential to simultaneously narrow and broaden viewer exposure to diverse content.

Table 7 Eigenvector centrality scores of China-Uyghur conflict

The South China Sea dispute narrative within YouTube’s recommendation network presented in Table 8 and Fig. 17 offers a concise example of how influencer nodes shape content pathways, starting from a surprising entry point: a blizzard warning in San Diego with the highest eigenvector score at the first depth. This initial weather-related video, seemingly unrelated to geopolitical themes, underscores the algorithm’s capacity to introduce diverse topics, potentially affecting the subsequent recommendation chain.

As the narrative progresses to the second depth, the focus shifts to global issues and environmental concerns, reflecting a broader exploration of themes such as new social orders and nuclear disarmament. The significant presence of videos on environmental protection and global governance illustrates the network’s influence in steering the audience towards a complex understanding of the interplay between geopolitical conflicts and broader global challenges.

By the third depth, the narrative focuses on regional crises, highlighted by detailed coverage of weather-related disasters across California. This focus on immediate, tangible events suggests an algorithmic response to viewer interest, demonstrating the dynamic nature of content recommendations, which can swiftly pivot from global discussions back to localized concerns.

The final depth returns to strategic and speculative themes, with a notable focus on U.S. naval preparation against China, technological advancements, and global demographic issues. The appearance of a strategic military video as the influencer at this depth signifies a full-circle return to geopolitical considerations, although within a much broader context that includes scientific discovery and future technological impacts.

Table 8 Eigenvector centrality scores of South China sea dispute

For the Cheng Ho narrative, the YouTube recommendation network depicted in Table 9 and Fig. 18 demonstrates a focused exploration of Cheng Ho’s historical legacy, with less drift in content across the initial depths, highlighting the algorithm’s ability to maintain thematic consistency.

At the first depth, the narrative firmly centers around Cheng Ho’s contributions and legacy, featuring videos on his life, the Cheng Hoo Mosque, and his significance as a Muslim admiral in Semarang. The highest eigenvector score is assigned to a video directly related to Cheng Ho’s legacy, indicating a strong thematic entry point into the narrative. This depth is dedicated to immersing viewers in the historical and cultural impacts of Cheng Ho, emphasizing his historical significance and the spread of Islam in Southeast Asia.

Progressing to the second depth, the focus remains on Cheng Ho, with videos exploring his mosque, the Sam Poo Kong Temple, and his historical footprints in the archipelago. The narrative continues to delve deeper into Cheng Ho’s cultural and religious heritage, maintaining a coherent and focused exploration of his enduring legacy.

By the third depth, while there’s a slight broadening of themes, including a live streaming from KOMPASTV, the content largely stays on topic. The sustained interest in Cheng Ho is evident with the reappearance of videos on his expeditions and the introduction of a film about him, suggesting a diversification within the bounds of the Cheng Ho narrative rather than a significant drift.

It is only in the final depth that the narrative begins to shift towards contemporary political and social issues, featuring videos on political commentary, suspicious financial transactions, and controversies surrounding the KPK Chairman. This late-stage drift indicates a departure from the historical and cultural focus of earlier depths, suggesting the algorithm’s inclination to eventually introduce current socio-political discussions, possibly in response to broader viewer engagement trends or the inherent dynamics of the recommendation algorithm.

In conclusion, our findings reveal the sophisticated mechanisms at play within YouTube’s recommendation system, where influencer nodes through their strategic position and thematic influence play a critical role in either maintaining narrative focus or facilitating thematic drift. This insight into the algorithm’s operation not only illuminates the challenges in navigating digital content landscapes but also underscores the importance of understanding influencer nodes’ impact on public discourse and perception.

Table 9 Eigenvector centrality scores of Cheng Ho propaganda
Fig. 16
figure 16

Network graphs of the China-Uyghur conflict

Fig. 17
figure 17

Network graphs of the South China sea dispute

Fig. 18
figure 18

Network graphs of the Cheng Ho propaganda

5.6 Engagement bias

In accordance with the methodology outlined in Sect. 4.7, we examined engagement metrics such as views, likes, and comments across various levels of recommendation depth. Utilizing box plots enabled us to illustrate not only the average engagement but also the distribution’s range, median, variance, and other statistical indicators.

Our analysis, as detailed in the narratives highlighted in Figs. 19, 20, and 21, revealed a consistent pattern of engagement metrics ranking in the order of views, likes, and comments. Notably, there was a significant increase in engagement at the initial recommendation depth, followed by minimal increases or stable engagement at subsequent depths. Furthermore, we observed an expansion in the outlier boundaries, increasing by orders of magnitude towards the final depths.

These observations support our hypothesis that the recommendation algorithm favors videos with higher engagement metrics, thus enhancing the visibility of content that is already popular. This trend indicates an algorithmic bias towards popular videos, potentially at the cost of less viewed but equally relevant content.

Further statistical analysis, using the method discussed in Sect. 4.8.2, investigated the distribution of engagement across depths for potential inequality. The results, as presented in Table 10, show values nearing 1, indicating a pronounced inequality in the distribution of videos across depths based on likes, views, and comment counts. Additionally, with each successive depth, the Atkinson index increased, nearly reaching 1 by depth 4. This suggests that with each recommendation cycle, the distribution of video engagements becomes increasingly unequal, highlighting a growing disparity in content visibility based on engagement metrics.

Table 10 Atkinson values for engagement statistics
Fig. 19
figure 19

China-Uyghur conflict box plot representation for engagement statistics

Fig. 20
figure 20

South China Sea dispute box plot representation for engagement statistics

Fig. 21
figure 21

Cheng Ho propaganda box plot representation for engagement statistics

6 Conclusion and discussion

In this study, we embarked on a comprehensive examination of YouTube’s recommendation algorithm to understand its impact on the narrative and diversity of content. Through a methodical approach that included gathering extensive data, analyzing emotions and morals conveyed in videos, assessing content toxicity, exploring topics, conducting network analysis, and scrutinizing engagement metrics, we aimed to uncover the subtle ways in which algorithmic suggestions might shape the evolution of narratives and influence wider discourse on a range of topics. Our investigation sought to peel back the layers of YouTube’s algorithmic ecosystem to reveal how it potentially directs narratives in specific directions, thereby affecting the broader conversation spectrum.

Our exploration of YouTube’s recommendation algorithm revealed a complex and varied terrain of influence. Key findings include:

  • Emotion Drift: We observed a significant evolution in the emotional tone of narratives, shifting from initially negative to increasingly neutral and positive emotions. This pattern was particularly pronounced in the discussions surrounding the China-Uyghur Conflict and the South China Sea Dispute. The significant shift from negative to neutral and positive emotions, as supported by very low p-values, indicates the algorithm’s strong influence in modifying the narrative tone. Conversely, the Cheng Ho narrative displayed remarkable stability, showing minimal alterations, suggesting a resilience against the algorithm’s tendency to modify emotional undertones.

  • Moral Values: Our examination of moral values within the narratives uncovered a complex interplay of ethical considerations. Specifically, we noted a trend towards the promotion of uplifting moral values in the narrative related to the China-Uyghur Conflict, implying an intentional curation by the algorithm. This shift towards more positive moral tones highlights the algorithm’s potential role in presenting sensitive topics more positively.

  • Content Toxicity: Our analysis revealed YouTube’s proficiency in refining its recommendations to sustain minimal toxicity levels. This effective moderation strategy ensures a healthier content ecosystem. However, this selective filtering may also introduce biases, prioritizing certain narratives or viewpoints and subtly influencing the diversity of discourse.

  • Topic Analysis: Significant thematic evolution was observed within the China-Uyghur Conflict narrative, aligning with our observations regarding emotional and moral shifts. This suggests a broader algorithmic effort to enrich and diversify the narrative landscape. In contrast, the Cheng Ho narrative exhibited remarkable thematic consistency, indicating the algorithm’s selective curation strategy to maintain the integrity of specific historical or cultural narratives.

  • Network Analysis: Our network analysis illuminated the pivotal role of specific videos that act as influential nodes within the recommendation ecosystem. These key influencer nodes, identifiable by high levels of engagement or central themes, significantly shape the trajectory of narrative flow, enhancing, altering, or inhibiting the dissemination of narratives across YouTube.

  • Engagement Metrics: The algorithm’s bias towards promoting content with higher user interaction was evident, highlighting its role in shaping the visibility and distribution of content based on viewer engagement levels. This preference influences the broader narrative discourse on the platform.

Collectively, our findings paint a detailed portrait of YouTube’s recommendation algorithm as a powerful influencer in the crafting and governance of narratives. By intricately weaving together elements of emotional resonance, moral context, content toxicity management, thematic direction, network dynamics, and engagement focus, the algorithm positions itself as a central figure in the construction of the digital content ecosystem. It subtly steers user interactions and shapes public conversation, underscoring its role as a critical determinant of the online narrative fabric. This complex coordination highlights the algorithm’s capacity to subtly navigate user experience and influence the broader discourse.

From our analysis we can conclude these generalized takeaways:

  • The recommendation algorithm significantly shifts the emotional tone of content, often from negative to more neutral and positive tones, particularly in sensitive geopolitical topics.

  • There is a notable trend towards the promotion of positive moral values in certain narratives, suggesting an algorithmic bias towards more uplifting content.

  • The platform’s moderation strategies effectively minimize content toxicity, though this selective filtering could introduce biases.

  • Thematic evolution within narratives indicates an effort by the algorithm to diversify content, though certain narratives are maintained with remarkable consistency.

  • Influential nodes within the recommendation network play a pivotal role in shaping narrative flow and user engagement.

  • Engagement metrics reveal a preference for content with higher user interaction, impacting content visibility and distribution.

The significance of our research reaches beyond mere academic interest, providing vital perspectives on the content moderation strategies and ethical frameworks within digital platforms. This work enriches the ongoing dialogue concerning the need for algorithmic openness, equity, and the broader societal consequences stemming from digital media practices. By illuminating the intricate ways in which recommendation algorithms affect content discoverability and audience interaction, our study advocates for a more profound exploration of the moral aspects surrounding algorithmic control.

In wrapping up, our investigation highlights the profound influence of recommendation algorithms in crafting online narratives and shaping user journeys. It encourages continued academic exploration into the ethical and social implications of algorithmic choices, calling for a future in which digital platforms not only seek to engage but also to uphold values of diversity, justice, and openness.

7 Ethical considerations and positionality

We recognize that our perspectives and backgrounds influence our approach and interpretation of the data. As researchers based in the United States, our views on geopolitical issues may be shaped by our sociopolitical context. While striving for objectivity, we acknowledge that complete neutrality is unattainable. Awareness of these potential biases helps us critically evaluate our findings and present a balanced analysis.

Ethical considerations were paramount throughout our study. We ensured that all data collected from YouTube were publicly accessible and did not involve any personally identifiable information, thus complying with ethical standards for data privacy. Although our study did not involve direct interaction with human subjects, we adhered to ethical guidelines for using publicly available data, ensuring transparency and respect for user-generated content.

We considered the potential impact of our findings on public discourse and the digital ecosystem. By highlighting biases in YouTube’s recommendation algorithms, our goal is to contribute to more equitable and transparent digital platforms, promoting constructive dialogue and positive change while being mindful of the ethical implications of our work.