1. Introduction
Community detection has been one of the primary applications of network science [
1]. In a network, a community is a group of nodes that are more likely to be connected if they share common characteristics. In the social context, individuals in a community tend to interact more with each other than with people outside the community. Community detection is used not only in social network analysis but also in other areas of complex network analysis, such as computer, information, and biological networks. Community detection methods have been applied to analyse functional groups in various areas of biology. For instance, functional groups in metabolic networks correspond to biochemical reaction cycles or pathways. In a protein–protein interaction network, communities indicate groups of proteins that exhibit similar functionality within a biological cell [
2,
3].
Numerous methods and algorithms have been proposed for community detection. Most of these methods rely on the notion that nodes within a community are more strongly connected than between nodes in other communities [
4,
5]. However, this definition is not specific, which results in many computational approaches being available. The definition of what constitutes a community is not well posed, making community detection a challenging task [
1].
Some of the earliest algorithms for dividing a network into communities include the minimum-cut method, the Girvan–Newman method, hierarchical clustering, and modularity maximisation. In the minimum-cut method, the network is divided into a predetermined number of groups selected to minimise the number of edges between groups. The Girvan–Newman algorithm identifies links in a network that connect communities based on a betweenness measure and removes them, leaving only the communities themselves [
6]. Hierarchical clustering methods group similar objects into clusters and build a cluster hierarchy. Modularity is often used in community detection methods to measure the strength of a network’s division into communities [
7,
8,
9]. By comparing the number of links within communities to what is expected by chance, modularity allows us to identify significant community structures [
2,
3].
When modelling network structures [
10], it is important to describe the network’s topology in terms of nodes and the connections between them. These connections can be represented as directed links between neighbouring nodes, and they have weights [
8,
11] that indicate their ability to convey information or influence. When modelling the spread of information or influence, we also have to consider network flow models. These models describe the rules by which information or influence propagates from one node to another throughout the network [
12,
13,
14]. In this study, we are particularly interested in non-conserved network processes, where the spread from one node continues to all adjacent nodes. In conserving flows, paths cannot be divided, whereas non-conserving flows allow propagation through multiple paths simultaneously. The relevant literature on network flow models and message passing on complex networks includes works such as [
13,
15].
We have developed a method to detect non-overlapping and overlapping communities in a network using a probability matrix and local maxima of a quality function (objective function). Our approach enables us to treat the exploration of network models and community detection models as distinct tasks. In our study, we model network flow using an influence-spreading model and define a community cohesion measure as the quality function. We consider communities as divisions of the network where the quality function has a local maximum.
We have designed a technique to identify groups of nodes that tend to belong to the same community frequently, based on a detailed network structure of nodes and links between them. Although some of these groups may be considered distinct communities according to our definition, not all of them are entirely self-contained communities. We use the term "building block" to describe such groups [
16], as a similar concept has been used in the literature [
17]. We define a community as a local maximum solution of the quality function, while building blocks are the union of detected communities and their intersections. Communities can be made up of one or more building blocks, and a building block can be a self-contained community or not. The definition of a community is quite general, and it also allows for hierarchical and overlapping communities [
10,
18,
19,
20,
21,
22]. Our approach, which involves searching for local maxima of the quality function, is well suited for identifying complex and overlapping community structures.
Community detection methods aim to accurately identify communities within a network while maintaining computational efficiency. While smaller networks pose no issues for computational efficiency, larger networks often require a trade-off between accuracy and efficiency. In our study, we define communities using a quality function which ensures accuracy by accepting only optimal solutions. However, the exhaustive optimisation of the quality function over the set of all graphs in a network can be computationally expensive [
8]. To ensure that our simulation algorithm effectively detects the most significant communities, traditional methods such as gradient descent or simulated annealing can be employed [
23,
24]. These methods guarantee a similar performance to the algorithm that uses modularity as a quality function. However, the focus of this study is on presenting the modelling principles rather than using traditional methods. Our objective is to offer a complete list of identified communities rather than just the most important ones. Although various approximate techniques [
25,
26,
27,
28] can enhance the efficiency of community detection algorithms, these methods are not ideal for searching overlapping communities, as we aim to identify accurate communities and their building blocks. However, approximate techniques can still be used to initiate the search algorithm. We offer guidelines for optimising the search for overlapping communities in
Appendix C.2. These hints apply to both random and optimised initialisations of the searching algorithm based on the probability matrix. The effectiveness of our influence-spreading model, which is used to construct the probability matrix, has already been discussed in [
29].
In this study, experiments are carried out on multiple small networks and two moderate-size networks that consist of 5000–10,000 nodes. The results show that the method used in the study is both useful and valid for different network structures and moderate network sizes. With the help of the pseudo-algorithm and ideas provided in
Appendix C.2, it is possible to apply our method to larger networks by making the algorithm more efficient. However, detecting overlapping communities in very large networks may require the development of new post-processing methods to present results, depending on the desired granularity of the analysis. This is because a significant number of almost similar overlapping communities can be detected in large networks. Nevertheless, our study shows that the results can be visualised easily for all network sizes. Furthermore, the method used in the study is particularly suitable for detailed network structures, as demonstrated with several small networks.
Several well-known networks are commonly used as test networks in the literature to compare the results of different community detection models. These include the Zachary’s Karate Club network, the American football games network, and the Dolphin social network. Also, the ground-truth communities are known for these networks. The Les Misérables social network is known for its information-theoretic solution. Additionally, we have tested our method on two moderate-size networks and the results have been visualised by a tool that generates the network layout independently without using any information about our community detection results. This allows for a more objective and visual verification of our results.
In
Section 2, we summarise the most commonly used community detection methods in the literature. In
Section 3, we present properties of our influence-spreading model, and in
Section 4, we explain our model for detecting communities and their building blocks. In
Section 5 and
Section 6, we present a pseudo-algorithm that can be used to detect overlapping communities and discuss the accuracy and efficiency of our method. In
Appendix C.2, we provide a more detailed explanation of the accuracy and efficiency of our algorithm and also suggest ways it could be further optimised. In
Section 7, we introduce our models with four small social networks. In addition, we demonstrate the use of the models with two larger social media networks. In
Section 8 and
Section 9, we discuss the properties of our models and present conclusions of this study. In
Appendix A, we provide a simple example of calculating circular effects in our influence-spreading model. In
Appendix B, we have an example with one of our small networks of how to use the model to divide a network into three communities. In
Appendix C, we discuss how to improve the efficiency of our simulation program for detecting overlapping community structures.
2. Related Work
Communities are groups of nodes in a network that are strongly connected or that share similar features or roles [
1]. Detecting communities are useful in studying the structure of complex systems such as social, information, and biological networks. The applications of these methods are numerous, including epidemic spreading, market segmentation, criminal detection, influence spreading, fake news detection, recommendation systems, and more. Various articles have reviewed and discussed community detection methods [
1,
4,
5,
28,
30,
31,
32,
33,
34,
35], including those that consider overlapping communities [
10,
19,
36,
37].
Community detection methods aim to enhance the accuracy of the outcomes while also creating computationally efficient algorithms for various applications. The accuracy of the results is evaluated based on the algorithm’s ability to identify ground-truth communities or recover known communities that are planted in artificial benchmark networks [
37,
38]. Many community detection algorithms aim to optimise a quality function, such as modularity or the one proposed in this study, which measures the quality of potential communities. The development of community detection algorithms is dependent on the desired community characteristics and the computational efficiency requirements. The methods proposed often prioritise the speed of calculation, which is why multi-step algorithms are frequently utilised [
39]. In such cases, the algorithm determines the connection to the community definition indirectly.
The exhaustive optimisation of a quality function over the set of all graphs of a network is computationally hard as the computational complexity of the problem is NP-complete [
8]. The optimisation of the quality function proposed in this study can be compared with the corresponding optimisation of the modularity measure. As mentioned in [
8], practical methods make use of approximate optimisation schemes such as greedy algorithms [
25], simulated annealing [
26,
27,
30,
40], spectral methods [
28], and genetic algorithms [
27]. Simulated annealing [
23] is a probabilistic technique that approximates the global optimum of a given function. It is preferred over exact algorithms like gradient descent [
23] or branch and bound [
24] when finding an approximate global optimum is more important than finding a precise local optimum within a fixed amount of time. However, using simulated annealing for large network problems is not practical because it requires a significant computational effort [
7,
8]. A recent study discussed in [
41] proposes an effective hybrid community detection method to enhance the quality of detected communities. It aims to increase the modularity of a community detected by any detection algorithm and network structure.
Challenges and opportunities of community detection methods have been discussed in [
36]. One approach is to choose a metric that measures the quality of a community and then try to maximise it. Another approach is statistical inference, where a generative model is fitted to the observed network data. One popular generative model is the stochastic block model, where nodes are grouped into blocks and edges are randomly placed between them based on their block assignments. However, this approach has a weakness in that it treats nodes within the same block as statistically independent, which may result in lower and higher-degree blocks rather than traditional community structures [
36]. According to the study in [
10], traditional quality measures are not enough to assess communities in a network. The study suggests that understanding the structural properties of how nodes are organised is crucial. The research in [
10] investigates four quality functions: overlapping normalised mutual information [
37,
42], the Omega index [
37], modularity [
7], and the statistical F1-score.
The Louvain method [
43,
44], InfoMap [
40], and spectral clustering [
28] are three popular algorithms used for community detection. The Louvain and InfoMap methods are particularly efficient for detecting communities in large complex networks. The Louvain method uses modularity [
7] as a quality function, and the algorithm optimises modularity by moving nodes between communities iteratively until no further improvements can be made. It has been shown that modularity suffers from a resolution limit, which means that in large networks, methods based on modularity would fail to detect small communities [
9,
44]. InfoMap, on the other hand, uses the so-called map equation to represent information diffusion on a map, where nodes are connected if they are close in the map’s representation [
40]. Communities are identified by minimising the entropy of the map. Spectral clustering is a method that uses the eigenvectors of the network’s graph Laplacian matrix, which is constructed from the network’s adjacency matrix, to identify different communities within the network [
2,
3].
Traditionally, community detection methods have assumed that nodes belong to disjoint communities, but real-world networks often exhibit overlapping community structures where nodes can participate in multiple groups. Methods and algorithms for detecting overlapping communities have been studied in [
18,
20,
21,
22,
45,
46,
47] to mention a few. An overlapping community detection method in complex networks based on information theory was presented in [
20]. A method to analyse and explore the main statistical features of the sets of overlapping communities was presented in [
18]. The first algorithm that finds simultaneously both overlapping communities and the hierarchical structure in complex networks was introduced in [
21]. Later, overlapping and hierarchical community detection for weighted networks was studied, for example, in [
22].
Recently, different approaches and strategies to detect and analyse overlapping communities have been proposed. In [
45], the algorithm identifies similar seed communities and calculates the similarity between the neighbouring nodes and the community. Nodes that meet the similarity threshold are selected, and an adaptive optimisation function is used to expand the community. Finally, free nodes are divided into communities. The study in [
46] presents two community detection algorithms that use extended modularity and cosine functions as quality functions.
One area of research involves fuzzy overlapping community detection [
48,
49]. In this method, each node is assigned to a community with a belonging factor that reflects the strength of its association. This fuzzy assignment can be turned into a crisp overlapping assignment by setting a belonging threshold. In a crisp overlapping assignment, each node is associated with each community with a binary belonging factor. This approach enables the identification of overlapping nodes at different scales.
Researchers have also developed tools for higher-order network analysis. Such higher-order interactions [
50] have been observed in a variety of systems, including collaboration networks, ecosystems, social networks, and nervous systems. The authors in [
51] found that the existence of higher-order interactions obscured the community structure in the network, and they suggested removing higher-order interactions to improve the accuracy of community detection.
The topic of opinion dynamics has received significant attention in the literature, and several studies have been conducted in [
12,
52,
53,
54]. In [
54], complex contagion is defined as a situation where an individual needs several exposures before adopting an innovation or behaviour change. Unlike a disease that can spread after just one contact (simple contagion) with an infected neighbour, innovation may not spread as easily [
54].
In the following section, we briefly describe our influence-spreading model, which generates an influence-spreading matrix capturing the network structure and spreading probabilities between all node pairs on the network. The model was introduced in previous research [
16,
29]. An alternative to the influence-spreading model is the network connectivity model [
16,
29], which describes the static connectivity in a network structure. In this case, the results are presented in the same matrix form, but we call the matrix a probability matrix. In
Section 4, we introduce our community detection model and the corresponding pseudo-algorithm for detecting overlapping communities. The algorithm is based on a quality function that is a function of the influence-spreading matrix.
Because the influence-spreading matrix includes information on both the structure of the network and the influence-spreading process, the quality function also has similar properties. Thus, our proposed community detection method that incorporates information about the network structure into the quality function can be a solution to the problem mentioned in [
10].
3. Influence-Spreading Model
Social influence can be represented by a probability matrix describing how people in a social network interact. This matrix can be created using different methods such as influence-spreading and connectivity models. To describe influence spreading, we require a network flow model to explain how influence spreads between nodes through paths in the network structure. The primary objective of our influence-spreading model is to calculate the probabilities of the influence between all individual nodes in the network by utilising the given probabilities between neighbouring nodes in the network structure [
16,
29,
55].
Our model uses as the initial information the topological structure of the network and directed link weights for all edges between adjacent nodes in the network. The topological structure is expressed as a list of directed links between the nodes in the network. Link weights are expressed as probability values in the range from 0 to 1. Similarly to link weights, nodes can have node weights. Node and link weights indicate the ability of these network elements to disseminate the influence in the spreading process.
We use probabilistic methods because they ensure the unambiguous interpretation of the influence-spreading matrix itself and the possibility of defining further interpretative quantities of the model. An element of the influence-spreading matrix is the probability of influence from one node to another node through all alternative paths in the network. The link weight between adjacent nodes i and j is interpreted as the probability of transmitting a piece of information or exerting social influence from node i to node j. Notice that link weights are directed and can all differ between neighbouring nodes in the network structure.
The model of network flow determines how interactions within the network structure influence the spreading process. The spreading mechanism can depend on various factors, such as the network structure, link and node properties, and the nodes’ states. In this study, the node state refers to the probability of the node being influenced already. In two extreme cases of our influence-spreading model, the full breakthrough influence and network connectivity models, the node states do not affect the spreading process. However, we do consider the influence spreading through different alternative paths to a target node, following the rules of probability theory to avoid double-counting. Unique in our influence-spreading model is the method of combining the alternative paths coming from different routes from source nodes to target nodes. In the case of full breakthrough effects, we can perform this calculation analytically.
Different models can have restrictions for alternative paths. There are two extreme cases: one model allows all possible paths, including circular and recurrent paths, while the other model only allows self-avoiding paths, where nodes can only appear once. Our influence-spreading model is an example of the first case. Self-avoiding paths are commonly used for modelling virus epidemics, where infected and recovered individuals achieve full immunity to the disease. These kinds of models are also useful for describing the transmission of well-defined information on social networks and other related applications.
Spreading processes are based on the principle that spreading from a node is only possible if the influence has already reached that node. This leads to an attenuating propagation because the node and link weights are typically less than one, and they are multiplied in the probabilistic calculations. Unlike typical Markov chain or random walk models in the literature, our non-conserved model allows a spread to all possible adjacent nodes in the network within the limits of the node and link weights.
Influence spreading and network connectivity models are related, as the connectivity of nodes in a network can be considered as a limiting case of influence spreading without any recurrent or circular effects. We discussed this in our earlier paper [
56]. In the following, we categorise our influence-spreading model with full breakthrough effects as a complex contagion and the network connectivity model as a simple contagion model. These definitions are generalisations of the commonly used concepts in the literature [
57]. Both models consider all possible paths of the network structure limited by the network size or maximum path length parameter
.
A key element of our influence-spreading model is combining the effects of different paths in the network structure [
29]. When influence propagates through multiple paths to a node, we utilise a formula from probability theory that accounts for mutually non-excluding events. This method ensures that influence from multiple neighbouring nodes is calculated only once for a given source node, rather than simply being added up. By doing so, we avoid the possibility of non-physical probabilities greater than one.
We have also published an efficient algorithm [
29] for computing the spreading probabilities, or the influence-spreading matrix elements, in the case of full breakthrough effects [
16,
55]. For large networks, the computation time can be limited by setting a value to the maximum path length
. We still use simulation methods for the network connectivity or spreading model with self-avoiding paths [
58].
Full breakthrough influence does not depend on the node states and does not affect the spreading process. Our current influence-spreading model only accounts for non-influenced and influenced states, where nodes that have already been influenced and those that have forgotten their opinion are considered as one state.
4. Community Detection Model
Our objective in this study is to introduce the method of separating the modelling of network structure or network flow from the modelling of community detection. We propose a community detection approach that is based on a probability matrix or an influence-spreading matrix and local maxima of a quality function.
We denote the influence-spreading matrix by
C and its elements by
where
s is a source node, and
t is a target node in the network structure. The number of nodes in the network is
N. The elements in the matrix describe directed influence probabilities between any two nodes in the network. This definition differs from other matrices used in the literature, such as the adjacency matrix and the Markov matrix [
3].
Centrality measures indicate a node’s importance in the network. These metrics help study network phenomena like opinion spreading and group formation. We define two variants of centrality measures based on the influence-spreading matrix. Out-centrality measures the influence one node exerts on other nodes in the network. In-centrality measures the influence other nodes in the network have on one node. These metrics denote the mean number of influenced or influencing nodes, or probabilities, depending on the normalisation convention. We define the out-centrality of node
s in network
G as
and the in-centrality of node
t as
In the literature, several other centrality measures [
2,
3] have been proposed but usually, they are not based on a consistent model where probabilistic or similar interpretations are possible.
In our model, the method for detecting communities is based on finding local maximum values of the quality function in Equation (
3) computed from the influence-spreading matrix elements
:
If there is a local maximum for a subset of nodes V, we infer that V is a community in network G. Our approach is an application of the general principle in applied mathematics where a local optimum of an optimisation problem is an optimal solution within a neighbouring set of candidate solutions.
Equation (
3) measures the division’s strength into two factions
V and
of the original network
G. The higher the value of
q, the better the sum of the cohesion of the two communities. The value of
q can be used as a quantitative measure for comparing the quality of different divisions of the network structure.
One of the key features of this model is that it can have multiple local maxima with varying strengths in the network structure. This allows for the existence of several overlapping community structures. The quality function in Equation (
3) can be used to measure the quality of each division. For a division, the quality function is a sum of the terms on both factions of the network. The advantage of the method based on searching local maxima of the quality function is that it does not fix the number of nodes in the communities, unlike some other community detection or network partition methods in the literature where the number of communities is predetermined [
3].
We can also express Formula (
3) as
We can identify that the first two terms can be expressed with the help of out-centrality measures as
Equation (
5) defines a new quantity
as a measure of the influence of nodes in
V and
on all nodes in network
G, including
V and
. Similarly, Equation (
5) can be expressed with in-centrality measures as
The centrality measures in (
1) and (
2) and the community influence measure in Equations (
3) and (
4) are closely related, as they are all based on the same influence-spreading matrix
C, ensuring their consistent definition. The sum of all matrix elements of the matrix
C is a constant, and we can define
as a measure of cohesion of the entire network
G. From Equation (
4), we see that maximising
q in Equation (
3) is equivalent to minimising the sum of the last two terms in Equation (
4). We denote this quantity as
in the following formula:
Our approach involves maximising interactions within the two factions
V and
of a network while minimising interactions across the two factions. As a result, the definition of the community quality function, denoted by
q in Equation (
3), does not have cross terms. This community quality function can be compared to the commonly used modularity measure [
3,
8], where networks with high modularity have dense connections between nodes within modules but sparse connections between nodes in different modules.
The quality function that is defined in Equation (
3) or (
4) is useful when comparing divisions within a network. However, when comparing communities in networks of different sizes, it is more appropriate to normalise the measure and take into account that the sums in Equation (
3) include different numbers of links and nodes depending on the sizes of the two factions of a division. Equations (
3) and (
4) can be normalised by dividing the expressions by the value of
, as shown in Equation (
8). The value of
is calculated using the formula:
where
represents the number of nodes in one of the two factions of the division. The number of nodes in the other faction is
. Normalised quality function values can be calculated for each division of a network as
Diagonal elements do not affect the community detection results because we calculate the influence-spreading matrix elements by assuming that the spreading process is initiated from a source node with probability one [
29]. We have set the values of the diagonal elements of matrix
C to zero. In Equation (
8), the number of terms corresponding to the first and second sums in Equation (
3) are
and
, respectively. Source nodes can take part in circular and recurrent spreading events, as long as these events are permitted within the network flow model. However, target nodes are not involved in such events [
29]. This characteristic distinguishes our model from other models, like Markov models, that are commonly found in the literature.
Normalisation has little impact on whether a community exists within a network, but it can influence the ranking of communities. If the network is divided into two communities of unequal sizes, normalisation would assign a lower weight to that division. The decision to use normalised values or not depends on the needs of the application. In this study, we present the findings as un-normalised values, except in
Table 1, where we provide normalised values in column
). It is interesting to note that the community rankings for Zachary’s Karate Club have changed in a way that follows the majority of other community detection models in the literature [
3,
8,
20]. Specifically, the division into communities with
and
nodes depicted in
Figure 1b has the first ranking.
We introduce a broader concept of a building block that comprises all the divisions of a network identified as local maxima of Equation (
3), along with their intersections. It is worth noting that while these intersections may not be local maxima of Equation (
3), many of them are. All communities qualify as building blocks, but not all the building blocks meet the criteria for being communities as defined by our method.
Figure 1 shows an example of the two strongest divisions and building blocks of communities in Zachary’s Karate Club network [
59] detected by our method. In
Section 7.1, we explain by example how these building blocks are detected.
Notice that the sums in Equation (
3) do not include interactions between the two factions of the network. This is a simplification because in real life, communication between different communities continues, albeit the relations inside each community usually are closer and more active. On the other hand, we can assume that the cross-terms cancel out in the process of community formation, which justifies leaving the cross-terms out.
It is worth noting that the model has a specific feature whereby paths between members of the same community via the other community are included in the calculation of Equation (
3). While this can strengthen cohesion within a community, the effects of directly connecting the two partitions are far more significant. Strong or dense connections between the two factions can prevent a local maximum or a subdivision into two communities.
In the literature, communities are often detected based on modularity [
3,
6,
7], a popular quality function in community detection methods. However, modularity is not based on probabilities and, therefore, it has no direct probabilistic or physical interpretation. Modularity has the same assumption as we have in our model that the cross-terms have no effects on community formation.
We define optimal solutions to the community detection problem as divisions into two factions of the original network where
q in Equation (
3) has a local maximum. This means that moving a node from one community to another community decreases the value of
q. We justify our model by the well-known principle of equilibrium, a condition or state in which driving forces are balanced. The probabilistic interpretation of the model enables us to define the quality function as an equilibrium state of the network.
Although the value of the quality function in Equation (
3) is important, there is another aspect related to the likelihood of community formation. To address this, we introduce a second measure that indicates the probability of community formation starting from a random initial setting of nodes in the network. Our statistics are derived from the computational results that we obtain by employing a basic simulation method to search for the local maxima of the quality function. We refer to this measure as a statistical measure of community formation and denote it by
f and define it for a detected community
A as
We define a random setting as a situation where nodes are assigned randomly to two sets that represent the two factions of the division in a network. It is worth noting that in simulations, there is a possibility that all the nodes of the network end up in one group and the second group is empty. Equation (
3) always gives this division the highest value
q. However, we do not count the entire network
G as a separate community. Due to this effect, the statistical community formation measure
f does not add up to 100 per cent. The missing portion can be interpreted as the probability of no communities arising.
It is important to note that the quality function defined in Equation (
3) does not rely on the statistical measure suggested in Equation (
10). As a result, the latter should be viewed as supplementary information that emphasises the difficulties involved in defining a quality function that can accurately detect overlapping communities. The measure
f in Equation (
10) is based on two randomly initiated sets of nodes. However, theoretically, the order of processing nodes in Algorithm 1 affects the value of
f. But it is worth mentioning that the order of processing does not affect the values of
q in Equation (
3) for detected communities because the algorithm only accepts optimal solutions. In this respect, there is no stability issue because optimal solutions of Equation (
3) are well defined.
When exploring network structures with unknown link weights, we can still analyse the network structure by using the link weight as a parameter. In this case, the same link weight value is used for all links in the network. The results of community detection can be influenced by varying the link weight parameter. If the link weight is high, then no communities will be detected. However, if the link weight is lowered, the first division into two communities will appear. Furthermore, if the link weight is further decreased, the number of different divisions into communities will increase. The numerical value of the threshold link weight is not required in the community detection algorithm. It is a theoretical quantity that has a natural interpretation in the model. The value determines a critical point above which the cohesion of the entire network is so high that there is only one community.
Algorithm 1 Detecting overlapping communities. |
1: ▷ Searches communities based on probability matrix M by maximising q (Equation (3)). |
2: procedure Division(control parameters, M) |
3: ▷ Control parameters include stopping rules for the calculation and optimisation rules |
to initialise the simulation and control the order of processing nodes. Maximum number |
of detected communities A and maximum number of iterations B are provided as control |
parameters. Matrix M elements describe directed influence probabilities between all node |
pairs. Matrix M has elements. |
4: |
5: while and no other stopping criteria are fulfilled do |
6: |
7: |
8: while equals and do |
9: or for ▷ Initiate V, random or optimised |
10: Calculate q |
11: for I = 1,N do |
12: ▷ The order of processing nodes can be optimised |
13: Calculate q |
14: if the value of q is higher than the previous q value then |
15: cycle ▷ Skip the remaining statements inside the loop |
16: else |
17: ▷ Restore |
18: |
19: end if |
20: end for |
21: ▷ The number of iterations is increased by one |
22: end while |
23: if the same division has not been detected in earlier iterations then |
24: ▷ The number of detected divisions is increased by one |
25: ▷ Save vector V to list |
26: ▷ Save q to vector |
27: end if |
28: end while |
29: for J = 1,a do |
30: Print to a file |
31: end for |
32: ▷ Now, the list of divisions and their quality function values are saved in a file. |
33: end procedure |
In practice, communities with very low values of the quality function or low formation probability are not very important. However, in general, the increasing number of possible communities indicates a low cohesion and fragility of the community. Low cohesion is a natural consequence of weak ties in the community structure. The cohesion value of a set of nodes
V can be calculated as a function of the link weight by taking the sum of influence matrix elements:
This measure can be calculated for any subset of nodes
in the network. Now, we can express the quality function in Equations (
3) and (
4) yet in another form as a sum of the cohesion values of node sets
V and
Our model can only identify two separate factions of the original network G at a time. Although this may appear to be a constraint, it allows us to examine the intersections and boundaries between these divisions. By calculating the intersections of different divisions, we can more effectively detect new communities in the network structure using computational methods. Some of these intersections correspond to communities, while others do not. If the set of nodes does not form a community, the statistical community measure is zero because it is not a local maximum of the quality function. As mentioned above, we have defined the concept of network building blocks as sets of nodes that are either communities or their intersections. Intersections are potential candidates for communities as members of two or more communities.
The statistical measures can indicate how long it takes to carry out a simulation. However, in some applications, it may be sufficient to find solutions in less time instead of searching for communities that may have a higher strength based on the value of the quality function. Nonetheless, structures with high local cohesion can be of interest. Local communities can emerge when nodes in the network structure locally have many links between each other or link weights are relatively high. Communities can be formed around specific interests, often consisting of a small number of members. In this model, such a scenario would involve the network being asymmetrically divided into a large and a small group, resulting in a local maximum of the quality function in Equation (
3).
Communities have a crucial aspect of stability. If we remove a node or a set of nodes from a community, it can cause the entire community to dissolve, leaving behind fewer nodes that do not form an optimal solution. We can measure the impact of removing a node or nodes by calculating the difference in the value of the quality function before and after the change. This information can be used to combine overlapping communities and reduce the number of communities. However, the combined effect of removing a set of nodes is different from the sum of the individual node-removing effects. In
Section 2, we discussed fuzzy overlapping community detection, where a belonging factor indicates the strength of a node’s association with a community. This factor can be defined for one node but may not be practical if we want to maintain an optimal solution for the quality function while defining the community.
5. Algorithm for Detecting Communities
Our method involves separating the modelling process into two parts: network modelling and analysing the community structure. Algorithm 1 concerns the analysis part of the problem while the network modelling part was presented in our earlier work [
16,
55,
58]. In the past, community detection methods combined these two steps into a single model [
3,
32,
36]. While this approach may be beneficial for optimising traditional community detection algorithms, it can also make the two models less distinct and complicate the use of a well-defined quality function for the community detection problem.
In our approach, we can use the probability matrix for optimising the search algorithm. This is because the matrix contains the needed information about the influence strength between all node pairs in the network structure. We can test how moving a node or a set of nodes affects the quality of a division and use that information to generate a more optimal division of the network. Currently, our algorithm moves one node at a time between the two factions of the division. When no move improves the value of the quality function in Equation (
3), we have detected an optimal division. The order in which we process nodes does not affect the value of the quality function, but it can affect the computing time and the order of detecting communities.
The detection of building blocks begins by searching the network structure’s optimal divisions into two factions (communities). The divisions are then sorted based on a quality function value, and the building blocks are constructed. If we consider all detected divisions, the sorting does not change the result as per our definition of a building block. However, we have the option to focus only on the most significant communities and building blocks based on our requirements in the application. Particularly, when there are numerous divisions, we want to consider only the most important cases where the quality function has a high value or focus on large communities. By default, we use the
q value in Equation (
3) as the quality function. Note that the cohesion value of Equation (
11) can also be calculated separately for the two factions of each division.
The procedure for dividing a network structure into two factions is presented in Algorithm 1. The algorithm takes as input a probability matrix, or an influence-spreading matrix, denoted by
M. The matrix contains
elements, where
N is the number of nodes in the network, and each element represents the probability or strength of the directed influence from one node to another. There are several methods to generate the probability matrix, but we do not discuss them in this study as they depend on the particular application. Examples of different spreading processes or network connectivity models whose results can be expressed in the matrix form can be found in references [
29,
58].
Apart from the probability matrix, several variables are required to control the computation. These variables determine when to stop, how to optimise the initialisation of the simulation, and how to speed up the search process for optimal solutions. In this particular study, the optimal solutions correspond to the maximal values of the quality function in Equation (
3). Since we are dividing the nodes of the network into two factions, we present the solutions in the form of an
N-dimensional vector
V with 1’s and 0’s indicating the faction of each node.
Our model defines two communities for each division, which may overlap with other communities. By analysing these overlaps, we can identify the building blocks of communities. We define building blocks as the communities themselves and all possible intersections of the detected communities.
In some cases, a building block can be classified as a community if the quality function, denoted as
q in Equation (
3), attains its maximum for that specific division. This implies that moving any node would lead to a reduction in the value of
q. Moreover, an optimal combination can be formed by two or more building blocks, even if they are not directly linked. To maintain consistency, we also consider these solutions as communities. This is a convention since these solutions can be removed from the set of detected optimal divisions depending on our definition of a community. It is worth noting that in some studies [
44], internally disconnected communities are considered problematic.
The final step of generating the building blocks from the list of network divisions can be performed in multiple ways. One approach is to use a tool like the Gephi analysis tool or an Excel spreadsheet application. In Algorithm 1, the output of line 30 can be formatted as a comma-separated values (CSV) file, which can be imported into the Gephi analysis tool. Developing software with a programming language may be a suitable alternative if specific post-processing is needed for analysing the results. To provide a concrete example of how the Gephi tool is used, we will demonstrate the building blocks structure of the football games network in
Section 7.4.
6. Accuracy and Efficiency of the Method
When selecting a community detection method, it is important to consider its efficiency for practical applications. However, it is also crucial to choose a method that provides theoretically accurate solutions for specific research problems. In this study, we prioritise focusing on the theoretical aspects of analysing community structure. Our work is based on a framework that includes detail-level network structure, overlapping and hierarchic communities, and a quality function based on a probability matrix. By modelling the detail-level network and analysing community structures, we use the probability matrix to establish the connection between the two models, which helps us to streamline the modelling process and keep it under control.
Community detection methods have traditionally focused on identifying non-overlapping communities [
4]. Some benchmarking methods still focus on detecting non-overlapping communities even though some nodes may belong to multiple communities. While some benchmarking methods have been proposed to measure and compare overlapping community detection methods [
37,
38], this field of study is still evolving. There are many issues to address in this field of study. One of the main challenges is defining what to measure and how to compare results accurately. The current benchmarks are not designed to compare the results of overlapping community detection with communities planted in artificial benchmark networks by a given quality function. Many publicly available benchmarks use the normalised mutual information measure, which is biased [
42]. Hence, we demonstrated the accuracy of our algorithm and quality function using commonly used network structures with ground-truth communities and visualisations that we compared with independently generated Gephi layouts. Our findings are in good agreement with the ground truth, literature, and visualisations.
Normalised mutual information (NMI) is a commonly used metric to evaluate the accuracy of solutions to the community detection problem. However, researchers have found that NMI is often biased and can lead to inaccurate conclusions about the best algorithm for the problem [
42]. Extensive numerical tests on popular algorithms have shown that the biases in the traditional mutual information significantly affect the results [
42]. Although modifications to the NMI metric have been proposed to address this issue, it is still debatable whether NMI is the best measure that can be used to assess the quality of solutions to the community detection problem.
Our approach involves detecting and comparing overlapping communities using a quality function based on the influence-spreading matrix. This matrix is generated from a detailed network model, and it considers variables like node and link weights with a probabilistic interpretation [
16,
55,
58]. The probability matrix’s elements represent the influence strength between nodes in the network, and we define derived quantities based on this matrix. Consequently, the quality function in Equation (
11) serves as a method to measure the strength of communities or other sets of nodes within the network, including the building blocks determined by the intersection of detected communities.
In
Section 4, we introduced another statistical measure, denoted as
f, which quantified the probability of community formation. Although this measure may require significant computation time for large networks, it serves as a theoretical example of an alternative measure to evaluate the quality of detected communities. Our method for detecting communities within a network does not rely on the statistical measure, as this measure is a byproduct of the search process. The quality function
q in Equation (
3), the normalised quality function
Q in Equation (
7), and the statistical measure
f can provide different rankings for the same community (refer to
Table 1 in
Section 4 for an example). The normalised measure
Q is calculated in proportion to the number of possible connections in the communities. We conclude that different quality functions are needed for different applications and requirements. This is also related to the fact that there is currently no widely accepted definition of a community [
4].
Both versions of the quality functions
q in Equation (
3) and
Q in Equation (
7) detect the same communities, but their rankings may be different. Note that
f is zero for building blocks that are not communities or optimal solutions of Equation (
3). However, there is an application of community formation in which we can calculate the value of cohesion even for building blocks that are not alone detected as communities. Strengthening ties in such weakly connected building blocks can lead to community formation. On the other hand, targeting information activities to nodes that belong to interceptions of detected communities can lead to the formation of a larger community. In this way, building blocks are potential groups that can evolve into a community alone or with other building blocks or communities in the network.
Our objective is to identify the divisions within a network structure. The number and comprehensiveness of the divisions required depend on the specific application. Building blocks are determined based on the list of divisions, as outlined above. The optimisation methods employed will differ depending on whether we are searching for a couple of important divisions or a more comprehensive list. In
Appendix C.2, we discuss how Algorithm 1 can be improved for the detection of a large number of divisions within a network structure.
8. Discussion
We developed a technique to identify communities and their substructures in a network structure. To do this, we used a probability matrix that measured the strength of the influence between all pairs of nodes in the network. With the help of that matrix, we defined an objective function, which served as a quality measure for communities in the network. Our approach is highly adaptable and can be used with a wide range of network models and quality functions to suit various applications.
Various network flow and network connectivity models can be used to generate the probability matrix. If the goal is to study influence spreading, the matrix is called an influence-spreading matrix [
29]. If we are interested in network connectivity, the matrix is called a connectivity matrix. We use a detailed topological model of the network structure in both cases, and the links between nodes are modelled bidirectionally. Probabilistic modelling is used to define all measures with physically interpretative quantities, including centrality measures, a cohesion measure, and quality functions for community detection. These measures can be expressed as a function of time or can be calculated for a stationary state where time approaches infinity
[
55].
In this study, we used equal link weights to provide a clear understanding of the method’s general concepts and characteristics, such as overlapping communities, quality functions, building blocks, and cohesion. We also developed detailed models of network connectivity and influence spreading with directed link weights, which describe node-level spreading in network structures. These models [
16,
55,
58] can be used to generate probability matrices that serve as inputs for the community detection method described in this study.
In our influence-spreading model, we calculated the quality function value by averaging the out-centrality and in-centrality values for all nodes within communities. This assumption is reasonable if the strength of social ties between individuals on average are symmetric. Our quality function does not consider the direct influence between different communities, which is also a common feature in other community detection methods [
4]. Different applications may require the use of different forms of quality functions. For instance, in the case of network connectivity, a similar form to ours can be used, or a form based on the connectivity between all pairs of nodes within the candidate communities. In addition, we showed that a normalisation based on the size of communities could give communities different rankings, although the normalisation did not affect which communities were detected. That is, the existence of communities and the classification of their characteristics can be considered separately.
We illustrated our method with the Zachary’s Karate Club social network which has been used as a test network in many other studies of community detection. Another example network is the Les Misérables network, because the same network has been analysed by information-theoretic methods [
17]. Therefore, these results can be compared with our results. The results of the information-theoretic model were almost similar to those provided by our model. In addition, we presented results for the dolphin social network and the American football games network. These results can be compared with the corresponding ground-truth communities and many results from other studies in the literature. Our model produced results similar to those obtained in these studies.
We tested two larger networks and found that local maxima of the quality function could be identified in the network structures, including substructures. These networks represented two different network structures of egocentric social circles and page link connections on Facebook. To visualise these lower-scale structures of the networks, we used the Gephi analysis tool [
64]. Layouts of the networks were generated independently by the analysis tool, and information about detected community structures from our model was added to these graphs. In this way, we showed the usefulness and validity of our model in cases where no ground-truth communities are known.
The identified local maxima indicate groups of nodes that may consist of multiple sub-communities with lower link weights or a lower cohesion of the network. However, it should be noted that not all of these groups are communities, despite what the Gephi visualisations may suggest. Whether a group of nodes is detected as a separate community depends on whether or not it is a local maximum of the quality function. This is a crucial aspect of our model.
Our model can be used to detect the main divisions into communities as a special case, when we increase the strength of social ties or link weights. This leads to only one or few solutions for the optimal community quality function. However, if the link weights are known and interpreted as probabilities, the model can detect several overlapping communities. This is a common scenario as people usually participate in numerous social activities related to work and hobbies.
We acknowledge that there are some limitations and deficiencies in the current state of our work. While we provide some preliminary tests in
Appendix C.2 and discuss some optimisation methods, the software code is not yet fully optimised. We believe that most of the optimisation can be achieved by using functions of the programming language more efficiently, which is not directly related to the method itself. In the future, producing a software product is a possible task. To pinpoint obvious optimisation targets, we provided a pseudoalgorithm in Algorithm 1. However, our method and algorithm can be compared with the widely used method of maximising modularity in other community detection methods. As the optimisation of our proposed quality function and the modularity measure [
8] are similar, we conclude that the computational complexity of both methods is comparable, and similar optimisation methods can be used to improve accuracy and efficiency in both cases.
One limitation of the model is that we do not know the exact link weight value in advance, especially if we are not using a probabilistic model or empirical observations to evaluate it. One way to determine suitable link weight values is to experiment with different values. We can start with a high value and gradually decrease it until communities begin to emerge; there is no need to finish the computation. Depending on the application, we can then analyse the network with a few lower link values, as we did in this study.
The structure of the quality function is a topic of research in itself. There are different ways to consider cross-terms between two or more factions of a division, particularly in network spreading models where they may be ignored or accounted for. We justified our proposed quality function by showing that maximising the quality function was equivalent to minimising the cross-term effects in the network. This is analogous to the definition of modularity that is used in other models. In a more detailed model than the one presented in this study, some effects between communities could be included in the quality function with a lower weight, as in practice, communities are not entirely isolated and tend to interact with each other.
9. Conclusions
We proposed a new method for detecting communities and their building blocks in a network structure. This method relied on a probability matrix and local maxima of a quality function. We defined communities as optimal solutions of the quality function and building blocks as the communities themselves and all possible intersections of the detected communities. Communities were formed by combining these building blocks. The probability matrix showed the strength of influence between all pairs of nodes in the network, while the quality function measured the quality of detected communities. Our model was designed to be flexible and can be implemented with various network models and quality functions that suit different applications.
Our proposed method enables the division of the problem of identifying communities within network structures into separate, independent problems. This approach makes it possible to analyse and compare different network models while keeping the community detection model consistent. Vice versa, different community detection methods can be compared while keeping the same network model.
In our study, we found that our influence-spreading model, which included full breakthrough effects and network community connectivity models, resulted in much the same community structures and building blocks. This is primarily because our model is based on the local maxima of a quality function, which means that communities usually emerge for low link values. When link values are low, the specific network flow model has only a minor effect on spreading probabilities, except in the neighbourhood of source nodes.
We defined a measure of cohesion that was a generalisation of the quality function. This measure can analyse network structures beyond detecting communities in networks, offering a broader perspective. The measure of cohesion applies to any set of nodes, including those with high link weights. Communities and building blocks emerge when there is a decrease in cohesion in the network or a specific part of it. In real social networks, this decrease in cohesion is caused by weaker ties in social relationships, which have a clear and direct interpretation.
As has been noted in the literature, the search for communities on networks is generally not a well-defined problem. In this study, we discussed the possibility of developing separate models for network structure, network flow, and community detection. Our method of using a probability matrix was proposed as a technique to combine network models and community detection models and help specify the community detection problem more consistently. In addition, we discussed the role of quality functions and their applications in community detection methods. Quality functions can be used to specify accurate community detection methods, but the most commonly used modularity function fails to detect small communities in large networks, and many of the proposed community detection algorithms are not based on quality functions or use approximate techniques to enhance computing efficiency. In this study, we proposed a novel quality function based on the probability matrix that can be used to analyse non-overlapping and overlapping community structures. The method was demonstrated with several social networks to show its usefulness and validity.
Although the problem of detecting communities has been extensively studied in numerous scientific articles, we believe that our approach adds significant value to the ongoing discussions. Our proposed method is a novel and consistent way to detect and quantify overlapping communities and their building blocks. Our work emphasises the complexity of the community detection problem, which may have been overlooked by traditional network analysis methods that rely on simple phenomenological models, use only local information of a network, or try to detect only one or a few communities. We believe that our approach offers a possible solution to these issues.