^{1}

^{1}

^{*}

In this paper we investigate the potential of Subset Multiple Correspondence Analysis (s-MCA), a variant of MCA, to visually explore two-mode networks. We discuss how s-MCA can be useful to focus the analysis on interesting subsets of events in an affiliation network while preserving the properties of the analysis of the complete network. This unique characteristic of the method is also particularly relevant to address the problem of missing data, where it can be used to partial out their influence and reveal the more substantive relational patterns. Similar to ordinary MCA, s- MCA can also alleviate the problem of overcrowded visualizations and can effectively identify associations between observed relational patterns and exogenous variables. All of these properties are illustrated on a student course-taking affiliation network.

Social Network Analysis (SNA; see [

There are primarily two main approaches to the analysis of affiliation networks [

Using either of the two approaches, a fundamental issue is the direct visualization of the affiliation structure (e.g., [

Although it is common to apply CA/MCA to the complete data set, there are cases when the analysis of a subset of the original data may be more appropriate or desirable. For instance, when analyzing a large number of events (columns), the interpretation may be obscured by the large number of points or vectors in the map, so that interpretation and conclusions are limited to broad generalities. The basic problem is that CA/MCA visualizes many different types of relationships simultaneously so that the factorial maps may not be easily conductive to visualizing those relationships of particular interest to the researcher [

A further analysis of interest would be in the case when some edges (participation of actors in specific events) are missing from the dataset, e.g. due to survey non-response. It is generally accepted that the analysis of social networks is hampered by missing values, because the visualization of the network structure is especially sensitive to missing data. Recent studies have shown the negative effects of missing actors and edges on the structural properties of social networks [

All of the aforementioned issues can be addressed through the use of Subset CA (s-CA), a simple variant of CA. The idea in s-CA, as the name suggests, is to visualize a subset of the rows or a subset of the columns (or both) in subspaces of the same full space as the original complete set [

The paper is organized as follows. In Section 2 we introduce the mathematical background of s-MCA in the context of affiliation networks. Section 3 illustrates three important properties of the method: 1) the handling of missing edges in a student-course taking affiliation network, 2) the visualization of relational patterns in interesting subsets of the network and 3) the incorporation of external information in a subset analysis to facilitate interpretation of observed relational patterns. Section 4 offers some concluding remarks and future directions.

Using a similar notation to that of [^{th} actor and the size s_{j} of the j^{th} event, respectively. The _{j} is described by a dummy variable with categories _{j} when associated to

Now suppose that the interest is to analyze and visualize a subset of events only (a column subset of Z). For instance, in a student course-taking affiliation network, where actors are students and events are university courses students are registered for, the interest may be to a subset of courses with similar content (e.g. science) or a subset of courses of a specific semester or year of study. Another interesting case may be to focus attention on existing edges only, that is ignoring students who provided with no information about a course or groups of courses (rows of Z with 1s in columns

Let

ages of its columns (1 is an × m matrix of ones). The averages of the columns are the column totals of Z divided by Z’s grand total nm, where n is the number of actors and, hence, are exactly the proportions of actors participating (or not participating) to the corresponding events, divided by the number of events, m. Let

and _{a} and column weights D_{h}. Therefore, the original relative frequencies of the categories are maintained and are not re- expressed relative to totals within the subset, as would normally be done in a regular MCA of the subset. The solution can be obtained using the generalized singular value decomposition (GSVD) of the matrix

Step 2 is the SVD, where U and V are the left and right singular vectors, respectively, and

The main output of s-MCA is the joint representation of actors and events in a two-dimensional map, with coordinates in the first two columns of F and ∆, respectively. In this biplot, actors are usually represented as points in the event space spanned by the axes with principal coordinates in F and events are usually represented as vectors in the actor space spanned by the axes with standard coordinates in ∆ [

The distance between two actors in a factorial map best approximates the chi-square distance among the corresponding actor profiles in the original space, and represents the actors’ relative positions in the network [_{j} in the actor space is represented by two opposite vectors, corresponding to the two poles

Another important aspect of s-MCA is that it allows the projection of supplementary rows or columns as points in the factorial map in order to investigate the association between existing relational patterns and actor or event covariates. Supplementary rows or columns do not participate in the creation of the factorial axes as they have zero masses and their relative positions can be evaluated to facilitate the interpretation. The coordinates of supplementary columns can be calculated as the weighted average of the actor standard coordinates with weights equal to the event profiles of the original affiliation matrix, using the so-called transition formula:

Finally, the quality ofrepresentation of eachindividualpoint (actor) orvector (event) on a factorialmap, could- beassessedviaaset of appropriateindices, suchascontribution (COR), correlation (CTR) and quality (QLT), which arepart of the standardoutput ofthesoftwarepackagesimplementing CA/MCA.

In order to demonstrate the important aspects of s-MCA in the context of affiliation networks, we consider a binary network of student enrollment in elective courses as part of their undergraduate studies in a primary education university department, located in a city of Northern Greece. The training of elementary pre-service teachers has been established as 4-year studies with eight semesters. In each semester, the department offers a wide range of elective courses in science, language, psychology, computer science, mathematics, statistics, social studies, music, art, physical education, and a miscellaneous group of courses. The student course-taking data were collected as part of a larger cross-sectional study aiming to associate students’ course enrollment with the reasons behind their choices and a variety of background characteristics. The affiliation network under study consists of 193 students and 67 elective courses offered by the department, in which participation has been recorded over four academic years (2011/12 through 2015/16), along with student-related attributes of gender, educational background in high school, perception towards post-graduate studies and the reasons for taking these specific courses. The sample is composed of 90% female and 10% male students. Approximately 81% of the students reported that they had a theoretical educational background in high school, 10% had a technological background, 7% had a scientific background and 2% did not provide any data.

Part of the 193 × 67 affiliation matrix used in our analysis is shown in ^{th} semester. During all four years of their undergraduate studies, students had to take ten elective courses in total (one course in each one of the 1^{st}, 2^{nd}, 3^{rd}, 4^{th} and 6^{th} semester, two courses in the 5^{th} and three courses in the 8^{th} semester). Therefore, the total degree of a student with no missing data in the network equals to ten (last column in ^{th} semester (all values in the corresponding row are missing-?), but suppose he/she has done so for the courses in the other seven semesters; hence his/her total degree is less by two (eight). Therefore, the corresponding indicator matrix

In practice, missing response categories often dominate the CA/MCA factorial map because of high association,

Student | Courses (5^{th} semester) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Voc | EUEdu | MPhil | EUHist | Bio | CDiv | TheHist | Cosmo | Geo | CProg | GrThe | Degree | |

1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 10 |

2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 |

3 | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | ? | 8 |

4 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 10 |

5 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 10 |

∙∙∙ | ∙∙∙ | ∙∙∙ | ∙∙∙ | ∙∙∙ | ∙∙∙ | ∙∙∙ | ∙∙∙ | ∙∙∙ | ∙∙∙ | ∙∙∙ | ∙∙∙ | ∙∙∙ |

Event size | 44 | 25 | 24 | 45 | 44 | 19 | 22 | 9 | 37 | 44 | 33 |

Voc: Vocabulary: description and pedagogy, EUEdu: Trends in European Education, MPhil: Modern Philosophy, EUHist: Modern European History, Bio: Topics in Biology, CDiv: Cultural Diversity in the Classroom, TheHis: History of Theatre, Cosmo: Cosmography, Geo: Geography Education, CProg: Computer Programming, GrThe: Group Theory, “?” indicates a missing edge.

forcing a contrast between these and the substantive categories. In order to motivate our approach, we first consider the usual MCAmap of the full indicator matrix in

The map of

For the current data, event non-response rates varied between 4% and 11%. To address the effect of missing values, s-MCA was applied to the 193 × 134 subset of Y (see Section 2.2), thus focusing on the subset of observed edges that lie, however, in the subspace of the same full space as the complete set. The corresponding s-MCA map is shown in ^{st} semester pointing to the top-left of the map, theoretical courses related to the teaching profession pointing to the bottom and bottom-right, and science, psychology and mathematics courses pointing to the top-right. However, the interpretation is still complicated because of over-crowding.

Another case where s-MCA can be useful is when the purpose is to visualize the relational patterns of a specific subset of events. ^{th} semester only. Recall that in this semester each student should enroll to two out of the eleven courses available. The map is a result of an s-MCA on the corresponding subset of 22 columns of the indicator matrix Z. In this map, one can now easily identify groups of courses that were usually chosen together or others that were rarely chosen together. For instance, along the first (horizontal) axis of course-taking, Vocabulary: description and pedagogy (VOC), Modern Philosophy (MPhil) and Trends in European Education (EUEdu) form a group of courses with similar enrolment patterns, whereas courses in this group are rarely chosen together with Geography (Geo) or Modern European History (EUHist). Geo and EUHist are

usually chosen together, as the corresponding vectors form a small angle and point to the opposite direction from VOC, MPhil and EUEdu along the first axis. A different story takes place along the second (vertical) axis. To the top, Computer Programming (CProg) and Group Theory (GrThe) form a small group with similar enrolment patterns, in contrast with Topics in Biology (Bio), Cosmography (Cosmo), History of Theatre (TheHis) and Cultural Diversity in the Classroom (CDiv), which form another group to the bottom. In addition, the length of the segment joining the two poles is indicative of the popularity of each course. Thus, the most popular courses in this semester are GrThe, Cprog, Geo, Voc, EUHist and Bio, whereas the least popular are Cosmo, TheHis, CDiv, MPhil and EUEdu. A quick look at the frequency of enrolment in

At this point, one could ask how the map of a direct application of MCA to this subset of courses, ignoring the rest, would be different from that of s-MCA in

An emerging question concerns the reasons which could potentially explain the student enrolment patterns observed in

In this paper we have discussed the use of an extension of MCA, Subset MCA, to visually explore relational patterns in subsets of two-mode networks. The application of s-MCA for social network analysis can serve a four-fold purpose: 1) to partial out the influence of missing data in an affiliation matrix, 2) to visualize relational patterns that lie in interesting subsets of the matrix in subspaces of the same full space as the original complete set, 3) to alleviate the problem of crowded representations of large affiliation networks and 4) to identify associations between observed relational patterns and exogenous variables (covariates).

The application of s-MCA to an affiliation matrix with missing data showed that it provided a meaningful approach to reveal substantive relational patterns while ignoring the non-substantive ones. In this context, s-MCA can be applied irrespective of the missing data mechanism present, it is computationally simple and it is able to handle large affiliation matrices. We argue that this exploratory method is easier to apply than the existing multiple imputation methods in which many complexities need to be considered.

When visualizing relatively large affiliation matrices, it is almost always true in practice that the interpretation of the maps is degraded by the large number of points and vectors analyzed, all of which load to a greater or lesser extent on every dimension, thereby limiting the interpretation and conclusions to broad generalities. Once the broad picture is seen in the complete analysis, there is value in a subsequent division of the events into numerous smaller, sensibly selected, mutually exclusive and exhaustive subsets. The actors-by-events structure of a two-mode network fits the structure that is assumed in s-MCA, which can provide a summary of the relationships within each subset.

Finally, we would like to highlight that s-MCA belongs to a large family of exploratory techniques that allow the data analyst to observe the patterns of associations in the data and to generate hypotheses that could be tested in a subsequent stage of research. Other methods of the same family that are worth investigating for the analysis of affiliation networks are [

Achilles Dramalidis,Angelos Markos, (2016) Subset Multiple Correspondence Analysis as a Tool for Visualizing Affiliation Networks. Journal of Data Analysis and Information Processing,04,81-89. doi: 10.4236/jdaip.2016.42007