 Research
 Open Access
 Published:
Techniques for comparing and recommending conferences
Journal of the Brazilian Computer Societyvolume 23, Article number: 4 (2017)
Abstract
This article defines, implements, and evaluates techniques to automatically compare and recommend conferences. The techniques for comparing conferences use familiar similarity measures and a new measure based on coauthorship communities, called coauthorship network community similarity index. The experiments reported in the article indicate that the technique based on the new measure performs better than the other techniques for comparing conferences, which is therefore the first contribution of the article. Then, the article focuses on three families of techniques for conference recommendation. The first family adopts collaborative filtering based on the conference similarity measures investigated in the first part of the article. The second family includes two techniques based on the idea of finding, for a given author, the strongest related authors in the coauthorship network and recommending the conferences that his coauthors usually publish in. The first member of this family is based on the Weighted Semantic Connectivity Score—WSCS, which is accurate but quite costly to compute for large coauthorship networks. The second member of this family is based on a new score, called the Modified Weighted Semantic Connectivity Score—MWSCS, which is much faster to compute and as accurate as the WSCS. The third family includes the ClusterWSCSbased and the ClusterMWSCSbased conference recommendation techniques, which adopt conference clusters generated using a subgraph of the coauthorship network. The experiments indicate as the best performing conference recommendation technique the ClusterWSCSbased technique. This is the second contribution of the article. Finally, the article includes experiments that use data extracted from the DBLP repository and a webbased application that enables users to interactively analyze and compare a set of conferences.
Introduction
Conferences provide an important channel for the exchange of information and experiences among researchers. The academic community organizes a large number of conferences, in the most diverse areas, generating a rich set of bibliographic data. Researchers explore such data to discover topics of interest, find related research groups, and estimate the impact of authors and publications [1–6]. Choosing a good conference or journal in which to publish an article is in fact very important to researchers. The choice is usually based on the researchers’ knowledge of the publication venues in their research area or on matching the conference topics with their paper subject. Indeed, the identification of relevant publication venues presents no problems when the researcher is working in his area. It is less obvious, though, when the researcher moves to a new area.
In this article, we define, implement, and evaluate techniques to automatically compare and recommend conferences that help address the questions of selecting and evaluating the importance of conferences. From a broad perspective, techniques for comparing conferences induce clusters of similar conferences, when applied to a conference catalog. Therefore, when one finds one or more familiar conferences in a cluster, he may consider that the other conferences in the cluster are similar to those he is familiar with. Techniques for recommending conferences, on the other hand, select conferences according to a given criteria and rank them in order of importance. Thus, when one finds a conference closer to the top of the ranked list, he may consider that the given conference is more important than those lower down in the list, within the bounds of the given criteria.
The techniques for comparing conferences adopt familiar similarity measures, such as the Jaccard similarity coefficient, the Pearson correlation similarity and the Cosine similarity, and a new similarity measure, called the coauthorship network community similarity index. The experiments reported in the article indicate that the best performing technique for comparing conferences is that based on the new similarity measure, which is therefore the first contribution of the article.
The article proceeds to define three families of conference recommendation techniques. The first family of techniques adopts collaborative filtering based on the conference similarity measures investigated in the first part of the article. The second family includes two techniques based on the idea of finding, for a given author, the strongest related authors in the coauthorship network and recommending the conferences that his coauthors usually publish in. The first member of this family is based on the Weighted Semantic Connectivity Score—WSCS, an index for measuring the relatedness of actors. However, since this index proved to be accurate but quite costly for large coauthorship networks, the article introduces a second technique based on a new score, called the Modified Weighted Semantic Connectivity Score—MWSCS, which is much faster to compute and as accurate as the WSCS. The third family of conference recommendation techniques includes the ClusterWSCSbased and the ClusterMWSCSbased techniques, which adopt conference clusters generated using a subgraph of the coauthorship network, instead of the full coauthorship network. The experiments suggest that the WSCSbased, MWSCSbased, and ClusterWSCSbased techniques perform better than the benchmark and better than the techniques based on similarity measures. Furthermore, between these three techniques, the experiments permit us to conclude that the ClusterWSCSbased technique should be preferred because it is more efficient and have no statistically significant differences when compared to the WSCSbased and MWSCSbased techniques. This is the second contribution of the article.
The experiments mentioned in the previous paragraphs use data extracted from a triplified version of the dblp computer science bibliography (DBLP) repository, which stores Computer Science bibliographic data for more than 4500 conferences and 1500 journals (as of early 2016). The experiments were performed using a webbased application that enables users to interactively analyze and compare a set of conferences.
The remainder of this article is structured as follows. The “Related work” section summarizes similar work. The “Techniques” section introduces the conference comparison and the conference recommendation techniques. The “Results and discussion” section presents an application that implements the techniques and describes their evaluation. Finally, the “Conclusions” section summarizes the main contributions of this article.
Related work
Henry et al. [1] analyzed a group of the four major conferences in the field of humancomputer interaction (HCI). The authors discovered many global and local patterns using only article metadata, such as authors, keywords, and year. Blanchard [2] presented a 10year analysis of the paper production in intelligent tutoring systems (ITS) and Artificial Intelligence in Education (AIED) conferences and showed that Western, Educated, Industrialized, Rich, and Democratic bias observed in psychology may be influencing AIED research. Chen, Zhang, and Vogeley [3] proposed an extension of the contemporary cocitation network analysis to identify cocitation clusters of cited references. Intuitively, the authors synthesize thematic contexts in which these clusters are cited and trace how the research focus evolved over time. Gasparini, Kimura, and Pimenta [4] presented a visual exploration of the field of humancomputer interaction in Brazil from a 15year analysis of paper production in the Brazilian Symposium on Human Factors in Computing Systems (IHC). Recently, Barbosa et al. [5] published an analysis of the same conference series. Chen, Song, and Zhu [6] opened a wide range of opportunities for research agendas and trends in Entity Relationship conferences.
Zervas et al. [7] applied social network analysis (SNA) metrics to analyze the coauthorship network of the Educational Technology & Society (ETS) Journal. Procópio, Laender, and Moro [8] did a similar analysis for the databases field. Cheong and Corbitt [9, 10] analyzed the Pacific Asia Conference on Information Systems and the Australasian Conference on Information Systems.
Recently, Lopes et al. [11, 12] carried out an extensive analysis of the WEBIST conferences, involving authors, publications, conference impact, topics coverage, community analysis, and other aspects. The analysis starts with simple statistics, such as the number of papers per conference edition and then moves on to analyze the coauthorship network, estimating the number of communities, for example. The paper also includes an analysis of author indices, such as the hindex, topics and conference areas, and paper citations.
Linked data principles to publish conference data were also used in [13].
All the above references focus on metrics typical of social network analysis mostly to compare different instances of the same publication venue and do not cover recommendation issues. Contrasting with the above references, in this article, we propose, implement, and evaluate several techniques to compare conferences in general and not a specific conference series. The current implementation works with the triplified version of the DBLP repository, which covers the vast majority of Computer Science conferences.
We now turn to conference recommendation, a problem that attracted attention due to the increase in the number of conferences in recent years.
Medvet et al. [14] considered a venue recommendation system based on the idea of matching the topics of a paper, extracted from the title and abstract, with those of possible publication venues for the paper. We adopted a simpler approach to obtain the topics of a conference from the set of keywords and titles of the papers published in the conference and their frequency, after eliminating synonymous keywords.
Pham et al. [15] proposed a clustering approach based on user social information to derive venue recommendation based on collaboration filtering and trustbased recommendation. The authors used data from DBLP and Epinion to show that the proposed clustering techniquebased collaboration filtering performs better than traditional collaboration filtering algorithms. In this article, we also explore collaborative filtering and conference clustering to define families of conference recommendation techniques.
Chen et al. [16] proposed a method for recommending academic venues based on the PageRank metric. However, unlike the original PageRank method, which induces a relationship network model of these venues, the authors proposed a method that considers the effects of authors’ preference for each venue. Thus, the PageRank metric is computed on a mixed network where nodes are academic venues and authors and edges are the coauthoring and publishing relationships (authorvehicle). The score of the nodes is then defined as the combination of the effects of coauthoring and publication. The propagation of punctuation across the network also suffers a variation from the original PageRank. Each adjacent node propagates its effects proportionally to the similarity with its neighbor. If two authors are similar, the score is more intensely propagated, that is, authors with similar interests influence the score of a venue more strongly.
Boukhris et al. [17] proposed a recommendation technique for academic venues for a particular target researcher, TR. The technique prioritizes the venues most used by the researchers that cite a TR. The citation intensities are adjusted with factors that intend to measure the interest of a researcher by the work of TR so that the venues of the researchers most strongly interested in the work of TR will have greater relevance. To solve the problem of target researchers with few citations, the recommendation process uses coauthors and colleagues from the same institution as TR. A final step in the recommendation process allows filtering the ranking results according to requirements reported by users.
Yang and Davison [18] proposed an interesting approach for venue recommendation based on stylometric features. They argue that the writing style and paper format may serve as features for collaborative filteringbased methods. Their results show that the combination of content features with stylometric features (lexical, structural, and syntactical) performs better than when stylometric or contentbased features are applied separately. Although the accuracy reported is rather low, linguistic style and paper format remain as interesting features to consider.
Huynh and Hoang [19] proposed a simple network model based on social network structure that may serve to represent connections that goes beyond classical “who knows whom” connections. Thus, for instance, in their network model, the relationships between researchers can be based on coauthorship measures and authors similarity. Their work can benefit from ours by borrowing the metrics proposed here.
Asabere et al. [20] and Hornick et al. [21] addressed the problem of recommending conference sessions to attendees. Similar to the venue recommendation problem, recommendation techniques such as content and collaborativebased methods are used to match attendees and session presentations. The use of geolocation information [20] and personal information provided at conferences [21] as features may also be incorporated to improve venue recommendation. For instance, conference and researcher locations can be used as features when budget restrictions apply.
Luong et al. [22] proposed and compared three recommendation methods for conferences. The methods find the most appropriate conference for a set of coauthors who want to publish a paper together. The best performing recommendation method, which we will refer simply as the most frequent conference, is divided into two stages. First, the method recursively collects the coauthors of the coauthors, until a three level deep network is created. Second, the method weights the contributions of each coauthor by the number of papers they have coauthored with an author. It is defined as:
where N is the set of coauthors who want to publish a paper together, i is a conference that might be recommended for the set N of coauthors, and coauthors_w _{ i,m } is the weight of conference i for a coauthor m∈N in the coauthorship network. This last function is defined as:
where CoA is the set of coauthors of author m who have published at conference i, w_CoA _{ k,m } is the number of times author m coauthored papers with another author k in the coauthorship network, nfreq_CONF_{ i,m } is the probability of author m publishing in conference i, and likewise nfreq_CONF_{ i,k } is the probability of author k publishing in conference i. In this article, we adopted Luong’s most frequent conference technique as the benchmark and, therefore, included a somewhat more detailed account of their work.
In this article, we propose two conference recommendation techniques based on a social network analysis of the coauthorship network, but we adopt a measure of the strength of the connections between the authors in the network which is computed differently from the previous methods. We first propose to estimate the relatedness of actors in a social network by using a semantic connectivity score [23], denoted SCS, which is in turn based on the Katz index [24]. This score takes into account the number of paths between two nodes of the network and the accumulated weights of these paths. Then, we propose a second score that approximates the SCS score and uses the shortest path between two nodes. In addition to these two strategies, we also propose to construct a utility matrix and to implement recommendation techniques based on collaborative filtering using the utility matrix.
Techniques
In this section, we introduce the conference comparison and the conference recommendation techniques, which are the main trust of the article. We refer the reader to [25] for illustrative examples of the techniques.
Conference comparison techniques
As mentioned in the Introduction section, the techniques for comparing conferences induce clusters of similar conferences, when applied to a conference catalog. They adopt familiar similarity measures, such as the Jaccard similarity coefficient, the Pearson correlation similarity and the Cosine similarity, and a new similarity measure, called the coauthorship network community similarity index.
In what follows, we use the following notation:

C is a set of conferences

A is a set of authors

P is a set of papers

pa : A → 2^{P} is a function that assigns to each author i ∈ A the set of papers pa(i) ⊆ P that author i published (in any conference)

pc : C → 2^{P} is a function that assigns to each conference x ∈ C the set of papers pc(x) ⊆ P that were published in x

pac : A × C → 2^{P} is a function that assigns to each author i ∈ A and each conference x ∈ C the set of papers pac(i, x) ⊆ P that author i published in conference x

A _{ x } and A _{ y } are the set of authors that published in conferences x and y, that is, A _{ x } = {i ∈ A / pac(i, x) > 0} and, likewise, A _{ y } = {i ∈ A / pac(i, y) > 0}

A _{ x,y } is the set of authors that published in both conferences x and y, that is, A _{ x,y } = {i ∈ A / pac(i, x) > 0 ∧ pac(i, y) > 0}

G _{ x } = (N _{ x }, E _{ x }), the coauthorship network of conference x, is an undirected and unweighted graph where i ∈ N _{ x } indicates that author i published in conference x and {i, j} ∈ E _{ x } represents that authors i and j coauthored one or more papers published in conference x
Similarity measures based on author information
In what follows, we adapt familiar similarity measures to conferences and authors and introduce a new measure called community similarity.
The Jaccard Similarity Coefficient for conferences x and y is defined as
The utility matrix expresses the preferences of an author for a conference to publish his research. More formally, the utility matrix [r _{ x,i }] is such that a line x represents a conference and a column i represents an author and is defined as:
Based on the utility matrix [r _{ x,i }], we define the Pearson’s Correlation Coefficient between conferences x and y as follows:
where \( \overline{r_x} \) is the average of the elements of line x of the utility matrix (and likewise for \( \overline{r_y} \)).
Again, based on the utility matrix [r _{ x,i }], we define the Cosine Similarity between conferences x and y as follows:
We introduce a new similarity measure between conferences based on communities defined over the coauthorship network of the conferences. Given the coauthorship network G _{ x } = (N _{ x }, E _{ x }) of conference x, we define an author community c _{ x } of x as the net of nodes of a connected component of G _{ x }. Let c _{ x } and c _{ y } be author communities in the coauthorship networks of conferences x and y, respectively. We say that c _{ x } and c _{ y } are equivalent w.r.t. a similarity measure sim and a threshold level α iff sim(c _{ x }, c _{ y }) ≥ α. For example, sim may be defined using the Jaccard similarity coefficient between pairs of conferences introduced above.
Let C _{ x } and C _{ y } be the sets of communities of conferences x and y, respectively. Let EQ[sim, α](x, y) be the set of communities in the coauthorship network of conference x that have an equivalent community in the coauthorship network of conference y (and symmetrically EQ[sim, α](y, x)).
The Coauthorship Network Communities Similarity (based on a similarity measure sim and a threshold level α) between conferences x and y is then defined as:
Note that C _{ x } > 0 and C _{ y } > 0 since G _{ x } and G _{ y } must have at least one node each and therefore at least one connected component each.
Similarity measure based on conference keywords
In the previous subsection, we proposed a utility matrix that expresses the preferences of an author for a conference to publish his research. However, we can also express the association of a topic with a conference. Therefore, in this section, we describe an algorithm to obtain the conference topics and introduce a new utility matrix that represents this information.
To obtain the topics of the conference x, we first extract, for each paper p ∈ pc(x), the set of keywords of the paper, denoted by kwrds(p). Then, we define the frequency of a keyword k for a conference x as:
where the function kwrds(p) tries to eliminate synonymous keywords. In our implementation, we used the API of the Big Huge Thesaurus^{Footnote 1} to retrieve the synonyms of a word, in English.
The extraction of keywords for a paper, that is, the computation of kwrds(p), is based on a lexical analysis of paper metadata. This process follows five steps:

1.
Obtain the text for keyword extraction; in our implementation, we used the title and the keyword list of the paper.

2.
Tokenize the extracted text.

3.
Eliminate stopwords (i.e., the most common words in a language).

4.
Eliminate suffixes to obtain the word lexeme.

5.
The resulting token list represents the keywords of the paper.
We then define the set of keywords of a conference as follows:
The database vocabulary is the union of all the relevant keywords for the conferences, that is:
where β is a frequency threshold, whose purpose is to eliminate keywords with low frequency.
From the process of obtaining the keywords of a conference, we can establish a new utility matrix that expresses the association of topics (keywords) with conferences. More formally, the utility matrix [s _{ x,k }] is such that a line x represents a conference and a column k represents a keyword and is defined as:
where β is the frequency threshold.
The cardinality of the columns of the matrix [s _{ x,k }] is the cardinality of the set K.
The problem of comparing conferences using topics is addressed by defining the similarity functions jaccard_sim_tpc(x,y), pearson_sim_tpc(x,y), cos_sim_tpc(x,y) and c_sim_tpc[sim,α](x,y), analogously to the functions jaccard_sim(x,y), pearson_sim(x,y), cos_sim(x,y), and c_sim[sim,α](x,y), respectively. To define the new functions, we apply the following transformations on the similarity functions introduced in the previous subsection:

We substitute A _{ x } and A _{ y } by K _{ x } and K _{ y }, where K _{ x } and K _{ y } are the sets of keywords that are relevant for conferences x and y, that is, K _{ x } = {k ∈ K / s _{ x,k } > 0} and K _{ y } = {k ∈ A / s _{ y,k } > 0}.

We substitute A _{ x,y } by K _{ x,y }, where K _{ x,y } is the set of keywords relevant for both conferences x and y, that is, K _{ x,y } = {k ∈ K / s _{ x,k } > 0 ∧ s _{ y,k } > 0}.
Conference recommendation techniques
Conference recommendation techniques based on classical similarity measures
As defined in [26], in a recommender system, there are two classes of entities—users and items. Users have preferences for certain items, which must be extracted from the data. The data itself is represented as a utility matrix giving, for each useritem pair, a value that represents what is known about the degree of preference or rating of that user for that item. An unknown rating implies that there is no explicit information about the user’s preference for the item. The goal of a recommendation system is to predict the unknown ratings in the utility matrix.
In our context, we recall from the “Conference comparison techniques” subsection that the utility matrix [r _{ x,i }] is such that r _{ x,i } expresses the preference (i.e., rating) of an author i for a conference x to publish his research. To predict an unknown rating, we compute the similarity between conferences and detect their nearest neighbors or most similar conferences. With this information, the rating of conference x for author i is defined as follows:
where S _{ x } is the set of conferences most similar to x and r _{ y,i } is the rating of conference y for author i.
Therefore, we may immediately define a family of conference recommendation techniques based on the utility matrix and the classical similarity measures introduced in the “Conference comparison techniques” subsection that we call CFJaccard, CFPearson, CFCosine, and CFCommunities, according to the similarity measure adopted. The “Results and discussion” section analyses how they perform in detail.
Conference recommendation techniques based on the weighted authorship network
Recall from the “Conference comparison techniques” subsection that pa : A → 2^{P} is the function that assigns to each author i ∈ A the set of papers pa(i) ⊆ P that author i published (in any conference). The weighted coauthorship network based on pa is the edgeweighted undirected graph G = (N, E, w), where i ∈ N represents an author, {i, j} ∈ E indicates that i and j are coauthors, that is, {i, j} ∈ E iff pa(i) ∩ pa(j) ≠ ∅, and w({i, j}) assigns a weight to the coauthorship relationship between i and j, and is defined as:
Hence, the larger w({i, j}) is, the stronger the coauthorship relationship will be if authors i and j coauthored all papers they published, then w({i, j}) = 1; and if they have not coauthored any paper, then the edge {i, j} does not exist.
The second family of conference recommendation techniques explores the weighted coauthorship network and adopts two scores: the Weighted Semantic Connectivity Score—WSCS and the Modified Weighted Semantic Connectivity Score—MWSCS. Hence, these techniques are called WSCSbased and MWSCSbased recommendation techniques.
The Weighted Semantic Connectivity Score, WSCS_{ e }, is defined by modifying the semantic connectivity score SCS_{ e } [7] to take into account the weight of the paths between two authors i and j, computed as the sum of the weights of the edges in the path:
where \( \left{\mathrm{paths}}_{< i, j>}^{< w>}\right \) is the number of paths of weight equal to w between i and j, T is the maximum weight of the paths, and 0 < β ≤ 1 is a positive damping factor.
The conference recommendation technique based on WSCS_{ e } works as follows. Given an author i, it starts by computing WSCS_{ e }(i, j), the score between i and any other author j in the weighted coauthorship network. Then, it sorts authors in decreasing order of WSCS_{ e }, since authors that are more related to author i will have a higher WSCS_{ e }(i, j) value. For better performance, the technique considers only the first n authors in the list ordered by WSCS_{ e }. Call this set F _{ i }. For each author j in F _{ i }, the technique selects the conference c ∈ C with the highest pac(j, c), denoted MaxC _{ j }. The rank of conference x for author i is defined as follows:
where \( g\left( x, j\right)=\left\{\begin{array}{c}\hfill 1, \mathrm{iff}\ x=\mathrm{Max}{C}_j\ \hfill \\ {}\hfill 0,\ \mathrm{otherwise}\ \hfill \end{array}\right. \)
Since computing the WSCS_{ e } score can be very slow for large graphs, we propose to compute only the shortest paths from author i to other authors using Dijkstra’s algorithm. We then redefine the score as follows:
where w is a length of the shortest path from author i to author j. The recommendation technique remains basically the same, except that it uses the MWSCS score.
The results for the recommendation technique using the MWSCS_{ e } score can be very different from those obtained using the WSCS_{ e } score. Indeed, it is easy to see that, by using the MWSCS_{ e } score, we lose the information about all paths between the authors, except the shortest. For example, in the coauthorship network of Fig. 1, the pairs of authors (A1, A3) and (A1, A2), using Eq. 16, have the same MWSCS_{ e }, whereas the pair (A1,A3) should have a larger value; indeed, the path (A1, A4, A3) is ignored in the calculation of the MWSCS_{ e } score, using Eq. 16.
Conference recommendation techniques based on conference clusters
In the previous subsection, we presented two algorithms to recommend conferences using the coauthorship network. The first algorithm, based on the WSCS_{ e } score, is computationally slower than the second, based on the MWSCS_{ e } score. Both algorithms are sensitive to the network size and, therefore, slower for large networks. In this section, we propose an algorithm to recommend conference that reduces the problem of recommending conferences using the coauthorship network to the problem of recommending conferences using a subgraph of the coauthorship network.
We may immediately define a third family of conference recommendation techniques that contains two techniques, called ClusterWSCSbased and ClusterMWSCSbased, if we use the WSCS and the MWSCS scores respectively to recommend conferences using a subgraph of the coauthorship network, instead of the full coauthorship network.
Let u ⊆ C be a conference cluster. The coauthorship network for u is the subgraph G _{ u } = (N _{ u }, E _{ u }, w) of the weighted coauthorship network G = (N, E, w) such that:
This family of recommendation techniques uses the following preprocessing algorithm:

1.
Obtain the set U of conferences clusters using a similarity function s.

2.
For each cluster u ∈ U, create the coauthorship network of the cluster.

3.
For each cluster u ∈ U, obtain a vector V _{ u } representing cluster u.

4.
For each author i ∈ A, obtain a vector V _{ i } representing author i.
To define the algorithm, we need a function cluster_score(i, u) : A × U → ℕ that assigns to each author i ∈ A and each cluster u ∈ U a relationship score based on the similarity between vectors V _{ i } and V _{ u }.
Then, the general algorithm to recommend a conference to an author i is defined as:

1.
Select u _{ i } such that \( {u}_i = \underset{u\in U}{ \max}\left(\mathrm{cluster}\_\mathrm{score}\left( i, u\right)\right) \)

2.
Apply a conference recommendation algorithm (any of those proposed in the “Conference recommendation techniques based on classical similarity measures” subsection) using the coauthorship network of cluster u _{ i }.
Steps 3 and 4 of the general algorithm and the definition of cluster_score depend on the choice of the similarity function s used in step 1 of the preprocessing algorithm. If we use one of the similarity functions introduced in the “Similarity measures based on author information” subsection, steps 3 and 4 and the cluster score are defined as:

Step 3 computes, for each cluster u ∈ U, the vector V _{ u } representing cluster u such that \( {V}_u\left[ c\right]=\left\{\begin{array}{cc}\hfill 1\hfill & \hfill, \mathrm{iff}\ c\in u\hfill \\ {}\hfill 0\hfill & \hfill,\ \mathrm{otherwise}\hfill \end{array}\right. \)

Step 4 computes, for each author i ∈ A, the vector V _{ i } representing author defined exactly as the column corresponding to author i in the utility matrix [r _{ x,i }] introduced in the “Similarity measures based on author information” subsection.

cluster_score is the similarity function s selected in step 1 of the preprocessing algorithm.

However, if we use one of the similarity functions introduced in the “Similarity measure based on conference keywords” subsection, steps 3 and 4 and the cluster score are defined as:

Step 3 computes for each cluster u ∈ U, the vector V _{ u } representing cluster u such that \( {V}_u\left[ k\right]=\left\{\begin{array}{cc}\hfill 1\hfill & \hfill, \mathrm{iff}\ k\in {\cup}_{c\in u}{K}_c\hfill \\ {}\hfill 0\hfill & \hfill,\ \mathrm{otherwise}\hfill \end{array}\right. \)

Step 4 computes the keywords of the papers belonging to the author. The process is described in the “Similarity measure based on conference keywords” subsection for the case of the conference keywords.

cluster_score is the Jaccard similarity function between vectors V _{ u } and V _{ a }.
Results and discussion
Experimental environment
Figure 2 summarizes the architecture of the application developed to run the experiments. The Conferences Data Service handles queries to the triple store with conference data. The Coauthorship Network Service receives data from the Conferences Data Service and handles queries to the Neo4j database. When an analysis is executed, the system stores the results for future reuse; the Previous Calculation Service manages these functions. All experiments that follow were executed in an Intel Core Quad 3.00GHz, with 6 GB RAM, running Windows 7.
Experiments with the conference similarity techniques
We evaluated the conference similarity techniques assuming that the most similar conferences should fall in the same category. We selected as benchmark the List of Computer Science Conferences defined in Wikipedia,^{Footnote 2} which contains 248 academic computer science conferences, classified in 13 categories. That is, the categories define a partition P of the set of conferences. Then, we applied the same clustering algorithm to this set of conferences but using each of the conference similarity measures. Finally, we compared the clusters thus obtained with P. The best conference similarity measure would therefore be that which results in conference clusters that best match P.
We adopted the hierarchical agglomerative clustering algorithm, which starts with each conference as a singleton cluster and then successively merges (or agglomerates) pairs of clusters, using similarity measures, until achieving the desired number of clusters. To determine how similar clusters are, and agglomerate them, a linkage criterion was used. The smallest value of these links that remains at each step causes the fusion of the two clusters whose elements are involved.
Let d(a, b) denote the distance between two elements a and b. Familiar linkage criteria between two sets of elements A and B are:

Completelinkage: the distance D(A, B) between two clusters A and B equals the distance between the two elements (one in each cluster) that are farthest away from each other:
$$ D\left( A, B\right)= \max \left\{ d\left( a, b\right)/ a\in A,\ b\in B\right\} $$(17) 
Singlelinkage clustering: the distance D(A, B) between two clusters A and B equals the distance between the two elements (one in each cluster) that are closest to each other:
$$ D\left( A, B\right)= \min \left\{ d\left( a, b\right)/ a\in A,\ b\in B\right\} $$(18) 
Average linkage clustering: the distance D(A, B) between two clusters A and B is taken as the average of the distances between all pairs of objects:
$$ D\left( A, B\right)=\frac{{\displaystyle {\sum}_{a\in A}}{\displaystyle {\sum}_{b\in B}} d\left( a, b\right)}{\left A\right\cdot \left B\right} $$(19)
Before explaining the measures used to compare how well different data clustering algorithms perform on a set of data, we need the following definitions. Given a set of n elements S and two partitions X and Y of S, where X is the correct partition and Y is the computed partition, we define:

TP (true positive) is the number of pairs of elements in S that are in the same set in X and in the same set in Y

TN (true negative) is the number of pairs of elements in S that are in different sets in X and in different sets in Y

FN (false negative) is the number of pairs of elements in S that are in the same set in X and in different sets in Y

FP (false positive) is the number of pairs of elements in S that are in different sets in X and in the same set in Y
The measures to evaluate the performance of the clustering algorithms using the proposed similarity functions are:

Rand Index: measures the percentage of correct decisions made by the algorithm:
$$ R I = \frac{TP+ TN}{TP+ TN+ FP+ FN} $$(20)

Fmeasure: balances the contribution of false negatives by weighting the recall through a parameter β > 0:
$$ F=\frac{\left({\beta}^2+1\right) P. R}{\left({\beta}^2 P\right)+ R} $$(21)
where \( P=\frac{TP}{TP+ FP} \) and \( R=\frac{TP}{TP+ FN} \)
Figure 3 shows the Rand Index obtained by executing the hierarchical agglomerative clustering algorithm with different linkages criteria, using the Jaccard, Pearson, Cosine, and Communities similarity measures based on author information and conferences keywords. Note that, in general, the algorithm based on the communities similarity had the best performance, followed by the Jaccard similarity for similarity measures based on author information. In this case, the Cosine similarity had the worst behavior. The similarity measures based on conference keywords are the best, among which the Pearson and Cosine similarity achieved the best results.
Figure 4 shows the Fmeasure obtained by executing the same algorithms. By analyzing Fig. 4, we observe that the best performances, for the group of similarity based on author information, were also obtained using the communities similarity and the Jaccard similarity measures; the worst performance was obtained using the Pearson similarity measure and the algorithm using the Cosine similarity measure achieved the worst performance only with the singlelink linkage criterion. Again, all similarity measures based on conference keywords had better results than the group based on author information, among which the Cosine similarity stands out.
Therefore, these experiments suggest that the best performing algorithm is that which adopts the communities similarity measure.
Experiments with the conference recommendation techniques
Recall that we proposed three families of recommendation techniques. One family is based on similarity measure defined in the “Similarity measures based on author information” subsection. These techniques are called CFJaccard, CFPearson, CFCosine, and CFCommunities because they use the Jaccard similarity, Pearson similarity, Cosine and a new similarity measure, the community similarity respectively. The second family includes two techniques based on the weighted and the modified weighted semantic connectivity, called WSCSbased and MWSCSbased recommendation techniques. Finally, the third family uses the techniques based on the subgraph of the coauthorship network and are called ClusterWSCSbased and ClusterMWSCSbased. In view of the results of the previous subsection that evaluate the similarity measures for the clustering algorithms ClusterWSCSbased and ClusterMWSCSbased, we selected as clustering technique the agglomerative algorithm with complete link and cos_sim_tpc(x,y) by its stability in the results.
We evaluated the conference recommendation techniques using the same dataset as in the previous subsection, with the 248 academic computer science conferences, and selected 243 random authors to predict their conferences ranking, for that we deleted all publications of the author on the conferences that we want to rank. We adopted Luong’s most frequent conference technique as the benchmark (see the “Related work” section).
Also recall that the mean average precision measures how good a recommendation ranking function is. Intuitively, let a be an author and C _{ a } be a ranked list of conferences recommended for a. Let S _{ a } be a gold standard for a, that is, the set of conferences considered to be the best ones to recommend for a. Then, we have:

Prec @ k(C _{ a }), the precision at position k of C _{ a }, is the number of conferences in S _{ a } that occur in C _{ a } until position k, divided by k

AveP(C _{ a }), the average precision of C _{ a } , is defined as the sum of Prec @ k(C _{ a }) for each position k in the ranking C _{ a } in which a relevant conference for a occurs, divided by the cardinality of S _{ a }:
$$ AveP\left({\boldsymbol{C}}_a\right)=\frac{{\displaystyle {\sum}_k} Prec@ k\left({\boldsymbol{C}}_a\right)}{\left{\boldsymbol{S}}_a\right} $$(22) 
MAP, the Mean Average Precision of a rank score function over all the authors used in these experiments (represented by set A) is then defined as follows:
$$ MAP= average\left\{ AveP\left({\boldsymbol{C}}_a\right)/ a\in \boldsymbol{A}\right\} $$(23)
Moreover, in order to evaluate whether the differences between the results are statistically significant, a paired statistical Student’s t test [27, 28] was performed. According to Hull [29], the t test performs well even for distributions which are not perfectly normal. A p value is the probability that the results from the compared data occurred by chance; thus, low p values are good. We adopted the usual threshold of α = 0.01 for statistical significance, i.e., less than 1% that the experimental results happened by chance. When a paired t test obtained a p value less than α, there is a significant difference between the compared techniques.
Consider first the two conference recommendation techniques based on the coauthorship network, the WSCSbased and MWSCSbased recommendation techniques. To compare them, we performed experiments that measured their runtime, accuracy, and average precision of the top 10 conferences of an author (thus, in this situation, the maximum S _{ a } value used in the AveP calculation is 10). Figure 5 shows the runtime results of the algorithms that implement these recommendation techniques. Note that the WSCSbased algorithm is by far the slowest, followed by the MWSCSbased. The remaining algorithms had similar runtimes.
Table 1 shows the accuracy and MAP of the 8 conference recommendation techniques we proposed, plus the benchmark. Two of the proposed techniques (first two rows of Table 1) have very similar accuracy. In fact, of the 243 authors that we tested, the balance of the correct predictions was 201 against 197. To better evaluate the results, we applied a paired t test to investigate whether there are statistically significant differences between the MAP results of these conference recommendation techniques. Table 2 shows the p values obtained by all t tests performed, where the boldface results represent differences which are not statistically significant.
Based on these results, the three techniques with the best results—WSCSbased, MWSCSbased, and ClusterWSCSbased—do not have differences which are statistically significant, based on their MAP results. The results also indicate that these three techniques have better MAP results than the benchmark (with statistically significant differences). The CFJaccard, CFCommunities, and the ClusterMWSCSbased techniques have results very close to the benchmark (without statistically significant differences when compared to the benchmark) but less than the three techniques with the best results—WSCSbased, the MWSCSbased, and ClusterWSCSbased (with differences statistically significant when compared to these three). The CFPearson and CFCosine techniques have poor accuracy (with statistically significant differences when compared to all other techniques).
Thus, between the three techniques with the best results—WSCSbased, MWSCSbased, and ClusterWSCSbased—we may conclude that the ClusterWSCSbased technique should be preferred because it is more efficient and maintains a MAP with no statistically significant differences when compared to the WSCSbased and MWSCSbased techniques.
Conclusions
In this article, we presented techniques to compare and recommend conferences. The techniques to compare conferences are based on some classical similarity measures and on a new similarity measure based on the coauthorship network communities of two conferences. The experiments suggest that the best performance is obtained using the new similarity measure.
We introduced three families of conference recommendation techniques, following the collaborative filtering strategy, and based on (1) the similarity measures proposed to compare conferences; (2) the relatedness of two authors in the coauthorship network, using the Weighted and the Modified Weighted Semantic Connectivity Scores; (3) conference clusters, using a subgraph of the coauthorship network, instead of the full coauthorship network. The experiments suggest that the WSCSbased, MWSCSbased, and ClusterWSCSbased techniques perform better than the benchmark and better than the techniques based on similarity measures. Furthermore, between these three techniques, the ClusterWSCSbased technique should be preferred because it is more efficient and maintains a MAP with no statistically significant differences when compared to the WSCSbased and MWSCSbased techniques.
These conclusions should be accepted under the limitations of the experiments, though, which we recall adopted a set of 248 academic computer science conferences as golden standard and used a random sample of 243 authors. Further experiments ought to be performed with other sets of conferences and authors, perhaps obtained from sources different from DPLP. However, the question of defining a golden standard remains an issue.
In another direction, some of the techniques described in the paper might be applied to other domains that contain essentially three types of objects, analogous to “conferences,” “papers,” and “authors” and two relationships, similar to “authored” and “published in.” One such domain would be that of “art museums,” “artworks,” and “artists”, with the relationships “created” and “exhibit in”. However, note that the notion of “coauthorship” would have no relevant parallel in the art domain. Again, the question of finding an appropriate data source and defining a golden standard would be an issue, which could be addressed as in [30].
A preliminary version of these results, except the techniques described in the “Similarity measure based on conference keywords” and “Conference recommendation techniques based on conference clusters” subsections, and the t test described in the “Experiments with the conference recommendation techniques” subsection, were presented in [31].
As for future work, we plan to experiment with a similarity measure based on conference keywords expanded to include semantic relationships between the keyword, other than just synonymy. Also, we plan to explore other strategies for recommending conferences, such as the complexity level, writing style, etc. Finally, we plan to expand the experiments to other publications datasets and other application domains, as already mentioned, and to make the tool and the test datasets openly available.
Abbreviations
 DBLP:

dblp computer science bibliography
 MSWCS:

Modified Weighted Semantic Connectivity Score
 SCS:

Semantic Connectivity Score
 SNA:

Social network analysis
 WSCS:

Weighted Semantic Connectivity Score
References
 1.
Henry N, Goodell H, Elmqvist N, Fekete JD (2007) 20 Years of four HCI conferences: a visual exploration. Int’l J of HumanComp Inter 23(3):239–285
 2.
Blanchard EG (2012) On the WEIRD nature of ITS/AIED conferences. In: Proceedings of the 11th Int’l. Conf. on Intelligent Tutoring Systems, Chania, Greece, 1418 June 2012., pp 280–285
 3.
Chen C, Zhang J, Vogeley MS (2009) Visual analysis of scientific discoveries and knowledge diffusion. In: Proceedings of the 12th Int’l. Conf. on Scientometrics and Informetrics—ISSI 2009, Rio de Janeiro, Brazil, 1417 July 2009
 4.
Gasparini I, Kimura MH, Pimenta MS (2013) Visualizando 15 Anos de IHC. In: Proceedings of the 12th Brazilian Symposium on Human Factors in Computing Systems, Manaus, Brazil, 0811 October 2013., pp 238–247
 5.
Barbosa SDJ, Silveira MS, Gasparini I (2016) What publications metadata tell Us about the evolution of a scientific community: the case of the Brazilian humancomputer interaction conference series. Scientometrics, First Online
 6.
Chen C, Song IY, Zhu W (2007) Trends in conceptual modeling: citation analysis of the ER conference papers (19792005). In: Proceedings of the 11th Int’l. Conf. of the International Society for Scientometrics and Informatrics, Madrid, Spain, 2527 June 2007., pp 189–200
 7.
Zervas P, Tsitmidelli A, Sampson DG, Chen NS, Kinshuk (2014) Studying Research collaboration patterns via Coauthorship analysis in the field of TeL: the case of educational technology & society journal. J Educ Techno Soc 17(4):1–16
 8.
Procópio PS, Laender AHF, Moro MM (2011) Análise da Rede de Coautoria do Simpósio Brasileiro de Bancos de Dados. In: Proceedings of the 26th Brazilian Symposium on Databases, Florianópolis, Brazil, 36 Oct. 2011
 9.
Cheong F, Corbitt BJ (2009) A social network analysis of the Coauthorship network of the Australasian conference of information systems from 1990 to 2006. In: Proceedings of the 17th European Conf. on Info. Systems, Verona, Italy, 810 June 2009
 10.
Cheong F, Corbitt BJ (2009) A social network analysis of the Coauthorship network of the pacific Asia conference on information systems from 1993 to 2008. In: Proceedings of the Pacific Asia Conference on Information Systems 2009, Hyderabad, India, 1012 July 2009, Paper 23
 11.
Lopes GR, Nunes BP, Leme LAPP, NurmikkoFuller T, Casanova MA (2015) Knowing the past to plan for the future—an indepth analysis of the first 10 editions of the WEBIST conference. In: Proceedings of the 11th Int’l. Conf. on Web Information Systems and Technologies, Lisbon, Portugal, 2022 May 2015., pp 431–442
 12.
Lopes GR, Nunes BP, Leme LAPP, NurmikkoFuller T, Casanova MA (2016) A comprehensive analysis of the first ten editions of the WEBIST conference. Lect. Notes in Business Information Processing 246:252–274
 13.
Batista MGR, Lóscio BF (2013) OpenSBBD: Usando Linked Data para Publicação de Dados Abertos sobre o SBBD. In: Proceedings of the 28th Brazilian Symposium on Databases, Recife, Brazil, 30 Sept.  03 Oct. 2013
 14.
Medvet E, Bartoli A, Piccinin G (2014) Publication venue recommendation based on paper abstract. In: Proceedings of the IEEE 26th International Conference on Tools with Artificial Intelligence, 1012 Nov. 2014
 15.
Pham MC, Cao Y, Klamma R, Jarke M (2011) A clustering approach for collaborative filtering recommendation using social network analysis. J Univers Comput Sci 17(4):583–604
 16.
Chen Z, Xia F, Jiang H, Liu H, Zhang J (2015) AVER: random walk based academic venue recommendation. In: Companion Proceedings of the 24th International Conference on World Wide Web., pp 579–584
 17.
Boukhris I, Ayachi R (2014) A novel personalized academic venue hybrid recommender. In: Proceedings of the IEEE 15th International Symposium on Computational Intelligence and Informatics, 1921 Nov. 2014
 18.
Yang Z, Davison BD (2012) Venue recommendation: submitting your paper with style. In: Proceedings of the 11th International Conference on Machine Learning and Applications, 1215 Dec. 2012., pp 12–15
 19.
Huynh T, Hoang K (2012) Modeling Collaborative knowledge of publishing activities for research recommendation. Computational collective intelligence. Technologies and applications. Volume 7653 of the series LNCS., pp 41–50
 20.
Asabere NY, Xia F, Wang W, Rodrigues JC, Basso F, Ma J (2014) Improving smart conference participation through socially aware recommendation. IEEE Trans Hum Mach Syst 44(5):689–700
 21.
Hornick M, Tamayo P (2012) Extending recommender systems for disjoint user/item sets: the conference recommendation problem. IEEE T Knowl Data En 24(8):1478–1490
 22.
Luong H, Huynh T, Gauch S, Do L, Hoang K (2012) Publication venue recommendation using author Network’s publication history. In: Proceedings of the 4th Asian Conf. on Intelligent Information and Database Systems  ACIIDS 2012, Kaohsiung, Taiwan, 1921 March 2012., pp 426–435
 23.
Nunes BP, Kawase R, Fetahu B, Dietze S, Casanova MA, Maynard D (2013) Interlinking documents based on semantic graphs. Procedia Comput Sci 22:231–240
 24.
Katz L (1953) A New Status index derived from sociometric analysis. Psychometrika 18(1):39–43
 25.
García GM (2016) Analyzing, comparing and recommending conferences. M.Sc. Dissertation. Department of Informatics, PUCRio, Rio de Janeiro, https://doi.org/10.17771/PUCRio.acad.27295
 26.
Leskovec J, Rajaraman A, Ullman JD (2014) Mining of Massive Datasets. Cambridge University Press, Cambridge
 27.
BaezaYates RA, RibeiroNeto BA (2011) Modern information retrieval—the concepts and technology behind search, 2nd edn. Pearson Education Ltd., Harlow, England
 28.
Manning CD, Raghavan P, Schütze H (2008) Introduction to Information Retrieval. Cambridge University Press, New York
 29.
Hull D (1993) Using statistical testing in the evaluation of retrieval experiments. In: Proceedings of the 16th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’93. NY, USA, ACM, New York, pp 329–338
 30.
Ruback L, Casanova MA, Renso C, Lucchese C (2017) SELEcTor: discovering similar entities on LinkEd DaTa by ranking their features. In: Proceedings of the 11th IEEE International Conference on Semantic Computing, San Diego, USA, 30 Jan.  2 Feb. 2, 2017
 31.
García GM, Nunes BP, Lopes GR, Casanova MA (2016) Comparing and recommending conferences. In: Proceedings of the 5th BraSNAM—Brazilian Workshop on Social Network Analysis and Mining, Porto Alegre, Brazil, 05 July 2016
Acknowledgements
This work was partly funded by CNPq under grants 444976/20140, 303332/20131, 442338/20147, and 248743/20139 and by FAPERJ under grant E26/201.337/2014.
Authors’ contributions
GMG defined the new similarity measures based on the coauthorship network, and implemented and evaluated all techniques, under the supervision of MAC, BPN, GRL, and LAPPL. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
About this article
Received
Accepted
Published
DOI
Keywords
 Conference comparison
 Conference recommendation
 Coauthorship networks
 Social network analysis
 Recommender systems
 Linked data