A tagging-based system usually has three sources of information available when it needs to recommend tags: (i) the Web resource, (ii) the folksonomy of the TBS, and (iii) the user’s personomy. Each one of these has a particular importance in the recommendation process.
The Web resource is the main element of a categorization, and its content can be available in many forms, such as text, pictures, videos, flash animations, etc. In this work, we considered only textual content (i.e., any Web page with some text). One of the most important aspects of a Web resource content that is generally forgotten is the fact that it can express some features of the context of the resource categorization. For any Web resource, there are several factors influencing the context in which it could be used, but for a Web page the context can usually be determined by the way in which the vocabulary is employed by their author in order to expose the content to the reader. The task of getting the context of a Web page is not trivial, since the text of the page may present misspellings, synonymy, polysemy, bending terms, parts that do not refer to the content (e.g., header, footer, menu columns, advertisements), among others. Taking all these aspects into account, we decided to represent the context of a Web page by the set of keywords that is most representative of the characteristics and properties of the Web page, i.e., the most relevant terms contained in its term-vectorFootnote 1 [26]. It is also possible to use Web resource metadata when it is available to represent a summary of the content [30], but as any summary it does not convey the richness of the content itself for the generation of the candidate tags.
A TBS folksonomy normally reflects the vocabulary that is common to the system’s users [28], providing a social view of the categorized resources, which a single user could never have by themselves. Thus, using the TBS folksonomy data to recommend tags can always be a valid alternative, since the user is categorizing a resource that others in the community have already categorized and, therefore, there may be a common interest, which guarantees the utility of the folksonomy tags. Also, the idiosyncrasies present in a folksonomy may benefit the information retrieval process, as they represent alternative and interesting terms for the users (which makes the serendipity effect possible [38]).
The user’s personomy can bring together a wide diversity of knowledge about the individual, since in a categorization users express, by the used tags, their knowledge, intentions and terminology preferences related to the content of each resource [32]. From the analysis of a user’s personomy, it is possible to guide the tag recommendation to target the user’s vocabulary, offering terms according to their preferences [34, 41].
Reviewing the recent literature about tag recommendation, we found that current approaches generally focuses on the system folksonomy as its main source of information [27]. A small number of approaches also uses the Web resource content to assist in the recommendation, among them we can quote Lu et al. [24], which analyzes the Web resource content and combines it with tags of similar resource for generating the candidate recommendations; Song et al. [37], which extract the document vector of the Web resource and applies statistical techniques over a bipartite graph of the words, tags, and resources to generate the candidate tags; and Heymann et al. [16], which uses the Web resource text, anchor text, and surround hosts for generating the recommending tags. Another small number of approaches employ information about the user together with the information of the folksonomy and Web resource metadata to select the tags that will be recommended, among them we can quote Lipczak [23], which uses the Web resource title and cooccurrence analysis to expand the set of candidate tags filtering it by using the user personomy to obtain the final recommendation; Musto et al. [30], which use the Web resource metadata to generate the candidates and the user information to personalize the final recommendation; and Tatu et al. [40], which use the Web resource to extract a combination of semantic and statistical characteristics to construct models of users and documents that are used to generate and select the recommended tags (they also employing the WordNetFootnote 2 to standardize concepts).
From the above review, we can observe that very few systems has tried to use the combination of the three sources of information together. Also, in spite of the fact that some systems had used a semantic approach, the level of semantic they explored is shallow. Most of them make use of complex statistical techniques to identify some level of semantic relations among the concepts and use these relations to inform the selection of candidates to the recommending tags. We propose to make use of these three sources together with an approach based on the recommendation of semantic tags that explores relations among the concepts. The three sources of information taken into account in this proposal can be combined in various ways, which we will discussed in the next section.
Possible scenarios for a tag recommendation process in a TBS
Taking into account the three sources of information discussed above, there are eight different scenarios that could happen in a tag recommendation process, as shown in Fig. 1. We can divide them in two groups: those that do not analyze the Web resource (1 through 4), and those that do it (5 through 8). Therefore, the only variables are whether or not there are sufficient amount of data in the user’s personomy and the TBS’ folksonomy to be used by the recommender system.
Let us first consider the scenarios where the Web resources are not used as a source of information, since this is the common case for the recommender systems in current TBSs.
Scenario 1: A user, without any information on his/her personomy, is trying to categorize a resource using a TBS without information on its folksonomy about that resource. This is the worst case scenario for a tag recommender system, and represents the situation confronted by a new system user categorizing a new resource. Since we do not have access to any of the three sources of information there is no way to generate recommendations. This is an instance of the cold start problem (i.e., the problem to generate recommendations for a resource without any source of information from where to take the terms to recommend) and happens in most of the recommender systems in current TBSs.
Scenario 2: A user, with information on his/her personomy, is trying to categorize a resource using a TBS without information on its folksonomy about that resource. This is the typical case of a user trying to categorize a new resource in a TBS. As a personomy has only information about resources already categorized by the user, there is no way to obtain information about the current resource being categorized and, therefore, we can only generate recommendations based on the global users’ interests, given by the most used vocabulary in his/her personomy. This kind of data is most of the time of little use, generating recommendations of low quality. This is another instance of the cold start problem and also happens in most recommender systems in the current TBSs.
Scenario 3: A user, without information on his/her personomy, is trying to categorize a resource using a TBS with information on its folksonomy about that resource. This is another possible scenario for a new system user, but once the resource has already been evaluated and categorized by other system users, it is possible to generate recommendations from a social point of view. However, it will not be possible to personalize these recommendations to match the user’s vocabulary, since there is no information in their personomy. The use of the folksonomy as the unique source of information is the common case for the majority of the recommender systems in current TBSs.
Scenario 4: A user, with information on his/her personomy, is trying to categorize a resource using a TBS with information on its folksonomy about that resource. This could be considered a desirable scenario for a recommender system, since it would be possible to use the folksonomy’s social point of view to generate recommendations, giving priority to the terms most used by the community; and also to use the user’s personomy to further personalize the recommendation data to match their vocabulary. The use of the folksonomy together with the personomy as source of information is the configuration used by the Delicious system.Footnote 3
For the other four scenarios, we will assume that the Web resource was analyzed and a representation of some features of the context of the categorization is available.
Scenario 5: A user, without information on his/her personomy, is trying to categorize a resource using a TBS without information on its folksonomy about that resource. Unlike what happens in Scenario 1, once we have information from the Web resource it will be possible to generate recommendations from the extracted contextual data, avoiding the cold start problem.
Scenario 6: A user, with information on his/her personomy, is trying to categorize a resource using a TBS without information on its folksonomy about that resource. Again, contrary to what happens in Scenario 2, once we have information from the Web resource it will be possible to generate recommendations from the extracted contextual data, avoiding the cold start problem. In addition, in this scenario it would be possible to personalize the recommendations to the user’s preferences, based on the information contained in their personomy.
Scenario 7: A user, without information on his/her personomy, is trying to categorize a resource using a TBS with information on its folksonomy about that resource. What makes this scenario different from Scenario 3 is that it will be possible to use the folksonomy data to filter the contextual data extracted from the Web resource, giving priority to the terms most used by the community.
Scenario 8: A user, with information on his/her personomy, is trying to categorize a resource using a TBS with information on its folksonomy about that resource. In this scenario, besides the social filter employed in the last scenario, we could also personalize the recommendation data to the user’s preferences based on the vocabulary of his/her personomy. In this way, this could be considered the best case scenario for a recommender system.
Although it is possible to make recommendations without analyzing the Web resource, as discussed in Scenarios 2, 3, and 4, using it as a source of information could lead to better recommendations. This will take place because the recommended tags will come from terms present in the Web resource, which normally makes it easier for the use to remember. Even better, if the folksonomy information is available, it would be possible to further improve the tag’s quality by applying a social filter to them, given priority to the tags most used by the community. In addition, if the personomy information is available, it would be possible to increase the memory of the recommended tags by taking into account the user’s vocabulary preferences, which will certainly contribute to the retrieval of the resource.
One more aspect is worth mentioning. As discussed in Scenarios 5 and 6, the variations of the Scenarios 1 and 2, where the cold start problem normally happens in the recommender systems of the current TBSs, the use of the Web resource as a source of information allows the system to deliver recommendations to users, avoiding the cold start problem.
Taking all these aspects into consideration, we claim that using these three sources of information to recommend semantic tags to the users of a TBS can enrich the quality of the user’s personomy. The adoption of semantic tags would avoid the use of mistaken terms and ambiguity and would improve the quality of the user’s tag from the beginning. To obtain semantic tags we develop an algorithm that analyze and combine ontologies emerged from the three sources discussed, which will be presented in the following sections.