ApproxMap  a method for mapping blank nodes in RDF datasets
 Juliano de Almeida MonteMor^{1, 2}Email author and
 Adilson Marques da Cunha^{2}
https://doi.org/10.1186/s1317301500223
© MonteMor and Cunha; licensee Springer. 2015
Received: 21 October 2013
Accepted: 16 October 2014
Published: 28 April 2015
Abstract
Background
Versioning has proven to be essential in areas like software development or data and knowledge management. For systems or applications making use of documents formatted according to the Resource Description Framework (RDF) standard, it is difficult to calculate the difference between two versions, owing to the presence of blank nodes, also known as bnodes in RDF graphs. These are anonymous nodes that can assume different identifiers between versions. In this case, the challenge lies in finding a mapping between the sets of blank nodes in the two versions while minimizing the operations needed to convert one version into another.
Methods
Within this context, we propose an algorithm, named ApproxMap, for mapping bnodes based on extended concepts of rough set theory, which provides a way to measure the proximity of bnodes and map them with closer approximations. Our heuristic method considers various strategies for reducing both the number of comparisons between blank nodes and the delta between the compared versions. The proposed algorithm has a worstcase time complexity of O(n ^{2}).
Results
ApproxMap showed satisfactory performance in our groups of experiments, as the algorithm that obtained solutions closest to the optimal values. This algorithm succeeded in finding the optimal delta size in 59% of the tests involving optimal values. ApproxMap achieved a delta size smaller than or equal to those of existing algorithms in at least 95% of the tested cases.
Conclusions
The results show that the proposed algorithm can be successfully applied to versioning RDF documents, such as that produced by software processes with iterative and incremental development. We recommend applying ApproxMap in various situations, particularly those involving similar versions and directly connected bnodes.
Keywords
Background
In areas such as software engineering, databases, and Web publishing, methods for versioning have already been developed and successfully applied. These methods must be able to calculate the differences (i.e., deltas) between versions to provide efficient storage of subsequent versions.
Particularly in software engineering, versioning algorithms are usually based on a comparison of text lines. However, these methods are not suitable to control versions of structured or semistructured documents. In this article, we focus specifically on the version control of documents following a Semantic Web standard, the Resource Description Framework (RDF) [1]. We have applied Semantic Web technologies in the software configuration management (SCM) domain [2].
RDF defines a basic data model for writing simple statements about Web objects or resources. It allows the definition of sentences through ‘subjectpredicateobject’ triples; that is, a resource, a property, and a value (which can be a literal or a resource). An RDF triple, like a graph’s edge, provides a binary relationship (predicate) that relates a subject to an object. Thus, an RDF document or dataset can be represented by a directed graph [3].
The conventional lineoriented mechanisms in software engineering are insufficient in the Semantic Web context because their deltas are based on unique serializations, which do not occur naturally in RDF datasets [4]. These bases usually consist of unordered collections of affirmations about resources; however, even when a standard serialization order is imposed (e.g., by sorting), existing comparison tools fail to consider knowledge inferred from schemas associated with RDF datasets [5].
Thus, to obtain the delta between two versions of an RDF dataset, we need to map the nodes in the graphs representing these versions. However, the main problem encountered during calculation of the delta concerns the existence of anonymous nodes (i.e., blank nodes or bnodes) in the RDF graphs. Bnodes represent resources that are not identified by a uniform resource identifier (URI) or literals. In this case, the mapping between bnodes contained in different graph versions directly influences the size of the deltas.
First, we can easily map bnodes ‘_:3’, ‘_:4’, and ‘_:5’ to bnodes ‘_:8’, ‘_:10’, and ‘_:9’, respectively. Then, by mapping bnode pairs ‘_:1’ and ‘_:6’ and ‘_:2’ and ‘_:7’, which seems to be a natural choice, we obtain a delta consisting of four triples. In other words, transforming the first graph into the second requires removing triples ‘ <_:1,friend, _:4 >’ and ‘ <_:2,friend, _:5 >’ and adding triples ‘ <_:1,friend, _:5 >’ and ‘ <_:2,friend, _:4 >’. However, if we were to map bnode ‘_:1’ to ‘_:7’ and ‘_:2’ to ‘_:6’, we would have a delta consisting of two triples; that is, triple ‘ <_:1,brother, _:3 >’ must be removed and triple ‘ <_:2,brother, _:3 >’ added. The latter mapping is better, owing to the smaller delta size. In the case of directly connected bnodes, we believe that a mapping based on a bottomup strategy, where nodes in the lower levels are mapped before those in the upper levels, can help reduce the delta size.
During bnode mapping, we need to address inaccuracies between the modified bnodes. To facilitate the handling of this imprecision, we chose to extend some concepts of rough set theory (RST) [7]. RST has already been successfully applied in several areas like artificial intelligence and cognitive sciences. Nicoletti et al. [8] presented the following application examples: creation of machine learning methods, knowledge representation, inductive reasoning, data mining, processing of imperfect or incomplete information, pattern recognition, and discovery of knowledge in databases.
In this context, our approach proposes a heuristic method for mapping blank nodes based on RST. This theory serves as the conceptual basis for the definition of metrics to assist in the choice of bnode pairs, providing the necessary support to map a bnode to the candidate with the closest approximation. Our main objective is to create an algorithm that can be successfully applied in software project versioning.
The remainder of this article is organized as follows: ‘Related work’ subsection gives an overview of existing work on calculating deltas and mapping bnodes. In ‘Problem description’ subsection, we formally describe the problem addressed in this work, while ‘Rough set theory’ discusses some basic concepts of RST. In ‘Blank nodes as rough sets’, we define a bnode representation model using rough sets, which is necessary for specifying the proposed mapping algorithm in ‘The ApproxMap method’ section. ‘Results and discussion’ discusses some experimental results, while ‘Conclusions’ presents our conclusions and recommendations.
Related work
Particularly in the software engineering domain, relatively little effort has been made to develop methods for obtaining a better blank node mapping between two versions, by reducing their delta size. Next, we briefly describe some studies on RDF dataset versioning, explaining how they handle blank nodes.
BernersLee and Connolly [4] discussed comparing RDF graphs and updating a graph from a calculated set of differences. They emphasized that the order and identification of bnodes can differ arbitrarily with serializations of the same graph. Hence, calculating deltas based on lineoriented approaches is a problem. Computing the differences between two graphs is simple and straightforward if all nodes are named. However, when not all bnodes are named, finding the largest common subgraph becomes an instance of the graph isomorphism problem. The authors further suggested that available solutions for the general isomorphism problem do not appear to be good matches for practical cases. Thus, they proposed an algorithm that produces an RDF difference only for graphs named directly with URIs or indirectly with functional or inverse functional properties. We extend their approach by performing the mapping considering unnamed nodes as well.
Carroll [9] showed that standard algorithms for graph isomorphism can be used to compare RDF graphs. He developed an algorithm considering an iterative vertex classification, used in his RDF toolkit Jena, where each anonymous resource is identified based on the statements in which it appears. Thus, bnodes receive identifiers considering their local contexts, which can change between different versions. In our approach, although we do not produce identifiers for bnodes, we also consider the triples in which they appear to classify approximations between bnode pairs.
Noy et al. [1012] presented an algorithm, called PromptDiff, which combines different heuristic matchers to map RDF graphs by comparing structural properties of the ontology versions. New matchers, which may be needed to compare anonymous classes, can easily be added. The authors considered two observations when comparing versions from the same ontology: a large proportion of the frames remain unchanged between versions; and if two frames have the same type and name (or a very similar name), they are almost certainly copies of one another. We follow the first observation, by first mapping equivalent bnodes. We also include some heuristic strategies in the design of our method.
Auer and Herre [13] suggested a framework to support versioning and the evolution of RDF knowledge bases. Their framework is based on atomic changes, including the addition or removal of RDF graphs statements. Atomic changes encompass all statements containing bnodes in a delta, where the graph is atomic if it cannot be split into two nonempty graphs with disjoint blank nodes. In contrast to our approach, because Auer and Herre did not aim to find a mapping between bnodes, there was no commitment to obtain the smallest delta.
Voelkel and Groza [14] showed a versioning approach, called SemVersion, which provides structural and semantic versioning for models in RDF/S and OWL. In their approach, bnodes were given unique identifiers in all versions. To identify equal blank nodes across models, they proposed a method for blank node enrichment, where URIs are attached as inverse functional properties to blank nodes. However, this means that blank nodes with different identifiers cannot be mapped, even if they represent the same element in different versions. Moreover, in our approach, we do not add any information to the datasets and do not consider unique identifiers for bnodes in different versions.
Cassidy and Ballantine [15] and Im et al. [16] presented versioning models for RDF repositories. They provided a collaborative annotation facility to develop and share annotations over the Web. Im et al. proposed a version framework for an RDF data model based on relational databases. None of these authors, however, considered blank nodes in their research or defined any method for mapping bnodes, as we do in our approach. These researchers addressed only procedures enabling versioning in RDF repositories.
Tzitzikas et al. [6] proposed two polynomial time algorithms for mapping bnodes between two knowledge bases. Seeking to reduce the size of the resulting delta, the authors modeled the problem of bnode mapping as an assignment problem and used a Hungarian [18] method, Alg _{ Hung }, to solve it. This method seeks to find the optimal solution with time complexity O(n ^{3}).
Alg _{ Hung } obtains the optimal delta if the considered knowledge bases do not have interconnected bnodes. In the case where the datasets have directly connected bnodes, the authors assume that all neighboring bnodes are equal during mapping. This method cannot be applied to larger knowledge bases owing to its quadratic space requirement in terms of RAM [6].
These authors also proposed a faster signaturebased method, called Alg _{ Sign }, for comparing large knowledge bases with time complexity O(n· logn). For each bnode, Alg _{ Sign } produces a string based on its direct neighborhood as the bnode’s signature. Thereafter, the mapping phase compares the generated strings, sorted lexicographically to allow a binary search. The cost of reducing the mapping time is a probable increase in the delta size [6].
Through experiments, Tzitzikas et al. verified that their algorithms obtain deltas with large sizes if the number of directly connected bnodes is high. In this case, once the direct neighborhoods lose their discrimination ability, the delta reduction potential becomes more unstable [6].
Because the number of directly connected bnodes affects the results of both Alg _{ Hung } and Alg _{ Sign }, we proposed a greedy method with a different strategy: neighboring bnodes are treated as different nodes, until they have been mapped in a previous iteration. Our proposal aims to develop a method with lower memory overhead than the Alg _{ Hung } algorithm, while reducing the probable increase in delta size when compared with Alg _{ Sign }.
Research performed before that of Tzitzikas et al. [6] did not seek a mapping that reduces the delta between versions. Tzitzikas et al. were the first to address the bnode mapping problem as an optimization problem, as described in the next section. Accordingly, their work served as the basis for implementing our approach, enabling a comparison between our method and their proposed algorithms.
Problem description
In this section, we describe the problem addressed in this article as defined by Tzitzikas et al. [6]. An RDF knowledge base, i.e., an RDF graph, consists of a finite set of RDF triples. Each RDF triple refers to (s,p,o)∈(W∪B)×W×(W∪B∪L), where W is an infinite set of URIs, B is an infinite set of blank nodes, and L is an infinite set of literals. Assuming W _{ k }, B _{ k }, and L _{ k } are sets of URIs, blank nodes, and literals of an RDF G _{ k } graph, respectively, the equivalence between two RDF graphs can be defined as follows:
Definition 1.

M(u r i)=u r i, for each u r i∈W _{1}∩N _{1};

M(l i t)=l i t, for each l i t∈L _{1};

M maps bnodes to bnodes (i.e., for each b∈B _{1} it holds that M(b)∈B _{2}); and

triple (s,p,o) is in G _{1} if, and only if, triple (M(s),p,M(o)) is in G _{2}.
Tzitzikas et al. denoted this equivalence between two graphs G _{1} and G _{2} as G _{1}≡_{ M } G _{2}. Moreover, they also defined the edit distance between two nodes as given in Definition 2. From these two definitions, the equivalence between graphs G _{1} and G _{2} can be defined as in Theorem 1.
Definition 2.
(from [6]) Let o _{1} and o _{2} be nodes in G _{1} and G _{2}, respectively. Suppose a bijection exists between the nodes of these graphs, i.e., function M:N _{1}→N _{2} (obviously N _{1}=N _{2}). Then, the edit distance between o _{1} and o _{2} over M, denoted by dist_{ M }(o _{1},o _{2}), is the number of additions or deletions of triples required to make the ‘direct neighborhoods’ of o _{1} and o _{2} the same (that is, where Mmapped nodes are the same). Formally:
Theorem 1.
In the case of versioning, current interest lies in nonequivalent knowledge bases. In this case, it is necessary to find a mapping between bnodes in the two knowledge bases, B _{1} and B _{2}, that reduces the delta resulting from a comparison thereof.
Definition 3.
Theorem 2.
Therefore, considering the context of this problem described by Tzitzikas et al., we propose a greedy method that seeks to reduce the delta size between two RDF graphs, obtaining an approximate solution to the bijection between the bnodes of these RDF graphs. For this purpose, we define some metrics extending various concepts of RST. In the next section, we present some basic concepts of this theory, which are considered in the design of our algorithm.
Rough set theory
RST is an extension of set theory, consisting of a mathematical model for uncertainty and imprecision handling, knowledge representation, and rough classification. The main advantage of using RST is that it does not require any preliminary or additional information about the data, such as a probability distribution or membership degree.
In our approach, we adopt RST as the formalism for dealing with imprecision resulting from the comparison of bnode pairs. RST also forms the conceptual basis of defining metrics for measuring the closeness between bnode pairs. Our method aims to map the closest bnode pairs in an attempt to reduce the delta size. Next, we present the main concepts of this theory, extracted from [7,19].
Basic concepts
Let U be a finite, nonempty, universe set of objects. In set U, we can define subsets using the equivalence relation R, called the indiscernibility relation. Relation R induces a partition (and consequently, classification) of the objects in U. Thus, an approximation space consists of an ordered pair A=(U,R), where given x,y∈U, if xRy then x and y are indiscernible in A. The equivalence class defined by x is the same as that defined by y, i.e., [x]_{ R }=[y]_{ R }.

Lower approximation of X in A  formed by the union of all elementary sets of A fully contained in X, i.e., the largest definable set in A contained in X:$$ A_{\text{inf}}(X) = \left\{x \in U  [x]_{R} \subseteq X\right\}. $$(7)

Upper approximation of X in A  formed by the union of all elementary sets of A having a nonempty intersection with X, i.e., the smallest definable set in A containing X:$$ A_{\text{sup}}(X) = \left\{x \in U  [x]_{R} \cap X \neq \emptyset\right\}. $$(8)

Positive region of X in A  formed by the union of all elementary sets of U fully contained in X:$$ \text{pos}\,(X) = A_{\text{inf}}\left(X\right). $$(9)

Negative region of X in A  formed by the elementary sets of U that have no elements in X:$$ \text{neg}\;(X) = U  A_{\text{sup}}(X). $$(10)

Doubtful region of X in A  also called the boundary of X, formed by the elementary sets of U that belong to the upper approximation, but do not belong to the lower approximation. The membership of an element of this region to set X is uncertain, based only on the equivalence classes of A:$$ \text{duv}\,(X) = A_{\text{sup}}(X)  A_{\text{inf}}\,(X). $$(11)
Some RST measures

Internal measure of X in A$$ \varpi_{\text{Ainf}}\,(X) = \left A_{\text{inf}}\,(X)\right $$(12)

External measure of X in A$$ \varpi_{\text{Asup}}(X) = \left A_{\text{sup}}(X)\right $$(13)

Quality of the lower approximation of X in A$$ \gamma_{\text{Ainf}}(X) = \frac{\varpi_{\text{Ainf}}(X)}{\leftU\right} = \frac{\left A_{\text{inf}}(X)\right}{\leftU\right} $$(14)

Quality of the upper approximation of X in A
$$ \gamma_{\text{Asup}}(X) = \frac{\varpi_{\text{Asup}}(X)}{\leftU\right} = \frac{\left A_{\text{sup}}(X)\right}{\leftU\right} $$(15)
The internal measure is the number of elements in A that definitely belong to X, while the external measure indicates the number of elements that could belong to X. The metrics for quality of the lower and upper approximations present these measures as percentages of the total number of elements in A. In particular, we extended γ _{Ainf}(X) and γ _{Asup}(X) in the design of our mapping algorithm. As a future work, we intend evaluating the adoption of other RST metrics. In the next section, we describe how bnodes can be modeled as approximate sets in an approximation space.
Methods
We adopted RST in our approach as the basis on which to build a heuristic method to reduce the size of the delta found in the mapping between RDF graphs. To achieve this goal, we must first model the bnodes as sets in an approximation space. The steps required for this transformation are explained below.
Blank nodes as rough sets
Function \(S_{b_{i}}(s, p, o)\) returns an ordered pair (l,n), where n represents the neighboring node b _{ i } (s or o) or `σ’, and l represents the connection or predicate between b _{ i } and n. Assuming that n=`σ’, where \(S_{b_{i}}\left (b_{i}, p, b_{i}\right) = (p, \mathrm {`}\sigma \text {'})\), the literal `σ’ represents a bnode automatically mapped from the mapping of b _{ i } itself.
Elements of the same class are indiscernible according to relation R. Having defined the approximation space and sets representing bnode pairs in this space, in the next section, we discuss how to extend the RST concepts to provide a measure of the closeness of bnodes.
Extending the RST concepts
Given any two approximation sets X _{ i } and X _{ j } in the approximation space A _{ ij }=(U _{ ij },R), we observe the following properties for the intersection of their approximations [7]: A _{inf}(X _{ i })∩A _{inf}(X _{ j }) =A _{inf}(X _{ i }∩X _{ j }) and A _{sup}(X _{ i })∩A _{sup}(X _{ j })⊇A _{sup}(X _{ i }∩X _{ j }). For a more accurate analysis of the approximation of X _{ i } and X _{ j } in A _{ ij }, we can extend the concepts of positive, doubtful, and negative regions, considering the intersections between their approximations:
Definition 4.

Positive change region  formed by the union of all elementary sets of U _{ ij } contained entirely in both X _{ i } and X _{ j }:$$ \text{pos}\left(X_{i}, X_{j}\right) = A_{\text{inf}}\left(X_{i}\right) \cap A_{\text{inf}}\left(X_{j}\right). $$(21)

Negative change region  formed by elementary sets of U _{ ij } that have no elements in X _{ i } or X _{ j }:$$ \text{neg}\left(X_{i}, X_{j}\right) = U_{ij}  \left(A_{\text{sup}}\left(X_{i}\right) \cap A_{\text{sup}}\left(X_{j}\right) \right). $$(22)

Doubtful change region  formed by elementary sets of U _{ ij } partially contained in X _{ i } or X _{ j }. In this case, X _{ i } or X _{ j }, but not both, may integrally contain elementary sets of U _{ ij }:$$ \begin{aligned} {}\! \text{duv}\left(X_{i}, X_{j}\right) \,=\, (A_{\text{sup}}\!\left(X_{i}\right) \!\cap\! A_{\text{sup}}\!\left(X_{j}\right)) \,\, \left(A_{\text{inf}}\left(X_{i}\right) \!\cap\! A_{\text{inf}}\left(X_{j}\right)\! \right)\!. \end{aligned} $$(23)
The positive change region pos(X _{ i },X _{ j }) comprises classes that relate to existing links in both bnodes, with the same neighboring nodes, i.e., these classes contain elements representing equivalent links, considering the mapping between bnodes. Classes contained in the doubtful change region duv(X _{ i },X _{ j }) contain elements representing predicates common to the bnodes, but connected to different neighbors, being considered as similar links. They represent change operations on common predicates of bnodes: rename, extend, or reduce. Finally, the negative change region neg(X _{ i },X _{ j }) consists of classes that are not found in both bnodes. These classes refer to the addition or removal of bnode predicates being considered as independent links.
The change regions may provide a way of measuring the approximation between the two sets representing the bnodes. However, before addressing this issue, we analyze some extreme situations involving these regions to improve the understanding thereof. Initially, considering the case where all elements are in the positive change region, we can rank the bnodes as equivalent in A _{ ij }, because there are no differences between the bnode predicates, i.e., \((b_{i} \equiv _{A_{\textit {ij}}} b_{j}) \Leftrightarrow (A_{\text {inf}}(X_{i}) \cap A_{\text {inf}}(X_{j}) = U_{\textit {ij}})\), where this relationship is denoted by the symbol \(\equiv _{A_{\textit {ij}}}\). Otherwise, if this region is empty, the bnodes have no common connections with the same neighboring nodes (equivalent links), i.e., A _{inf}(X _{ i })∩A _{inf}(X _{ j })=∅. In this case, analysis of other change regions is necessary.
Regarding the doubtful change region, if all elements meet in this region it means that the bnodes have similar links with different neighboring nodes, i.e., (A _{inf}(X _{ i })∩A _{inf}(X _{ j })=∅)∧(A _{sup}(X _{ i })∩A _{sup}(X _{ j })=U _{ ij }). If this region is empty, there are no changes in the predicates common to both bnodes, i.e., (A _{sup}(X _{ i })∩A _{sup}(X _{ j }))−(A _{inf}(X _{ i })∩A _{inf}(X _{ j }))=∅. If the positive and/or doubtful regions are not empty and smaller than the universe, we categorize bnodes as approximated in A _{ ij }, represented by the symbol \(\approx _{A_{\textit {ij}}}\), because they have predicates in common, i.e., \((b_{i} \approx _{A_{\textit {ij}}} b_{j}) \Leftrightarrow (\emptyset \neq (A_{\text {sup}}(X_{i}) \cap A_{\text {sup}}(X_{j})) \neq U_{\textit {ij}})\).
Finally, if all the elements are in the negative change region, we classify the bnodes as distinct in A _{ ij }, represented by \(\neq _{A_{\textit {ij}}}\), because they have independent links, i.e., \((b_{i} \neq _{A_{\textit {ij}}} b_{j}) \Leftrightarrow (A_{\text {sup}}(X_{i}) \cap A_{\text {sup}}(X_{j}) = \emptyset)\). On the other hand, if this region is empty, all the connections are common to both bnodes, i.e., A _{sup}(X _{ i })∩A _{sup}(X _{ j })=U _{ ij }.
Therefore, we can evaluate the approximation between bnodes from these change regions. For this purpose, we need to extend the RST measures presented in ‘Some RST measures’ subsection to measure the approximation between sets X _{ i } and X _{ j } in A _{ ij }, by considering the intersection of the approximation of these sets:
Definition 5.

Internal change measure$$ \varpi_{\text{Ainf}}\left(X_{i}, X_{j}\right) = \left A_{\text{inf}}\left(X_{i}\right) \cap A_{\text{inf}}\left(X_{j}\right) \right $$(24)

External change measure$$ \varpi_{\text{Asup}}\left(X_{i}, X_{j}\right) = \left A_{\text{sup}}\left(X_{i}\right) \cap A_{\text{sup}}\left(X_{j}\right) \right $$(25)

Quality of the lower change approximation$$ \begin{aligned} \gamma_{\text{Ainf}}\left(X_{i}, X_{j}\right) &= \frac{\varpi_{\text{Ainf}}\left(X_{i}, X_{j}\right)}{\leftU\right}\\ &= \frac{\left A_{inf}(X_{i}) \cap A_{inf}(X_{j}) \right}{\leftU\right} \end{aligned} $$(26)

Quality of the upper change approximation$$ \begin{aligned} \gamma_{\text{Asup}}\left(X_{i}, X_{j}\right) &= \frac{\varpi_{\text{Asup}}\left(X_{i}, X_{j}\right)}{\leftU\right}\\ &= \frac{\left A_{\text{sup}}(X_{i}) \cap A_{\text{sup}}(X_{j}) \right}{\leftU\right} \end{aligned} $$(27)
Based on the measures given in Definition 5, we redefine the approximation between two bnodes b _{ i } and b _{ j } in Definition 6. γ _{Ainf}(X _{ i },X _{ j }) provides a way of measuring the percentage of identical predicates considering the mapping between X _{ i } and X _{ j }, while γ _{Asup}(X _{ i },X _{ j }) provides a way of measuring the approximation between the predicates of X _{ i } and X _{ j }.
Definition 6.

\(\left (b_{i} \equiv _{A_{\textit {ij}}} b_{j}\right) \Leftrightarrow \left (\gamma _{\text {Ainf}}(X_{i}, X_{j}) = 1\right)\);

\(\left (b_{i} \approx _{A_{\textit {ij}}} b_{j}\right) \Leftrightarrow \left (0 < \gamma _{\text {Asup}}\left (X_{i}, X_{j}\right) < 1\right)\);

\(\left (b_{i} \neq _{A_{\textit {ij}}} b_{j}\right) \Leftrightarrow \left (\gamma _{\text {Asup}}\left (X_{i}, X_{j}\right) = 0\right)\).
Exemplifying the modeling

ϖ _{Ainf}(X _{1},X _{2})=5;

ϖ _{Ainf}(X _{1},X _{3})=3;

ϖ _{Asup}(X _{1},X _{2})=8;

ϖ _{Asup}(X _{1},X _{3})=10;

γ _{Ainf}(X _{1},X _{2})=5/10=0.5;

γ _{Ainf}(X _{1},X _{3})=3/12=0.25;

γ _{Asup}(X _{1},X _{2})=8/10=0.8;

γ _{Asup}(X _{1},X _{3})=10/12≈0.83.
Thus, we have \(\phantom {\dot {i}\!}b_{1} \approx _{A_{12}} b_{2}\) and \(\phantom {\dot {i}\!}b_{1} \approx _{A_{13}} b_{3}\), but as γ _{Ainf}(X _{1},X _{2})>γ _{Ainf}(X _{1},X _{3}), we prefer the mapping between b _{1} and b _{2}. We applied metric γ _{Ainf}(X _{ i },X _{ j }) in the mapping between bnode pairs b _{ i } and b _{ j }, with the aim of reducing the delta between the versions. The greater is the value of the lower approximation quality, the higher is the equivalence between the bnode connections. In cases with equal values for γ _{Ainf}(X _{ i },X _{ j }), we prioritize the pairs providing the greatest value for γ _{Asup}(X _{ i },X _{ j }), because these are the bnodes with the closest approximations in terms of connections representing the same predicates.
We assume that mapping bnode pairs with higher equivalence or greater approximation between their predicates can reduce the delta size. In the next section, we use the approximation metrics γ _{Ainf}(X _{ i },X _{ j }) and γ _{Asup}(X _{ i },X _{ j }) to design the proposed mapping algorithm.
The ApproxMap method
In this section, we describe the strategies, data structures, and procedures designed to map bnodes in two RDF graphs. We call our mapping algorithm ApproxMap, because the project involves an analysis of the approximation between the sets representing the bnodes.
Heuristic strategies

Two approximation metrics  we use metric γ _{Asup}(X _{ i },X _{ j }) if the candidate pairs have the same γ _{Ainf}(X _{ i },X _{ j }). A pair with a greater γ _{Asup}(X _{ i },X _{ j }) has a higher similarity owing to the greater number of common predicates. We consider that mapping pairs with more similar predicates can help in reducing the delta size.

Two levels for bnode partitioning  the first level considers the existing hierarchy between directly connected bnodes, classifying the bnodes into four disjoint sets: roots, leaves, intermediates, and no interconnections. Then, in the second partitioning level, we organize the bnodes according to the number of connections with other nodes, allowing quick access to sets of bnodes with a particular number of links.

Unmapped neighboring bnodes are the same for incoming links but differ for outbound links  while neighboring bnodes are unmapped, URIs and literals play an important role in distinguishing blank nodes. The strategy adopted by Tzitzikas et al. [6], whereby all neighbors as considered the same, can increase the delta size, if the mapped neighbors differ in the final mapping. Therefore, we aim to mitigate this effect by adopting the strategy described above, which considers the possible impact of different neighbors when computing the delta. With prior mapping of neighboring bnodes, we can find a greater approximation between candidate pairs.

Bottomup approach to map directly connected bnodes  bnodes in the higher levels are mapped based on prior mappings in the lower levels. We compare each bnode mainly with those in the same hierarchical level, thereby reducing the number of comparisons. Relaxation of the same neighborhood for incoming links is due to this approach.

Topdown approximation during bnode mapping  bnodes are mapped iteratively considering a decreasing approximation in the interval (0.0,1.0]. We start the mapping of bnodes with the maximum approximation and, in each iteration, we reduce the lower limit for the desired approximation. Using this approach, we are able to reduce the number of comparisons between bnodes if the datasets contain vastly differing numbers of bnode links. This is because we do not need to compare bnode pairs that differ greatly in their numbers of links, thereby preventing an approximation greater than or equal to the desired value.

Initial equivalent bnode mapping  we can reduce the number of comparisons between the remaining bnodes that have not yet been mapped. Moreover, during the mapping of equivalent bnodes, we can also reduce the comparisons by applying filters to select only those bnodes in the same hierarchical level and with the same number of links as the other nodes.
Our heuristic combines all these strategies in an attempt to produce a solution with a reduced delta size during the mapping of blank nodes of two RDF graphs. For this purpose, we use specific data structures, as described in the next section.
Data structures
In the first adopted partitioning level, we store the unmapped bnodes of each graph G _{ k } in the data structure, T a b G _{ k }, which is partitioned into four disjoint sets: roots, leaves, intermediates, or no interconnections. We use the operator ‘ [ ]’ to index the partitions of T a b G _{ k }, where T a b G _{ k }[i] denotes partition i of T a b G _{ k }.
Each partition of T a b G _{ k } is further partitioned in a second level and indexed by the number of bnode links. This allows us to find bnodes with the same number of predicates quickly, where T a b G _{ k }[i][j] returns a reference to the set of bnodes from partition i of T a b G _{ k }, with j connections with neighboring nodes.
The ApproxMap algorithm also makes use of four arrays, with size equal to B _{ k }, for each graph G _{ k }: a l i a s _{ k }, approxInf_{ k }, a p p r o x S u p _{ k }, and M _{ k }. Considering that b _{ i }∈B _{ k }, a l i a s _{ k }[i] stores the bnode currently mapped to b _{ i }; approxInf_{ k }[i] and a p p r o x S u p _{ k }[i] refer, respectively, to the values of the lower and upper approximations, calculated for b _{ i } and a l i a s _{ k }[i]. Similarly, M _{ k }[i] stores the bnode definitely mapped to b _{ i }.
Before describing the ApproxMap method, we need to explain the process of finding bnode pairs with the greatest approximation during the mapping. In the next section, we discuss this process, which uses the data structures mentioned above.
Mapping bnode pairs
The algorithm looks for pairs with a value for the quality of lower approximation γ _{Ainf}(X _{ i },X _{ j }) greater than or equal to the desired value indicated by approx; values below this limit are discarded. Variable b _{ m } stores the current bnode with the closest approximation to b _{ i }, while api_{ m } and aps_{ m } store, respectively, their lower and upper approximations, calculated by metrics γ _{Ainf}(X _{ i },X _{ j }) and γ _{Asup}(X _{ i },X _{ j }).
Considering subgraph G _{ i }⊆G _{ k }, as defined in Equation 16, G _{ i } is the cardinality of G _{ i }, i.e., the number of triples or connections of b _{ i }. In addition, Φ _{ i } is the set of possible p values for triples in the form (s,p,b _{ i })∈G _{ i }, and Θ _{ i } is the set of p values for triples (b _{ i },p,o)∈G _{ i }.
In lines 5 and 6 of the algorithm, we use the values of variables l _{inf} and l _{sup} to reduce the comparison space using the topdown approximation approach discussed in the ‘Heuristic strategies’ subsection. We can only find an approximation greater than or equal to approx in the interval [l _{inf},l _{sup}], considering our second partitioning level. In line 10, a further filtering takes place, whereby only bnodes with at least one predicate in common are compared.
After obtaining the lower approximation between b _{ i } and the candidate b _{ j }, in line 16, we check whether this new approximation is greater than that previously found. If so, the respective bnodes are marked as candidates for mapping, and any previous pairs are discarded. However, if the new value for γ _{Ainf}(X _{ i },X _{ j }) is equal to that previously found, we compare the new value of γ _{Asup}(X _{ i },X _{ j }), as shown in line 22. If this value is greater than the current value, the respective bnodes are also marked as candidates for mapping.
After the first phase, we have pairs of candidates with the greatest approximation for mapping, which is finalized in the second phase. Procedure M a p A p p r o x i m a t i o n s(m,a p p r o x), with 1≤m≤4, is used to carry out the mapping. Bnodes in T a b G _{1}[m] with an approximation greater than or equal to parameter approx are permanently mapped.
Procedures FindApproximations and MapApproximations are executed to map similar bnodes. However, we can refine these procedures to filter unmapped bnodes, when looking for equivalent pairs to reduce the search space. Thus, we designed procedure M a p E q u i v a l e n t s(m) to map equivalent bnodes in T a b G _{1}[m] and T a b G _{2}[m], where 1≤m≤4. This procedure compares only bnodes with exactly the same incoming and outbound predicates. Thus, we permanently map only those bnode pairs with approximations equal to 1.0.
We also developed a procedure to map the remaining bnodes, after termination of the iterations for the adopted topdown approximation strategy. Procedure M a p B y O r d e r() compares bnodes in the same way as FindApproximations. However, the mapping is carried out directly between pairs with the greatest approximation according to the order defined by the partitioning of T a b G _{1}, thereby ignoring the possibility of a closer relationship with another bnode pair.
Proposed method
The ApproxMap starts by mapping equivalent bnodes in T a b G _{ k }[1] and T a b G _{ k }[2], as shown in lines 1 and 2. During the mapping of T a b G _{ k }[2], we consider relaxing the neighboring bnodes for inbound links. This mapping is performed only once, because these bnodes are leaves in the hierarchy and do not depend on previous mappings of other bnodes.
The rest of the algorithm includes a loop, defined between lines 4 and 45, that maps the bnodes using the bottomup approach discussed in ‘Heuristic strategies’ subsection, where the mapping of bnodes contained in tables T a b G _{ k }[3] and T a b G _{ k }[4] depends on previous mappings of bnodes in lower levels of the hierarchy. The algorithm aims to map T a b G _{1}[1] (lines 6 to 14), T a b G _{1}[2] (lines 15 to 23), T a b G _{1}[3] (lines 24 to 33), and T a b G _{1}[4] (lines 34 to 43) in order.
Thus, for each iteration of the outer loop, the value of min is decremented according to step η _{2}, as expressed in line 5. This value defines the minimum approximation required to map the bnodes in each partition of T a b G _{1} to the other different partitions in T a b G _{2}. In the case of the same partitions, the mapping occurs in the inner loops, taking into account step η _{1}, so that the current approximation is decremented in each iteration (lines 11, 20, 30, and 40), until it reaches the limit set in min. Just prior to termination of the algorithm, in line 46, the remaining bnodes are mapped after 1/η _{1} iterations.
We compared the bnodes of T a b G _{1}[m] 1/η _{1} times with the ones in T a b G _{2}[m], and a minor number of 1/η _{2} times with those in T a b G _{2}[n], where m≠n. Therefore, during the search for bnode pairs with greater approximations, the outer loop provides the mapping of bnodes that change partitions between versions, while the inner loop provides the mapping of bnodes that remain in the same hierarchical partition for all versions.
Method analysis
The proposed method models bnodes as approximate sets, based on their classification as equivalent, similar, or distinct predicates in terms of their connections with other nodes. This organization by approximation classes allows the definition of metrics to measure the approximation between bnodes.
Considering the introductory example in Figure 1, algorithms A l g _{ Hung } and A l g _{ Sign } obtain a mapping resulting in a delta with size 4. Tzitzikas et al. focused on the mapping between pairs (_:1, _:6) and (_:2, _:7) because they considered connected bnodes to be the same, where dist_{ h }(1,6)=0 and dist_{ h }(1,7)=1. We emphasize the adoption of both bottomup and different neighbor strategies in ApproxMap while mapping directly connected bnodes. The first iteration of ApproxMap results in the mapping of bnode pairs (_:3, _:8), (_:4, _:10), and (_:5, _:9), which have an approximation equal to 1.0. From this initial mapping, our method can map pairs (_:1, _:7) and (_:2, _:6), because γ _{Ainf}(X _{1},X _{6})=0.50, γ _{Ainf}(X _{1},X _{7})=0.67, γ _{Ainf}(X _{2},X _{6})=0.67, and γ _{Ainf}(X _{2},X _{7})=0.34. The mapping obtained by our method results in a smaller delta size of two triples.
First, ApproxMap maps bnodes ‘_:Product1’ and ‘_:Product3’, corresponding to the pair with the closest approximation. The closest approximations of both ‘_:Product1’ and ‘_:Product2’ are to bnode ‘_:Product3’. However, this mapping represents the lowest cost of transforming some bnode in the first version into ‘_:Product3’. We can change ‘_:Product1’ to ‘_:Product3’ by including only a single triple. However, we would need to include an additional three triples to transform ‘_:Product2’ into ‘_:Product3’.
Therefore, ApproxMap also maps the remaining bnodes ‘_:Product2’ and ‘_:Product4’, resulting in a global delta containing seven triples. However, if we had initially mapped ‘_:Product1’ to ‘_:Product4’, the resulting delta would have size 5, as is the case using the Hungarian algorithm. This occurs because our hypothesis considers only a reduction in delta between individual pairs and not an assessment of the impact of this reduction in terms of the global delta size. Owing to the mapping of remaining bnodes, considering only unmapped bnodes pairs, the ApproxMap does not test all mapping possibilities, which can result in obtaining a local optimum.
Moreover, in ApproxMap, the mapping occurs in the order defined by the adopted partitioning. We also used some ordered structures during algorithm implementation, optimizing the comparisons between bnodes. The additional cost of insertion is already known for these structures, although this is beyond the scope of this article. This adopted order can affect the delta size, mainly, considering procedure MapByOrder. As before, this may occur because our method does not test all mapping possibilities.
Furthermore, in cases involving completely different datasets, ApproxMap compares all bnodes in the two datasets during the mapping, resulting in the maximum delta equal to the sum of the triples in the two datasets. We included some optimizations in ApproxMap, reducing the cost of comparing distinct pairs, by first checking for the presence of common predicates. The worstcase execution corresponds to a particular case of distinct datasets, where all bnodes have the same predicates. In this case, we obtain dispersed approximate sets representing the bnodes, i.e., sets with an empty lower approximation (γ _{Ainf}(X _{ i },X _{ j })=0) and an upper approximation equal to the set universe (γ _{Asup}(X _{ i },X _{ j })=U) [7], as shown in Figure 14b.
We use step η _{1} to control the number of comparisons between bnodes, where the total number is given by 1/η _{1}×O(n ^{2}). Thus, when η _{1} is considerably smaller than 1/n, where n gives the smallest number of bnodes in the datasets, in the worst case, the time complexity of the algorithm is O(n ^{2}). Conversely, the best case execution of ApproMap occurs with equivalent datasets containing bnodes with varying numbers of connections and without any directly connected bnodes. In this case, we need to compare each bnode with exactly one bnode in the other version. Thus, the complexity of the best case is Ω(n).
Finally, we intend to apply ApproxMap to configuration management of software engineering projects, specifically to version control of RDF datasets. These projects are characterized by the manipulation of data, information, and knowledge in various types and formats, manually constructed based on the modularity principle, where complex elements are divided into smaller parts. Therefore, we expect great diversity between bnodes in the same version, justifying the application of ApproxMap in this context.
Because the datasets involved are usually constructed using an incremental development approach, we expect satisfactory performance of ApproxMap on similar versions, containing several approximately equivalent bnode pairs, as generally occurs in successive versions of software engineering artifacts. A recommended configuration management practice is to perform version control considering the low percentage of changes between versions. If this does not occur, larger deltas prevent the recovery of intermediate states between successive versions.
As future work, we propose a meticulous analysis of the impact of the adopted metrics and strategies on the mapping. We also intend verifying the applicability of other RST metrics that could provide better approximation measures between bnodes. As a further future work, we propose improving the performance of the algorithm, taking into account execution of some operations in parallel, such as comparison of approximate sets.
Results and discussion
In analyzing the performance of the ApproxMap algorithm, we considered both the delta size calculated from mapping pairs of RDF datasets and the time spent on this task. This allowed comparison of the results and values obtained for the A l g _{ Hung } and A l g _{ Sign } algorithms, presented by Tzitzikas et al. [6]. All experiments discussed in this section were executed on an Intel Core i73537U, 2.0 GHz processor, with 8 GB RAM and running Ubuntu 13.10. To correct any formatting or encoding issues, preprocessing was carried out on certain pairs of datasets.
Three metrics defined by Tzitzikas et al. [6] were used in the analysis of the experiments: b _{density}, b _{len}, and D _{ a }. Let N and B denote, respectively, the sets of nodes and blank nodes of graph G, where B⊆N. Further, let conn(b) denote the set of nodes in G directly attached to b∈N. Then, we have b _{density}=a v g _{ b∈B }(conn(b)∩B/conn(b)); b _{len} refers to the average maximum path length, with vertexes consisting only of bnodes; and D _{ a } corresponds to the average number of bnode triples.
Except for the last experiment, we tested the ApproxMap algorithm with three different sets of parameters: η _{1}=0.01 and η _{2}=0.1; η _{1}=0.05 and η _{2}=0.125; and η _{1}=0.05 and η _{2}=0.25 where these tests are denoted, respectively, as A p p r o x M a p 1/10%, A p p r o x M a p 5/12%, and A p p r o x M a p 5/25%. We chose these steps empirically, considering the desired number of iterations. As future work, we propose further analysis of the choice of step values and calibration of ApproxMap.
We used the A p p r o x M a p 5/12% tests as the baseline for comparison when evaluating the impact of changes in η _{1} and η _{2} on the results. The A p p r o x M a p 5/12% test includes 20 iterations (1/η _{1}) of the inner loop of the method, comparing each bnode with those in the same hierarchical partition in the second version. In addition, there are eight iterations (1/η _{2}) of the outer loop, comparing the bnodes with those in the remaining partitions. The A p p r o x M a p 5/25% test was used to verify the impact of an increase in η _{2}, reducing the comparisons between distinct partitions for 4 iterations. Finally, we used the A p p r o x M a p 1/10% tests to analyze the impact of a reduction in η _{1}, increasing the comparisons between the same partitions for 100 iterations. In these tests, we also adjusted η _{2} to better fit η _{1}, resulting in ten iterations of the outer loop.
We organized the experiments in three groups based on the type of dataset used in each: real, extracted from the Web (i.e., crawled), or synthetic datasets, as discussed in the following sections. The standard units for delta size and mapping time are, respectively, triples and milliseconds. We used a logarithmic scale for charts showing mapping times of the algorithms, thereby providing better visualization and comparison of the results.
Real datasets
Information about real datasets
Dataset  B  G  D _{ a }  b _{ density }  b _{ len } 

Swedish  522  3,670  5.47  0.00  0.00 
Italian  6,390  49,897  3.42  0.00  0.00 
Results of the algorithms applied to real datasets
Dataset  Swedish  Italian  

Delta (triples)  A p p r o x M a p 1/10%  297  6 
A p p r o x M a p 5/12%  297  6  
A p p r o x M a p 5/25%  297  6  
A l g _{ Hung }  297  6  
A l g _{ Sign }  423  6  
Time (ms)  A p p r o x M a p 1/10%  113  170 
A p p r o x M a p 5/12%  36  158  
A p p r o x M a p 5/25%  34  153  
A l g _{ Hung }  4,789  456,173  
A l g _{ Sign }  37  59 
Crawled datasets
Owing to the difficulty in finding appropriate real versioned datasets for the experiments, in the second group of experiments, we used an RDF crawler, LDSpider [21], to construct pairs of RDF dataset versions. We extracted some versions from randomly chosen links to common datasets in the linked open data (LOD) cloud [22], such as Dbpedia and DBPL, as well as FOAF Profiles.
Crawled datasets using the loadbalancing strategy
Instance number  B  G  D _{ a }  b _{ density }  b _{ len } 

1  19  1,048  9.00  0.01  0.11 
2  83  11,555  7.31  0.00  0.00 
3  361  28,208  5.93  0.00  0.00 
4  362  28,219  5.96  0.00  0.00 
5  893  15,337  4.40  0.00  0.02 
For bnode mapping, algorithm A l g _{ Hung } was the slowest. Considering the differences between the mapping times of the algorithms presented in Figure 16, compared with A p p r o x M a p 5/25%, A l g _{ Hung } showed an increase in mapping time between 0.50 and 3.16, on the adopted logarithmic scale. A p p r o x M a p 5/25% was faster than A l g _{ Sign } in two instances, with the maximum time increase for A l g _{ Sign } equal to 0.78. A l g _{ Sign } was faster in the remaining instances, with an increase in time for A p p r o x M a p 5/25% less than 1.03. Finally, considering the differences between steps η _{1} and η _{2}, compared with A p p r o x M a p 5/12%, the mapping time for A p p r o x M a p 1/10% increased by between 0.38 and 0.60, while A p p r o x M a p 5/25% showed a maximum reduction in mapping time of 0.06.
Crawled datasets with breadthfirst/loadbalancing strategy
Instance number  B  G  D _{ a }  b _{ density }  b _{ len }  

File 1  File 2  File 1  File 2  File 1  File 2  File 1  File 2  File 1  File 2  
1  169  19  4,355  1,048  5.73  9.00  0.21  0.01  16.26  0.11 
2  190  83  11,892  11,470  5.82  7.31  0.07  0.00  1.67  0.00 
3  1,246  893  24,364  15,337  5.13  4.40  0.10  0.00  10.88  0.02 
4  1,963  361  27,650  28,208  6.75  5.93  0.00  0.00  0.00  0.00 
5  1,967  362  28,031  28,219  6.74  5.96  0.00  0.00  0.01  0.00 
Synthetic datasets
In this final group of experiments, to evaluate the algorithms in the mapping of datasets with some specific features, e.g., directly connected bnodes or equivalent datasets, we generated pairs of synthetic datasets for use in the tests.
Datasets from adapted UnivBench artificial generator
Synthetic datasets generated by Tzitzikas et al. [ 6 ]
Instance number  G  D _{ a }  b _{ density }  b _{ len }  Δ _{ opt } /G (%) 

1  5,846  13.4  0  0  1 
2  5,025  10.5  0.1  1  0.5 
3  2,381  7  0.15  1  1.5 
4  1,628  5  0.2  1  1.5 
5  1,636  5  0.2  1.15  1 
6  1,399  4  0.25  1.15  1.7 
7  919  3  0.32  1.15  3.2 
8  909  3.25  0.4  1.35  2.7 
9  1,001  3.94  0.5  21.5  2.5 
Datasets with directly connected bnodes
To analyze the performance of the algorithms, considering datasets with a higher number of directly connected bnodes, we developed an RDF dataset generator, based on that included in the Berlin SPARQL Benchmark (BSBM) [24]. We used this generator to produce pairs of file versions with an average b _{density} of 0.34%, and c v=7.25%. We discuss the experiments using this generator in the next section.
Datasets from adapted BSBM generator
Our adapted generator is capable of producing two versions of an ecommerce portal, which is used by vendors to offer various products and consumers to submit reviews about these products. The versions contain descriptions for five different types of resources, as well as three different types of blank nodes. We determined these quantities empirically to obtain the desired value for b _{density}. Thus, we defined the elements corresponding to products, their types, and characteristics as blank nodes, although the portal also included a hierarchy of product types.
In all experiments, 74.73% of the triples contained bnodes, with the coefficient of variation, c v=1.28%. The high percentage of bnodes was acceptable because triples without bnodes could be mapped directly and this was not our concern. Moreover, except for the last experiment in which we tested large datasets, we limited the maximum number of bnodes to 2,000, so that the datasets could be tested with all algorithms. We constructed the version pairs in such a way that ensured that changes occurred in isolation. The intersections of sets of equivalent, added, or removed triples between versions were empty, considering all possible bnode mappings. Based on this, we obtained by construction the optimal delta size for the tested pairs.
The adapted generator accepts as input the number of products sold on the ecommerce portal and then, determines the number of other bnodes (product types and characteristics) in terms of this input number. As a result, the values of some metrics were affected by the numbers of bnodes, such as the average maximum path length (b _{len}), due to variations in the product type hierarchy. However, this metric does not affect the computational cost of ApproxMap. We dealt with bnode hierarchies using the adopted bottomup strategy. Similarly, the absence of bnode interconnections (b _{density}=0) did not affect ApproxMap because, in this case, it merely grouped the bnodes in the same hierarchical partition. A meticulous analysis of the impact of these metrics on delta size is suggested as a future work.
In the next sections, we describe the five experiments performed using datasets produced by our adapted generator. These experiments consider increases in the version and delta sizes, identical or different versions, as well as large datasets.
Changing the size of the datasets
The first experiment using our adapted generator aimed to analyze the impact of an increase in the number of bnodes. For the generation of datasets, we set a fixed ratio of 50% of equivalent elements among pairs, to assess the impact of an increase in version size assuming a moderate delta size.
Datasets with varying version sizes
Instance  B  G  D _{ a }  b _{ density }  b _{ len }  Δ _{ opt } /G (%) 

number  
1  400  3,390  7.29  0.29  60.42  53.39 
2  800  7,000  7.78  0.33  127.34  53.02 
3  1,200  10,806  8.30  0.37  212.71  52.74 
4  1,600  14,061  7.88  0.34  200.95  52.49 
5  2,000  17,541  7.86  0.34  270.09  52.71 
Regarding bnode mapping times, Alg _{ Hung } was slower than ApproxMap 5/25%, with an increase in time varying between 0.97 and 1.44 on the logarithmic scale. Compared with the Alg _{ Sign } algorithm, the increase in time for ApproxMap 5/25% varied between 0.65 and 1.70. Finally, the maximum increase in the execution time of ApproxMap 1/10% compared with that of ApproxMap 5/12% was equal to 0.45. Similarly, ApproxMap 5/12% showed an increase smaller than 0.12, compared with ApproxMap 5/25%.
Changing delta size
Datasets with varying delta sizes
Instance number  G  D _{ a }  b _{ density }  b _{ len }  Δ _{ opt } /G (%) 

1  17,895  8.47  0.38  231.60  15.86 
2  17,639  7.99  0.35  304.20  31.73 
3  17,868  8.19  0.37  326.40  47.34 
4  17,841  8.13  0.36  265.72  62.91 
5  18,014  8.28  0.37  340.61  78.35 
6  17,695  7.95  0.34  328.19  94.06 
As before, A l g _{ Sign } performed the worst in terms of delta size, with the distance to the optimal delta varying between 51.61 and 88.3. A l g _{ Hung } showed a distance to the optimal ranging from 11.45 to 21.28, while A p p r o x M a p 1/10% showed one varying between 1.36 and 7.87. For A p p r o x M a p 5/12% and A p p r o x M a p 5/25%, the distances to the optimal varied from 0.85 to 8.84 and from 5 to 9.39, respectively.
Identical datasets
For a better analysis of the algorithms behavior, the next two experiments considered extreme cases, with the datasets either identical or completely different. In the first case, we compared the second version of the datasets from the first experiment using our adapted generator, with a version created by application of the delta in the first version, i.e., \(G^{'}_{2} = G_{1} + \Delta \). With this, we validated the deltas previously found by ApproxMap. No differences were found by any of the algorithms for the identical datasets, even when considering the second version in reverse order.
Different datasets
Compared with A p p r o x M a p 5/25%, A l g _{ Hung } required increased time varying between 1.4 and 2.21, while the reduced time requirement of A l g _{ Sign } varied from 0.94 to 1.92. Compared with the time requirement of A p p r o x M a p 5/12%, A p p r o x M a p 1/10% showed an increase ranging from 0.4 to 0.44, while the time reduction for A p p r o x M a p 5/25% varied between 0.11 and 0.15, on the logarithmic scale.
Large datasets
Finally, the last experiment considered the behavior of ApproMap and A l g _{ Sign } when mapping large datasets. We could not test A l g _{ Hung } in this experiment, owing to its high computational cost. With the aim of reducing the number of comparisons between bnodes, we adopted steps η _{1}=0.2 and η _{2}=0.5 for ApproxMap, which is referred to as A p p r o x M a p 20/50%. Thus, with this choice of steps, there are five iterations comparing bnodes in the same hierarchical partitions and only two iterations comparing bnodes in different partitions.
In the construction of the dataset pairs, the number of bnodes varied between 20,000 and 100,000, in fixed steps of 20,000. We created these five pairs with an average number of triples (G) of 183,732; 356,176; 562,828; 754,524; and 958,038 assuming a maximum value of c v=0.37%. We created these datasets with 25% different elements, with the average size of the optimal delta equal to 26.41% of the triples with c v=0.11%. The adopted values in this experiment were chosen empirically with the aim of reducing the mapping times of the algorithms. We did not test the algorithms with datasets larger than those generated in this experiment, owing to the computational cost of ApproxMap. However, the considered instances were sufficient to evaluate the behavior of the algorithm with large datasets. Moreover, the construction of a large dataset is not common practice in the application context of our method, that is, software development projects, with stimulated techniques such as modularization.
Analysis of results
Satisfactory results of ApproxMap in the experiments confirm our hypothesis that mapping bnode pairs with the highest approximation can assist in reducing the delta size. Considering the tests where optimal delta values are known, ApproxMap obtained the optimal delta size in 59% of the tests. A l g _{ Hung } and A l g _{ Sign } found optimal solutions, respectively, in 50% and 30% of the test cases.
Considering all experiments, except the final one with large datasets, ApproxMap found a delta size equal to that of A l g _{ Hung } in 55% of the tests and smaller than that of A l g _{ Hung } in 40% of the cases, except in the tests with steps 5/25%, where the delta found by ApproxMap was smaller in 41% of the test cases. Compared with A l g _{ Sign }, ApproxMap obtained a smaller delta in 67% of the cases and the same delta in 33% of the cases. In the experiment with large datasets, ApproxMap performed better than A l g _{ S } i g n in all cases. Moreover, when compared to A l g _{ Sign }, A l g _{ Hung } found the same delta in 38% of the tests and a smaller delta in 60% of the tested cases.
Regarding mapping time, ApproxMap was faster than A l g _{ Hung } in 84% of the tests and slower in the remaining 16%, except in the tests with steps 1/10%, where it was outperformed in 21% of the tested instances. ApproxMap was faster than A l g _{ Sign } in 14%, 21%, and 24% of the tests with steps 1/10%, 5/12%, and 5/25%, respectively, and outperformed in the other cases. In the experiment with large datasets, A l g _{ Sign } was faster than ApproxMap in all tests. A l g _{ Sign } was also faster than A l g _{ Hung } in all the tests conducted with these algorithms.
Based on the experimental results, the empirically defined values for parameters η _{1} and η _{2} are considered to be satisfactory. Considering the tests with steps 5/12% as our reference, the decrease in η _{1} from 5% to 1% (steps 1/10%) caused a reduction in the delta size in 12% of the tests, while we obtained the same delta in 81% of the cases. However, the consequent increase in mapping time was confirmed in 98% of the cases, while it remained the same in the other 2% of cases. On the other hand, with the increase in η _{2} from 12.5% to 25% (steps 5/25%), a delta increase occurred in 17% of the tests, while we obtained the same delta in 78% of cases. However, a consequent reduction in mapping time occurred in 64% of the instances, while the time remained the same in 28% of the tested cases.
Furthermore, considering the impact of interconnected bnodes in the experiments, in cases without directly connected bnodes (with b _{density}=0), ApproxMap had the same delta size as A l g _{ Hung } in all the tests. But, in cases where b _{density}>0, ApproxMap had the same delta size in 47% of the cases, and a smaller size than that in A l g _{ Hung } in 47% of the tests, with the exception of tests with steps 5/25%, where the size was smaller in 49% of the tested instances.
On the other hand, analyzing the algorithms’ performance in the experiments with equivalent pairs, ApproxMap was faster than A l g _{ Hung } in all tests. When considering the ratio between the time spent by these algorithms, the mapping time of A l g _{ Hung } was up to 283 times greater than that of A p p r o x M a p 5/25%. We also emphasize the results for the real dataset Italian, whose delta contained no triples with bnodes. In this case, A l g _{ Hung } required a greater mapping time than A p p r o x M a p 5/25% with a ratio of 2,982. However, A l g _{ Hung } yielded a nonempty delta in 25% of the tests with equivalent datasets but with the datasets in reverse order. The bnode order did not affect ApproxMap in the experiments, because it imposes an internal ordering.
Based on these results, we can state that the ApproxMap method obtained satisfactory performance in the experiments, and its application is recommended in the versioning of RDF datasets. We intend to apply this algorithm in the design of an SCM method, as part of an integrated environment of tools for software engineering projects, based on the Semantic Web standards. Moreover, we emphasize the satisfactory performance of ApproxMap in mapping datasets with large numbers of equivalent elements. Thus, we recommend its application for version control following the recommended practices for SCM, considering a low percentage of changes between versions.
Conclusions
This paper aimed to develop a heuristic method for mapping blank nodes. The proposed method, called ApproxMap, applied extended concepts of RST, presented by Pawlak [7], in the handling of imprecision in bnode mapping. RST provided the necessary support to obtain a mapping between bnodes, seeking closer approximations between bnodes of the considered versions. The proposed modeling of blank nodes as approximate sets in an approximation space is an important contribution of this article. This modeling can be reused in other research domains involving blank node mapping.
In our method, we determined the number of comparisons between bnodes as parameter η _{1}. Considering small values for the ratio 1/η _{1}, the proposed algorithm has worstcase time complexity of O(n ^{2}), involving two completely different datasets, whose bnodes have the same predicates.
ApproxMap showed satisfactory performance in our groups of experiments, as the algorithm that obtained solutions closest to the optimal values. This algorithm succeeded in finding the optimal delta size in 59% of the tests involving optimal values. Considering all tests with different values for parameters η _{1} and η _{2}, ApproxMap achieved a delta size smaller than or equal to those of A l g _{ Hung } and A l g _{ Sign }, respectively, in at least 95% and 100% of the tested cases. Regarding mapping time, ApproxMap was faster than A l g _{ Hung } in at least 79% of the instances and slower than A l g _{ Sign } in at least 76% of the tests.
Despite its mapping time being greater than that of A l g _{ Sign }, which has a time cost of n·logn, we recommend applying ApproxMap in various situations, particularly those involving similar versions and directly connected bnodes. Great diversity between the bnodes in the same version is beneficial for ApproxMap. Thus, our algorithm can be successfully applied in RDF dataset versioning, such as that produced by software processes with iterative and incremental development.
As future work, we propose the creation of a parallel version of the ApproxMap algorithm to reduce the time required to compare bnodes of the two RDF bases. Furthermore, we propose a meticulous analysis of the appropriate choice of input steps η _{1} and η _{2}, and of the impact of the adopted metrics and strategies on delta size. Besides, we also intend investigating other RST metrics.
Declarations
Acknowledgements
Many thanks to Christina Lantzaki and Yannis Tzitzikas for their help in the execution of tests using the Alg _{ Hung } and Alg _{ Sign } algorithms and also for making their synthetic datasets available. We also thank the reviewers for their help in improving the article.
Authors’ Affiliations
References
 Klyne G, Carroll JJ, McBride B (2014) RDF 1.1 concepts and abstract syntax. World Wide Web Consortium, Recommendation. http://www.w3.org/TR/rdf11concepts.
 MonteMor JA, Cunha AM (2014) Galo: a semantic method for software configuration management In: Information Technology: New Generations (ITNG), 2014 11th International Conference On, 33–39.Google Scholar
 Antoniou G, van Hatrmelen F (2004) A Semantic Web prime. The MIT Press, London, England. p. 238.Google Scholar
 Lee TB, Connolly D (2001) Delta: an ontology for the distribution of differences between RDF graphs. Technical report, W3C. http://www.w3.org/DesignIssues/Diff.
 Zeginis D, Tzitzikas Y, Christophides V (2011) On computing deltas of RDF/s knowledge bases. ACM Trans Web 5(3): 14–11436.View ArticleGoogle Scholar
 Tzitzikas Y, Lantzaki C, Zeginis D (2012) Blank node matching and RDF/s comparison functions In: Proceedings of the 11th International Conference on The Semantic Web  Volume Part I. ISWC’12, 591–607.. Springer, Berlin, Heidelberg.Google Scholar
 Pawlak Z (1982) Rough sets. Int J Comput Inform Sci 11: 341–356.View ArticleMATHMathSciNetGoogle Scholar
 do Carmo Nicoletti M, Uchôa JQ, Baptistini MTZ (2001) Rough relation properties. Int J Appl Math Comput Sci 11(3): 621–635.MATHMathSciNetGoogle Scholar
 Carroll JJ (2002) Matching RDF graphs In: Proceedings of the First International Semantic Web Conference on The Semantic Web. ISWC ’02, 5–15.. Springer, London, UK.Google Scholar
 Noy NF, Kunnatur H, Klein M, Musen MA (2004) Tracking changes during ontology evolution In: ISWC2004, Proceeding of the 3rd International Semantic Web Conference, Hiroshima, Japan, November 711, 2004, 259–273.. Springer, Berlin, Heidelberg.Google Scholar
 Noy NF, Musen MA (2002) Promptdiff: a fixedpoint algorithm for comparing ontology versions In: Eighteenth National Conference on Artificial Intelligence, 744–750.. American Association for Artificial Intelligence, Menlo Park, CA, USA.Google Scholar
 Noy NF, Musen MA (2004) Ontology versioning in an ontology management framework. IEEE Intell Syst 19(4): 6–13.View ArticleGoogle Scholar
 Auer S, Herre H (2006) A versioning and evolution framework for RDF knowledge bases In: Proceedings of the 6th International Andrei Ershov Memorial Conference on Perspectives of Systems Informatics. PSI’06, 55–69.. Springer, Berlin, Heidelberg.Google Scholar
 Völkel M, Groza T (2006) SemVersion: An RDFbased ontology versioning system. In: Nunes MB (ed)Proceedings of IADIS International Conference on WWW/Internet (IADIS 2006), 195–202, Murcia, Spain.Google Scholar
 Cassidy S, Ballantine J (2007) Version control for RDF triple stores. In: Filipe J, Shishkov B, Helfert M (eds)ICSOFT 2007, Proceedings of the Second International Conference on Software and Data Technologies, Volume ISDM/EHST/DC, Barcelona, Spain, July 2225, 2007, 5–12.. INSTICC Press, Setubal, Portugal.Google Scholar
 Im DH, Lee SW, Kim HJ (2012) A version management framework for RDF triple stores. Int J Softw Eng Knowl Eng 22(1): 85–106.View ArticleMathSciNetGoogle Scholar
 Zeginis D, Tzitzikas Y, Christophides V (2007) On the foundations of computing deltas between RDF models In: Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference. ISWC’07/ASWC’07, 637–651.. Springer, Berlin, Heidelberg.Google Scholar
 Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res. Logist. Quart 2: 83–97.View ArticleMathSciNetGoogle Scholar
 Uchôa JQ (1998) Representação e indução de conhecimento usando teoria de conjuntos aproximados. Master’s thesis, Universidade Federal de São Carlos, São Carlos, Brasil.Google Scholar
 Pawlak Z, Skowron A (2007) Rough sets: some extensions. Inform Sci 177(1): 28–40.View ArticleMATHMathSciNetGoogle Scholar
 Isele R, Umbrich J, Bizer C, Harth A (2010) Ldspider: An opensource crawling framework for the web of linked data. In: Polleres A Chen H (eds)ISWC Posters & Demos. CEUR Workshop Proceedings.. CEURWS.org.Google Scholar
 Bizer C, Heath T, BernersLee T (2009) Linked data  the story so far. Int J Semantic Web Inf Syst 5(3): 1–22.View ArticleGoogle Scholar
 Guo Y, Pan Z, Heflin J (2005) LUBM: a benchmark for owl knowledge base systems. Web Semant 3(23): 158–182.View ArticleGoogle Scholar
 Bizer C, Schultz A (2009) The Berlin SPARQL benchmark. Int J Semantic Web Inform Syst 5(2): 1–24.View ArticleGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.