Skip to content

DataFusion

olehmberg edited this page Jun 19, 2017 · 4 revisions

Data Fusion

The data fusion process can be executed with the DataFusionEngine. The following Figure illustrates the process:

Data Fusion Overview

The data fusion method expects datasets in a consolidated schema and correspondences between their records. The correspondences can be loaded into a CorrespondenceSet. From these correspondences, groups of records which represent the same entity are collected for each attribute. Such an entity/attribute group then contains all values for this combination from the input datasets. To decide for a final value, a ConflictResolutionFunction is applied to the values. The DataFusionEngine is initialised with a DataFusionStrategy. This strategy defines a ConflictResolutionFunction and an EvaluationRule for each attribute. These functions determine how a final value is chosen from multiple possible values and how it is evaluated.

Implemented Conflict Resolution Functions

  • Voting: Applies majority voting to the values
  • ClusteredVote: Clusters all values using the provided similarity measure and returns the centroid of the largest resulting cluster
  • Intersection: Creates the intersection of all values (applicable if values are sets)
  • IntersectionKSources: Creates a set of all values that are included in at least k input values (applicable if values are sets)
  • Union: Creates the union of all values (applicable if values are sets)
  • FavourSources: Returns the value from the source with the highest score (as defined by the user)
  • MostRecent: Returns the value from the source that is most recent (as defined by the user)
  • Average: Returns the average of all values (numeric)
  • Median: Returns the median of all values (numeric)
  • LongestString: Returns the longest value by character count (strings)
  • ShortestString: Returns the shortest value by character count (strings)

Example

Load the datasets:

// Load the Data into FusableDataSet
FusableDataSet<Movie, Attribute> ds1 = new FusableDataSet<>();
new MovieXMLReader().loadFromXML(
  new File("usecase/movie/input/academy_awards.xml"),
  "/movies/movie",
  ds1);
ds1.printDataSetDensityReport();

FusableDataSet<Movie, Attribute> ds2 = new FusableDataSet<>();
new MovieXMLReader().loadFromXML(
  new File("usecase/movie/input/actors.xml"),
  "/movies/movie",
  ds2);
ds2.printDataSetDensityReport();

FusableDataSet<Movie, Attribute> ds3 = new FusableDataSet<>();
new MovieXMLReader().loadFromXML(
  new File("usecase/movie/input/golden_globes.xml"),
  "/movies/movie",
  ds3);
ds3.printDataSetDensityReport();

Load the record correspondences between the datasets (obtained from identity resolution):

// load correspondences
CorrespondenceSet<Movie, Attribute> correspondences = new CorrespondenceSet<>();
correspondences.loadCorrespondences(
  new File("usecase/movie/correspondences/academy_awards_2_actors_correspondences.csv"),
  ds1,
  ds2);
correspondences.loadCorrespondences(
  new File("usecase/movie/correspondences/actors_2_golden_globes_correspondences.csv"),
  ds2,
  ds3);

Define the data fusion strategy and add fusers for multiple attributes:

// define the fusion strategy
DataFusionStrategy<Movie, Attribute> strategy = new DataFusionStrategy<>(new MovieXMLReader());
// add attribute fusers
strategy.addAttributeFuser(
  Movie.TITLE,
  new TitleFuserShortestString(),
  new TitleEvaluationRule());
strategy.addAttributeFuser(
  Movie.DIRECTOR,
  new DirectorFuserLongestString(),
  new DirectorEvaluationRule());
strategy.addAttributeFuser(
  Movie.DATE,
  new DateFuserVoting(),
  new DateEvaluationRule());
strategy.addAttributeFuser(
  Movie.ACTORS,
  new ActorsFuserUnion(),
  new ActorsEvaluationRule());

Initialise the data fusion engine with the strategy and run it:

// create the fusion engine
DataFusionEngine<Movie, Attribute> engine = new DataFusionEngine<>(strategy);
// run the fusion
FusableDataSet<Movie, Attribute> fusedDataSet = engine.run(correspondences, null);

Load the gold standard and evaluate the fusion result:

// load the gold standard
DataSet<Movie, Attribute> gs = new FusableDataSet<>();
new MovieXMLReader().loadFromXML(
  new File("usecase/movie/goldstandard/fused.xml"),
  "/movies/movie",
  gs);
// evaluate
DataFusionEvaluator<Movie, Attribute> evaluator = new DataFusionEvaluator<>(
  strategy, new RecordGroupFactory<Movie, Attribute>());
evaluator.setVerbose(true);

double accuracy = evaluator.evaluate(fusedDataSet, gs, null);

System.out.println(String.format("Accuracy: %.2f", accuracy));