DBpedia DBpedia Spotlight D2R Server R2R Silk Sieve LDIF NG4J Marbles WIQA Pubby RAP
Open Source projects by the Web-based Systems Group:  
Andreas Schultz
Andrea Matteini
Robert Isele
Chris Bizer
Christian Becker

Contents

Introduction

This page presents a performance evaluation for LDIF using life science use cases.

The following data sources have been used to evaluate the performance of LDIF:

  • Allen Mouse Brain Atlas - a growing collection of online public resources integrating extensive gene expression and neuroanatomical data;
  • KEGG GENES - a collection of gene catalogs for all complete genomes generated from publicly available resources;
  • KEGG Pathway - a collection of pathway maps representing knowledge on the molecular interaction and reaction networks;
  • PharmGKB - which provides data on gene information, disease and drug pathways, and SNP variants;
  • Uniprot - a dataset containing protein sequence, genes and function.

Use case

In this use case, we evaluated the performance of LDIF integrating two larger life science datasets and translate those into a common target vocabulary.

Datasets

For this use case we used only KEGG GENES and UniProt datasets. There is a huge difference in dataset size between the two datasets. Converted to N-Triples the complete KEGG GENES dump is about 28GB in size whereas the UniProt dataset contains over 400GB worth of data.

For the test, we generated subsets of both data sources amounting together to 25, 100, 150 and 300 million RDF triples. The 3690 M dataset include the complete UniProt and complete KEGG GENES datasets.

Details about the benchmark datasets are summarized in the following table. It provides statistics about the data integration process for each dataset. The original number of input triples decreases in the process as LDIF discards input triples which are irrelevant for the defined mappings, and therefore can not be translated into the target vocabulary. The number decreases again after the actual translation, as the input data uses more verbose vocabularies and as multiple triples from the input data are thus combined into single triples in the target vocabulary. The size of the final dataset is the number of quads after the mapping phase plus any provided provenance data.

25M 100M 150M 300M 3690M
Number of input quads 25,000,000 100,000,000 150,000,000 300,000,000 3,687,918,681
Number of quads after irrelevance filter 13,576,394 44,249,757 - - 1,164,250,713
Number of quads after mapping 4,419,410 24,972,112 - 32,211,677 98,380,001
Number of pairs of equivalent entities resolved 24,782 213,062 - 1,084,808 6,321,750
Overall file size 5.6 GB 23 GB 35 GB 67 GB -
Download link 25M.zip 100M.zip - - -

Mappings

We defined R2R mappings for translating genes, diseases and pathways from KEGG GENES and genes from UniProt into a proprietary target vocabulary. Some more sophisticated mappings from the use case translate complex structural patterns and perform value transformations (e.g. extracting an integer value from a URI). The prevalent value transformations are extracting strings with a regular expression and modifying the target data types.

Here are some examples of the mappings we used for:
  • mapping a KEGG GENE gene into a target vocabulary Gene
     mp:Gene
    a r2r:ClassMapping;
    r2r:prefixDefinitions
    "category: <http://mywiki/resource/category/> .
    property: <http://mywiki/resource/property/> .
    pathway: <http://wiking.vulcan.com/neurobase/kegg_pathway/resource/vocab/> .
    genes: <http://wiking.vulcan.com/neurobase/kegg_genes/resource/vocab/> .
    xsd: <http://www.w3.org/2001/XMLSchema#> .";
    r2r:sourcePattern "?SUBJ a genes:gene";
    r2r:targetPattern "?SUBJ a category:Gene";
    .
  • mapping a KEGG GENE externalLink property into a target vocabulary UniprotId property
     mp:GeneLinkUniProt
    a r2r:PropertyMapping;
    r2r:mappingRef mp:Gene;
    r2r:sourcePattern "?SUBJ genes:externalLink ?x";
    r2r:transformation "?id = regexToList('UniProt:(.+)', ?x)";
    r2r:targetPattern "?SUBJ property:UniprotId ?'id'^^xsd:string";
    .

Link Specifications

We defined the following Link Specification for the identity resolution phase:

<Silk>
	<Prefixes>
		<Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#" />
		<Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" />
		<Prefix id="owl" namespace="http://www.w3.org/2002/07/owl#" />
		<Prefix id="property" namespace="http://mywiki/resource/property/" />
		<Prefix id="category" namespace="http://mywiki/resource/category/" />
	</Prefixes>

	<Interlinks>
		<Interlink id="link">
<LinkType>owl:sameAs</LinkType>

<SourceDataset dataSource="Source" var="a">
<RestrictTo>
{ ?a rdf:type category:Gene }
UNION { ?a rdf:type category:Disease }
UNION { ?a rdf:type category:Pathway }
</RestrictTo>
</SourceDataset>

<TargetDataset dataSource="Target" var="b">
<RestrictTo>
{ ?b rdf:type category:Gene }
UNION { ?b rdf:type category:Disease }
UNION { ?b rdf:type category:Pathway }
</RestrictTo>
</TargetDataset>

<LinkageRule>
<Aggregate type="max">
<Compare metric="equality">
<Input path="?a/property:UniprotId" />
<Input path="?b/property:UniprotId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:EntrezGeneId" />
<Input path="?b/property:EntrezGeneId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:MgiMarkerAccessionId" />
<Input path="?b/property:MgiMarkerAccessionId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:KeggDiseaseId" />
<Input path="?b/property:KeggDiseaseId" />
</Compare>
<Compare metric="equality">
<Input path="?a/property:KeggPathwayId" />
<Input path="?b/property:KeggPathwayId" />
</Compare>
</Aggregate>
</LinkageRule>

<Filter threshold="1.0" />

</Interlink>

</Interlinks>
</Silk>

Benchmark Machines

We used a machine with the following specification for the benchmark experiments:

FUB machines

These machines have been used for In-Memory, TDB and FUB Hadoop clusters.

Hardware Processors: Intel i7 950, 3.07GHz (quadcore)
Memory: 24GB
Hard Disks: 2 × 1.8TB (7,200 rpm) SATA2
Software Operating System: Ubuntu 11.04 64-bit, Kernel: 2.6.38-10
Java version: 1.6.0_22

EC2 c1.medium instances

There machines have been used for EC2 Hadoop clusters.

Hardware Processor: 5 EC2 Compute Units1
Memory: 1.7 GB
Hard Disks: 350 GB
I/O Performance: Moderate
Software Operating System: Ubuntu 11.04 32-bit

Hadoop clusters

In all cluster configurations, the master works as job tracker and name node, while the slaves work as data nodes and task trackers.

FUB Hadoop 2-slaves cluster

  • 1 master, 2 slaves (FUB machines)
  • Network: Gigabit Ethernet

EC2 Hadoop X-slaves cluster

  • 1 master, X slaves (EC2 c1.medium instances)

Test Procedure

We used LDIF version 0.5 both for the in-memory, TDB and Hadoop tests.

Single machine tests

For each single machine test, we applied the following procedure:

  1. Clear Operating System caches
     echo 2 > /proc/sys/vm/drop_caches 
  2. Run test
     java -server -Xmx20G -jar ldif-single-machine.jar 

Hadoop tests

For each distributed test, we applied the following procedure:

  1. Launch a master instance w/ EBS containing input data attached
  2. Launch slave instances
  3. Init the cluster (connect via SSH, clear all temporary data, format HDFS, run start-all)
  4. Upload input data into HDFS (from EBS)
  5. Run integration
Details about the Hadoop cluster configurations are available at this page.

Results

The following table summarizes the LDIF run times for the different dataset sizes. The overall run time is split according to the different processing steps of the integration process.

25M run times

Phase In-memory TDB Hadoop FUB 2-slaves
Load and build entities for R2R 186.3 s 1334 s 688 s
R2R data translation 116.4 s 85 s 87 s
Build entities for Silk 12.9 s 139 s 225 s
Silk identity resolution 72.6 s 213 s 216 s
URIs rewriting 16.2 s 15 s 427 s
Overall execution 6.7 min 29.7 min 27.4 min

100M run times

Phase In-memory TDB Hadoop FUB 2-slaves
Load and build entities for R2R 822 s 7014 s 3112 s
R2R data translation 744 s 607 s 165 s
Build entities for Silk 81 s 806 s 405 s
Silk identity resolution 772 s 4656 s 544 s
URIs rewriting 113 s 118 s 1056 s
Overall execution 42.2 min 220 min 88 min

150M run times

Phase In-memory TDB Hadoop FUB 2-slaves
Load and build entities for R2R Out of Memory 13776 s 3830 s
R2R data translation - 1206 s 170 s
Build entities for Silk - 847 s 380 s
Silk identity resolution - 5328 s 688 s
URIs rewriting - 173 s 1235 s
Overall execution - 355 min 105 min

300M run times

Phase In-memory TDB Hadoop FUB 2-slaves
Load and build entities for R2R Out of Memory 22870 s 6070 s
R2R data translation - 1203 s 179 s
Build entities for Silk - 1006 s 436 s
Silk identity resolution - 8392 s 1022 s
URIs rewriting - 176 s 1232 s
Overall execution - 560 min 148 min
Phase Hadoop EC2 8-slaves Hadoop EC2 16-slaves Hadoop EC2 32-slaves
Load and build entities for R2R 7933 s 4647 s 2382 s
R2R data translation 297 s 173 s 114 s
Build entities for Silk 646 s 421 s 324 s
Silk identity resolution 1546 s 932 s 580 s
URIs rewriting 2174 s 1430 s 1085 s
Overall execution 209 min 126 min 75 min

3690M run times

Phase Hadoop FUB 2-slaves
(format: hh:mm:ss)
Load and build entities for R2R 9:34:44
R2R data translation 15:22
Build entities for Silk 51:12
Silk identity resolution 15:27:52
Find sameAs URI sets 1:44:24
URIs rewriting 1:45:29
Overall execution 29:09:03



[1] According to this article, c1.medium instances with 5 ECU are observed to run as 2 of 4 cores of an Intel E5410 processor (4 cores, 2.33 GhZ).