LDIF - Benchmark Results

Introduction

This page presents a performance evaluation for LDIF using life science use cases.

The following data sources have been used to evaluate the performance of LDIF:

Allen Mouse Brain Atlas - a growing collection of online public resources integrating extensive gene expression and neuroanatomical data;
KEGG GENES - a collection of gene catalogs for all complete genomes generated from publicly available resources;
KEGG Pathway - a collection of pathway maps representing knowledge on the molecular interaction and reaction networks;
PharmGKB - which provides data on gene information, disease and drug pathways, and SNP variants;
Uniprot - a dataset containing protein sequence, genes and function.

Use case

In this use case, we evaluated the performance of LDIF integrating two larger life science datasets and translate those into a common target vocabulary.

Datasets

For this use case we used only KEGG GENES and UniProt datasets. There is a huge difference in dataset size between the two datasets. Converted to N-Triples the complete KEGG GENES dump is about 28GB in size whereas the UniProt dataset contains over 400GB worth of data.

For the test, we generated subsets of both data sources amounting together to 25, 100, 150 and 300 million RDF triples. The 3690 M dataset include the complete UniProt and complete KEGG GENES datasets.

Details about the benchmark datasets are summarized in the following table. It provides statistics about the data integration process for each dataset. The original number of input triples decreases in the process as LDIF discards input triples which are irrelevant for the defined mappings, and therefore can not be translated into the target vocabulary. The number decreases again after the actual translation, as the input data uses more verbose vocabularies and as multiple triples from the input data are thus combined into single triples in the target vocabulary. The size of the final dataset is the number of quads after the mapping phase plus any provided provenance data.

	25M	100M	150M	300M	3690M
Number of input quads	25,000,000	100,000,000	150,000,000	300,000,000	3,687,918,681
Number of quads after irrelevance filter	13,576,394	44,249,757	-	-	1,164,250,713
Number of quads after mapping	4,419,410	24,972,112	-	32,211,677	98,380,001
Number of pairs of equivalent entities resolved	24,782	213,062	-	1,084,808	6,321,750
Overall file size	5.6 GB	23 GB	35 GB	67 GB	-
Download link	25M.zip	100M.zip	-	-	-

Mappings

We defined R2R mappings for translating genes, diseases and pathways from KEGG GENES and genes from UniProt into a proprietary target vocabulary. Some more sophisticated mappings from the use case translate complex structural patterns and perform value transformations (e.g. extracting an integer value from a URI). The prevalent value transformations are extracting strings with a regular expression and modifying the target data types.

Here are some examples of the mappings we used for:

mapping a KEGG GENE gene into a target vocabulary Gene

 mp:Gene
   a r2r:ClassMapping;
   r2r:prefixDefinitions
        "category: <http://mywiki/resource/category/> .
         property: <http://mywiki/resource/property/> .
         pathway:  <http://wiking.vulcan.com/neurobase/kegg_pathway/resource/vocab/> .
         genes:    <http://wiking.vulcan.com/neurobase/kegg_genes/resource/vocab/> .
         xsd:      <http://www.w3.org/2001/XMLSchema#> .";
   r2r:sourcePattern "?SUBJ a genes:gene";
   r2r:targetPattern "?SUBJ a category:Gene";
   .

mapping a KEGG GENE externalLink property into a target vocabulary UniprotId property

 mp:GeneLinkUniProt
   a r2r:PropertyMapping;
   r2r:mappingRef mp:Gene;
   r2r:sourcePattern  "?SUBJ genes:externalLink ?x";
   r2r:transformation "?id = regexToList('UniProt:(.+)', ?x)";
   r2r:targetPattern  "?SUBJ property:UniprotId ?'id'^^xsd:string";
   .

Link Specifications

We defined the following Link Specification for the identity resolution phase:

<Silk>
	<Prefixes>
		<Prefix id="rdf" namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#" />
		<Prefix id="rdfs" namespace="http://www.w3.org/2000/01/rdf-schema#" />
		<Prefix id="owl" namespace="http://www.w3.org/2002/07/owl#" />
		<Prefix id="property" namespace="http://mywiki/resource/property/" />
		<Prefix id="category" namespace="http://mywiki/resource/category/" />
	</Prefixes>

	<Interlinks>
		<Interlink id="link">
			<LinkType>owl:sameAs</LinkType>

			<SourceDataset dataSource="Source" var="a">
				<RestrictTo>
					{ ?a rdf:type category:Gene }
					UNION { ?a rdf:type category:Disease }
					UNION { ?a rdf:type category:Pathway }
				</RestrictTo>
			</SourceDataset>

			<TargetDataset dataSource="Target" var="b">
				<RestrictTo>
					{ ?b rdf:type category:Gene }
					UNION { ?b rdf:type category:Disease }
					UNION { ?b rdf:type category:Pathway }
				</RestrictTo>
			</TargetDataset>

			<LinkageRule>
			<Aggregate type="max">
					<Compare metric="equality">
						<Input path="?a/property:UniprotId" />
						<Input path="?b/property:UniprotId" />
					</Compare>
					<Compare metric="equality">
						<Input path="?a/property:EntrezGeneId" />
						<Input path="?b/property:EntrezGeneId" />
					</Compare>
					<Compare metric="equality">
						<Input path="?a/property:MgiMarkerAccessionId" />
						<Input path="?b/property:MgiMarkerAccessionId" />
					</Compare>
					<Compare metric="equality">
						<Input path="?a/property:KeggDiseaseId" />
						<Input path="?b/property:KeggDiseaseId" />
					</Compare>
					<Compare metric="equality">
						<Input path="?a/property:KeggPathwayId" />
						<Input path="?b/property:KeggPathwayId" />
					</Compare>
			</Aggregate>
			</LinkageRule>

			<Filter threshold="1.0" />

		</Interlink>

	</Interlinks>
</Silk>

Benchmark Machines

We used a machine with the following specification for the benchmark experiments:

FUB machines

These machines have been used for In-Memory, TDB and FUB Hadoop clusters.

Hardware	Processors: Intel i7 950, 3.07GHz (quadcore) Memory: 24GB Hard Disks: 2 × 1.8TB (7,200 rpm) SATA2
Software	Operating System: Ubuntu 11.04 64-bit, Kernel: 2.6.38-10 Java version: 1.6.0_22

EC2 c1.medium instances

There machines have been used for EC2 Hadoop clusters.

Hardware	Processor: 5 EC2 Compute Units¹ Memory: 1.7 GB Hard Disks: 350 GB I/O Performance: Moderate
Software	Operating System: Ubuntu 11.04 32-bit

Hadoop clusters

In all cluster configurations, the master works as job tracker and name node, while the slaves work as data nodes and task trackers.

FUB Hadoop 2-slaves cluster

1 master, 2 slaves (FUB machines)
Network: Gigabit Ethernet

EC2 Hadoop X-slaves cluster

1 master, X slaves (EC2 c1.medium instances)

Test Procedure

We used LDIF version 0.5 both for the in-memory, TDB and Hadoop tests.

Single machine tests

For each single machine test, we applied the following procedure:

Clear Operating System caches
```
 echo 2 > /proc/sys/vm/drop_caches 
```

Run test

 java -server -Xmx20G -jar ldif-single-machine.jar

Hadoop tests

For each distributed test, we applied the following procedure:

Launch a master instance w/ EBS containing input data attached
Launch slave instances
Init the cluster (connect via SSH, clear all temporary data, format HDFS, run start-all)
Upload input data into HDFS (from EBS)
Run integration

Details about the Hadoop cluster configurations are available at this page.

Results

The following table summarizes the LDIF run times for the different dataset sizes. The overall run time is split according to the different processing steps of the integration process.

25M run times

Phase	In-memory	TDB	Hadoop FUB 2-slaves
Load and build entities for R2R	186.3 s	1334 s	688 s
R2R data translation	116.4 s	85 s	87 s
Build entities for Silk	12.9 s	139 s	225 s
Silk identity resolution	72.6 s	213 s	216 s
URIs rewriting	16.2 s	15 s	427 s
Overall execution	6.7 min	29.7 min	27.4 min

100M run times

Phase	In-memory	TDB	Hadoop FUB 2-slaves
Load and build entities for R2R	822 s	7014 s	3112 s
R2R data translation	744 s	607 s	165 s
Build entities for Silk	81 s	806 s	405 s
Silk identity resolution	772 s	4656 s	544 s
URIs rewriting	113 s	118 s	1056 s
Overall execution	42.2 min	220 min	88 min

150M run times

Phase	In-memory	TDB	Hadoop FUB 2-slaves
Load and build entities for R2R	Out of Memory	13776 s	3830 s
R2R data translation	-	1206 s	170 s
Build entities for Silk	-	847 s	380 s
Silk identity resolution	-	5328 s	688 s
URIs rewriting	-	173 s	1235 s
Overall execution	-	355 min	105 min

300M run times

Phase	In-memory	TDB	Hadoop FUB 2-slaves
Load and build entities for R2R	Out of Memory	22870 s	6070 s
R2R data translation	-	1203 s	179 s
Build entities for Silk	-	1006 s	436 s
Silk identity resolution	-	8392 s	1022 s
URIs rewriting	-	176 s	1232 s
Overall execution	-	560 min	148 min

Phase	Hadoop EC2 8-slaves	Hadoop EC2 16-slaves	Hadoop EC2 32-slaves
Load and build entities for R2R	7933 s	4647 s	2382 s
R2R data translation	297 s	173 s	114 s
Build entities for Silk	646 s	421 s	324 s
Silk identity resolution	1546 s	932 s	580 s
URIs rewriting	2174 s	1430 s	1085 s
Overall execution	209 min	126 min	75 min

3690M run times

Phase	Hadoop FUB 2-slaves (format: hh:mm:ss)
Load and build entities for R2R	9:34:44
R2R data translation	15:22
Build entities for Silk	51:12
Silk identity resolution	15:27:52
Find sameAs URI sets	1:44:24
URIs rewriting	1:45:29
Overall execution	29:09:03

[1] According to this article, c1.medium instances with 5 ECU are observed to run as 2 of 4 cores of an Intel E5410 processor (4 cores, 2.33 GhZ).