WebPIE

WebPIE (Web-scale Parallel Inference Engine) is a MapReduce distributed RDFS/OWL inference engine written using the Hadoop framework. This engine applies the RDFS and OWL ter Horst rules and it materializes all the derived statements.

WebPIE encodes Semantic Web reasoning as a set of Map and Reduce operations, tapping on the power of the MapReduce programming model for performance, fault-tolerance and simplified administration. It aims at high scalability in terms of the number of processing nodes and data size. This is achieved by optimizing the execution of joins required to apply the RDFS and OWL-horst rules.

In this page we report all the relevant material that was published so far on this project. We have released the source code under an open source license and it can be downloaded as explained below. We have also created an amazon AWS image with the reasoner preinstalled so that you can compute the inference on your own with almost no efforts. In the tutorial we explain how to do this step-by-step.

We are interested in your experience using WebPIE: did it work for you? What tasks did you try? What performance did you get? Please keep us informed by sending a brief email to
j DOT urbani AT few DOT vu DOT nl

Documentation

We have published some papers on WebPIE. If you want to understand better how it works and see how it performs, please read the relevant material. This is:

Urbani J., Kotoulas, S., Maaseen J., Drost N., Seinstra F., van Harmelen, F. & Bal, H. (2010), WebPIE: a Web-scale Parallel Inference Engine, Submission to the SCALE competition at CCGrid '10. [DOWNLOAD]
Urbani J., Kotoulas, S., Maaseen J., van Harmelen, F. & Bal, H. (2010), OWL reasoning with WebPIE: calculating the closure of 100 billion triples, In Proceedings of the ESWC '10. [DOWNLOAD]
Urbani J., Maaseen J., & Bal, H. (2010), Massive Semantic Web data compression with MapReduce, In Proceedings of the MapReduce workshop at HPDC 2010. [DOWNLOAD]
Urbani, J., Kotoulas, S., Oren, E. & van Harmelen, F. (2009), Scalable Distributed Reasoning using MapReduce, In Proceedings of the ISWC '09. honorable mention best student paper award [DOWNLOAD]
Urbani J. (2009), RDFS/OWL reasoning using the MapReduce framework, Master thesis, Vrije Universiteit [DOWNLOAD]
Here you can also download a presentation that describes WebPIE and its performance at an high level.

History

This project started as Jacopo Urbani's master thesis in spring 2009. Initially the program was called reasoning-hadoop and it was written using the Hadoop framework 0.19.1. Initially, the reasoner supported only RDFS reasoning, but it was later extended to the OWL ter Horst fragment. Finally, the project was renamed in WebPIE and it was ported to the Hadoop version 0.20.2.

Source code

The 0.19.1 version can still be downloaded at launchpad but will no longer be continued. The new version is incompatible with the previous Hadoop versions and can run only using Hadoop framework 0.20.2 or later. At the moment, the source code of this new version can only be downloaded here (webpie-1.1.1.tar.gz).

In order to run WebPIE on Amazon, you need to download some scripts to launch and terminate the clusters. The files are available here.

Tutorial: launch WebPIE on Amazon

In this tutorial, we will explain how to launch the WebPIE reasoner using the Amazon EC2 cloud. Amazon offers to the users the possibility to reserve a cluster of machines and pay only for the effective usage of the machines. Before you proceed with the rest of the tutorial, you must set up Amazon API command-line tools by following these instructions.

Using this service, it is possible, for example, to launch a cluster of 10 machines, compute the closure using WebPIE and terminate the cluster. The price that Amazon will charge will be proportional to the time, type and number of machines that were used.

Requirements

In order to launch WebPIE on the Amazon cluster, you first need to create an account for Amazon EC2 (in addition to your Amazon Web Services account) as explained in the Amazon website. You also need to download some scripts as described above.

The scripts we have prepared are a modified version of the scripts distributed in the official Hadoop distribution. Since they are Unix scripts, if you use Windows you need to download cygwin and use this emulator to launch them.

1st step: setup and launch the cluster

At first, you must download the following files: (i) the AWS EC2 certificate (starting with "cert-" and with extension ".pem", (ii) the private key of the AWS EC2 certificate (starting with "pk-" and ending with ".pem", and (iii) the private key of the security key pair you want to use to login in the system (this last file must have a 0600 permissions). Copy these files in the same directory and set the following environment variables. Note that you may need to generate a certificate and a key in the Amazon web site and the EC2 management console.

  • EC2_CERT: absolute path of the AWS EC2 certificate
  • EC2_PRIVATE_KEY: absolute path of the AWS EC2 private key
  • EC2_URL: URL of the EC2 region. You must set it to: https://ec2.eu-west-1.amazonaws.com

After, we must configure the scripts with the credentials of our Amazon account. Uncompress the archive and copy the file hadoop-ec2-env.sh.template to hadoop-ec2-env.sh.

There are some variables that you must set in the file you just copied. These are:

  • AWS_ACCOUNT_ID: your Amazon AWS account number
  • KEY_NAME: the name of the security key pair that you want to use to login in the machines. You can create it in the AWS console manager.
Now you should be able to launch a Hadoop cluster on the Amazon cloud. Move to directory of the script and type:
./launch-hadoop-cluster webpie number-of-nodes
The parameter number-of-nodes indicates the number of slaves you want to use for you cluster. When you want to determine the size of your cluster, consider that it always consists of 1 master plus n nodes. Therefore, if, for example, you set this number to 4, you cluster will be composed of 5 machines: 1 master + 4 slaves.

Starting the nodes will take a couple of minutes. Check the EC2 Management Console and wait until all your instances are running. You will find this information when you follow the "Running Instances" link. Warning: Amazon EC2 will start billing you from this point on. Make sure that you terminate your instances after you are finished.

2nd step: upload input data on the cluster

Before we launch the reasoner, we need to upload the input data to the HDFS filesystem and compress it in a suitable format. The input data must consist of gzipped compressed files of triples in N-Triples format. Try to keep files to similar sizes and have more files than cpu cores since each file will be processed by one machine.

We first log in on the master node. We do this by launching the command:
./cmd-hadoop-cluster login webpie
Now we are logged in the master node of our cluster. With the command hadoop we can have access to the HDFS filesystem and lauch our reasoning jobs later. For example the command
hadoop fs -ls /
will list the root directory of the HDFS filesystem.

Now, you must create the "/input" directory with the command:

hadoop fs -mkdir /input

The input data can be copied to the HDFS in several ways. For example, we could store it in the Amazon S3 and download it to the HDFS. In this tutorial we will explain the easiest way that consists of first copying the data in the master node, and then upload it to HDFS.

We first copy the data to the cluster master machine with the command:

./cmd-hadoop-cluster push webpie input_triples.tar.gz
With this command we copy a generic input archive called input_triples.tar.gz into the master node home directory. Once we are logged back on the master node, we can uncompress the tar file and copy the content to the HDFS with the command:
hadoop fs -copyFromLocal input-files /input
This command will copy a directory called "input-files" with some compressed N-Triples files in the HDFS directory "/input". Once this operation is completed, we can proceed to compress the input data.

3rd step: compress the input data

The reasoner works with the data in a compressed format. We compress the data with the command:

hadoop jar webpie.jar jobs.FilesImportTriples /input /tmp /pool --maptasks 4 --reducetasks 2 --samplingPercentage 10 --samplingThreshold 1000
This command has many parameters.
  • maptasks: this parameters indicates in how many map tasks the jobs should be split. Set this value close to the total number of CPU cores.
  • reducetasks: similar than before but only on the number of reducer task. Set this value close to the total number CPU cores.
  • samplingPercentage: the compressing jobs extracts the popular resoucers and treat them differently. With this parameter we set the sampling percentage (in this case 10%) used to determine whether a resource is popular or not
  • samplingThreshold: this parameter is used as a threshold limit to recognize the popular resources within the sample. All the resources which appear in the sample more than this number are to be considered popular.

The above command can be read as: launch the compression and split the job between 4 map tasks and 2 reduce tasks, sample the input using a 10% of the data and mark as popular all the resources which appear more than 1000 times in this sample.

After this job is finished, we have in the directory /pool the compressed input data and we can proceed to the reasoning.

4th step: reasoning

The reasoning process is started with the command:
hadoop jar webpie.jar jobs.Reasoner /pool --fragment owl --rulesStrategy fixed --reducetasks 2 --samplingPercentage 10 --samplingThreshold 1000

With this command we start a sequence of MapReduce jobs which compute all the RDFS and OWL rules on the input directory ("/pool"). If we set the --fragment parameter to RDFS then the reasoner will compute only the RDFS rules.

The structure of the directory "pool" has a specific structure and a user should not change it manually. Once the reasoning process is finished, the subdirectories "dir-rdfs-output" and "dir-owl-output" will contain all the derived triples in a compressed format. The last operation we need to do is to decompress these triples in the original N-Triples format.

5th step: decompress the data and download it

The directory "pool" contains both the input files and the derived ones. If we look inside we note that there are several subdirectories:
  • _dict: this directory contains the dictionary table.
  • dir-synonyms-table: this directory may not exist. It contains the synonyms table.
  • dir-input: the triples in input
  • dir-rdfs-output: the triples derived using the RDFS rules
  • dir-owl-output: the triples derived using the OWL rules
If we want to decompress the triples we need to launch the command:
hadoop jar webpie.jar jobs.FilesExportTriples /pool /uncompressed_data --maptasks 4 --reducetasks 2 --samplingPercentage 10 --samplingThreshold 1000

This command specifies the directory "/pool" as input and the directory "/uncompressed_data" as output. After this command is terminated in /uncompressed_data there will be the triples back in their original format. Keep in mind that this proceduce will convert both the triples in input and the derived ones. In case you want to convert only the derived ones, you must first remove (or move) the directory "/pool/dir-input" and then launch the command.

Once you have uncompressed the data, you can copy it locally with the command:

hadoop fs -copyToLocal /uncompressed_data .

Finally, the last operation you need to do is to terminate the Amazon cluster. You can do it by launching the script on the local machine:

./terminate-hadoop-cluster webpie

and confirm with "yes".

Important: If you do not terminate the cluster, you will continue to be billed by Amazon EC2.

Congratulations! You ran WebPIE on the cloud!

Warning: We can not be held responsible for any charges incurred by Amazon EC2. Please verify that you stop your instances.

Acknowledgments

WebPIE has received crucial support from the Larkc project and, more in particular, from Eyal Oren, Spyros Kotoulas and Frank van Harmelen.