SPARQL Engine User Manual

SPARQL Engine

User Manual

Introduction
Intended Use
Supported RDF Servers
The Basics of Engine Use
Writing an Adapter
Using the Engine as a Frontend
Customizing an Adapter
- Selecting a Set Logic

Introduction

SPARQL

SPARQL is a query language currently under design by the Data Access Working Group of the World Wide Web Consortium. It is the first attempt to establish a neutral query language for RDF data, in the hopes of maturing the activities of the "Semantic Web". With a standardized query language, RDF implementations can be changed while maintaining the same client side implementation. As a positive example, SQL was developed as a relational database query language, was accepted by the database community, and serves to ease developer transition between database servers. For more information on SPARQL, see the current W3C working draft or the SPARQL Manual.

Background

This engine was originally created as an open source project funded by Google's Summer of Code 2005 program. The goal of the project was to create a SPARQL query handling engine for Sesame (an open source RDF server solution), which currently did not have SPARQL handling capabilities. In the process of investigating a solution, it was found that it wouldn't take much more effort to enable the engine to be used by any RDF server that wanted SPARQL handling capabilities. Therefore, a design was developed in attempt to make the language very flexible yet still retain most of the efficiency of a custom build query engine. An engine like this would be a boon for the RDF server field because 1) it would be save time since similar evaluation code wouldn't have to be written for each engine and 2) updates and changes to the specification could be easily propagated to client servers.
Intended Use

The engine and this document is actually intended for two audiences/use cases:
1. Backend plug-in for RDF server - The end user of the server never has to know it's there or being used. All it requires is a written adapter library and library inclusion on the classpath, most likely distributed with the server.
2. Front-end customized query handler - While not originally intended for this use, it can actually be used to extend the SPARQL language with user value functions. It's expected that this part will actually become more powerful in the future, as new ways of giving the front-end user more flexibility are put in.
Natively Supported RDF Servers

Because the engine was originally meant to be fitted to Sesame, a lot of the terminology and class patterns have been taken from the Sesame 2 codebase. Regardless of which RDF server fits your needs, the design of the newest incarnation of Sesame is very elegant, efficient, and this document may reference ideas that were gleaned from that server. That being said, SPARQL does not depend on the Sesame server, nor is it written to be specific to the server. It's possible that another server that uses a very different way of finding triples in it's RDF persistence might not be able to use the engine because of assumptions made. However, at least Jena and Sesame are quite compatible with the architecture and it's hoped that many others will be as well.

Currently, there is only a released adapter for Sesame, therefore that's the only RDF store that can be used out of the box. However, there is a rough Jena adapter that exists and will be ready for release shortly. Neither of these is fully optimized either, so there is still much work to do them.
The Basics of Engine Use

To understand the SPARQL engine or write an adapter for it, it's necessary to know a bit about the underlying design and architecture. For a full discussion, including justifications for the decisions made, see the Design Specification. But for this, it's really only necessary to understand the basics of how the engine is used:
1. The static methods in the SPARQLParser class (parse(Reader) or parse(InputStream)) are used to return a Query object:
```
Query query = SPARQLParser.parse(new StringReader(queryString));
```
2. The Query object is cast into it's particular specific interface (SelectQuery,AskQuery,ConstructQuery,or DescribeQuery):
```
if (query instanceof SelectQuery) {
	SelectQuery select = (SelectQuery) query;
}
```
3. The execute(RdfSource) method is called on the Query object. All the subinterfaces have that method, they just all return different classes of results:
```
RdfBindingSet results = select.execute(new MyRdfSource());
```
As can be seen, it's not a terribly complex process. From the point of view of the engine, the only thing that it really needs for query evaluation is the RdfSource implementation. That interface has seven methods to it. Six of those are get/hasStatement methods that allow wildcard matching against statements in the RDF store with various named graph contexts. The other method provides access to the ValueFactory interface that will create Value objects (literals, blank nodes, URIs) for the query in order to provide comparable base value objects.
Writing your own Adapter

First of all, it's advised to download the Sesame adapter source for a working example

Engine Requirements - (RdfSource,ValueFactory,GraphStatement,RDF Store Value adapters)

The first requirement to writing a SPARQL adapter is to write an implementations/adapter for the RdfSource that your choice of persistence uses. At least Jena and Sesame use an underlying wild card matching scheme to query their persistence model. If your choice of RDF server does not, it may be a bit more difficult/inefficient to model that behavior.

One of the methods in the RdfSource interface returns a ValueFactory interface. This is used to create consistently comparable value objects, so when value equality or binding intersections are calculated, the same things are used to compare. As an example, if one server returns Value objects that test for literal equality just on the label whereas one uses the whole signature, it could create issues when expressions return query native values and compare against the bound value. It's probably advised that you create a ValueFactory implementation that is responsible for creating Value adapters to the internal objects the RDF server uses to represent the values.

The matching methods in the RdfSource interface return an Iterator of GraphStatement objects. It's most likely that an iterator will need to be implemented that iterates over the results returned by the persistence and GraphStatement implementations returned. The GraphStatements consist of Value objects, which should probably just use the ValueFactory you implemented to wrap the RDF native objects

RDF Server Requirements - (Engine Value adapters?,RdfBindingSet,RdfBindingRow,RdfGraph,LenientStatement,Query/Parser delegating)

The methods in the RdfSource adapter all use openrdf Value objects to specify the statement to match. It's doubtful that the underlying RDF server uses the same interface, so most likely an adapter for the Value objects coming from the statement will have to be created (we already handled the Value objects coming from the results). It's advised that these be adapters to, since that is perhaps slightly more memory efficient, but in the query this doesn't matter nearly as much as in the results. If you used a ValueFactory as above, you can actually just cast the Value objects to your adapter classes and then unwrap them for persistence evaluation.

The results of the queries are either RdfBindingSet, RdfGraph, or boolean returns. It's probable that these objects need to then be represented in the native format of the RDF server for post-processing or to pass out through a common interface. This is fairly implementation specific though, so we'll leave out discussion.

On the front-end side of the adapter, the code of the server most likely has to be modified somewhat to allow SPARQL evaluation, which will usually require objects or methods to place the code listed in the basic use section above.
Using the Engine as a Frontend

This is essentially the same process as writing an adapter without needing to worry about post-handling the binding sets and graphs or pre-handling the query objects. If an adapter already exists, then no work is necessary. Just use the code like described in the basic use section and you can work with the returned RdfBindingSet and RdfGraph objects returned from using the SPARQLParser and Query objects.

Customizing an Adapter

One of the more complex and powerful aspects of the SPARQL engine is the ability to optimize the logic within the engine in order to take advantage of implementation specific nuances or persistence dependent optimizations. This is slightly more complex and requires a good understanding of the design of the SPARQL engine, so read up.

Though figuring out how to extend the logic of the engine is tricky, it's not hard to understand how it works. The query is parsed but at that point is just data, it can't be evaluated in it's current state. At evaluation time, a visitor passes over the query structure with a logic factory that each data part of the query uses to figure out how handle their aspect of the query evaluation. A user can override the default logic of a query by providing their own logic factory before evaluation time that returns custom logic objects to do evaluation differently. Essentially, this means you can rewrite the entire evaluation of a query yourself!

A couple of points on this process:

It's advised that you overload the default factory to make your custom factory. That way you get all the default logic pieces that already work.
Currently, certain pieces of the logic have some fairly complex behavior that isn't designed very well. It's important that the logic works exactly the way as it's supposed to in terms of general functionality or else it may break not only specification compliance but also runnability.
The changes to the engine are constrained to only logic changes (with an exception, see below). Therefore, implementing a new operator isn't possible currently without recreating the grammar and recompiling the library.

The exceptions to the non-extendible grammar are the external functions that are allowed by the specification to be in the language (see section 11). The default factory contains a map of function URIs to ExternalFunction objects that is used to bind parsed function URIs to the logic that evaluates them. These functions are only used in FILTER value constraints, but can provide some nice shortcuts and elegances to value evaluation. Just create an instance of the default factory or whatever extended object you're using and use the registerExternalFunction(...) method.

Selecting a Set Logic

The SPARQL engine currently comes with several different algorithms to do combinatory set logic, which serves as the base of how all queries are evaluated. Using techniques such as indexing and streaming, it's possible to tradeoff memory space for efficiency and caching. The current logic implementations for each set logic are compared below:

SetDistinctionLogic

Class Name	Indexed	Cached/Streamed (Efficiency)	Extra Space
StaticSetDistinctionLogic	Yes	Cached (O(n))	Large O(n)
IndexedSetDistinctionLogic (default)	Yes	Streamed (O(1))	Small O(n)
StreamedSetDistinctionLogic	No	Streamed (O(n))	O(1)

SetIntersectLogic

Class Name	Indexed	Cached/Streamed (Efficiency)	Extra Space
NaiveSetIntersectLogic	No	Cached (O(n*m))	O(n*m)
StaticSetIntersectLogic (not done)	Yes	Cached (O(n*s + m))	O(n*m)
IndexedSetIntersectLogic (default)	Yes	Streamed (O(s))	O(m)
StreamedSetIntersectLogic (not done)	No	Streamed (O(m))	O(1)

SetJoinLogic

Class Name	Indexed	Cached/Streamed (Efficiency)	Extra Space
NaiveSetJoinLogic	No	Cached (O(n*m))	O(n*m)
StaticSetJoinLogic (not done)	Yes	Cached (O(n*s + m))	O(n*m)
IndexedSetJoinLogic (default)	Yes	Streamed (O(s))	O(m)
StreamedSetJoinLogic (not done)	No	Streamed (O(m))	O(1)

SetProjectionLogic

Class Name	Cached/Streamed (Efficiency)	Extra Space
StaticSetProjectionLogic	Cached (O(n))	O(n)
StreamedSetProjectionLogic (default)	Streamed (O(1))	O(1)

SetRangeLogic

Class Name	Cached/Streamed (Efficiency)	Extra Space
StaticSetRangeLogic	Cached (O(n))	O(n)
StreamedSetRangeLogic (default)	Streamed (O(1))	O(1)

SetUnionLogic

Class Name	Cached/Streamed (Efficiency)	Extra Space
StaticSetUnionLogic	Cached (O(n+m))	O(n+m)
StreamedSetUnionLogic (default)	Streamed (O(1))	O(1)

To change the set logic used by the engine, follow the customization process detailed above and override the default factory methods that return the particular set logic piece that is to be changed.

Version History

2005-10-02: Included section on set logic
2005-08-31: First Draft

SPARQL Engine

User Manual

Table of Contents

SPARQL

Background

Engine Requirements - (RdfSource,ValueFactory,GraphStatement,RDF Store Value adapters)