SPARQL is a query language currently under design by the Data Access Working Group of the World Wide Web Consortium. It is the first attempt to establish a neutral query language for RDF data, in the hopes of maturing the activities of the "Semantic Web". With a standardized query language, RDF implementations can be changed while maintaining the same client side implementation. As a positive example, SQL was developed as a relational database query language, was accepted by the database community, and serves to ease developer transition between database servers. For more information on SPARQL, see the current W3C working draft or the SPARQL Manual.
This engine was originally created as an open source project funded by Google's Summer of Code 2005 program. The goal of the project was to create a SPARQL query handling engine for Sesame (an open source RDF server solution), which currently did not have SPARQL handling capabilities. In the process of investigating a solution, it was found that it wouldn't take much more effort to enable the engine to be used by any RDF server that wanted SPARQL handling capabilities. Therefore, a design was developed in attempt to make the language very flexible yet still retain most of the efficiency of a custom build query engine. An engine like this would be a boon for the RDF server field because 1) it would be save time since similar evaluation code wouldn't have to be written for each engine and 2) updates and changes to the specification could be easily propagated to client servers.
The engine and this document is actually intended for two audiences/use cases:
Because the engine was originally meant to be fitted to Sesame, a lot of the terminology and class patterns have been taken from the Sesame 2 codebase. Regardless of which RDF server fits your needs, the design of the newest incarnation of Sesame is very elegant, efficient, and this document may reference ideas that were gleaned from that server. That being said, SPARQL does not depend on the Sesame server, nor is it written to be specific to the server. It's possible that another server that uses a very different way of finding triples in it's RDF persistence might not be able to use the engine because of assumptions made. However, at least Jena and Sesame are quite compatible with the architecture and it's hoped that many others will be as well.
Currently, there is only a released adapter for Sesame, therefore that's the only RDF store that can be used out of the box. However, there is a rough Jena adapter that exists and will be ready for release shortly. Neither of these is fully optimized either, so there is still much work to do them.
To understand the SPARQL engine or write an adapter for it, it's necessary to know a bit about the underlying design and architecture. For a full discussion, including justifications for the decisions made, see the Design Specification. But for this, it's really only necessary to understand the basics of how the engine is used:
The static methods in the SPARQLParser class (parse(Reader) or parse(InputStream)) are used to return a Query object:
Query query = SPARQLParser.parse(new StringReader(queryString));
The Query object is cast into it's particular specific interface (SelectQuery,AskQuery,ConstructQuery,or DescribeQuery):
if (query instanceof SelectQuery) { SelectQuery select = (SelectQuery) query; }
The execute(RdfSource) method is called on the Query object. All the subinterfaces have that method, they just all return different classes of results:
RdfBindingSet results = select.execute(new MyRdfSource());
As can be seen, it's not a terribly complex process. From the point of view of the engine, the only thing that it really needs for query evaluation is the RdfSource implementation. That interface has seven methods to it. Six of those are get/hasStatement methods that allow wildcard matching against statements in the RDF store with various named graph contexts. The other method provides access to the ValueFactory interface that will create Value objects (literals, blank nodes, URIs) for the query in order to provide comparable base value objects.
First of all, it's advised to download the Sesame adapter source for a working example
The first requirement to writing a SPARQL adapter is to write an implementations/adapter for the RdfSource that your choice of persistence uses. At least Jena and Sesame use an underlying wild card matching scheme to query their persistence model. If your choice of RDF server does not, it may be a bit more difficult/inefficient to model that behavior.
One of the methods in the RdfSource interface returns a ValueFactory interface. This is used to create consistently comparable value objects, so when value equality or binding intersections are calculated, the same things are used to compare. As an example, if one server returns Value objects that test for literal equality just on the label whereas one uses the whole signature, it could create issues when expressions return query native values and compare against the bound value. It's probably advised that you create a ValueFactory implementation that is responsible for creating Value adapters to the internal objects the RDF server uses to represent the values.
The matching methods in the RdfSource interface return an Iterator of GraphStatement objects. It's most likely that an iterator will need to be implemented that iterates over the results returned by the persistence and GraphStatement implementations returned. The GraphStatements consist of Value objects, which should probably just use the ValueFactory you implemented to wrap the RDF native objects
The methods in the RdfSource adapter all use openrdf Value objects to specify the statement to match. It's doubtful that the underlying RDF server uses the same interface, so most likely an adapter for the Value objects coming from the statement will have to be created (we already handled the Value objects coming from the results). It's advised that these be adapters to, since that is perhaps slightly more memory efficient, but in the query this doesn't matter nearly as much as in the results. If you used a ValueFactory as above, you can actually just cast the Value objects to your adapter classes and then unwrap them for persistence evaluation.
The results of the queries are either RdfBindingSet, RdfGraph, or boolean returns. It's probable that these objects need to then be represented in the native format of the RDF server for post-processing or to pass out through a common interface. This is fairly implementation specific though, so we'll leave out discussion.
On the front-end side of the adapter, the code of the server most likely has to be modified somewhat to allow SPARQL evaluation, which will usually require objects or methods to place the code listed in the basic use section above.
This is essentially the same process as writing an adapter without needing to worry about post-handling the binding sets and graphs or pre-handling the query objects. If an adapter already exists, then no work is necessary. Just use the code like described in the basic use section and you can work with the returned RdfBindingSet and RdfGraph objects returned from using the SPARQLParser and Query objects.
One of the more complex and powerful aspects of the SPARQL engine is the ability to optimize the logic within the engine in order to take advantage of implementation specific nuances or persistence dependent optimizations. This is slightly more complex and requires a good understanding of the design of the SPARQL engine, so read up.
Though figuring out how to extend the logic of the engine is tricky, it's not hard to understand how it works. The query is parsed but at that point is just data, it can't be evaluated in it's current state. At evaluation time, a visitor passes over the query structure with a logic factory that each data part of the query uses to figure out how handle their aspect of the query evaluation. A user can override the default logic of a query by providing their own logic factory before evaluation time that returns custom logic objects to do evaluation differently. Essentially, this means you can rewrite the entire evaluation of a query yourself!
A couple of points on this process:
The exceptions to the non-extendible grammar are the external functions that are allowed by the specification to be in the language (see section 11). The default factory contains a map of function URIs to ExternalFunction objects that is used to bind parsed function URIs to the logic that evaluates them. These functions are only used in FILTER value constraints, but can provide some nice shortcuts and elegances to value evaluation. Just create an instance of the default factory or whatever extended object you're using and use the registerExternalFunction(...) method.
The SPARQL engine currently comes with several different algorithms to do combinatory set logic, which serves as the base of how all queries are evaluated. Using techniques such as indexing and streaming, it's possible to tradeoff memory space for efficiency and caching. The current logic implementations for each set logic are compared below:
Class Name | Indexed | Cached/Streamed (Efficiency) | Extra Space |
---|---|---|---|
StaticSetDistinctionLogic | Yes | Cached (O(n)) | Large O(n) |
IndexedSetDistinctionLogic (default) | Yes | Streamed (O(1)) | Small O(n) |
StreamedSetDistinctionLogic | No | Streamed (O(n)) | O(1) |
Class Name | Indexed | Cached/Streamed (Efficiency) | Extra Space |
---|---|---|---|
NaiveSetIntersectLogic | No | Cached (O(n*m)) | O(n*m) |
StaticSetIntersectLogic (not done) | Yes | Cached (O(n*s + m)) | O(n*m) |
IndexedSetIntersectLogic (default) | Yes | Streamed (O(s)) | O(m) |
StreamedSetIntersectLogic (not done) | No | Streamed (O(m)) | O(1) |
Class Name | Indexed | Cached/Streamed (Efficiency) | Extra Space |
---|---|---|---|
NaiveSetJoinLogic | No | Cached (O(n*m)) | O(n*m) |
StaticSetJoinLogic (not done) | Yes | Cached (O(n*s + m)) | O(n*m) |
IndexedSetJoinLogic (default) | Yes | Streamed (O(s)) | O(m) |
StreamedSetJoinLogic (not done) | No | Streamed (O(m)) | O(1) |
Class Name | Cached/Streamed (Efficiency) | Extra Space |
---|---|---|
StaticSetProjectionLogic | Cached (O(n)) | O(n) |
StreamedSetProjectionLogic (default) | Streamed (O(1)) | O(1) |
Class Name | Cached/Streamed (Efficiency) | Extra Space |
---|---|---|
StaticSetRangeLogic | Cached (O(n)) | O(n) |
StreamedSetRangeLogic (default) | Streamed (O(1)) | O(1) |
Class Name | Cached/Streamed (Efficiency) | Extra Space |
---|---|---|
StaticSetUnionLogic | Cached (O(n+m)) | O(n+m) |
StreamedSetUnionLogic (default) | Streamed (O(1)) | O(1) |
To change the set logic used by the engine, follow the customization process detailed above and override the default factory methods that return the particular set logic piece that is to be changed.