Cover Page

Databases and Big Data Set

coordinated by

Dominique Laurent and Anne Laurent

Volume 1

NoSQL Data Models

Trends and Challenges

Edited by

Olivier Pivert

Wiley Logo

Foreword

This volume is part of a series entitled Database and Big Data, or DB & BD for short, whose content is motivated by the radical and rapid evolution (not to say revolution) of database systems during the last decade.

Indeed, since the 1970s, inspired by the relational database model, many research topics have emerged in the database community, such as, just to cite a few, Deductive Databases, Object-Oriented Databases, Semi-Structured Databases, Resource Description Framework (RDF), Open Data, Linked Data, Data Warehouses, Data Mining, and more recently, Cloud Computing, NoSQL and Big Data. Currently, the last three issues are becoming increasingly important and attract the most research efforts in the domain of databases.

Consequently, considering that Big Data environments are now to be handled in most current applications, the goal of this series is to address some of the latest issues in such environments. By doing so, while reporting on specific recent research results, we aim to provide readers with evidence that database technology is significantly changing, so as to face important challenges encountered in the majority of these applications.

More precisely, although relational databases are still commonly used in traditional applications, it is clear that most current Big Data applications cannot be handled by Relational DataBase Management Systems (RDBMSs), mainly because of the following reasons:

New database systems have been proposed during the past few years, which are known under the generic term NoSQL Databases. These systems aim to solve the previous two points, and all claim to achieve their goal.

However, these systems need to be investigated further, because some important issues remain open (semantics of data, constraint satisfaction, transaction processing, privacy preservation, optimization, etc.). The volumes of this series aim to address some of these challenging issues and to present some of the most recent research results in this field.

Considering that the numerous currently available proposals are based on various concepts and data models (column-based, text-based, graph or hyper graph-based), this volume addresses issues related to trends and challenges related to NoSQL data models.

Anne LAURENT
Dominique LAURENT

Preface

As is well known, a major event in the field of data management was the introduction of the relational model by Codd in the early 1970s, which laid the foundations for a genuine theory of databases. After a somewhat slow start, due to the important Research and Development effort necessary to define efficient systems, relational database management systems reigned supreme for several decades.

However, around the end of the 20th Century, several phenomena modified the data management landscape. First, new types of applications in several domains were introduced to handle data for which the relational model appeared inadequate or inefficient. Typical examples are semi-structured data on the one hand, and graphs on the other (social networks, bibliographic databases, cartographic databases, genomic data, etc.) for which specific models and systems had to be designed. Second, a major event was the rise of the Semantic Web whose aim is, according to the W3C, to “provide a common framework that allows data to be shared and reused across application, enterprise and community boundaries”. The Semantic Web uses models and languages specifically designed for linked data, which facilitate automated reasoning on such data. Besides, the amount of useful data in some application domains has become so huge that it cannot be stored or processed by traditional database solutions. This latter phenomenon is commonly referred to as Big Data. In terms of database technology, as a response to these new needs, we have seen the appearance of what have come to be called NoSQL databases.

The term NoSQL was coined by Carlo Strozzi in 1998, who designed a relational database system without SQL implementation and named it Strozzi NoSQL. However, this system is distinct from the circa-2009 general concept of NoSQL databases, which are typically non-relational. Many data models have been proposed: key-value stores, document stores (key-value stores that restrict values to semi-structured formats such as JSON), wide column stores, RDF, graph databases, XML, etc1.

While the management of large volumes of data has always been subject to many research efforts, recent results in both the distributed systems and database communities have led to an important renewal of interest in this topic. Large scale distributed file systems such as Google File System2 and parallel processing paradigm/environments such as MapReduce3 have been the foundation of a new ecosystem with data management contributions in major database conferences and journals. Different (often open-source) systems have been released, such as Pig4, Hive5 or, more recently, Spark6 and Flink7, making it easier to use data center resources to manage Big Data. However, many research challenges remain, related, for instance, to system efficiency, and query language expressiveness and flexibility.

This book presents a sample of recent works by French research teams active in this domain. As the reader will see, it covers various aspects of NoSQL research, from semantic data management to graph databases, as well as Big Data management in cloud environments, dealing with data models, query languages and implementation issues. The book is organized as follows:

Chapter 1, by Kim Nguyễn, from LRI and the University of Paris-Sud, presents an overview of NoSQL languages and systems. The author highlights some of the technical aspects of NoSQL systems (in particular, distributed computation with MapReduce) before discussing current research trends: join implementations on top of MapReduce, models for NoSQL languages and systems, and the perspective that consists of defining a formal model of NoSQL databases and queries.

Chapter 2, entitled “Distributed SPARQL Query Processing: A Case Study with Apache SPARK”, by Bernd Amann, Olivier Curé and Hubert Naacke, from the LIP6 laboratory in Paris, is devoted to the issue of evaluating SPARQL queries over large RDF datasets. The authors present a solution that consists of using the MapReduce framework to process SPARQL graph patterns and show how the general purpose cluster computing platform Apache Spark can be used to this end. They emphasize the importance of the physical data layer for query evaluation efficiency and show that hybrid query plans combining partitioned and broadcast joins improve query performances in almost all cases.

Chapter 3, authored by Manel Achichi, Mohamed Ben Ellefi, Zohra Bellahsene and Konstantin Todorov, from the LIRMM laboratory in Montpellier, is entitled “Doing Web Data: From Dataset Recommendation to Data Linking”. It deals with the production of web data and focuses on the data linking stage, seen as an operation which generates a set of links between two different datasets. The authors first study the prior task which consists of discovering relevant datasets leading to the identification of similar resources to support the data linking issue. They provide an overview of recommendation approaches for candidate datasets, then present and classify the different techniques that are applied by the currently available data linking tools. The main challenge faced by all of these techniques is to overcome different heterogeneity problems that may occur between the considered datasets, such as differences in descriptions at different levels (value, ontological or logical) in order to compare the resources efficiently, and the authors show that further research efforts are still needed to better cope with these heterogeneity issues.

Chapter 4, entitled “Big Data Integration in Cloud Environments: Requirements, Solutions and Challenges”, by Rami Sellami and Bruno Defude, from CETIC Charleroi and Telecom SudParis respectively, presents and discusses the requirements of Big Data integration in cloud environments. In such a context, applications may need to interact with several heterogeneous data stores, depending on the types of data they have to manage (traditional data, documents, graph data from social networks, simple key-value data, etc.). A first constraint is that, to make these interactions possible, programmers have to be familiar with different APIs. A second difficulty is that the execution of complex queries over heterogeneous data models cannot currently be achieved in a declarative way and therefore requires extra implementation efforts. Moreover, cloud discovery as well as application deployment and execution are generally performed manually by programmers. The authors analyze and discuss the current state-of-the-art regarding four requirements (automatic data stores selection and discovery, unique access for all data stores, transparent access for all data stores, global query processing and optimization), provide a global synthesis according to three groups of criteria, and highlight important challenges that remain to be tackled.

Chapter 5 is authored by Vijay Ingalalli, Dino Ienco and Pascal Poncelet, from the LIRMM laboratory in Montpellier, and is entitled “Querying RDF Data: A Multigraph-based Approach”. In this chapter, the authors cope with two challenges faced by the RDF data management community: first, automatically generated queries cannot be bounded in their structural complexity and size; second, the queries generated by retrieval systems (or any other application) need to be efficiently answered in a reasonable amount of time. In order to address these challenges, the authors advocate an approach to RDF query processing that involves two steps: an offline step where the RDF database is transformed into a multigraph and indexed, and an online step where the SPARQL query is transformed into a multigraph too, which makes query processing boil down to a subgraph homomorphism problem. An RDF query engine based on this strategy is presented, named AMBER, which exploits structural properties of the multigraph query as well as the indices previously built on the multigraph structure.

Chapter 6 is entitled “Fuzzy Preference Queries to NoSQL Graph Databases” and is authored by Arnaud Castelltort, Anne Laurent, Olivier Pivert, Olfa Slama and Virginie Thion; the first two authors being affiliated to the LIRMM laboratory in Montpellier, and the last three authors to the IRISA laboratory in Lannion. This chapter deals with flexible querying of graph databases that may involve gradual relationships. The authors first introduce an extension of attributed graphs where edges may represent a fuzzy concept (such as friend in the case of a social network, or co-author in the case of a bibliographic database). Then, they describe an extension of the query language Cypher that makes it possible to express fuzzy requirements, both on attribute values and on structural aspects of the graph (such as the length or the strength of a path). Finally, they deal with implementation issues and outline a query processing strategy based on the derivation of a regular Cypher query from the fuzzy query to be evaluated, through an add-on built on top of a classical graph database management system.

Finally, Chapter 7, by Cédric du Mouza and Nicolas Travers, from CNAM Paris, is entitled “Relevant Filtering in a Distributed Content-based Publish/Subscribe System”, and deals with textual data management. More precisely, it considers a crucial challenge faced by Publish/Subscribe systems, which is to efficiently filter feeds’ information in real time. Publish/Subscribe systems make it possible to subscribe to flows of items coming from diverse sources and notify the users according to their interests, but the existing systems hardly address the issue of document relevance. However, numerous sources may provide similar information, or a new piece of information may be “hidden” in a large flow. The authors introduce a real-time filtering process based on relevance that notably integrates the notions of novelty and diversity, and they show how this filtering process can be efficiently implemented in a NoSQL environment.

Olivier PIVERT
May 2018