This article is the first in a series about Squid Solutions’ technology. French version available here.
Introduction
Way back in 2005, we chose to develop new technology to simplify analytics in massive datawarehouses. Our approach, described in the series of upcoming articles, is now called « in-database, » although this term did not exist at the time. Basically, we use high performance Massively Parallel Processing (MPP) databases to execute in-database analytics. Let’s start by describing what we see in MPP databases.
Shared-Nothing Architecture or MPP (Massively Parallel Processing)
Shared-nothing corresponds to a distributed architecture, both in terms of storage and processing. The idea is to split data automatically between a collection of segments, each cluster of segments being managed independently by a processing node and all nodes communicating via a dedicated network.
An SQL processor directs the execution of queries between the different nodes. Each node processes the query locally. The results are then concatenated, either by the processor, or directly by the network layer.
The distribution of data among different segments/nodes is critical for the performance of the system. Unlike a conventional database, which also uses data partitioning techniques to maintain good performance levels, the secret of the performance of an MPP system is to use a distribution key that is independent from data logic.
For instance, a transactional processing system would partition the transactions month per month in order to filter the relevant segments called in a query and thus reduce data loading time. On the other hand, the MPP approach favors the use of opaque hash key that allows a distribution data between a variable number of segments.
The first advantage is scalability of the system when data volumes increase: if there are n segments, each segment stores only 1 / n percent of the data.
The second advantage is to greatly simplify system administration since it is generally not necessary to explicitly partition data (even though most current MPP systems also support partitioning on top of the native distribution mechanism). The addition of segments is usually transparent to the user. The system is able to redistribute all data automatically.
This mechanism of data redistribution can also be called during a join if the distribution keys of the concerned tables are not compatible with each other. As such, the choice of a distribution key is a major driver in the optimization of an MPP system.
(To be continued …)