HPCC Systems includes an ETL process, a set of data management and analytics features, including data profiling, data cleansing, snapshot data updates, and a scheduling component. The Roxie cluster contains a powerful query and built-in search engine enabling extremely fast processing of billions of data instances.
HPCC Systems also boast a wide range of predictive modeling tools, including linear regression, logistic regression, decision trees, and random forests.
While this system is quite large, we’ll show you a few basic aspects that may be of use to database marketers.
ECL
Using HPCC Systems is a bit different than other analytics tools reviewed on this site. It contains no GUI component. To handle all aspects of the ETL process, each task is handled by writing small queries using an internally developed programming language, known as ECL (Enterprise Control Language), which is a Query/ETL Language designed specifically for the task of processing big data.
ECL is a very high-level language; it mostly consists of a series of commands to get the results you want from your data. It is non-procedural. There are no assignment statements, and no variables. Definitions are designed exactly once and cannot be redefined. There is no executable code being written; all of that is handled by the compiler. They are essentially just a series of instructions.
Example definition:
IsSeniorCitizen := People.Age >=65;
SeniorAvgAge := AVE(People(IsSeniorCitizen), Age);
ECL Watch Console
Getting it running requires the installation of a Virtual Server (we used Oracle’s Virtual Box, as recommended) to set up a node. Access to it is provided through a web browser where one can manage all aspects including loading and managing data sources, publishing to services, event scheduling, and more.
Data sets can be loaded and “sprayed” into a node for processing purposes. Included within HPCC is their ECL IDE (Integrated Development Environment) which is where one creates workflows for managing data.
A simple workflow with an inline data set might look something like this:
Data can be displayed in a graphical format as well:
You’ll note that all of the data processing is handled in a code window. All preparation and processing data is handled this way. For example, converting all names in a data file into upper case would look something like this:
The output of this code shows as below:
Querying Data
Getting information from a data file is also relatively straightforward. Here is an example of querying data with a specific zip code, and the results:
Web Service Forms
In many cases you most likely will not be wanting to write out queries in code each time, so HPCC makes it possible to create basic web services to make it possible to query this information. By changing the query to use an input like so:
This will enable the web service to create a form like this:
Which will then generate readable output.