Help & Support

212.660.6600

HPCC Systems

A review of the free HPCC Systems open source data analytics platform for use in marketing data operations.

Try it now
2.5

About HPCC Systems

HPCC systems is a powerful enterprise-level big data analytics platform developed by LexisNexis. HPCC, or “High Performance Computing Cluster” was released as open source and is available as a reusable framework to provide data scientists a way of handling enormous datasets, by using the networking power of every system running as nodes in a super-computer. As it is built off nodes, the platform is highly scalable. Smaller test environments can be created which can then easily be spun off into production in the cloud.

HPCC systems tools are divided into two main clusters. “Thor” is designed for handling big data workflows, including ETL processes, indexing, data cleansing, and more. “Roxie” is the name of its Data Delivery Engine, which is designed to handle high volumes of online data processing and data warehousing functions. Building smaller test environments can be easily scaled, as each node handles both server and agent processes.

Using HPCC Systems is a bit different than other analytics tools reviewed on this site. It contains no GUI component. To handle all aspects of the ETL process, each task is handled by writing small queries using an internally developed programming language, known as ECL (Enterprise Control Language), which is a Query/ETL Language designed specifically for the task of processing big data.

Features

HPCC Systems includes an ETL process, a set of data management and analytics features, including data profiling, data cleansing, snapshot data updates, and a scheduling component. The Roxie cluster contains a powerful query and built-in search engine enabling extremely fast processing of billions of data instances.

HPCC Systems also boast a wide range of predictive modeling tools, including linear regression, logistic regression, decision trees, and random forests.

While this system is quite large, we’ll show you a few basic aspects that may be of use to database marketers.


ECL

Using HPCC Systems is a bit different than other analytics tools reviewed on this site. It contains no GUI component. To handle all aspects of the ETL process, each task is handled by writing small queries using an internally developed programming language, known as ECL (Enterprise Control Language), which is a Query/ETL Language designed specifically for the task of processing big data.

ECL is a very high-level language; it mostly consists of a series of commands to get the results you want from your data. It is non-procedural. There are no assignment statements, and no variables. Definitions are designed exactly once and cannot be redefined. There is no executable code being written; all of that is handled by the compiler. They are essentially just a series of instructions.

Example definition:

IsSeniorCitizen := People.Age >=65;
SeniorAvgAge := AVE(People(IsSeniorCitizen), Age);


ECL Watch Console

Getting it running requires the installation of a Virtual Server (we used Oracle’s Virtual Box, as recommended) to set up a node. Access to it is provided through a web browser where one can manage all aspects including loading and managing data sources, publishing to services, event scheduling, and more.

 

img1_2.png



Data sets can be loaded and “sprayed” into a node for processing purposes. Included within HPCC is their ECL IDE (Integrated Development Environment) which is where one creates workflows for managing data.

A simple workflow with an inline data set might look something like this:

 

img2_2.png


Data can be displayed in a graphical format as well:
 

img3_2.png

You’ll note that all of the data processing is handled in a code window. All preparation and processing data is handled this way. For example, converting all names in a data file into upper case would look something like this:
 

img4_1.png


The output of this code shows as below:

 

img5_2.png


Querying Data

Getting information from a data file is also relatively straightforward. Here is an example of querying data with a specific zip code, and the results:

 

img6_1.png img7_1.png

 

Web Service Forms

In many cases you most likely will not be wanting to write out queries in code each time, so HPCC makes it possible to create basic web services to make it possible to query this information. By changing the query to use an input like so:
 

img8_1.png


This will enable the web service to create a form like this:
 

img9.png


Which will then generate readable output.
 

img10.png

Summary: Key takeaway

HPCC Systems is by far one of the most powerful data analytics engines available. LexisNexis made itself known as a source for finding out information about anything and anybody very quickly, and this tool has the features which make it possible for other organizations to make use of this processing power. If as a direct marketing organization you need to process billions of data points regularly and quickly, and to have the ability to get needed information quickly, it could be highly effective.

However, the learning curve involved with using HPCC Systems could be a barrier for many companies. Those who do choose to go with this tool will be very impressed by its speed, power, and flexibility. The ECL language itself is intuitive and not difficult to learn, and if you are looking for tools to build your own systems, it could be an excellent choice.

Integrations

  • Spark
  • Pentaho
  • ECL for VS Code
  • JDBC Driver
  • Java API
  • ODBC Driver

Xperra Star Ratings

Overall functionality useful to a direct marketer
4 /5

HPCC is an impressive and powerful data transformation and analysis tool for massive datasets. While it can work with smaller amounts of data, it scales rapidly; all that is required is spinning up a few more nodes. That said, from the point of view of most direct marketing operations, this might be a bit of overkill; the cost-value proposition here would be for time vs output. If one has a very large set of data that is expected to grow rapidly, and which needs massive real-time processing, this could be a very useful tool. However getting it up and running may take some time, so the value is dependent on the organization.

If you are working with billions of data points, it can be helpful and quite powerful, and its flexibility and scalability make it appealing for those organizations who expect to work at this level. For smaller organizations, the learning curve could be prohibitive.

Intuitive User Experience
2 /5

There is virtually nothing “out of the box” in HPCC Systems. While getting it up and running isn’t much more time-consuming than other options, getting an understanding of how to use it is not something you could start with on day one. Even for data science professionals, expect to spend some time familiarizing yourself with the environment and the manuals.

Active Support Community
3 /5

HPCC Systems official forums are available directly on their website. While there is a presence, activity seems to be slow. (There have been no new posts for the past month at the time of this writing)

The Github repository itself appears to be fairly active (there are commits as recently as in the past week). There is no noticeable presence on Stackoverflow for either HPCC or ECL (other than a few sparse questions)

HPCC Systems Official Forum hpccsystems.com

Github:
github.com/hpcc-systems
 

Commits:

23954

Contributors:

42
Releases: 517

Watch:

44

Star:

394

Fork:

222

Commits/Contributors:

570

 

Minimal Technical Skill Required
2 /5

Unless familiar with programming it’s not something that one could jump into right away. As mentioned, there is no GUI component; each process needs to be defined separately. To be able to simply create components, you need to learn an entirely new programming language.

That said, ECL is a very high level language; it consists mostly of a series of commands to get what you want it to provide you about your data. It is non-procedural – it is declarative. You ask it questions, it gives answers, so it is most certainly something that can be learned. However if one is not technically inclined, this tool may not suit your needs.

Related Articles

HPCC Takes on Hadoop's Big Data Dominance

HPCC Takes on Hadoop...

HPCC Systems Intros Machine Learning Beta

HPCC Systems Intros...

Everything you need to know about HPCC Systems

Everything you need...

Related Experts

Data Engineer

Data Engineer

Data Quality Analyst

Data Quality Analyst

Alteryx Designer

Alteryx Designer

Pimcore Engineer

Pimcore Engineer

Machine Learning Engineer

Machine Learning Engineer

Related Solutions

Gain a 360⁰ View of Your Customers

Gain a 360⁰ View of Your Customers

Profile Your Best Customers

Profile Your Best Customers

Capture Actionable Data From Anywhere

Capture Actionable Data From Anywhere

Other Tools

Talend Open Studio
Data ETL & Data Wrangling Limited Open Source

Talend Open Studio

Talend Open Studio is designed for enabling one to extract diverse datasets, normalize, and transform them into a consistent format which can be loaded into...

KNIME Analytics Platform
Data ETL & Data Wrangling FREE Open Source

KNIME Analytics Platform

KNIME Analytics Platform is a powerful free open source data mining tool which enables data scientists to create independent applications and services through a...

Alteryx
Data ETL & Data Wrangling Commercial

Alteryx

Alteryx is the only quick-to-implement end-to-end data analytics platform for your organization that allows data scientists and analysts alike to solve business...