Using AWS Redshift was straightforward. All queries of data used a PostgreSQL style of syntax, so understanding basic SQL made it possible to easily grab and transfer data without a lot of specialized knowledge.
We found that Redshift works fast with large data sets, and like many of Amazon’s products, integrates well with AWS hosted files.
One of the characteristics of how Redshift handles data is that, unlike traditional row-based organizational structure, it uses columns. While this approach may not be as strong for transactional querying, for the purpose of data ingestion this resulted in extremely fast and efficient queries. As a result, column-based indexing is particularly well-suited for the purpose of data analytics, particularly those which must be run in real time with large datasets.
Real Time Processing
The ability to run in real-time varied somewhat, depending on the size of the cluster being imported. One issue with AWS is that their pricing structure is directly based on processing time. Everything is scaled, so if one is dealing with relatively small datasets, this does not become too much of an issue, however real-time processing of massive amounts of data can be somewhat taxing on any system. As a result, the overall cost for Redshift handling big data could tend to skyrocket quickly, unless one keeps a handle on the data flow.
Redshift handles data in CSV and JSON file formats, which is fine for most situations. One drawback is that Redshift cannot handle data from CRM systems. In order to be able to work with these, data needs to be extracted and converted before use, which can impact its ability to handle this information in real time.
Data type detection
One feature that we hoped to be able to find would be for the application to have some built in machine-learning type analytics so as to be able to detect data types. While all systems can generally identify basic datatypes such as strings, integers, and dates, more helpful, particularly working with marketing datasets, would be to identify certain content of strings, such as email and mailing address, or being able to classify names as first names and last names.
We found that, like many of Amazon’s products, Redshift provided extensive documentation. However, as Redshift uses a fairly common SQL syntax for querying data sources, we found it to be mostly unnecessary to spend much time reading the manuals. Typically, we were able to operate without much trouble by relying on our existing knowledge of this language.
Custom Coding needed:
Redshift contains no AI component, so data and field content analysis needed to be handled via external code. As a result, we built our own ML tools for this purpose.
Duplicate File checking
Working with large amounts of data requires consistency and non-redundancy of information, especially when working with real time data from multiple sources, which may overlap. As data leakage or false positives can be a real problem with predictive analytics it becomes crucial to identify and eliminate any obviously redundant data
Amazon Redshift unfortunately does not provide this as a built-in feature.
Redshift provides a free trial version. Also, given Amazon’s pricing structures, one can initiate relatively small trials without an exorbitant cost; expense is directly tied to the amount of processing used, so one can get a relatively good idea of how much it would cost to operate in larger datasets (though, of course, mileage may vary depending on whether or not sudden large amounts of processing power is needed to handle large groupings of real-time data.)
Real Time Reporting
For marketing campaigns, understanding exactly what is happening at a given time can be valuable. Events can occur which can immediately impact a campaign, such as a weather event or social activity (e.g. a regional team wins a championship, resulting in a sudden increase in demand for commemorative t-shirts or memorabilia). For marketing companies to be able to move quickly and take advantage of real information, real-time reporting can be a key factor in being able to make quick decisions.
Redshift does provide real-time reporting.
Overall, while Redshift does not provide every necessary feature available, its ability to handle real time data, both during the processing and reporting phases, made it a strong contender as a data ingestion tool. Ease of use (using a standard SQL syntax) also served as a strong selling point. Where it does not perform some important functions (such as machine learning), it is extensible, making it possible to integrate with other external resources.