What is Amberstone?

Amberstone is a scalable compression, storage and batch aggregation engine for machine generated data. Machine generated data is loaded into it and stored in tightly compressed, column oriented fact tables. Once loaded, Amberstone can very efficiently output time series aggregate tables which can be loaded into any database or column store for reporting.

Scalability Through Efficiency

Amberstone's core design philosphy is to achieve high scalability through efficiency. Due to it's high compression rates and high speed aggregation, Amberstone can work with hundreds of billions of records on a single server. Amberstone also supports efficient incremental aggregation so only newly added data needs to be aggregated.

Amberstone can be run on single node, or run in a cluster to increase storage capacity and load and aggregation throughput.

Core Use Case

Amberstone is designed to be used as part of the data pipeline feeding a time series reporting system. In this scenario, machine generated data flows into the data warehouse throughout the day. When it arrives, the data is loaded into Amberstone for long term storage and aggregation. Amberstone outputs time series aggregate tables that are loaded into a database for reporting. The reporting system uses the aggregate tables to generate on-demand time series reports and pivot tables for end uses.

Compression

Amberstone's compression strategy is two pronged. The first part of the strategy is to store data in column oriented fact tables. This design allows compression to be applied on individual columns, which can result in very high compression rates. Amberstone's second approach is to provide built-in functions at load time that replace text data with surrogate integer keys. This approach creates on-the-fly star schemas for machine generated data, and leads to very compact fact tables.

By combining these two approaches Amberstone can often achieve 90% or better compression rates on machine generated data. High compression rates allow for longer data retention on less hardware.

Aggregation

Amberstone creates time series aggregate tables from the data stored in it's fact tables. These aggregate tables are multi-dimensioned and automatically include time series dimensions. Amberstone can include any column from the fact table as a dimension and can sum any column in the fact table to build it's aggregates.

Amberstone performs aggregation at very high speeds. It deploys a read-ahead thread to read data off disk in the background as aggregation is being performed. Aggregation is performed in memory in high performance hash tables optimized to work with the surrogate integer keys in the fact table. Combining these approaches with a high end 8-12 core CPU can result in aggregation speeds approaching 30 million records per second.

Amberstone also aggregates sessions and transactions.

Handling High Cardinality

Amberstone builds it's aggregations in memory, which is much faster then the disk based, sort approach taken by Hadoop. But how does Amberstone deal with the issue of high cardinality?

Amberstone has a two pronged approach for dealing with high cardinality. The first approach is sliding window aggregation. Amberstone's fact tables are sorted in ascending time order. Amberstone reads the fact tables linearly and cuts daily or hourly aggregate files as it goes. Using this approach only a window of the aggregation is kept in memory at any given time. This allows Amberstone to build aggregations on high cardinality dimensions that could not be done if the entire aggregation was kept in memory at once.

The second approach for handling high cardinality is horizontal partitioning. Amberstone allows fact tables to be split across any number of partitions and then aggregated separately. Using this approach a fact table can be split on a high cardinality dimension across N partitions. Each partition will be aggregated separately in memory. Then the separate aggregate files can be quickly merged into a single master aggregate.

Combining sliding window aggregation with horizontal partitioning allows Amberstone to effectively manage high levels of cardinality.

License

Amberstone is released under the Apache 2.0 open source license.

http://www.apache.org/licenses/LICENSE-2.0