By Daniel Ng, Senior Director, APAC, Cloudera
Today, with the availability of modern hardware profiles, the demands of real-time and interactive data are met – allowing the ability to capture, integrate and incorporate more data points from streams, external sources and internal core systems than ever before. We have no doubt made tremendous progress compared to just a few years ago when we were focused on deploying purpose built analytic databases to meet the defined demands of the data.
With this shift, users are no longer managing against a scarcity of resources, but are instead empowered to build new functionality that increases their organization’s ability to better apply data analytics to a variety of sectors of the business.
Apache Hadoop has already delivered on the promise of limitless analytics by providing a distributed framework, allowing for the collection of massive amounts of data that scale outside the scope of most analytic database environments.
Furthermore, Hadoop has increased analytic performance by removing many common resource bottlenecks by bringing compute resources closer to the data. Hadoop users also now have a choice of processing frameworks and file systems to meet the discrete demands of their use case, without the need to employ multiple technology solutions.
The move towards becoming actionable in response to data is becoming an increasingly powerful ability. Real world examples of how Hadoop has changed our ability to deliver analytic value include helping retailers providing real-time offers through recommendation engines and enabling rapid location based targeting from mobile sources. This is reshaping how marketers target their customers and shape their future product development.
In healthcare, we have seen hospitals leverage time-series data to better understand data from bedside monitors. By feeding this time-series data into an analytics environment, medical staff are able to achieve near real-time event monitoring during surgery and recovery.
The realization of these new capabilities also brings about the aspiration to leverage even more types of data. We recently see much innovation around solutions for streaming and online data formats, as well as, an increase in analytic tooling that is opening up developer access of the new data types.
Some data types have remained somewhat challenging to analyze. These include the complex nature of rapidly-changing or ‘mutable’ data types. Here are some examples of the requirements that mutable data demands of analytic systems:
· Time-series data — where users need to insert, update, scan, and lookup capabilities to address use cases such as real-time streaming.
· Stock Market Data — where users need to run analytics on a full data set while new information streams in real-time updates.
· Fraud Detection — where systems need to analyze data immediately to actively detect fraudulent activity.
· Operational Data — users need the ability to store logs for easy lookup and have reliable information for analytic model building.
Modern analytic solutions do exist for mutable data types. However, these may require yet another technology deployment or result in redundant storage. Existing solutions have also been plagued by some common drawbacks including poor analytic performance, complex application design, and security or policy enforcement across multiple access engines.
Mutable data types inside of Hadoop have often been handled by data stores like HBase but often at the sacrifice of analytic performance. This has forced developers to leverage both HDFS and HBase to strike the balance.
The good news is that we are making progress in this area. One of the new solutions now available is Kudu. Currently, available as public beta, Kudu is an updatable columnar store for Hadoop designed for fast analytic performance. It simplifies the architecture for building analytic applications on changing data, complementing the capabilities of HDFS and HBase.
Kudu is a simpler architecture providing superior performance in a single data store to support increasingly common real-time use cases. We expect it to greatly enhance the performance of Hadoop components like Impala and is helping continue to drive Impala’s performance leadership in the ecosystem.
Another benefit that Kudu brings is that it eliminates the need to explore tiering solutions that complicate Hadoop’s unified design. Developers no longer have to make a choice between the scanning analytic capabilities of HDFS and the insert and update capabilities of Apache HBase.
With data, you cannot have performance without ensuring security. Another solution that we have recently made available for public beta download is RecordService. RecordService is the new role-based policy enforcement engine for Apache Hadoop. It is designed to provide centralized policy enforcement so that developers and users can continue to add new features to Hadoop with a core standard of policy management.
RecordService provides the controls that allow us to integrate sensitive data sources so that we are creating a better, full-fidelity view of data. This is crucial as more organizations are using big data systems to handle highly sensitive data.
With data analytics, you have control
There is no doubt that Apache Hadoop is advancing the state of modern analytic databases. Business leaders are saying that data analytics is going to be the future of everything, but we believe that what is more important is that data analytics offer more control now.
To most businesses, it can be a little chaotic and overwhelming now. What with an onslaught of data growth, and a competitive environment that is fast transforming. It is crucial that data analytics allow decision makers to measure, monitor and predict so that they can make changes if necessary. Data analytics offers them a view into who their customers are, what they are buying, how their business operation is running, what their market outlook is like, and much more. With a clearer view into what is going on, what data analytics offers is better control.