Systems | Development | Analytics | API | Testing

Latest Posts

Data security vs usability: you can have it all

Growing up, were you ever told you can’t have it all? That you can’t eat all the snacks in one sitting? That you can’t watch the complete Back to the Future trilogy as well as study for your science exam in one evening? Over time, we learn to set priorities, make a decision for one thing over the other, and compromise. Just like when it comes to data access in business.

HBase Clusters Data Synchronization with HashTable/SyncTable tool

Replication (covered in this previous blog article) has been released for a while and is among the most used features of Apache HBase. Having clusters replicating data with different peers is a very common deployment, whether as a DR strategy or simply as a seamless way of replicating data between production/staging/development environments.

Re-thinking The Insurance Industry In Real-Time To Cope With Pandemic-scale Disruption

The Insurance industry is in uncharted waters and COVID-19 has taken us where no algorithm has gone before. Today’s models, norms, and averages are being re-written on the fly, with insurers forced to cope with the inevitable conflict between old standards and the new normal.

New Multithreading Model for Apache Impala

Today we are introducing a new series of blog posts that will take a look at recent enhancements to Apache Impala. Many of these are performance improvements, such as the feature described below which will give anywhere from a 2x to 7x performance improvement by taking better advantage of all the CPU cores. In addition, a lot of work has also been put into ensuring that Impala runs optimally in decoupled compute scenarios, where the data lives in object storage or remote HDFS.

How-to: Index Data from S3 via NiFi Using CDP Data Hubs

Data Discovery and Exploration (DDE) was recently released in tech preview in Cloudera Data Platform in public cloud. In this blog we will go through the process of indexing data from S3 into Solr in DDE with the help of NiFi in Data Flow. The scenario is the same as it was in the previous blog but the ingest pipeline differs. Spark as the ingest pipeline tool for Search (i.e.

Using Cloudera Machine Learning to Build a Predictive Maintenance Model for Jet Engines

Running a large commercial airline requires the complex management of critical components, including fuel futures contracts, aircraft maintenance and customer expectations. Airlines, in just the U.S. alone, average about 45,000 daily flights and transporting over 10 million passengers a year (source: FAA). Airlines typically operate on very thin margins, and any schedule delay immediately angers or frustrates customers.

Apache Spark on Kubernetes: How Apache YuniKorn (Incubating) helps

Apache Spark unifies batch processing, real-time processing, stream analytics, machine learning, and interactive query in one-platform. While Apache Spark provides a lot of capabilities to support diversified use cases, it comes with additional complexity and high maintenance costs for cluster administrators. Let’s look at some of the high-level requirements for the underlying resource orchestrator to empower Spark as a one-platform.

What you need to know to begin your journey to CDP

Recently, my colleague published a blog build on your investment by Migrating or Upgrading to CDP Data Center, which articulates great CDP Private Cloud Base features. Existing CDH and HDP customers can immediately benefit from this new functionality. This blog focuses on the process to accelerate your CDP journey to CDP Private Cloud Base for both professional services engagements and self-service upgrades.

Cloudera acquires Eventador to accelerate Stream Processing in Public & Hybrid Clouds

We are thrilled to announce that Cloudera has acquired Eventador, a provider of cloud-native services for enterprise-grade stream processing. Eventador, based in Austin, TX, was founded by Erik Beebe and Kenny Gorman in 2016 to address a fundamental business problem – make it simpler to build streaming applications built on real-time data. This typically involved a lot of coding with Java, Scala or similar technologies.