Systems | Development | Analytics | API | Testing

December 2020

An A-Z Data Adventure on Cloudera's Data Platform

In this blog we will take you through a persona-based data adventure, with short demos attached, to show you the A-Z data worker workflow expedited and made easier through self-service, seamless integration, and cloud-native technologies. You will learn all the parts of Cloudera’s Data Platform that together will accelerate your everyday Data Worker tasks.

How ASEAN Retailers Can Become insight driven with a Hybrid Cloud data strategy

There has been an e-commerce explosion this year as consumers seek safety and convenience from the comfort of their own homes using digital tools to purchase everything from durable hard goods to fashion accessories to daily living consumables like food perishables, cleaning products and even school supplies.

Enabling The Full ML Lifecycle For Scaling AI Use Cases

When it comes to machine learning (ML) in the enterprise, there are many misconceptions about what it actually takes to effectively employ machine learning models and scale AI use cases. When many businesses start their journey into ML and AI, it’s common to place a lot of energy and focus on the coding and data science algorithms themselves.

Cloudera Replication Plugin enables x-platform replication for Apache HBase

The Cloudera Data Platform (CDP) is the latest Big Data offering from Cloudera. It includes Apache HBase and Phoenix as part of the platform. These two components are provided in 3 form-factors: Cloudera’s Apache HBase customers typically run mission-critical applications that cannot afford any downtime. They need a way to migrate to a new deployment either without a production outage or, at a minimum, a tiny outage.

The role of data in COVID-19 vaccination record keeping

The role of data in COVID-19 vaccination record keeping Now that the Pfizer vaccine has been approved by the FDA for use in the US, and the Moderna vaccine likely isn’t far behind, we are now on the verge of being able to emerge from the social distancing world that began earlier in 2020. Recent news has talked about distributing a vaccination record card to everyone who gets a COVID-19 vaccine.

Bringing transaction support to Cloudera Operational Database

We’re excited to share that after adding ANSI SQL, secondary indices, star schema, and view capabilities to Cloudera’s Operational Database, we will be introducing distributed transaction support in the coming months. The ACID model of database design is one of the most important concepts in databases. ACID stands for atomicity, consistency, isolation, and durability. For a very long time, strict adherence to these four properties was required for a commercially successful database.

How does Apache Spark 3.0 increase the performance of your SQL workloads

Across nearly every sector working with complex data, Spark has quickly become the de-facto distributed computing framework for teams across the data and analytics lifecycle. One of most awaited features of Spark 3.0 is the new Adaptive Query Execution framework (AQE), which fixes the issues that have plagued a lot of Spark SQL workloads. Those were documented in early 2018 in this blog from a mixed Intel and Baidu team.

Top 4 Reasons Why You Should Upgrade Your Stream Processing Workloads To CDP

If there’s one thing enterprises have learned in 2020, it’s how to navigate through uncertain times, and in 2021, organizations will likely have to continue navigating through a shifting landscape. One trend that we’ve seen this year, is that enterprises are leveraging streaming data as a way to traverse through unplanned disruptions, as a way to make the best business decisions for their stakeholders.

Covid Data: An anomalous blip, or the new normal?

COVID-19 has forced virtually every industry to embrace an acceleration in digital capabilities. While it can be argued that digital transformation was already underway; it’s hard to dispute that it has accelerated in recent months. A recent McKinsey survey, cited in CRN, shows that worldwide, 58 percent of customer interactions were digital as of July 2020.

3x better performance with CDP Data Warehouse compared to EMR in TPC-DS benchmark

In a previous blog post on CDW performance, we compared Azure HDInsight to CDW. In this blog post, we compare Cloudera Data Warehouse (CDW) on Cloudera Data Platform (CDP) using Apache Hive-LLAP to EMR 6.0 (also powered by Apache Hive-LLAP) on Amazon using the TPC-DS 2.9 benchmark. Amazon recently announced their latest EMR version 6.1.0 with support for ACID transactions. This benchmark is run on EMR version 6.0 as we couldn’t get queries to run successfully on version 6.1.0.

How to configure clients to connect to Apache Kafka Clusters securely - Part 2: LDAP

In the previous post, we talked about Kerberos authentication and explained how to configure a Kafka client to authenticate using Kerberos credentials. In this post we will look into how to configure a Kafka client to authenticate using LDAP, instead of Kerberos. We will not cover the server-side configuration in this article but will add some references to it when required to make the examples clearer.

Cost Conscious Data Warehousing with Cloudera Data Platform

Have you been burned by the unexpected costs of a cloud data warehouse? If so, you know about the failed economics of some cloud-native solutions on the market today. If not, before adopting a cloud data warehouse, consider the true costs of a cloud-native data warehouse. Data warehouses have been broadly adopted to provide timely reports and valuable insights. However, traditional deployments are notoriously cumbersome and cost-prohibitive at large scales.

Federated Learning, Machine Learning, Decentralized Data

Two years ago we wrote a research report about Federated Learning. We’re pleased to make the report available to everyone, for free. You can read it online here: Federated Learning. Federated Learning is a paradigm in which machine learning models are trained on decentralized data. Instead of collecting data on a single server or data lake, it remains in place—on smartphones, industrial sensing equipment, and other edge devices—and models are trained on-device.

How Cloudera Supports Government Data Encryption Standards

As part of our ongoing commitment to supporting Government regulations and standards in our enterprise solutions, including data protection, Cloudera recently introduced a version of our Cloudera Data Platform, Private Cloud Base product (7.1.5 release) that can be configured to use FIPS compliant cryptography.

Get to Know Your Retail Customer: Accelerating Customer Insight and Relevance

There are lessons to be learned from the brick and mortar or pure-play digital retailers that have been successful in the Covid-19 chaos. As the pandemic’s stress test of e-commerce, in-store insights, supply chain visibility, and fulfillment capabilities have revealed shortcomings, and long-lasting consumer experiences— it has also allowed many companies to pivot to very successful strategies built on enterprise data and the digitization efforts that accompany it.

Global View Distributed File System with Mount Points

Apache Hadoop Distributed File System (HDFS) is the most popular file system in the big data world. The Apache Hadoop File System interface has provided integration to many other popular storage systems like Apache Ozone, S3, Azure Data Lake Storage etc. Some HDFS users want to extend the HDFS Namenode capacity by configuring Federation of Namenodes. Other users prefer other alternative file systems like Apache Ozone or S3 due to their scaling benefit.

Accelerate Application Development with the Operational Database Demo Highlight

Cloudera Operational Database is a fast, flexible, dbPaaS database that enables faster application development. It simplifies application planning as it grows in scale and importance, and is a great fit for many application types including mobile, web, gaming, ad-tech, IoT, and ML model serving.

How to configure clients to connect to Apache Kafka Clusters securely - Part 1: Kerberos

This is the first installment in a short series of blog posts about security in Apache Kafka. In this article we will explain how to configure clients to authenticate with clusters using different authentication mechanisms.

Cloudera Operational Database Infrastructure Planning Considerations

In this blog post, let us take a look at how you can plan your infrastructure planning that you may have to do when deploying an operational database cluster on a CDP Private Cloud Base deployment. Note that you may have to do some planning assumptions when designing your initial infrastructure, and it must be flexible enough to scale up or down based on your future needs.