Systems | Development | Analytics | API | Testing

April 2020

Operational Database Integrity

This blog post is part of a series on Cloudera’s Operational Database (OpDB) in CDP. Each post goes into more details about new features and capabilities. Start from the beginning of the series with, Operational Database in CDP. This blog post provides an overview of the OpDB data integrity capabilities that help you achieve ACID transactions and data consistency. OpDB guarantees certain properties to ensure atomicity, durability, consistency, and visibility.

The challenges you'll face deploying machine learning models (and how to solve them)

In 2019, organizations invested $28.5 billion into machine learning application development (Statistica). Yet, only 35% of organizations report having analytical models fully deployed in production (IDC). When you connect those two statistics, it’s clear that there are a breadth of challenges that must be overcome to get your models deployed and running.

One billion files in Ozone

Apache Hadoop Ozone is a distributed key-value store that can manage both small and large files alike. Ozone was designed to address the scale limitations of HDFS with respect to small files. HDFS is designed to store large files and the recommended number of files on HDFS is 300 million for a Namenode, and doesn’t scale well beyond this limit.

Operational Database Availability

This blog post is part of a series on Cloudera’s Operational Database (OpDB) in CDP. Each post goes into more details about new features and capabilities. Start from the beginning of the series with, Operational Database in CDP. This blog post gives you an overview of the high availability configuration capabilities of Cloudera’s OpDB. Cloudera’s Operational Database (OpDB) is a cluster-based software, which comes configured for High Availability (HA) out of the box.

Augment EMR Workloads with CDP

The first thing that comes to mind when talking about synergy is how 2+2=5. Being the writer that he is, Mark Twain described it a lot more eloquently as “the bonus that is achieved when things work together harmoniously”. There is a multitude of product and business examples to illustrate the point and I particularly like how car manufacturers can bring together relatively small engines to do big things.

Machine learning in production: Human error is inevitable, here's how to prepare.

You did it. You have machine learning capabilities up and running in your organization. Success! What started as a few nascent experiments (and maybe a few failures) are now carefully constructed models racing along in full production—with the ability to scale into the hundreds or thousands of productional models in sight. Assembling your expert team of data scientists and custodians seems like a distant memory. Now you’re looking ahead to the future—growth, innovation, revenue!

Operational Database Management

This blog post is part of a series on Cloudera’s Operational Database (OpDB) in CDP. Each post goes into more details about new features and capabilities. Start from the beginning of the series with, Operational Database in CDP. This blog post gives you an overview of the OpDB management tools and features in the Cloudera Data Platform. The tools discussed in this article will help you understand the various options available to manage the operations of your OpDB cluster.

Challenges of running a big data distro in the cloud

There are many reasons to run a big data distribution, such as Cloudera Data Hub (CDH) and Hortonworks Data Platform (HDP), in the cloud with Infrastructure-as-a-Service (IaaS). The main reason is agility. When the business needs to onboard a new use case, a data admin can bring on additional virtual infrastructure to their clusters in the cloud in minutes or hours. With an on-prem cluster, it may take weeks or months to add the infrastructure capacity for the new use cases.

Evolving Insurance with Data and Analytics

Insurance companies around the world are striving ahead with innovative offerings that are fundamentally changing the insurance landscape. Insurance companies are creating personalized offerings and products that are tailored to the specific needs of their customers. For example, they are implementing usage-based insurance (UBI) based on driving habits, miles driven and driving history and discounts on health insurance based on health trackers, etc.).

The U.S. Census Enters the Digital Age with Cloudera

2020 brings a new decade, and for the U.S Census Bureau, a new challenge. As the federal government’s—and the nation’s—leading provider of demographic and economic data, its largest initiative is the U.S. Census, which is conducted every 10 years and counts every resident in the United States. For the first time in U.S history, the census will be conducted primarily online instead of by mail.

Supercharge ML models with Distributed Xgboost on CML

Since childhood, we’ve been taught about the power of coalitions: working together to achieve a shared objective. In nature, we see this repeated frequently – swarms of bees, ant colonies, prides of lions – well, you get the idea. It is no different when it comes to Machine Learning models. Research and practical experience show that groups or ensembles of models do much better than a singular, silver bullet model. Intuitively, this makes sense.

Operational Database Administration

This blog post is part of a series on Cloudera’s Operational Database (OpDB) in CDP. Each post goes into more details about new features and capabilities. Start from the beginning of the series with, Operational Database in CDP. This blog post gives you an overview of the operational database (OpDB) administration tools and features in the Cloudera Data Platform.

Benchmarking NiFi Performance and Scalability

Ever wonder how fast Apache NiFi is? Ever wonder how well NiFi scales? When a customer is looking to use NiFi in a production environment, these are usually among the first questions asked. They want to know how much hardware they will need, and whether or not NiFi can accommodate their data rates. This isn’t surprising. Today’s world consists of ever-increasing data volumes. Users need tools that make it easy to handle these data rates.

Hadoop: Decade Two, Day Zero*

One key aspect of the Cloudera Data Platform (CDP), which is just beginning to be understood, is how much of a recombinant-evolution it represents, from an architectural standpoint, vis-à-vis Hadoop in its first decade. I’ve been having a blast showing CDP to customers over the past few months and the response has been nothing short of phenomenal…

Operational Database Accessibility

This blog post is part of a series on Cloudera’s Operational Database (OpDB) in CDP. Each post goes into more details about new features and capabilities. Start from the beginning of the series with, Operational Database in CDP. Cloudera’s OpDB provides a rich set of capabilities to store and access data. In this blog post, we’ll look at the accessibility capabilities of OpDB and how you can make use of these capabilities to access your data.