Systems | Development | Analytics | API | Testing

September 2022

Data Governance and Strategy for the Global Enterprise

While the word “data” has been common since the 1940s, managing data’s growth, current use, and regulation is a relatively new frontier. Governments and enterprises are working hard today to figure out the structures and regulations needed around data collection and use. According to Gartner, by 2023 65% of the world’s population will have their personal data covered under modern privacy regulations.

Cloudera DataFlow Functions for Public Cloud powered by Apache NiFi

Since its initial release in 2021, Cloudera DataFlow for Public Cloud (CDF-PC) has been helping customers solve their data distribution use cases that need high throughput and low latency requiring always-running clusters. CDF-PC’s DataFlow Deployments provides a cloud-native runtime to run your Apache NiFi flows through auto scaling Kubernetes clusters as well as centralized monitoring and alerting and improved SDLC for developers.

Serverless NiFi Flows with DataFlow Functions: The Next Step in the DataFlow Service Evolution

Cloudera DataFlow for the Public Cloud (CDF-PC) is a cloud-native service for Apache NiFi within the Cloudera Data Platform (CDP). CDF-PC enables organizations to take control of their data flows and eliminate ingestion silos by allowing developers to connect to any data source anywhere with any structure, process it, and deliver to any destination using a low-code authoring experience.

Announcing GA of DataFlow Functions

Today, we’re excited to announce that DataFlow Functions (DFF), a feature within Cloudera DataFlow for the Public Cloud, is now generally available for AWS, Microsoft Azure, and Google Cloud Platform. DFF provides an efficient, cost optimized, scalable way to run NiFi flows in a completely serverless fashion. This is the first complete no-code, no-ops development experience for functions, allowing users to save time and resources.

The Top Three Entangled Trends in Data Architectures: Data Mesh, Data Fabric, and Hybrid Architectures

Data teams have the impossible task of delivering everything (data and workloads) everywhere (on premise and in all clouds) all at once (with little to no latency). They are being bombarded with literature about seemingly independent new trends like data mesh and data fabric while dealing with the reality of having to work with hybrid architectures. Each of these trends claim to be complete models for their data architectures to solve the “everything everywhere all at once” problem.

Improve Underwriting Using Data and Analytics

Insurance carriers are always looking to improve operational efficiency. We’ve previously highlighted opportunities to improve digital claims processing with data and AI. In this post, I’ll explore opportunities to enhance risk assessment and underwriting, especially in personal lines and small and medium-sized enterprises.

SCIM (System for Cross-domain Identity Management)

The identity team at Cloudera has been working to add the System for Cross-domain Identity Management (SCIM) support to Cloudera Data Platform (CDP) and we’re happy to announce the general availability of SCIM on Azure Active Directory! In Part One we discussed: CDP SCIM Support for Active Directory, which discusses the core elements of CDP’s SCIM support for Azure AD.

A Flexible and Efficient Storage System for Diverse Workloads

Apache Ozone is a distributed, scalable, and high-performance object store, available with Cloudera Data Platform (CDP), that can scale to billions of objects of varying sizes. It was designed as a native object store to provide extreme scale, performance, and reliability to handle multiple analytics workloads using either S3 API or the traditional Hadoop API.

Demystifying Modern Data Platforms

July brings summer vacations, holiday gatherings, and for the first time in two years, the return of the Massachusetts Institute of Technology (MIT) Chief Data Officer symposium as an in-person event. The gathering in 2022 marked the sixteenth year for top data and analytics professionals to come to the MIT campus to explore current and future trends. A key area of focus for the symposium this year was the design and deployment of modern data platforms.

Chose Both: Data Fabric and Data Lakehouse

A key part of business is the drive for continual improvement, to always do better. “Better” can mean different things to different organizations. It could be about offering better products, better services, or the same product or service for a better price or any number of things. Fundamentally, to be “better” requires ongoing analysis of the current state and comparison to the previous or next one. It sounds straightforward: you just need data and the means to analyze it.

Get to anomaly detection faster with Cloudera's Applied Machine Learning Prototypes

The Applied Machine Learning Prototype (AMP) for anomaly detection reduces implementation time by providing a reference model that you can build from. Built by Fast Forward Labs, and tested on AMD EYPC™ CPUs with Dell Technologies, this AMP enables data scientists across industries to truly practice predictive maintenance.

The Modern Data Lakehouse: An Architectural Innovation

Imagine having self-service access to all business data, anywhere it may be, and being able to explore it all at once. Imagine quickly answering burning business questions nearly instantly, without waiting for data to be found, shared, and ingested. Imagine independently discovering rich new business insights from both structured and unstructured data working together, without having to beg for data sets to be made available.

Kubernetes Logs Collection with MiNiFi C++

The MiNiFi C++ agent provides many features for collecting and processing data at the edge. All the strengths of MiNiFi C++ make it a perfect candidate for collecting logs of cloud native applications running on Kubernetes. This video explains how to use the MiNiFi C++ agent as a side-car pod or as a DaemonSet to collect logs from Kubernetes applications. It goes through many examples and demonstrations to get you started with your own deployments. Don’t hesitate to reach out to Cloudera to get more details and discuss further options and integrations with Edge Flow Manager.

Large Scale Industrialization Key to Open Source Innovation

We are now well into 2022 and the megatrends that drove the last decade in data—The Apache Software Foundation as a primary innovation vehicle for big data, the arrival of cloud computing, and the debut of cheap distributed storage—have now converged and offer clear patterns for competitive advantage for vendors and value for customers.

Modern Data Architecture for Telecommunications

In the wake of the disruption caused by the world’s turbulence over the past few years, the telecommunications industry has come out reasonably unscathed. There remain challenges in workforce management, particularly in call centers, and order backlogs for fiber broadband and other physical infrastructure are being worked through. But digital transformation programs are accelerating, services innovation around 5G is continuing apace, and results to the stock market have been robust.

Managing agents in Edge Flow Manager

This video explains the Agent Manager view introduced with the 1.4 release. The main goal of this view was to give the user better understanding and more control over the agents in the system. Monitoring individual agents’ health becomes easier as you can see rich details about them. From the Agent Details view, you can also request and download debug logs from the agents, so in case of any issues you don’t need to log in to the agent’s environment. The highly customizable main table and the different tabs (details, alerts, commands and properties) are explained in detail.

Five Reasons for Migrating HBase Applications to Cloudera Operational Database in the Public Cloud

Apache HBase has long been the database of choice for business-critical applications across industries. This is primarily because HBase provides unmatched scale, performance, and fault-tolerance that few other databases can come close to. Think petabytes of data spread across trillions of rows, ready for consumption in real-time.