Systems | Development | Analytics | API | Testing

April 2021

Managing Python dependencies for Spark workloads in Cloudera Data Engineering

Apache Spark is now widely used in many enterprises for building high-performance ETL and Machine Learning pipelines. If the users are already familiar with Python then PySpark provides a python API for using Apache Spark. When users work with PySpark they often use existing python and/or custom Python packages in their program to extend and complement Apache Spark’s functionality. Apache Spark provides several options to manage these dependencies.

Future of Data Meetup: Exploring Data and Creating Interactive Dashboards in the Cloud

In this meetup, we’re going to once again put ourselves in the shoes of an electric car manufacturer that is deploying a recently developed electric motor out into their new cars. We’re going to show how to explore some data that has been previously collected through various different sources and stored into Apache Hive within a data warehouse, with the goal of tracking down a specific set of potentially defective parts. We’ll then take the results of this data exploration and create an interactive dashboard that presents our results in a visually appealing way using a BI tool that’s integrated right into the same data warehouse.

Fast Forward Live: Few-Shot Text Classification

Join us for this month's Machine Learning research discussion with Cloudera Fast Forward Labs. We will discuss few-shot text classification - including a live demo and Q&A. This is an applied research report by Cloudera Fast Forward. We write reports about emerging technologies. Accompanying each report are working prototypes or code that exhibits the capabilities of the algorithm and offer detailed technical advice on its practical application.

The New Releases of Apache NiFi in Public Cloud and Private Cloud

Cloudera released a lot of things around Apache NiFi recently! We just released Cloudera Flow Management (CFM) 2.1.1 that provides Apache NiFi on top of Cloudera Data Platform (CDP) 7.1.6. This major release provides the latest and greatest of Apache NiFi as it includes Apache NiFi 1.13.2 and additional improvements, bug fixes, components, etc. Cloudera also released CDP 7.2.9 on all three major cloud platforms, and it also brings Flow Management on DataHub with Apache NiFi 1.13.2 and more.

Cable Companies Are Growing Up

Cable and Satellite companies in the US have emerged from a decade of acquisitions, consolidation and shakeout and are beginning to assert themselves as full service providers in the communications and media space. With Comcast just announcing its new suite of cellphone plans this month, and Charter, Altice and Dish ramping up their offerings, the Big Three in wireless – AT&T, Verizon and T-Mobile/Sprint – are looking over their shoulders.

Converting HBase ACLs to Ranger policies

CDP is using Apache Ranger for data security management. If you wish to utilize Ranger to have a centralized security administration, HBase ACLs need to be migrated to policies. This can be done via the Ranger webUI, accessible from Cloudera Manager. But first, let’s take a quick overview of HBase method for access control.

Cloudera Data Platform (CDP) Private Cloud on Red Hat OpenShift

Learn how Cloudera and Red Hat help enterprise companies securely manage the complete data lifecycle, putting data to work faster and reducing time to value. Cloudera Data Platform (CDP) Private Cloud on Red Hat® OpenShift® aggregates and visualizes data to derive actionable insights in a secure, hybrid, and open-source environment.

HDFS Data Encryption at Rest on Cloudera Data Platform

Encryption of Data at Rest is a highly desirable or sometimes mandatory requirement for data platforms in a range of industry verticals including HealthCare, Financial & Government organizations. The capability increases security and protects sensitive data from various kinds of attack that could be internal or external to the platform.

Apache Ozone and Dense Data Nodes

Today’s enterprise data analytics teams are constantly looking to get the best out of their platforms. Storage plays one of the most important roles in the data platforms strategy, it provides the basis for all compute engines and applications to be built on top of it. Businesses are also looking to move to a scale-out storage model that provides dense storages along with reliability, scalability, and performance.

Future of Data Meetup: Nice to Meet You, NiFi!

You asked for and we are delivering the third in our “Hello:“ series of introductory “Big Data” topics. Our next meetup covers using Apache NiFi. Lots of people want to be a data scientist... but what good is machine learning, artificial intelligence or advanced analytics if you don’t have data? Getting data is incredibly important, but getting data in real time or near real time helps you give near real time insight.

Drinking our own champagne - Cloudera upgrades to CDP Private Cloud

Like most of our customers, Cloudera’s internal operations rely heavily on data. For more than a decade, Cloudera has built internal tools and data analysis primarily on a single production CDH cluster. This cluster runs workloads for every department – from real-time user interfaces for Support to providing recommendations in the Cloudera Data Platform (CDP) Upgrade Advisor to analyzing our business and closing our books.

What is Streaming Analytics?

What is Streaming Analytics? Streaming Analytics is a type of data analysis that processes data streams for real-time analytics. It continuously processes data from multiple streams and performs simple calculations to complex event processing for delivering sophisticated use cases. The primary purpose is to present the most up-to-date operational events for the user to stay on top of the business needs and take action as changes happen in real-time.

What's new in CDP Private Cloud Base 7.1.6?

According to IDG, when customers consider updating to the latest release of a product, they expect new features, enhanced security, and better performance, but increasingly want a more streamlined upgrade process. With each new release of CDP Private Cloud, this is exactly what we strive to deliver. Along with a host of new features and capabilities, we are improving the upgrade process to be as painless as possible.

Cloudera Data Engineering - Integration steps to leverage spark on Kubernetes

Cloudera Data Engineering is a serverless service for Cloudera Data Platform (CDP) that allows you to submit jobs to auto-scaling virtual clusters. CDE enables you to spend more time on your applications, and less time on infrastructure. CDE allows you to create, manage, and schedule Apache Spark jobs without the overhead of creating and maintaining Spark clusters.

No Data Loss and No Service Interruption - HDF to CFM Rolling Migration

The blog “Migrating Apache NiFi Flows from HDF to CFM with Zero Downtime” detailed how many common NiFi dataflows can be easily migrated when the Hortonworks DataFlow and Cloudera Flow Management clusters are running side-by-side. But what if you lack the resources to run multiple NiFi clusters concurrently? Not a problem.

5 Success Stories That Show the Value of Enterprise Data Cloud

What’s the fastest and easiest path towards powerful cloud-native analytics that are secure and cost-efficient? In our humble opinion, we believe that’s Cloudera Data Platform (CDP). And sure, we’re a little biased—but only because we’ve seen firsthand how CDP helps our customers realize the full benefits of public cloud.

10 Steps to Achieve Enterprise Machine Learning Success

You’ve probably heard it more than once: Machine learning (ML) can take your digital transformation to another level. It’s a pie-in-the-sky statement that sounds great, right? And while you’d be forgiven for thinking that it might sound too good to be true, operational ML is, in fact, achievable and sustainable. You can get the very kind of ML you need to increase revenue and lower costs. To help teams work smarter and do things faster.

The Key to Unlocking IT Modernization's Power? Enterprise level Transformation

The United States Veterans Administration (VA) over the last decade underwent a massive enterprise-wide IT transformation, eliminating its fragmented shadow IT and adopting a centralized system capable of supporting the agency’s 400,000 employees and more effectively utilizing its $240 billion-plus annual budget. The result: A more reliable and modern IT environment that improves access, availability, and user experience -ultimately supporting the VA mission more effectively.

Enabling NVIDIA GPUs to accelerate model development in Cloudera Machine Learning

When working on complex, or rigorous enterprise machine learning projects, Data Scientists and Machine Learning Engineers experience various degrees of processing lag training models at scale. While model training on small data can typically take minutes, doing the same on large volumes of data can take hours or even weeks. To overcome this, practitioners often turn to NVIDIA GPUs to accelerate machine learning and deep learning workloads.

Next Stop - Predicting on Data with Cloudera Machine Learning

This blog series follows the manufacturing and operations data lifecycle stages of an electric car manufacturer – typically experienced in large, data-driven manufacturing companies. The first blog introduced a mock vehicle manufacturing company, The Electric Car Company (ECC) and focused on Data Collection. The second blog dealt with creating and managing Data Enrichment pipelines. The third video in the series highlighted Reporting and Data Visualization.

Building Automated ML Pipelines in Cloudera Machine Learning

In this video, we'll walk through an example on how you can use Cloudera Machine Learning to run some python code that creates specific Machine Learning models. We’ll then go through some features within Cloudera Machine Learning such as job scheduling and model deployments to see how you can do some more advanced machine development operations!

Enabling kubectl for CDE

The kubectl tool provides direct administrative access to the Kubernetes cluster underlying a CDE service, which is useful for troubleshooting, among other things. This video will demonstrate how to set up kubectl access. To enable kubectl, we will need a couple of prerequisites. We wiil need the kubeconfig file from the CDE service. We will need to get and authorize the IAM user, and then need to make sure that everything is set up correctly, both for kubectl and some other tools like k9s.

Cloudera Honored With 5-Star Rating in the 2021 CRN Partner Program Guide

Cloudera is being acknowledged by CRN®, a brand of The Channel Company, in its 2021 Partner Program Guide. This annual guide provides a conclusive list of the most distinguished partner programs from leading technology companies that provide products and services through the IT Channel. The 5-Star rating is awarded to an exclusive group of companies that offer solution providers the best of the best, going above and beyond in their partner programs.

Hybrid Cloud and Strategic Data Use Accelerate State, Army Missions

Some of the most forward-operational elements of the United States federal government are making strides in leveraging data through hybrid cloud environments—and they’re constantly evaluating progress and recalibrating their approaches along the way. At agencies including the Army and the State Department, work is well underway to find ways of employing emerging technologies that build on cloud services and data optimization to realize new levels of effectiveness.

Fast Forward Live: Representation Learning & Image Analysis

Good representations of data (e.g., text, images) are critical for solving many tasks (e.g., search or recommendations). But what exactly are representations, how can they be built and why are deep learning models useful? In this livestream, we will discuss these questions from a software engineering perspective and walk through a live example!