As a developer, you're no stranger to your vast and varied data environment… Or are you? The tremendous amount of data your organization collects is stored in various sources and formats. You need a way to understand where and what data is, to be able to do what you need to do: build amazing event-driven applications.
Heroku is a powerful platform for application development. Users can build and deploy on the cloud, and you can effortlessly scale up once your app takes off. And behind every app, you'll find an equally powerful database: Heroku Postgres. If you're building Heroku apps, you'll find them to be a rich source of operational and customer data. Add in the right Business Intelligence (BI) tools, and you'll be able to derive insights about the inner workings of your organization.
Cloudera Data Engineering is a serverless service for Cloudera Data Platform (CDP) that allows you to submit jobs to auto-scaling virtual clusters. CDE enables you to spend more time on your applications, and less time on infrastructure. CDE allows you to create, manage, and schedule Apache Spark jobs without the overhead of creating and maintaining Spark clusters.
The blog “Migrating Apache NiFi Flows from HDF to CFM with Zero Downtime” detailed how many common NiFi dataflows can be easily migrated when the Hortonworks DataFlow and Cloudera Flow Management clusters are running side-by-side. But what if you lack the resources to run multiple NiFi clusters concurrently? Not a problem.
In this blog post, I’ll set up and run a couple of experiments to demonstrate the effects of different kinds of partition pruning in Spark.
Over the past several years, there has been an explosion of different terms related to the world of IT operations. Not long ago, it was standard practice to separate business functions from IT operations. But those days are a distant memory now, and for good reason.
Since the start of the pandemic nearly a year ago, there's been one word on the lips of every business leader, analyst, and investor around the world: cloud. COVID-19 fundamentally changed the way businesses operate. In response, organizations went all in on cloud, betting on the unmatched scale, speed, and security of SaaS applications to help them weather the storm. Nowhere was this shift more pronounced that in our own data and analytics industry.
Data integration has been around for decades in some form or fashion, as organizations are always looking for ways to combine their enterprise data and collect it in a centralized location. The most commonly used and dominant type of data integration is ETL (extract, transform, load). ETL first extracts data from one or more source systems, transforms it as necessary, and then loads it into a target warehouse or data lake.