Future of Data Meetup | Apache Iceberg: Looking Below the Waterline

Future of Data Meetup | Apache Iceberg: Looking Below the Waterline

Dec 9, 2022

Apache Iceberg has become the most popular, fastest-growing, and widely adopted open table format in the big data space.

Hear from (and ask your questions to!) some of the key partners leading the Iceberg roadmap and enhancements.

AGENDA:

Apache Iceberg for BI use cases
Speakers: Vincent Kulandaisamy and Shaun Ahmadian, Cloudera

This talk will cover the integration of Iceberg open table format with Apache Hive and Impala compute engines, Iceberg v1 and v2 capabilities support, customer use cases and future Iceberg enhancements and innovations in the works at Cloudera. We'll take a detailed look into the following capabilities supported in Hive and Impala:

  • Critical functional and performance enhancements
  • Materialized views support
  • In-place Table migration of Hive external to Iceberg tables
  • Row level update/delete
  • Table rollback
  • Table maintenance

Learn how Teranet keeps up with the changing growth and requirements of their business using Apache Iceberg for their change data capture use case leveraging Spark and Impala.

Multi-function Analytics with Apache Iceberg
Speaker: Wing Yew Poon, Cloudera

This session will present a demonstration of using Spark with Iceberg tables, highlighting key Iceberg features. We'll show the interoperability of Spark with Hive and Impala. Along the way, we'll cover Cloudera's contributions for improving Spark and Impala support on Iceberg.

Apache Iceberg's REST Catalog - Real and Potential Uses Beyond Data Workflows
Speaker: Samual Redai, Netflix

Iceberg's new REST catalog provides a friendly access point for the rich metadata and functionality that comes with an Iceberg-powered data warehouse. This makes Iceberg even easier to integrate into compute engines and makes catalog operations available from pretty much any client you can imagine. However, the power of the REST catalog doesn't stop there. There are a myriad of tools and features that sit on the edge of the data platform that benefit highly from the REST catalog design. In this talk, I want to cover a few creative uses that currently exist as well as some imaginative uses that could exist.

Incremental compaction using Apache Iceberg
Speaker: Vikram Bohra, Linkedin

At Linkedin, streaming data in the form of Kafka topics is ingested to the data lake by low-latency ingestion pipelines powered by Apache Gobblin. This often leads to smaller files that can contain duplicate records due to at-least once delivery semantics, which lead to the creation of another set of pipelines that deduplicate data for correctness and compact into larger files for storage and query efficiency. Those compaction pipelines are bursty, compute intensive and have higher latency due to their batch processing nature. With the increase in data volume, it becomes increasingly important to process/compute data in an incremental fashion for optimal resource utilization and lower latency. In this talk, we present how Linkedin leverages Iceberg to migrate its compaction pipelines from batch to incremental processing models and solve such latency and compute problems. We also show how that leads to an improvement in overall cluster resource utilization and more uniform workload distribution. Furthermore, we will also focus on how we optimize compaction and data deduplication in light of late data.