blogtutorial

Pathway’s Apache Iceberg Connectors for Real-Time Data Pipelines

Shlok Srivastava

·Published February 11, 2025·Updated February 11, 2025·0 min read

We’re excited to share that Pathway has officially released Apache Iceberg connectors, enabling you to seamlessly integrate and manage your data in Iceberg with the full power of Pathway’s Live Data Framework. These connectors make it simpler than ever to harness Iceberg’s flexible table format while taking advantage of Pathway’s real-time computation engine.

You'll find a comprehensive summary below. If you'd like to directly start the implementation, head over to the documentation here and to schedule a call with Pathway regarding its capabilities with Apache Iceberg, book a slot here.

Benefits of Using Pathway’s Iceberg Connectors

Below are some compelling reasons for using the Pathway Iceberg connectors:

Near Real-Time Insights: The streaming capabilities let you capture incremental changes instantly, making it perfect for event-driven use cases or live dashboards.
Simplicity & Efficiency: Setting up your Iceberg read/write logic takes only a handful of lines in Python, reducing the complexity of your pipeline.
Scalability: Pathway’s distributed engine and Iceberg’s optimized file format let you handle very large datasets without sacrificing performance.
Unified Data Workflows: Integrate multiple data sources—like CSV, NATS, Kafka, Postgres, or MongoDB—and funnel all transformations into a single Iceberg table for easy query and analytics.
Ideal for AI and Machine Learning: Low-latency updates keep your models current. Continuous training or inference becomes straightforward when your pipeline is always up to date.

By introducing built-in connectors for Iceberg, Pathway extends its commitment to scalable data ingestion and real-time analytics, letting you tap into the power of Iceberg in just a few lines of code.

Here’s a quick look at what these new connectors bring:

Static and Streaming Modes: Read data once (static) or continuously monitor for changes (streaming).
Two-Way Integration: Not only can you read from Iceberg, but you can also write changes back into Iceberg storage.
Schema-Driven: The connectors rely on Python-based schema definitions, making it easy to pick and choose which fields you need.

The sections below goes deeper into some key details about the new connectors.

Key Implementation Details

Reading from Iceberg

The Pathway Iceberg connector enables efficient data retrieval from iceberg tables. Here’s how it works:

Static or Streaming Mode: The connectors support two modes. Static Mode reads your existing data exactly once—ideal for batch analyses. Streaming Mode continuously monitors updates to your Iceberg tables, capturing row additions and deletions in real time.
Schema Definition: You can specify each column’s type (e.g., int, bool, str, float) in a Python class. You can also mark certain columns as primary keys using pw.column_definition(primary_key=True) to uniquely identify rows, especially important in streaming mode. Head over to the developer documentation for more details.
Integration with Pathway: The connector automatically reflects changes in your computational graph. Any new data that appears in the underlying Iceberg table is immediately visible in Pathway’s pipeline.

Once you’ve set up your reading mechanism, you’re ready to incorporate this data into your transformations or AI pipelines.

Next, let’s examine how to write your processed data back to Iceberg.

Writing to Iceberg

After reading and processing your data, you may want to publish your results into an Iceberg table. Below are some of the highlights of the write connectors, head to the API documentation for a full list of configuration knobs that you can tune.

Automatic Table Creation: If you haven’t created the table or namespace yet, Pathway can do it for you, inferring the schema from the table you’re writing.
Change Tracking: Pathway uses two special columns—time (representing the minibatch of computation) and diff (indicating whether a row is being added or removed)—to accurately mirror all real-time changes in your dataset.
Commit Frequency: Configure min_commit_frequency to manage how often Pathway writes changes to storage, balancing real-time responsiveness with I/O overhead.

The connectors support all Pathway primitive types (e.g., bool, int, float, str) which are then directly mapped to corresponding Iceberg types. Duration, Naive DateTime, and UTC DateTime are also supported, ensuring broad coverage for typical use cases.

What's Next?

In the coming weeks, we’ll publish an in-depth tutorial that dives deeper into advanced configurations, best practices, and performance tuning for large-scale use cases.

Ready to leverage the power of Apache Iceberg with Pathway? Get started with a Free Pathway Scale or Enterprise License and follow the instructions on this link. Or reach out to us at contact@pathway.com to discuss how real-time, high-volume data processing can transform your analytics stack. We can’t wait to see what you’ll build!

Transform Your Data Pipelines with Confidence

Are you eager to accelerate your data processing workflows with Apache Iceberg connectors? Pathway is trusted by industry leaders such as NATO and Intel, and is natively available on both AWS and Azure Marketplaces. Pathway’s experts are here to help. Get a 15-minute, no-obligation consultation focused on your unique data challenges.

Shlok Srivastava

Lead Engineer at Pine Labs

Sergey Kulik

Lead Software Research Engineer at Pathway

Power your RAG and ETL pipelines with Live Data

Get started for free

Blog

Gemini 2.0 for Document Ingestion and Analytics with Pathway

Blog

Real-Time AI Pipeline with DeepSeek, Ollama and Pathway