Apache Falcon use only three types of entities to describe all data management policies and pipelines. These entities are:
· Cluster: Represents the “interfaces” to a Hadoop cluster· Feed: Defines a “dataset” File, Hive Table or Stream· Process: Consumes feeds, invokes processing logic & produces feeds
Using these three types of entities only we can manage replication, archival, retention of data and also handle job/process failures and late data arrival.
These Falcon entities–
- Are easy and simple to define using XML.
- Are modular - clusters, feeds & processes defined separately and then linked together and easy to re-use across multiple pipelines.
- Can be configured for replication, late data arrival, archival and retention.
Using Falcon a complicated data pipeline like below
can be simplified to a few Falcon entities (which are further converted to multiple Oozie workflows by Falcon engine itself)
In my next post I will explain how we can define a Falcon process and perquisites for that.