While working for a project recently I got chance to work on Apache Falcon, which is a data governance engine that defines, schedules, and monitors data management policies. Falcon allows Hadoop administrators to centrally define their data pipelines, and then Falcon uses those definitions to auto-generate workflows in Apache Oozie.
I faced some issues in setting up my first Falcon job so in my next few posts I will write more about Falcon.
What does Apache Falcon do?
Apache Falcon simplifies complicated data management workflows into generalized entity definitions. Falcon makes it far easier to:
- Define data pipelines
- Monitor data pipelines in coordination with Ambari, and
- Trace pipelines for dependencies, tagging, audits and lineage.
This results in some common mistakes. Processes might use the wrong copies of data sets. Data sets and processes may be duplicated, and it becomes increasingly more difficult to track down where a particular data set originated.
Falcon addresses these data governance challenges with high-level and reusable “entities” that can be defined once and re-used many times. Data management policies are defined in Falcon entities and manifested as Oozie workflows.
Falcon Features -
1. Centrally Manage Data Lifecycle
Ø Centralized definition & management of pipelines for data ingest, process & export across different clusters.
2. Business Continuity & Disaster Recovery
Ø Out of the box policies for data replication, retention and archival.
Ø Configuration and management of late data handling and exception handling.
Ø Handles process failures and retries.
Ø End to end monitoring of data pipelines.
3. Address audit & compliance requirements
Ø Visualize data pipeline lineage.
Ø Track data pipeline audit logs.
Ø Tag data with business metadata.
Reference - http://www.slideshare.net/Hadoop_Summit/driving-enterprise-data-governance-for-big-data-systems-through-apache-falcon