Falcon FEED need certain features
This is based on experience from Falcon 8, and operation feasibility
- Feed should be read or written to various source of data or cluster, retention should be optional
- Which manages its own retention policy, so having retention is void, if accidentally scheduled will kick off oozie coord which is always in killed state
- s3 retention
- flume
- hbase
- hive
- kafka
- Feed can be replicated to any external source
- need support to add custom jar/path
- Feed replication is not completed untill we have data from all colos
- in this case we need feed available to consume which ever is been replicated
- add promotion property, part of feed replication property=promoted, and specify directory
- this will help if feed is replicat and track the promoted directory for the instance
- Feed should have pipeline entity, so as to know the source of feed
- with using dependency its too cumbersome, there would be lots of discarded or junk process dependecies.
- Feeds which are replicated, or promoted or archived can be tied to pipelines and be helpful case of maintenance, or backlog reprocessing
- Feeds should have properties for replication, archival as
- replication, archival(optional retention, its just move out from current source), promotion(its move so no retention required).
- replication, archival should also support as fetch/push operation and data type
- we should be able to replicate data from HDFS to DB and vice versa
- this can be helpful in bulk migration
- Feed which have endtime defined, should be retired/deleted from config store
- falcon startup.properties can have retention for retired feeds
- Feed should have auto update in entity - pipelines in case a new process is added
- This will help tracking if multiple pipelines are using same feed
- Feed should validate if it has write permissions(already there), schedule job in queue.
- This will help to submit correctly before scheduling
- Feed re-run for archival and retention, should validate if it has not crossed the retention period.
- The jobs keep getting failed as source might not have data
- Feeds should maintain stats for activities its doing, rather than just logging
- Amount of data transferred speed, replication/archival
- Amount of data deleted time for retention
- Feed retention should not be based on feed frequency, but 30minutely, hourly, daily job
- The last only instance should take care of all retention
- If any failed instance, next succeeded should take care of previous instances cleanup
No comments:
Post a Comment