CloudxLab

Thursday, April 28, 2016

Falcon FEED need certain features


This is based on experience from Falcon 8, and operation feasibility 

  • Feed should be read or written to various source of data or cluster, retention should be optional
    • Which manages its own retention policy, so having retention is void, if accidentally scheduled will kick off oozie coord which is always in killed state
      • s3 retention
      • flume
      • hbase
      • hive
      • kafka


  • Feed can be replicated to any external source
    • need support to add custom jar/path
      • s3
      • nfs
      • san
  • Feed replication is not completed untill we have data from all colos
    • in this case we need feed available to consume which ever is been replicated
    • add promotion property, part of feed replication property=promoted, and specify directory
    • this will help if feed is replicat and track the promoted directory for the instance
  • Feed should have pipeline entity, so as to know the source of feed
    • with using dependency its too cumbersome, there would be lots of discarded or junk process dependecies.
    • Feeds which are replicated, or promoted or archived can be tied to pipelines and be helpful case of maintenance, or backlog reprocessing
  • Feeds should have properties for replication, archival as
    • replication, archival(optional  retention, its just move out from current source), promotion(its move so no retention required).
    • replication, archival should also support as fetch/push operation and data type
    • we should be able to replicate data from HDFS to DB and vice versa
    • this can be helpful in bulk migration

  • Feed which have endtime defined, should be retired/deleted from config store
    • falcon startup.properties can have retention for retired feeds


  • Feed should have auto update in entity - pipelines in case a new process is added
    • This will help tracking if multiple pipelines are using same feed


  • Feed should validate if it has write permissions(already there), schedule job in queue.
    • This will help to submit correctly before scheduling
  • Feed re-run for archival and retention, should validate if it has not crossed the retention period.
    • The jobs keep getting failed as source might not have data
  • Feeds should maintain stats for activities its doing, rather than just logging
    • Amount of data transferred speed, replication/archival
    • Amount of data deleted time for retention
  • Feed retention should not be based on feed frequency, but 30minutely, hourly, daily job
    • The last only instance should take care of all retention
    • If any failed instance, next succeeded should take care of previous instances cleanup