Cloud: 2016

Thursday, April 28, 2016

Falcon FEED need certain features

This is based on experience from Falcon 8, and operation feasibility

Feed should be read or written to various source of data or cluster, retention should be optional

Which manages its own retention policy, so having retention is void, if accidentally scheduled will kick off oozie coord which is always in killed state

s3 retention
flume
hbase
hive
kafka

Feed can be replicated to any external source

need support to add custom jar/path

s3
nfs
san

Feed replication is not completed untill we have data from all colos

in this case we need feed available to consume which ever is been replicated
add promotion property, part of feed replication property=promoted, and specify directory
this will help if feed is replicat and track the promoted directory for the instance

Feed should have pipeline entity, so as to know the source of feed

with using dependency its too cumbersome, there would be lots of discarded or junk process dependecies.
Feeds which are replicated, or promoted or archived can be tied to pipelines and be helpful case of maintenance, or backlog reprocessing

Feeds should have properties for replication, archival as

replication, archival(optional retention, its just move out from current source), promotion(its move so no retention required).
replication, archival should also support as fetch/push operation and data type
we should be able to replicate data from HDFS to DB and vice versa
this can be helpful in bulk migration

Feed which have endtime defined, should be retired/deleted from config store

falcon startup.properties can have retention for retired feeds

Feed should have auto update in entity - pipelines in case a new process is added

This will help tracking if multiple pipelines are using same feed

Feed should validate if it has write permissions(already there), schedule job in queue.

This will help to submit correctly before scheduling

Feed re-run for archival and retention, should validate if it has not crossed the retention period.

The jobs keep getting failed as source might not have data

Feeds should maintain stats for activities its doing, rather than just logging

Amount of data transferred speed, replication/archival
Amount of data deleted time for retention

Feed retention should not be based on feed frequency, but 30minutely, hourly, daily job

The last only instance should take care of all retention
If any failed instance, next succeeded should take care of previous instances cleanup

Subscribe to: Posts (Atom)