Apache Flume Interview Questions & Answers
Apache Flume is a distributed, reliable, and available software for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. Here you will find the most commonly asked Apache Flume interview questions with answers which are faced by interviewee. Also these questions will get you acquainted with the nature of the questions you may be ask during your interview.
1. What is Apache Flume?
Answer:- Flume is a standard, simple, robust, flexible, and extensible tool for data ingestion from various data producers (webservers) into Hadoop.
Apache Flume is a reliable and distributed system for collecting, aggregating and moving massive quantities of log data. It is a highly available and reliable service which has tunable recovery mechanisms.
The main idea behind the Flume's design is to capture streaming data from various web servers to HDFS. It has simple and flexible architecture based on streaming data flows. It is fault-tolerant and provides reliability mechanism for Fault tolerance & failure recovery.
2. What are the features of Flume?
Answer:- Here are the advantages of using Flume −
1. Flume carries data between sources and sinks. This gathering of data can either be scheduled or event-driven. Flume has its own query processing engine which makes it easy to transform each new batch of data before it is moved to the intended sink.
2. Apache Flume is horizontally scalable.
3. Apache Flume provides support for large sets of sources, channels, and sinks.
4. With Flume, we can collect data from different web servers in real-time as well as in batch mode.
5. Flume provides the feature of contextual routing.
6. If the read rate exceeds the write rate, Flume provides a steady flow of data between read and write operations.
3. What is a Flume Agent?
Answer:- An agent is an independent daemon process (JVM) in Flume. It receives the data (events) from clients or other agents and forwards it to its next destination (sink or agent). Flume may have more than one agent.
Flume Agent contains three main components. They are the source, channel, and sink.
1. Source: It accepts the data from the incoming streamline and stores the data in the channel. Example : Exec source, Thrift source, Avro source, twitter 1% source, etc.
2. Channel: In general, the reading speed is faster than the writing speed. Thus, we need some buffer to match the read & write speed difference. Basically, the buffer acts as a intermediary storage that stores the data being transferred temporarily and therefore prevents data loss. Similarly, channel acts as the local storage or a temporary storage between the source of data and persistent data in the HDFS. Example : Memory channel, File system channel, JDBC channel, etc.
3. Sink: Then, our last component i.e. Sink, collects the data from the channel and commits or writes the data in the HDFS permanently.
Example : HDFS sink
4. What is Flume Event?
Answer:- A Flume event is a basic unit of data that needs to be transferred from source to destination.
5. How many types of data flow in Apache Flume?
Answer:- A flume is a tool used for moving log data into HDFS. Apache Flume supports complex data flow. There are three types of data flow in Apache Flume. They are:
1. Multi-hop Flow : Within Flume, there can be multiple agents and before reaching the final destination, an event may travel through more than one agent. This is known as multi-hop flow.
2. Fan-out Flow : The dataflow from one source to multiple channels is known as fan-out flow.
3. Fan-in Flow : The data flow in which the data will be transferred from many sources to one channel is known as fan-in flow.
6. What are the Disadvantage of Apache Flume?
Answer:- Let's study about the core Disadvantage of Apache Flume.
1. Flume has complex topology i.e configuration and maintain is difficult.
2. It does not guarantee 100% unique message delivery (duplicate messages might enter at any times).
3. It does not support for data replication.
4. In Flume throughput depends on the backing store of the channel so scalability and reliability in not up to the mark.
7. What are the Apache Flume Applications?
Answer:- Some of the core applications of flume are:
1. Apache Flume has a wide range of demand in e-commerce Company to analyze the customer behavior of different regions.
2. The main design goal of flume is to ingest huge log data generated by application servers into HDFS at a higher speed.
3. The main application of flume is online analytics.
4. It is backbone for real-time event processing.
8. What is Channel Selectors?
Answer:-The Channel selector is that component of Flume that determines which channel particular Flume event should go into when a group of channels exists. The target channel can be one or multiple.
The mechanism used is an internal mechanism. As discussed earlier, in two ways the multiple channels can be handled. Channel selectors are of two types- Default and multiplexing.
9. What is the use of Sink Processors?
Answer:- These are used to invoke a particular sink from the selected group of sinks. These are used to create failover paths for your sinks or load balance events across multiple sinks from a channel.
10. What is the use of Interceptors?
Answer:- The interceptors used to modify/drop events in-flight. Flume has the capability as it uses the interceptors. The interceptor also decides what sort of data should pass through to the Channel.
An interceptor can modify/drop events based on any criteria chosen by the developer of the interceptor. Flume supports binding of interceptors. Interceptors are stated as a whitespace separated list in the source configuration.
The below are the list of interceptors available :
1. Timestamp Interceptor
2. Host Interceptor
3. Static interceptor
4. Regex filtering interceptor
Also check :