What do you know about Apache Spark ?

Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.

Apache Spark’s story

In order to process and analyze huge amounts of data very efficiently, Apache Hadoop saw the need for a new engine called MapReduce. And soon MapReduce has become the only way of data processing and analysis with Hadoop Ecosystem. Being the only one of a kind, it influenced communities to develop new engines to process big data. This led to the evolution of Spark at Berkeley AMPLab. The developers at Berkeley AMPLab decided to take the benefit of already established big data open community. So they donated the codebase to Apache Software Foundation and Apache Spark is born.

What does Apache Spark comprise of ?

Before going into the discussion of what Spark can do, lets have a quick look into what Spark has inside it. Excluding the Spark Core, Apache Spark has four libraries that address four areas. They are :

  1. Spark SQL
  2. Spark Streaming
  3. Spark Machine Learning library (also called as Spark MLlib)
  4. GraphX

What can Apache Spark do ?

Now we know what Spark has, let us see what Spark can do.

  1. Unlike Hadoop, Spark can process data in mini-batches and perform transformations.
  2. With the help of Spark’s distributed machine learning framework, machine learning tasks could run on Spark cluster with commodity hardware.
  3. Similarly, graph processing could also be done using the distributed framework.
  4. Structured and semi-structured data could be processed using SQL component of Apache Spark.

References to learn Apache Spark

If you are interested in learning Apache Spark, here are few of the useful links that will help you get started with. Feel free to get your hands dirty.

  1. Apache Spark Official by Apache Software Foundation
  2. Apache Spark Tutorial by TutorialKart

How Apache Kafka is helping Industry

Apache Kafka is an open-source stream processing platform developed by the Apache Software Foundation written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds.

Apache Kafka has becoming popular in industry with the rise of stream processing. Many of the existing organisations are looking forward to include Kafka in their new projects, while some are trying to incorporate Kafka into their existing applications.

Currently Kafka is being used for :

  • Application Monitoring
  • Data Warehousing
  • Asynchronous Applications
  • Recommendation Engines in Online Retail
  • Dynamic Pricing Applications
  • IOT (Internet Of Things)

What is Industry telling about Kafka?

  1. Kafka is helping applications to work in a loosely coupled manner.
  2. Kafka is handling stream processing and thus became the underlying data infrastructure.
  3. Real-time processing of high volumes of data.
  4. Improvement in the Application Scalability.

Other References

If you are interested to learn Apache Kafka, you may refer following links.