Apache Spark is an open-source cluster-computing framework. Originally developed at the University of California, Berkeley’s AMPLab, the Spark codebase was later donated to the Apache Software Foundation, which has maintained it since.
Apache Spark’s story
In order to process and analyze huge amounts of data very efficiently, Apache Hadoop saw the need for a new engine called MapReduce. And soon MapReduce has become the only way of data processing and analysis with Hadoop Ecosystem. Being the only one of a kind, it influenced communities to develop new engines to process big data. This led to the evolution of Spark at Berkeley AMPLab. The developers at Berkeley AMPLab decided to take the benefit of already established big data open community. So they donated the codebase to Apache Software Foundation and Apache Spark is born.
What does Apache Spark comprise of ?
Before going into the discussion of what Spark can do, lets have a quick look into what Spark has inside it. Excluding the Spark Core, Apache Spark has four libraries that address four areas. They are :
What can Apache Spark do ?
Now we know what Spark has, let us see what Spark can do.
- Unlike Hadoop, Spark can process data in mini-batches and perform transformations.
- With the help of Spark’s distributed machine learning framework, machine learning tasks could run on Spark cluster with commodity hardware.
- Similarly, graph processing could also be done using the distributed framework.
- Structured and semi-structured data could be processed using SQL component of Apache Spark.
References to learn Apache Spark
If you are interested in learning Apache Spark, here are few of the useful links that will help you get started with. Feel free to get your hands dirty.