menu EXPLORE
history NEW

What is Apache Spark

Apache Spark is a framework designed for managing huge amounts of data in a distributed manner, designed to be fast and facilitate the use of distributed operations as if working on a single node.

It is a project of the Apache foundation so it is open source. Many groups are behind the development and maintenance of Apache Spark.

In addition, it allows connection with other frameworks in the Hadoop ecosystem such as Kafka, Hbase or Hive.

What is Spark for in the Big Data world?

The 22nd century is undoubtedly the technological century dominated by the large amount of data that is generated every second.

Many of the applications that we carry on our mobile phones are constantly collecting information about our actions and locations.

All of this is stored in large data warehouses containing millions and millions of entries.

How can we extract useful information from a huge sea of ​​data?

This problem is solved with Big Data technology, of which Spark is a part.

This incredible amount of information cannot be stored on a single computer, so it is distributed across different computers that we call nodes. The set of nodes is called a cluster.

When we need to extract information we have to go through the data found in the different nodes of the system. This is where the function of Spark comes in, facilitating this task that at first seems so complicated.

Apache Spark allows us to orchestrate, distribute and monitor all nodes to obtain the necessary information for our project.

Spark loads data from a distributed file system such as HDFS and creates objects called RDD (Resilient Distributed Datasets). On these objects we can establish certain actions such as searches, joins or groupings to obtain the information that interests us from the database.

When we schedule the actions, a directed acyclic graph (DAG) is formed where the nodes are the operations that must be executed. However, Spark is characterized by having what is known as lazy evaluation, so operations are not executed until a certain event is triggered.

A very important point about Spark is that it is a framework with very flexible use since it includes different APIs that allow its use in different programming languages. The APIs included are Scala, Java, Python, SQL and R.

Apache Spark Core Components

CoreSpark

Spark Core is the heart of Spark and from which the rest of the components that we will see below work.

The Core is in charge of orchestrating and distributing the tasks of the master node to the rest of the nodes. This works on YARN or Mesos, which are the applications in charge of managing the cluster resources.

Spark Streaming

Spark Streaming is an extension that enables real-time data processing. We live in a data-driven society where new information is generated every second, for example, on social networks.

Many companies are interested in analyzing in real time what is happening around the world. Spark Streaming connects to data sources and processes this information to obtain relevant data on what is happening at all times.

Spark SQL

Spark SQL is a module that allows you to treat data tables as an object called dataframe. Dataframes can be created from multiple sources of information such as Avro, Parquet or JDCB.

Spark SQL allows you to make requests in SQL, making it easier to obtain data since SQL is very similar to the natural language of humans.

Spark MLlib

MLlib (Machine Learning Library) is a Spark module that allows you to train machine learning models with data that is stored in a distributed file system such as HDFS.

This allows models to be generated on datasets or data sets with millions and millions of entries, increasing the precision of the model.

Some of the machine learning tasks that can be performed with MLlib are:

  • regression
  • Classification
  • Core Value Analysis
  • Algorithms based on decision trees
  • Data engineering
  • Data Cleaning