Data lakes, data warehouses, and databases

“Data lakes, data warehouses, and databases “–All these are some terminologies used in Data Management. But what exactly their meaning is and are the same or differ from each other, let’s try to explore in this article.  We will start with the definitions, then will discuss key differences.

A database is generic data storage and processing platform, often designed for a specific data model (e.g., relational, hierarchical etc.) that can be used for different workloads such as OLTP or OLAP. Database is an organized collection of data. Early databases were flat and limited to simple rows and columns. Today, the popular databases are:

Enterprise Data Warehouse (EDW): This is a data warehouse that serves the entire enterprise.

Data Mart: A data mart is used by individual departments or groups and is intentionally limited in scope because it looks at what users need right now versus the data that already exists.

Data Swamp: When your data lake gets messy and is unmanageable, it becomes a data swamp.

Data warehouse

A data warehouse collects data from various sources, whether internal or external, and optimizes the data for retrieval for business purposes. the data warehouse is designed to gather business insights and allows businesses to integrate their data, manage it, and analyze it at many levels.

A data warehouse system is a specific data processing system, often build using a database system for OLAP/Analytics workloads. A specific data warehouse is using a relational or multi-dimensional analytics schema, often in form of a so-called star schema with fact and dimension tables, to answer a predefined set of questions and reports very efficiently. It is thus fully schematized and only stores the minimal amount of data to answer that set of questions. Data storage costs are normally high. The data is primarily structured, often from relational databases, but it can be unstructured too.

Data Lake

“A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources.”

A data lake is a generic data repository designed to store any type of data in as close as original format as possible, at large scale and low cost, with the ability to schematize the data on demand. It is designed for discovering the questions you want to ask (for example by applying machine learning algorithms on the data to discover interesting data correlations) and provide the original data to be able to answer new questions.

A data lake at the conceptual level augments the data warehouse by offering the original data at lower cost to create new data warehouses for the new questions as they show up and are being discovered. Its analytics engine(s) provides a highly scalable batch processing structure, custom code execution framework and machine learning infrastructure, integration with streaming pipelines (to form the cold path analysis in a lambda architecture) and the ability to run interactive queries.

Data lake platforms often include data warehousing capabilities by including a catalog system to store optimized logical storage abstractions such as tables and at the very least are integrated with a data warehouse system.

Data lakes, data warehouses and databases are all designed to store data. So why are there different ways to store data, and what’s significant about them? In this section, we’ll cover the a few of the key differences between a data lake ,data ware house and data base.

Database

Data
  • A database can store day to day transaction as it stores current data.
  • A data warehouse only stores data that has been modeled/structured data, or we can say data ware house can stores historical data,
  • while a data lake is no respecter of data. It stores it all—structured, semi-structured, and unstructured.
Processing
  • Database contains day to day transactions, so processing of data includes like insertion, updating or deletion of data.
  • Before we can load data into a data warehouse, we first need to give it some shape and structure—i.e., we need to pre-process it. That’s called schema-on-write.
  • With a data lake, you just load in the raw data, as-is, and then when you’re ready to use the data, that’s when you give it shape and structure. That’s called schema-on-read.
Agility
  • A database contains well defined schema or tables to store data, all these tables are related to each other, can access jointly to get information.
  • A data warehouse is a highly-structured repository, by definition. It’s not technically hard to change the structure, but it can be very time-consuming given all the business processes that are tied to it.
  • A data lake, on the other hand, lacks the structure of a data warehouse—which gives developers and data scientists the ability to easily configure and reconfigure their models, queries, and apps on-the-fly.
Security
  • Data in database can be secured by applying different level of authentication and synchronization approach.
  • Data warehouse technologies have been around for decades, while big data technologies (the underpinnings of a data lake) are relatively new. Thus, the ability to secure data in a data warehouse is much more mature than securing data in a data lake.
Users
  • As database stores day to day transactions the people working at clerk and admin level needs to access data from database.

 

Data ware houses are meant for data analysis, so any one working on this ,like higher management, data analysts, business analysts are the user for data warehouses.

A data lake, at this point in its maturity, is best suited for the data scientists.

 

Apache Hive – getting started

Hive is basically a Data Warehouse Infrastructure Tool, which is used for processing structured data in Hadoop. Primarily used to summarize and manage Big Data, Hive helps make querying and analyzing easy. Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage.

 

Hive is a powerful tool for ETL, It is, however, relatively slow compared with traditional databases. It doesn’t offer all the same SQL features or even the same database features as traditional databases. But it does support SQL, it does function as a database, and it gives access to Hadoop technology to more people (even those who are not programmers). It offers a way to transform unstructured and semi-structured data into usable schema-based data.

History of Hive

A little history about Apache Hive will help you understand why it came into existence. When Facebook started gathering data and ingesting it into Hadoop, the data was coming in at the rate of tens of GBs per day back in 2006. Then, in 2007, it grew to 1TB/day and within a few years increased to around 15TBs/day. Initially, Python scripts were written to ingest the data in Oracle databases, but with the increasing data rate and also the diversity in the sources/types of incoming data, this was becoming difficult. The Oracle instances were getting filled pretty fast and it was time to develop a new kind of system that handled large amounts of data. It was Facebook that first built Hive, so that most people who had SQL skills could use the new system with minimal changes, compared to what was required with other RDBMs.

The main features of Hive are:

  • Hive provides data summarization, query, and analysis in much easier manner.
  • Hive supports external tables which make it possible to process data without actually storing in HDFS.
  • Apache Hive fits the low-level interface requirement of Hadoop perfectly.
  • It also supports partitioning of data at the level of tables to improve performance.
  • Hive has a rule based optimizer for optimizing logical plans.
  • It is scalable, familiar, and extensible.
  • Using HiveQL doesn’t require any knowledge of programming language, Knowledge of basic SQL query is enough.
  • We can easily process structured data in Hadoop using Hive.
  • Querying in Hive is very simple as it is similar to SQL.
  • We can also run Ad-hoc queries for the data analysis using Hive.

The importance of Hive in Hadoop

Apache Hive lets you work with Hadoop in a very efficient manner. It is a complete data warehouse infrastructure that is built on top of the Hadoop framework. Hive is uniquely placed to query data, and perform powerful analysis and data summarisation while working with large volumes of data. An integral part of Hive is the HiveQL query, which is an SQL-like interface that is used extensively to query what is stored in databases.

Hive has the distinct advantage of deploying high-speed data reads and writes within the data warehouses while managing large data sets that are distributed across multiple locations, all thanks to its SQL-like features. It provides a structure to the data that is already stored in the database. The users are able to connect with Hive using a command line tool and a JDBC driver.

Apache Hive also possess some of the disadvantages, which are very important from a learner’s point of view.

Some of them are:

  • It does not offer real-time queries.
  • It does not offer row-level update
  • Provides acceptable latency for interactive data browsing.
  • Sub-queries are not supported in Hive
  • Latency for Apache Hive queries is generally very high.
  • Not designed for Online Transaction Processing
  • Supports overwriting or apprehending data but not updates and deletes.

However, it will be good if you have a complete knowledge on Apache Hive, for which you can refer to Hive Tutorial by Edureka.

You can Also refer What is Hive ?

You should have understanding of Hadoop as well for a good grip on Hive. Edureka provides a good playlist of Hadoop tutorial videos as well as Hadoop tutorial blog series.

Following are the books for learning Apache Hive:

  • Programming Hive by Edward Capriolo, Dean Wampler, and Jason Rutherglen – O’Reilly
  • Apache Hive Essentials by Dayong Du – Packt Publishing

Scala vs Python for Spark programming

In Big Data Analysis,Apache Spark is one of the most popular framework .The Apache Spark is written in Scala .Apache  Spark has API’s for Scala, Python, Java and R , So we can work with any of these, but the popularly used languages are the Scala and Python.

Java does not support Read-Evaluate-Print-Loop, and R is not a general purpose language.Hence , data science community is divided in two camps; one which prefers Scala whereas the other preferring Python.

Scala vs Python- Which one to choose for Spark Programming?

Choosing a programming language for Apache Spark is a subjective matter.  It is useful for a data scientist to learn Scala, Python, R, and Java for programming in Spark and choose the preferred language based on factors like performance.

Scala and Python are both easy to program and help data experts get productivity fast.

Performance

Scala programming language is 10 times faster than Python for data analysis and processing due to JVM. The performance is average when Python programming code is used to make calls to Spark libraries. Python PyPy interpreter has an in-built JIT (Just-In-Time) compiler  is very fast but it does not provide support for various Python C extensions. In such situations, the CPython interpreter with C extensions for libraries outperforms PyPy interpreter.

Scala is faster than Python when there are less number of cores. As the number of cores increases, the performance advantage of Scala starts to reduce.

When working with lot of cores, performance is not a major driving factor in choosing the programming language for Apache Spark. However, when there is significant processing logic, performance is a major factor and Scala definitely offers better performance than Python, for programming against Spark.

Moreover Scala is native for Hadoop as its based on JVM. Hadoop is important because Spark was made on the top of the Hadoop’s filesystem HDFS. Python interacts with Hadoop services very badly, so developers have to use 3rd party libraries (like hadoopy). Scala interacts with Hadoop via native Hadoop’s API in Java. That’s why it’s very easy to write native Hadoop applications in Scala.

Learning Pattern

Learning Scala enriches a programmer’s knowledge of various novel abstractions in the type system, novel functional programming features and immutable data.. Programmers might find the syntax of Scala for programming in Spark crazy hard at times. Few libraries, in Scala makes it difficult to define random symbolic operators that can be understood by inexperienced programmers. While using Scala, developers need to focus on the readability of the code. Scala is a sophisticated language with flexible syntax when compared to Java or Python.

Python is comparatively easier to learn for Java programmers because of its syntax and standard libraries. However, Python is not an ideal choice for highly concurrent and scalable systems.

Concurrency

The complex and diverse infrastructure of big data systems demands a programming language, that has the power to integrate across several databases and services.

Scala has multiple standard libraries and cores which allows quick integration of the databases in Big Data ecosystems. Scala allows writing of code with multiple concurrency primitives.

Python, to the contrary, does support heavyweight process forking using uwsgi but it does not support true multithreading. When using Python for Spark, irrespective of the number of threads the process has –only one CPU is active at a time for a Python process. This helps get around with one process per CPU core but the downfall to this is, that whenever a new code is to be deployed, more processes need to restart and it also requires additional memory overhead.

Due to its concurrency feature, Scala allows better memory management and data processing.

Usability

Both are expressive and we can achieve high functionality level with them. Python is more user friendly and concise. Scala is always more powerful in terms of framework, libraries, implicit, macros etc.

Scala works well within the MapReduce framework because of its functional nature. Many Scala data frameworks follow similar abstract data types that are consistent with Scala’s collection of APIs.

But for NLP, Python is preferred as Scala doesn’t have many tools for machine learning or NLP. Moreover for using GraphX, GraphFrames and MLLib, Python is preferred. Python’s visualization libraries complement Pyspark as neither Spark nor Scala have anything comparable.

TypeSafety

When programming with Apache Spark, developers need to continuously re-factor the code based on changing requirements. Scala is a statically typed language though it appears like a dynamically typed language because of the classy type inference mechanism. Being a statically typed language –Scala still provides the compiler to catch compile time errors.

Python is an effective choice against Spark for smaller ad hoc experiments but it does not scale efficiently like the statically type language – Scala, for large software engineering efforts in production.

Advanced Features

Scala programming language has several existential types, macros and implicits. The secret syntax of Scala might make it difficult to experiment with the advanced features which might be incomprehensible to the developers. However, the advantage of Scala comes with using these powerful features in important frameworks and libraries.

In Python, SparkMLib –the machine learning library has only fewer ML algorithms but they are ideal for big data processing. Scala lacks good visualization and local data transformations.

Scala is best pick for Spark Streaming feature because Python Spark streaming support is not advanced and mature like Scala.

The conclusion we can draw from above points is, Language choice for programming in Apache Spark depends on the features that best fit the project needs, as each one has its own pros and cons.

So first analyze requirement of project needs, compare language features with those requirements and then decide “Scala or Python”.

Introduction to NoSQL

Relational database have failed in solving some of the complex modern problems like

  1. Continuously changing the nature of data – structured, semi-structured, unstructured and polymorphic data.
  2. Applications now serve millions of users in different Geo-locations, in different timezones and have to be up and running all the time with data integrity maintained.
  3. Applications nowadays are becoming more distributed with many moving towards cloud computing.

NoSQL plays a vital role in an enterprise application which needs to access and analyze a massive sets of data that is being made available on multiple vital servers , remote based in the cloud infrastructure and mainly when the data set is not structured.

Hence the NoSQL database is designed to overcome the Performance, Scalability, Data Modeling and Distribution limitations that are in the relational databases. You can also watch following keynote to know about NoSQL.

– Martin’s keynote “Introduction to NoSQL”

Characteristics of NoSQL

  • It’s more than rows in tables—NoSQL systems store and retrieve data from many formats: key-value stores, graph databases, column-family (Bigtable) stores, document stores, and even rows in tables.
  • It’s free of joins—NoSQL systems allow you to extract your data using simple interfaces without joins.
  • It’s schema-free—NoSQL systems allow you to drag-and-drop your data into a folder and then query it without creating an entity-relational model.
  • It works on many processors—NoSQL systems allow you to store your database on multiple processors and maintain high-speed performance.
  • It uses shared-nothing commodity computers—Most (but not all) NoSQL systems leverage low-cost commodity processors that have separate RAM and disk.
  • It supports linear scalability—When you add more processors, you get a consistent increase in performance.
  • It’s innovative—NoSQL offers options to a single way of storing, retrieving, and manipulating data. NoSQL supporters (also known as NoSQLers) have an inclusive attitude about NoSQL and recognize SQL solutions as viable options. To the NoSQL community, NoSQL means “Not only SQL”.

ACID and BASE

When we talk about  NoSQL databases, data consistency models can sometimes be strikingly different than those used by relational databases .It uses two consistency models are known by the acronyms ACID and BASE.

We will discuss the key differences between ACID and BASE data consistency models .

The ACID Consistency Model

The key ACID guarantee is that it provides a safe environment in which to operate on your data. The ACID acronym stands for:

  • Atomic
    • All operations in a transaction succeed or every operation is rolled back.
  • Consistent
    • On the completion of a transaction, the database is structurally sound.
  • Isolated
    • Transactions do not contend with one another. Contentious access to data is moderated by the database so that transactions appear to run sequentially.
  • Durable
    • The results of applying a transaction are permanent, even in the presence of failures.

NoSQL use an ACID consistency model to ensure data is safe and consistently stored.

The BASE Consistency Model

In the NoSQL world, ACID transactions are less fashionable as some databases have loosened the requirements for immediate consistency, data freshness and accuracy in order to gain other benefits, like scale and resilience.

Here’s how the BASE acronym breaks down:

  • Basic Availability
    • The database appears to work most of the time.
  • Soft-state
    • Stores don’t have to be write-consistent, nor do different replicas have to be mutually consistent all the time.
  • Eventual consistency
    • Stores exhibit consistency at some later point (e.g., lazily at read time).

BASE properties are much looser than ACID guarantees, but there isn’t a direct one-for-one mapping between the two consistency models.

A BASE datastore values availability (since that’s important for scale), but it doesn’t offer guaranteed consistency of replicated data at write time. Overall, the BASE consistency model provides a less strict assurance than ACID: data will be consistent in the future, either at read time nor it will always be consistent.

 

MongoDB – a NoSQL store

MongoDB is an open source, cross-platform, and the most popular NoSQL database program.

Database,collections and documents are terminology in mongodb.

  • Each database has collections which in turn has documents.
  • The data stored is in the form of JSON style documents (rows).
  • It is useful in building scalable websites with millions of users.

Mongodb – Relational databases were not designed to cope with the scale and agility challenges that face modern applications.But they built to take advantage of the commodity storage and processing power available today. NoSQL is the term used to define the new database architecture that scales well in a distributed environment.

Mongodb supplies a javascript environment with BSON object storage(a binary adaption of JSON).So reading and writing data from the node is extremely efficent.It stores incoming records in memory,so it is ideal in high write situations.

Since incoming records are stored in memory,inserrting data into mongo is non-blocking,making it ideal for logging operations and telemetry data.

It also supports javascript functions inside queries,making it very powerful in read situations,including Mapreduce queries.

Advantages
Schemaless

Documents don’t need to have a schema defined beforehand. Instead, the creation of the field happens on the fly.

Fast Access

It uses internal memory for storing the (windowed) working set, enabling faster access to data.

Replication

MongoDB can provide high availability with replica sets. A replica set consists of two or more mongo DB instances.

No complex joins

instead of joins, it supports embedding or referencing to access related data.No need for complex ORM (mapping from relational database tables to application objects )

Language support

It has official drivers for major programming languages and development environments. JSON based structure allows to store complex nested schemas. It plays well with NodeJS , which also understand json natively.

Learn more about Mongo Database from Mongo Database University(https://university.mongodb.com/).

Learning Python

Python is used right from scripting to constructing large websites. There is awesome support for machine learning and data science in form of libraries like numpy, scikit-learn, pandas , mllib. Python is used in big-data / map reduce through Apache Spark. Python is first choice for scientific computing and is replacing matlab in many cases. Django and Flask are used for constructing powerful websites like reddit and quora that runs on python. Python introduces new programming constructs like generators, list comprehensions, functional programming that can help in writing elegant code.

Here is why you should learn Python  https://dbader.org/blog/whylearn-python 

Functional programming : Python has support for functional programming. It means functions are treated as first class citizens – you can manipulate functions just like any other data type (like int, string). You can store functions in a list. Functions can be passed as parameters and can be returned from other functions. Here is more info on functional programming using python
https://www.ibm.com/developerworks/library/l-prog/