Introducing Grouply

Happy to introduce Grouply.in – a community of technology professional sharing great content similar to hackernews or reddit

Some of the groups you may find interesting.  Many more groups are available

Machine learning

Python Programming

Jobs and Career

Software Engineering

Tech Interviews

GATE Computer Science

Basic Computer Science

Author info:  I am Arvind from Computer Science Careers – India’s largest tech community providing trusted career advice. Thank you for being part of the community, we just crossed 250K members.

Pandas for data analytics

Learn Data Analytics before diving deep into Data Science

Is it true that to become a data scientist you master the following: statistics, linear algebra, calculus, programming, databases, distributed computing, machine learning, visualization, experimental design, clustering, deep learning, natural language processing, and more???

The answer of this Simply NO!!!

Data science is the simply process of questioning interesting facts, and then answering those questions using large set of data. So data science, in general can be understand as process which includes following steps

  • Design the questionnaire
  • Gather data that might help you to answer that question
  • Clean the data
  • Explore, analyse, and visualize the data
  • Build and evaluate a machine learning model
  • Determine the required results

All above listed tasks do not necessarily requires knowledge of advanced mathematics, a mastery of deep learning, or many of the other skills as mentioned above.

But it still require , skill to understand the data and  ability to work with data with any programming language(R or Python). So Just don’t start with complex concepts, start with Data analytics with Pandas.

Why data analysis  with pandas ?

For working with data in Python, you should learn how to use the pandas library.

Pandas  provides a high-performance data structure (called a “DataFrame”) that is suitable for tabular data with columns of different types, similar to an Excel spreadsheet or SQL table.

It includes tools for reading and writing data, handling missing data, filtering data, cleaning messy data, merging datasets, visualizing data, and so much more. In short, learning pandas will significantly increase your efficiency when working with data.

 

Full Stack development

  1. CSS , HTML, Javascript Jquery
  2. Bootstrap and Foundation are popular CSS frameworks
  3. A responsive website supports different layouts
  4. Frontend build tools – speeds up the development process. Gulp and Grunt are popular ones
  5. Dependency management – Browserify, Webpack, Require.js , Yeomen
  6. Javascript frameworks – React.js, Angular.js, Backbone.js, Ember.js,Vue.js
  7. Backend frameworks – This is where business logic resides – Ruby on Rails, Python Django, PHP, Node.JS , Java, C#
  8. Database – Persistent (MySQL, MongoDB,redis,PostgreSQL,Cassandra)
  9. Caching – Caching reduces the need to hit the database everytime (Nginx, Apache, Redis, InMemory)
  10. Web Platforms – Hosting providers (AWS, DigitalOcean , Heroku, Azure) –
  11. DevOps – Bridges the development process with Server Administration. Automates the Worflow and deployment.
  12. Docker is used for containizering applications, Vagrant is used for provisioning virtual machines(ensures the development environment matches with server)
  13. Server Management – Configuration Management tools. Way of allowing servers to be provisioned (Salt, Puppet, Chef, Ansible)
  14. Others – Authentication, Authorization, API, RESTful services, SOA,Continuos Integration and Deployment
  15. Tools – FTP,SSH, Github

Introduction to Apache Hadoop and friends

Hadoop is a MapReduce framework that enables processing large datasets in parallel, on clusters of commodity hardware. This is cheaper, as it’s a open source solution that can run on commodity hardware while handling petabytes of data. It’s faster on massive data volumes as data processing is done in parallel.

A complete Hadoop MapReduce based solution may have following layers.

1.Hadoop core- HDFS (Hadoop Distributed File System)

This is the data storage system where data is splitted into large files ~64 or 128MB blocks and saved. This can scale to 1000s of nodes and is inspired by Google File System that resolved the problem of indexing the web. In order to cater for fault tolerance, by default, HDFS keep 3 replicas of same data that it stored. Hence there can be integrity violations on the data set at times.

2.MapReduce API

Allows job-based(batch) parallelizable processing, across data in HDFS. This API enables auto-parallelize for huge amounts of data while preserving fault tolerance and adding high availability. Also enables computing logics to come to data, rather than data travelling across the network to be operated on. Mapreduce algorithm expects hardware failures in the used commodity hardware, so have automatic retry is built-in.

Ref: http://mm-tom.s3.amazonaws.com/blog/MapReduce.png

As shown in the above picture, there 3 main phases in MapReduce. In the mapping phase, it splits the incoming files into blocks that can be saved in HDFS and output a key-value pair each of these data piece. This output is then consumed by the shuffling phase, which transforms data to reducers. While at shuffling phase, data can be subjected to optimization mechanisms such as sorting or pre-processing that will save time at reducers. Reducers derive the final desired output according to a defined reducing logic consuming the shuffled output.

3. Data access

access the data store in HDFS in optimized or enhanced ways for particular domains. Following are few such layers available for use.

● Pig – A scripting language that provides ‘Extract Transform Load’ library for HDFS
● Hive – Provides Hive Query Language which is similar to SQL for HDFS
● HBase – Inspired by Google BigTable for GFS, HBase provides a more abstract layer to leverages the distributed data storage of HDFS
● Mahout – Specific data access layer for scalable machine learning algorithms on HDFS

4. Tools and libraries

● HUE – Hadoop User Experience that provides a web interface for Hadoop platform.
● Sqoop – A tool to efficiently transfer bulk data between Hadoop and structured data stores. This is useful when enterprises using relational databases, encounter massive growths, which leads to massive amounts of data that can no longer be efficiently handled in relation databases.

5. Monitoring and alerting

On the cluster node performance and statistics related to current jobs in execution, mostly to be used by administration.

Though Hadoop is a framework with lot of power, it is not a solution for all the scalability and data volume problems in the enterprise. It should only be used when batch-processing and the involved latencies are fine with the application.

For example Hadoop may not be suitable for real time analysis where interactivity is needed. Facebook uses Hadoop for ‘Ad Targeting’ while Hadoop is also used for ‘point of sale transactional analysis (for relational systems)’, threat analysis, data sandboxing, recommendation engines, risk modelling and trade surveillance when huge volumes of data is involved.

While Hadoop is an open source framework, there are several vendors such as Cloudera, HortonWorks, MapR, AWS and Windows Azure who provides a premium layer on top of that under a cost, with value addition and support

Creating Chatbots using Python

Chatbots is computer program which will have a conversation or interaction with user through chat and it is hot now because,

  • create a unique customer experience and also user-friendly.
  • and  give you a feeling that you’re talking to a real person rather than a computer
  • It can schedule meetings, tell you the weather, and provide customer support and many more.That’s why business absolutely love them and use them as part of their branding .

We conducted a webinar on how to develop chatbot using python.

About the Speaker

Srushith from CodeOps Technology , Experienced Software Engineer with a demonstrated history of working in the information technology and services industry. He loves Python, Serverless technologies, Multisim, Matlab, C++, Amazon Web Services, and Cloud Technologies.

This two hour webinar covers,

  1.  Write fun and tiny little code in Python to automatically tweet (and become popular)
  2. Build a simple personal assistant/Conversational User Interface (chatbot) using Python

Introduction to Tensorflow

Tensorflow is an open source software library for numerical computation by using data flow graphs. In graphs, nodes represent mathematical operations, edges represent multidimensional data known as tensors.Computations can be deployed on one or more CPU or GPU in desktop or mobile.

TensorFlow for Machine learning and Deep learning

  • Open source software library created by Google.
  • A library for dataflow programming.
  • As we know that both Machine learning and Deep learning have a pool of powerful algorithms-and both works to skilled a computer to learn automatically complex problem and make a decision and provide solution.
  •  It leverages various optimization techniques to make the calculation of mathematical expressions easier and more performing. Because of this, it is becoming heart of Machine learning and Deep Learning.

Some of the key features of TensorFlow are:

  • Tensorflow is implemented in C++ and is available for C language and Python
  • Efficiently works with mathematical expressions involving multi-dimensional arrays
  • Good support of deep neural networks and machine learning concepts
  • GPU/CPU computing where the same code can be executed on both architectures
  • High scalability of computation across machines and huge data sets

Together, these features make TensorFlow the perfect framework for machine intelligence at a production scale.
If you’re interested in details please refer the below links

  1. In this link , you will learn how you can use simple yet powerful machine learning methods in TensorFlow and how you can use some of its auxiliary libraries to debug, visualize, and tweak the models created with it.
    https://www.kdnuggets.com/2017/12/getting-started-tensorflow.html
  2. In this link you will learn some new features of TensorFlow’s 1.4 release
    https://cloud.google.com/blog/big-data/2017/12/new-in-tensorflow-14-converting-a-keras-model-to-a-tensorflow-estimator
    We will keep updated you as will go through the good articles.

For complete details on Tensorflow refer https://www.tensorflow.org/tutorials/

Introduction to Data Mining

Today, the demand for data analysts and data scientists is so high that the companies are struggling to fill their open positions.

A data scientist is the most in-demand job title in the market and as per the trend will continue to remain so for next couple of decades. So learning about data mining techniques will surely help you in preparation for becoming data analyst or data scientist.

Data mining is a process of getting a useful information from an unorganised raw data.

These techniques is used to
  • predict the future trends
  • mainly to identify the customers and to develop marketing stratergies to increase the sales rate.
The ultimate goal of data mining is prediction – and predictive data mining is the most common type of data mining.
The biggest challenge is to analyse the data to extract meaningful information that can be used to solve a problem or for the growth of the business. There are powerful tools and techniques available to mine data and find insights from it.

There are various data mining techniques. Each technique helps us find different patterns.

Below is the list of the most common data mining techniques.

Classification

  • Collect the data by classifying into different classes based on their attribute.
  • These pre-defined classes will help in segregation of data for furthur analysis to give better results.
  • And the classification analysis is majorly used in machine learning algorithms.

Regression

  • In this analysis,you can find relationship between multiple variables.
  • It helps in identifying the amount dependency of the variable on other variables.
  • It will predict how one variable will change if a variable related to it changes.

Association

  • It is used to find the relation between variables in large data set and it will extract the hidden patterns in the data.
  • Major application of this technique is in retail industry.

Clustering

  • Placing data into groups based on similar values.
  • The grouping is done in such a manner that the objects within the same cluster are very similar to each other but they are very dissimilar to the objects in some other cluster.

Anamoly detection

  • As name suggests,it is used to detect unusual pattern.
  • It has wide applications on dectecting fraud in credit/debit card transactions or dectecting hack in network traffic.

Decision tree

  • Well, decision tree is represented graphically as hierarchical structures so they have a very unique property that they are easy to read and understand.
  • In fact, they are among the few models that are interpretable, where you can understand exactly why the classifier makes a decision.Also, it is able to handle numerical and categorical data.Read more

Neural network

A neural network is just an attempt to make computer models like a brain because if computers were more like the brain they could be good at some of the things humans are good at, like pattern recognition. So a neural network simulate a collection of neurons just as done in the brain and these simulated neurons take inputs and give outputs through their connections.Read more

R programming language for statistics and data science

R programming is about 70%  widely used tool from among all the Data analytics tools and languages because it is an open source free software easily extendable with lots of packages.

Due to these reasons R Programming is an important skill acquired in the Data Science field of study.

Importance of R cannot be limited in words, so here we tried to consolidate the resources to learning R. Here you will find lists of Books, MOOCs ,tutorials and much more to learn R.

Books

There are lot of books has written on R. R is worthwhile in variety of fields(Data Science, Business Analytics, Social media and so on) ,  so here I tried to categorized books for R according to different usability of R.

1.Books for Beginners

Beginning R – Free Download eBook – pdf

This book examines R language using simple statistical examples, showing how R operates in a user-friendly context. This book  is useful for learning simple summary statistics , hypothesis testing, creating Graph, regression , and much more . It covers formula notation, complex statistics, manipulating data and extracting components and rudimentary programming.

Hands-On Programming with R

With this book, you’ll learn how to load data, assemble and disassemble data objects, navigate R’s environment system, write your own functions and use all of R’s programming tools.

RStudio Master Instructor Garrett Grolemund not only teaches you how to program, but also shows you how to get more from R than just visualizing and modelling data.

2.Installing  RStudio

Getting Started with RStudio

This concise book provides new and experienced users with an overview of R Studio, as well as hands-on instructions for analysing data, generating reports, and developing R software packages.

3.Work with Graphics and R

R Graphics Cookbook

This Practical guide provides more than 150 recipes to help you generate high-quality graphs quickly, without having much knowledge of R’s graphics systems.

Guidebook to R Graphics Using Microsoft Windows

This book takes readers step by step through the process of creating histograms, oxplots ,strip charts ,time series graphs , steam-and-leaf displays , scatterplot matrices and map graphs.

4.Data visualization using R and Java script

Pro Data Visualization using R and JavaScript

In this book, you will learn how to gather data effectively, and also how to understand the philosophy and implementation of each type of chart, so as to be able to represent the results visually.

5.Data science algorithms implemention in R

Practical Data Science Cookbook

This books guide you from the basics( how to set up your numerical programming environment) to advance level of data science pipeline( introduce you to data science iterative process of project completion).After leraning this book will able to implement data science algorithms in both R and Python.

6.Machine learning algorithms with R

Machine Learning for Hackers

If you are an experienced programmer interested in crunching data this book will get you started with machine learning – a toolkit of algorithms that enables computers to train themselves to automate useful tasks.

Using R programming, you will learn how to analyze sample dataset and write simple codes for machine learning algorithms. Machine learning for hackers is ideal for programmers from any background, including business, government, and academics research.

Machine Learning with R Cookbook

This book covers the basics of R by setting up a user-friendly programming environment and performing data ETL in You will then dive into important machine learning topics including data classification, regression, clustering, association rule mining, and dimension reduction.

7.Social media analysis with  R

Social Media Mining with R

This book provides detailed instructions on how to obtain, process and analyze a variety of socially-generated data while providing a theoretical back ground to help you accurately interpret your findings.

8.Business analytics and R

Data Mining and Business Analytics with R

In This book readers are provided with the needed guidance to model and interpret complicated data and become adept at building powerful models for prediction and classification.

9.Web development in R

Web Application Development with R Using Shiny

After learning this full book , you will be able to build useful and engaging web applications with only a few lines of code- no java script required.

10.Analysing Big Data:  R and Hadoop

Big Data Analytics with R and Hadoop

This book is focused on the techniques of integrating R and Hadoop by various tools such as RHIPE and RHadoop.

11.500 Plus links on R programming, statistics and visualization. by Alket Cecaj on Algorithms and Data Fusion

Tutorials

1.R Tutorial – Code School

Codeschool started teaching programming language in the banner of ‘Learn with doing it’. It is an interactive course and the content presentation is very lucid.

2.DataCamp: The Easy Way To Learn R & Data Science Online

Datacamp is the best portal to learn about data science. They’ve created tutorials in simple manner.

3.R Tutorial at tutorialspoint

Tutorialspoint ,One of the site widely known for sharing knowledge about various programming languages. They have created R tutorials as well.

 

MOOC’s

Here is list of different MOOC program where you can learn R

  1. R Programming – Johns Hopkins University | Coursera
  2. Introduction to R for Data Science
  3. Free Introduction to R Programming Online Course| DataCamp
  4. R Programming A-Z™: R For Data Science With Real Exercises!
  5. Learn R Programming from Scratch – Udemy
  6. Introduction to R for Data Science| edX
  7. R Programming – Johns Hopkins University | Coursera
  8. R Fundamentals | Dataquest.io
  9. swirl: Learn R, in R.

Data lakes, data warehouses, and databases

“Data lakes, data warehouses, and databases “–All these are some terminologies used in Data Management. But what exactly their meaning is and are the same or differ from each other, let’s try to explore in this article.  We will start with the definitions, then will discuss key differences.

A database is generic data storage and processing platform, often designed for a specific data model (e.g., relational, hierarchical etc.) that can be used for different workloads such as OLTP or OLAP. Database is an organized collection of data. Early databases were flat and limited to simple rows and columns. Today, the popular databases are:

Enterprise Data Warehouse (EDW): This is a data warehouse that serves the entire enterprise.

Data Mart: A data mart is used by individual departments or groups and is intentionally limited in scope because it looks at what users need right now versus the data that already exists.

Data Swamp: When your data lake gets messy and is unmanageable, it becomes a data swamp.

Data warehouse

A data warehouse collects data from various sources, whether internal or external, and optimizes the data for retrieval for business purposes. the data warehouse is designed to gather business insights and allows businesses to integrate their data, manage it, and analyze it at many levels.

A data warehouse system is a specific data processing system, often build using a database system for OLAP/Analytics workloads. A specific data warehouse is using a relational or multi-dimensional analytics schema, often in form of a so-called star schema with fact and dimension tables, to answer a predefined set of questions and reports very efficiently. It is thus fully schematized and only stores the minimal amount of data to answer that set of questions. Data storage costs are normally high. The data is primarily structured, often from relational databases, but it can be unstructured too.

Data Lake

“A data lake is a place to store your structured and unstructured data, as well as a method for organizing large volumes of highly diverse data from diverse sources.”

A data lake is a generic data repository designed to store any type of data in as close as original format as possible, at large scale and low cost, with the ability to schematize the data on demand. It is designed for discovering the questions you want to ask (for example by applying machine learning algorithms on the data to discover interesting data correlations) and provide the original data to be able to answer new questions.

A data lake at the conceptual level augments the data warehouse by offering the original data at lower cost to create new data warehouses for the new questions as they show up and are being discovered. Its analytics engine(s) provides a highly scalable batch processing structure, custom code execution framework and machine learning infrastructure, integration with streaming pipelines (to form the cold path analysis in a lambda architecture) and the ability to run interactive queries.

Data lake platforms often include data warehousing capabilities by including a catalog system to store optimized logical storage abstractions such as tables and at the very least are integrated with a data warehouse system.

Data lakes, data warehouses and databases are all designed to store data. So why are there different ways to store data, and what’s significant about them? In this section, we’ll cover the a few of the key differences between a data lake ,data ware house and data base.

Database

Data
  • A database can store day to day transaction as it stores current data.
  • A data warehouse only stores data that has been modeled/structured data, or we can say data ware house can stores historical data,
  • while a data lake is no respecter of data. It stores it all—structured, semi-structured, and unstructured.
Processing
  • Database contains day to day transactions, so processing of data includes like insertion, updating or deletion of data.
  • Before we can load data into a data warehouse, we first need to give it some shape and structure—i.e., we need to pre-process it. That’s called schema-on-write.
  • With a data lake, you just load in the raw data, as-is, and then when you’re ready to use the data, that’s when you give it shape and structure. That’s called schema-on-read.
Agility
  • A database contains well defined schema or tables to store data, all these tables are related to each other, can access jointly to get information.
  • A data warehouse is a highly-structured repository, by definition. It’s not technically hard to change the structure, but it can be very time-consuming given all the business processes that are tied to it.
  • A data lake, on the other hand, lacks the structure of a data warehouse—which gives developers and data scientists the ability to easily configure and reconfigure their models, queries, and apps on-the-fly.
Security
  • Data in database can be secured by applying different level of authentication and synchronization approach.
  • Data warehouse technologies have been around for decades, while big data technologies (the underpinnings of a data lake) are relatively new. Thus, the ability to secure data in a data warehouse is much more mature than securing data in a data lake.
Users
  • As database stores day to day transactions the people working at clerk and admin level needs to access data from database.

 

Data ware houses are meant for data analysis, so any one working on this ,like higher management, data analysts, business analysts are the user for data warehouses.

A data lake, at this point in its maturity, is best suited for the data scientists.