How to Build Big Data Analytics Infrastructure


ref https://www.datasciencecentral.com/profiles/blogs/big-data-analytics-infrastructure

 

Big data can bring huge benefits to businesses of all sizes. However, as with any business project, proper preparation and planning is essential, especially when it comes to infrastructure. Until recently it was hard for companies to get into big data without making heavy infrastructure investments (expensive data warehouses, software, analytics staff, etc.). But times have changed. Cloud computing in particular has opened up a lot of options for using big data, as it means businesses can tap into big data without having to invest in massive on-site storage and data processing facilities.

In order to get going with big data and turn it into insights and business value, it’s likely you’ll need to make investments in the following key infrastructure elements: data collection, data storage, data analysis, and data visualization/output. Let’s look at each area in turn.

Data collection

This is where the data arrives. It includes everything from your sales records, customer database, feedback, social media channels, marketing lists, email archives and any data gleaned from monitoring or measuring aspects of your operations. You may already have the data you need, but chances are you need to source some or all of the data required.

If you do need to source new data, this may require new infrastructure investments. Infrastructure requirements for capturing data depend on the type or types of data required, but key options might include: sensors (that could sit in devices, machines, buildings, or on vehicles, packaging, or anywhere else you would like to capture data from); apps which generate user data (for example, a customer app which allows customers to order more easily); CCTV video; beacons (such as iBeacons changes to your website that prompt customers for more information; and social media profiles.

With a little technical knowledge, you can set many of these systems up yourself, or you can partner with a data company to set up the systems and capture the data on your behalf. Accessing external data sources, such as social media sites, may require little or no infrastructure changes on your part, since you’re accessing data that someone else is capturing and managing. If you’ve got a computer and an internet connection, you’re pretty much good to go.

Data storage

This is where you keep your data once it is gathered from your sources. As the volume of data generated and stored by companies has exploded, sophisticated but accessible systems and tools have been developed to help with this task. The main storage options include: a traditional data warehouse; a data lake; a distributed/cloud-based storage system; and your company server or a computer hard disk.

Regular hard disks are available at very high capacities and for very little cost these days and, if you’re a small business, this may be all you need. But when you start to deal with storing and analyzing a large amount of data, or if data is going to be a key part of your business going forward, a more sophisticated, distributed (usually cloud-based) system like Hadoop may be called for. 

Data analysis

When you want to use the data you have stored to find out something useful, you will need to process and analyze it. So this layer is all about turning data into insights. This is where programing languages and platforms come into play.

There are three basic steps in this process: 1. preparing the data (identifying, cleaning and formatting the data so it is ready for analysis); 2. building the analytic model; and 3. drawing a conclusion from the insights gained.

Software exists from vendors such as IBMOracle and Google to help you do all of this: turning raw data into insights. Google has BigQuery, which is designed to let anyone with a bit of data science knowledge run queries against vast datasets. Other analytics options include Cloudera, Microsoft HDInsight and Amazon Web Services. And many startups are piling into the market, offering simple solutions which claim to let you feed it with all of your data, and sit back while it highlights the most important insights, and suggests actions for you to take.

Data visualization/output

This is how the insights gleaned from analyzing the data are passed on to the people who need them, i.e. the decision makers in your company. Clear and concise communication is essential, and this output can take the form of brief reports, charts, figures and key recommendations.

All too often I see businesses bury the real nuggets of information that could really impact strategy in a 50-page report or a complicated graphic that no one understands. It’s clearly unrealistic to expect busy people to wade through mountains of data with endless spreadsheet appendices and extract the key messages. Remember: if the key insights aren’t clearly presented, they won’t result in action.

Key data output options include management dashboards, commercial data visualization platforms that make the data attractive and easy to understand, and simple graphics (like charts and graphs) that communicate insights.

Together, these four areas represent the key infrastructure requirements for big data analysis.