Data Management Tools and Technologies

A wide range of tools and technologies, most of which have existed since several decades, support the application of consumer analytics. More recently, the need to cope with big data led to the emergence of new technologies. Prominent among these are Hadoop and Hadoop-related projects, cloud computing, and cognitive systems.

This section details some of the better known tools and technologies that are currently used for managing structured, unstructured and semi-structured data.

Database

A database is made up of a collection of tables, where data is stored in rows and columns. Relational database management systems (RDBMS) store structured data that may be managed with the use of SQL (structured query language). Non-relational databases, on the other hand, do not store data in tables.

Structured and Unstructured Data

Data that resides in fixed fields (e.g., relational databases, spreadsheet tables) is called structured data and data that does not reside in fixed fields is called unstructured data.

The growth of unstructured data has exploded with the propagation of social media. Examples include free-form text (e.g., social media and other webpages, books, articles, emails), untagged audio, image and video data.

Semi-structured data does not conform to fixed fields but contains tags and other markers to separate data elements. Examples of semi-structured data include XML or HTML-tagged text.

Business Intelligence

The term business intelligence (BI) is used in two different contexts as defined by Forrester Research:

In the broader context: “Business Intelligence is a set of methodologies, processes, architectures, and technologies that transform raw data into meaningful and useful information used to enable more effective strategic, tactical, and operational insights and decision-making.” This definition encompasses a very wide range of theories, methodologies, architectures, and technologies that transform raw data into meaningful and actionable information for business analysis and decision-making.
In a narrower context, business intelligence refers to “reporting, analytics and dashboards”.

A wide range of software applications including Tableau, IBM Cognos, Oracle Hyperion and SAP NetWeaver provide facilities for retrieving, analysing and reporting data for business intelligence.

Data Marts and Warehouses

A data warehouse is a repository of structured data from diverse data sources, which is optimally designed for reporting and analysis purposes. Data from the different sources is uploaded using ETL (extract, transform and load) tools.

A data mart is a subset of a data warehouse, used to provide data to users usually through business intelligence tools.

Data Integration

Data integration is the process of combining data residing in different sources to present a unified view of the data. In business applications it is often referred to as “Enterprise Information Integration”.

There is no universal approach to data integration. Many techniques in this field are still evolving. One approach is the common data storage method, also known as data warehousing. This method extracts data from outside sources, transforms it to fit operational needs, and loads it into a repository, the data warehouse. Specialized ETL (extracts, transforms, and loads) software tools are used for this purpose.

Because the data has already been extracted, converted and combined, queries to a data warehouse take little time to resolve. From a user or front-end standpoint, data warehousing is therefore an efficient way to work with integrated data. The development of the backend on the other hand, requires considerable thought and effort, and the warehouse needs to be large enough to accommodate the data from the multiple sources, a challenge considering the pace at which some databases are growing.

The key disadvantage with data warehouses is that the information in them is not always current. A data warehouse might not extract and load data very frequently, which means information may not be reliable for time-sensitive applications.

An alternative approach, which is suited for time sensitive data, is for the system to pull data directly from individual networked data sources. The way this is done is by means of mapping the schema for the data sources to a mediated schema. This requires the development of “wrappers” or adapters that transform queries to the mediated schema into appropriate queries over the respective data sources.

Data Fusion

Data fusion is the process of matching two or more data sources at the consumer level to create a unified database. The matching uses information common to both sources, called “linking variables” or “fusion hooks,” to pair up consumers from the respective databases. The unified database contains information from the original data sources, simulating single-source data.

For example, data from social media, analysed by natural language processing, can be fused with real-time sales data, to determine what effect a marketing campaign is having on customer sentiment and purchasing behaviour.

Similarly consumer panel data which tells us what consumers buy may be fused with audience measurement data which tells us what programmes they watch. By cross referencing results in this manner, data fusion links purchasing behaviour with media habits to deliver insights that were otherwise not available.

Since it fits into existing analysis systems and requires no additional primary research, data fusion is a cost-effective means of enhancing existing data.

Distributed System

A distributed system comprises a collection of computers, connected via a network. The system uses middleware (software that serves to “glue together” the multiple computers) to enable the connected computers to coordinate their activities and to share the resources of the system, as a single, integrated computing facility.

A distributed system lowers costs because a cluster of lower-end computers can be less expensive than a single higher-end computer. It also improves reliability and scalability — system resources and capabilities can be increased by simply adding more nodes rather than replacing a central computer.

Google File System and Colossus

Google File System (GFS) is a proprietary, scalable distributed file system developed by Google to provide efficient, reliable access to data on large clusters of inexpensive commodity hardware. The successor to GFS is called Colossus.

BigTable

BigTable is a compressed, high performance, and proprietary distributed data storage system built on GFS and other Google technologies. It provides better scalability and control of performance characteristics for Google’s applications. The company offers access to BigTable through Google App Engine.

BigTable is used by a number of applications including web indexing, MapReduce, Google Maps, Google Book Search, “My Search History”, Google Earth, Blogger.com, Google Code hosting, Orkut, YouTube, and Gmail.

Dynamo

Dynamo is an Amazon proprietary distributed data storage system.

Cassandra

Cassandra is an open source (i.e. free) database management system originally developed at Facebook, designed to handle large data volumes on a distributed system. The system is now managed by Apache Software Foundation.

Cloud Computing

Cloud computing is the delivery of computing as a service rather than a product. Configured as a distributed system, scalable computing resources are provided on users’ computers as a utility over the internet.

Hadoop

Hadoop is an open source computing environment that is widely used for large data operations, on distributed clustered systems. Inspired by Google File System and MapReduce, Hadoop was originally developed at Yahoo! and is now managed as a project of the Apache Software Foundation. It is implemented in Java.

Hadoop essentially comprises of the Hadoop distributed file system (HDFS), the Hadoop MapReduce model, and Hadoop Common, which contains libraries and utilities needed by other Hadoop modules.

The HDFS breaks down data into blocks that are distributed across the Hadoop cluster. The MapReduce programme performs two distinct functions — map and reduce. Tasks to be performed on a dataset are broken down into smaller sub-tasks, and distributed to the DataNodes (i.e. worker nodes). The DataNodes process the sub-tasks in parallel, generating a set of intermediate results. The reduce function merges the intermediate values producing the final result.

The Apache Hadoop platform consists of a number of related projects such as Apache Pig, Apache Hive, Apache HBase, Apache Spark, and others.

HBase

HBase, one of a number of projects related to Hadoop, is a column-oriented database management system that runs on top of HDFS. Unlike relational database systems, it does not support a structured query language like SQL. HBase works well with sparse data sets, which are common in many big data use cases. It is managed as a project of the Apache Software Foundation as part of the Hadoop group of services.

Metadata

Metadata is “data about data”, i.e. it describes the content and context of data files. A webpage for example, may include metadata specifying what language it is written in, what tools were used to create it, and where to go for more on the subject, allowing web browsers to automatically improve the experience of users.

Previous Next

Use the Search Bar to find content on MarketingMind.

MarketingMind

Big Data & Consumer Analytics

Dunnhumby

Aggregate/Disaggregate data & Consumer Analytics

Big Data