Welcome to Debtech International


Onsite Seminar

Introduction to Big Data

Introduction to Big Data

Introduction to Big Data

Introduction to Big Data

Introduction to Big Data

Introduction to Big Data


These days more than ever enormous volumes of data are being generated. These volumes come from indexing a large number of documents, serving pages on high-traffic websites, handling the volumes of social networking data, capturing sensor or audio-visual data, and delivering streaming media. These are just some examples. Such data surpasses the capabilities of traditional DBMSs. This phenomenon is commonly called “BigData”. “BigData” has characteristics of volume, variety and velocity. This course covers the requirements of “BigData” including new DBMSs, new data structures, new transaction constraints, new distribution technologies and new communications protocols.

To support these needs, a whole raft of hardware and software technologies has emerged. These requirements gave birth to technologies such as Hadoop. Hadoop is a software library and framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model.

This course presents “Big Data” and these new technologies and shows how and where they fit together. Several exercises are used to give the attendees experience in choosing and architecting the right solution. The course will also address the role that data architecture and data management should play.

Agenda

“BigData”

  • Definition of “Big Data”
  • “BigData” characteristics
  • “BigData” market drivers
  • Examples of successful “BigData” implementations
  • The previous dominance of relational technology
  • What really is a relational DBMS?
  • Types of data
  • Structured
  • Semi-structured
  • Unstructured

NOSQL databases and file management systems

  • Definition
  • Characteristics
  • Integrity rules
  • CAP vs. ACID
  • CAP
  • Eventually consistent
  • Available
  • Partition tolerant
  • ACID
  • Atomic
  • Consistent
  • Isolated
  • Durable
  • Sharding
  • Scalability
  • Companies, products, standards driving NOSQL

Key value

  • Definition
  • Characteristic
  • Amazon’s Dynamo
  • Amazon support for SimpleDB
  • Vendor products
  • Relevance

Columnar

  • Definition
  • Characteristics
  • Distinction between wide-column and column-oriented
  • Google’s Big Table
  • Definition and example of Name-Value Pair
  • Cassandra
  • Sybase IQ and Vertica
  • Other vendor products
  • Relevance

Document

  • Definition
  • Characteristic
  • Vendor products
  • Relevance

Relational

  • Definition
  • Characteristics
  • Misunderstandings
  • Advantage of and disadvantages of
  • When and where of ACID properties
  • Major vendor products
  • Relevance
  • Restrictions
  • Proper perspective on
  • The future of RDBMS

Hadoop

  • Apache
  • Description of functionality in Hadoop
  • Hadoop components
  • Ring structure and replication

Massively Parallel Processing

  • Definition
  • Shared nothing architecture
  • Characteristics of good data distribution
  • Benefits of
  • Restrictions to use of

Data Map: reduce

  • Overall definition
  • Definition of Map
  • Definition of Reduce
  • API support

Big Data Analytics

  • Definition of “BigData” Analytics
  • Origins of it in web searching
  • Influence of Google and Yahoo
  • Comparison of traditional data warehousing and “BigData” analytics
  • The role of existing analytics components:
  • The data warehouse
  • ODS (Operational Data Store)
  • Data mining
  • Near real time data warehousing
  • Positioning of DBMSs that support “BigData” initiatives
  • Traditional
  • Shared-nothing
  • NOSQL
  • Appliances
  • New analytic engines
  • Layers of components in “BigData”
  • Analytic applications
  • Fast-loading databases
  • Data mapping
  • Higher level languages
  • Job control
  • Location-aware file systems
  • Original (or source) data

Zookeeper

  • A Distributed Coordination Service for Distributed Applications
  • Design Goals
  • Data model and the hierarchical namespace
  • Nodes and ephemeral nodes
  • Conditional updates and watches
  • Guarantees
  • Simple API
  • Implementation
  • Uses
  • Performance
  • Reliability
  • The ZooKeeper Project

Organizational Issues in “BigData”

  • Ownership and control of “BigData”
  • Specialized (and departmental) focus of “BigData”
  • Job roles in “BigData”

Best Practices

  • Managing growth
  • Managing analytics
  • Managing new data types
  • Determining refresh rates
  • Migrating analytics platforms

Trends, Products and Techniques for “BigData”

  • Vendors and products
  • Methodologies
  • Trends

Duration
2 days

Course Format
Lecture, group discussion and exercises 

Instructor
Tom Haughey

To request a quote for this in-house seminar
Please call (561) 218-4752 or email info@debtechint.com

Return to Onsite Seminars Table of Contents