INFO-I 416 Applied Cloud Computing for Data Intensive Sciences

3 credits

Prerequisites: Databases (CSCI-N 211, CSCI 44300, CIT 21400 or INFO-I 308) and Programming (CSCI 23000, CIT 21500, INFO-I 210, NEWM-N 220, but recommend INFO-B 210) and MATH 17100 Multidimensional Mathematics

This course covers data science concepts, techniques, and tools to support big data analytics, including cloud computing, parallel algorithms, nonrelational databases, and high-level language support. The course applies the MapReduce programming model and virtual-machine utility computing environments to data-driven discovery and scalable data processing for scientific applications.

This course includes the following topics:

  • Data intensive sciences and the data center model
  • Clouds with infrastructure, platform, and software as a service
  • Virtualization technologies and tools
  • Introduction to FutureGrid (or Openstack) as an experimental testbed
  • Parallel programming using MapReduce vs. MPI
  • MapReduce and data parallel applications using Hadoop
  • Iterative MapReduce and data mining algorithms using Twister (expectation maximization, clustering, multidimensional scaling, latent Dirichlet allocation, Bayes networks)
  • MapReduce on multicore/graphics processing unit (CUDA)
  • NoSQL databases (Google BigTable and Hadoop HBase) and parallel query processing
  • High level language (Hive and Pig)
  • Amazon EC2 and Microsoft Azure and their applications

Learning Outcomes

  • Explain the main concepts, models, technologies, and services of cloud computing, the reasons for the shift to this model, and its advantages and disadvantages.
  • Examine the technical capabilities and commercial benefits of hardware virtualization.
  • Analyze tradeoffs for data centers in performance, efficiency, cost, scalability, and flexibility.
  • Explain the core challenges of cloud computing deployments, including public, private, and community clouds, in terms of privacy, security, and interoperability.
  • Create cloud computing infrastructure models.
  • Demonstrate and compare the use of cloud storage vendor offerings, such as Amazon S3, Microsoft Azure, OpenStack, and Hadoop distributed file system.
  • Develop, install, and configure cloud-computing applications under software-as-a-service principles, employing Pig, Hive, and other cloud-computing frameworks and libraries.
  • Apply the MapReduce programming model to data analytics in informatics-related domains.
  • Enhance MapReduce performance by redesigning the system architecture (e.g., provisioning and cluster configurations).

Course Delivery

  • On-Campus
  • Online

Course Schedule