INFO-I 416 Applied Cloud Computing for Data Intensive Sciences
Prerequisites: Databases (CSCI-N 211, CSCI 44300, CIT 21400 or INFO-I 308) and Programming (CSCI 23000, CIT 21500, INFO-I 210, NEWM-N 220, but recommend INFO-B 210) and MATH 17100 Multidimensional Mathematics
This course covers data science concepts, techniques, and tools to support big data analytics, including cloud computing, parallel algorithms, nonrelational databases, and high-level language support. The course applies the MapReduce programming model and virtual-machine utility computing environments to data-driven discovery and scalable data processing for scientific applications.
This course includes the following topics:
- Data intensive sciences and the data center model
- Clouds with infrastructure, platform, and software as a service
- Virtualization technologies and tools
- Introduction to FutureGrid (or Openstack) as an experimental testbed
- Parallel programming using MapReduce vs. MPI
- MapReduce and data parallel applications using Hadoop
- Iterative MapReduce and data mining algorithms using Twister (expectation maximization, clustering, multidimensional scaling, latent Dirichlet allocation, Bayes networks)
- MapReduce on multicore/graphics processing unit (CUDA)
- NoSQL databases (Google BigTable and Hadoop HBase) and parallel query processing
- High level language (Hive and Pig)
- Amazon EC2 and Microsoft Azure and their applications
- Explain the main concepts, models, technologies, and services of cloud computing, the reasons for the shift to this model, and its advantages and disadvantages.
- Examine the technical capabilities and commercial benefits of hardware virtualization.
- Analyze tradeoffs for data centers in performance, efficiency, cost, scalability, and flexibility.
- Explain the core challenges of cloud computing deployments, including public, private, and community clouds, in terms of privacy, security, and interoperability.
- Create cloud computing infrastructure models.
- Demonstrate and compare the use of cloud storage vendor offerings, such as Amazon S3, Microsoft Azure, OpenStack, and Hadoop distributed file system.
- Develop, install, and configure cloud-computing applications under software-as-a-service principles, employing Pig, Hive, and other cloud-computing frameworks and libraries.
- Apply the MapReduce programming model to data analytics in informatics-related domains.
- Enhance MapReduce performance by redesigning the system architecture (e.g., provisioning and cluster configurations).