INFO-I 416 Applied Cloud Computing for Data Intensive Sciences

3 credits

  • Prerequisites: Databases (CSCI-N 211, CSCI 44300, CIT 21400 or INFO-I 308) and Programming (CSCI 23000, CIT 21500, INFO-I 210, NEWM-N 220, but recommend INFO-B 210) and MATH 17100 Multidimensional Mathematics
  • Delivery: On-Campus, Online
  • This course covers data science concepts, techniques, and tools to support big data analytics, including cloud computing, parallel algorithms, nonrelational databases, and high-level language support. The course applies the MapReduce programming model and virtual-machine utility computing environments to data-driven discovery and scalable data processing for scientific applications.

    This course includes the following topics:

    • Data intensive sciences and the data center model
    • Clouds with infrastructure, platform, and software as a service
    • Virtualization technologies and tools
    • Introduction to FutureGrid (or Openstack) as an experimental testbed
    • Parallel programming using MapReduce vs. MPI
    • MapReduce and data parallel applications using Hadoop
    • Iterative MapReduce and data mining algorithms using Twister (expectation maximization, clustering, multidimensional scaling, latent Dirichlet allocation, Bayes networks)
    • MapReduce on multicore/graphics processing unit (CUDA)
    • NoSQL databases (Google BigTable and Hadoop HBase) and parallel query processing
    • High level language (Hive and Pig)
    • Amazon EC2 and Microsoft Azure and their applications

    Learning Outcomes

    • Explain the main concepts, models, technologies, and services of cloud computing, the reasons for the shift to this model, and its advantages and disadvantages.
    • Examine the technical capabilities and commercial benefits of hardware virtualization.
    • Analyze tradeoffs for data centers in performance, efficiency, cost, scalability, and flexibility.
    • Explain the core challenges of cloud computing deployments, including public, private, and community clouds, in terms of privacy, security, and interoperability.
    • Create cloud computing infrastructure models.
    • Demonstrate and compare the use of cloud storage vendor offerings, such as Amazon S3, Microsoft Azure, OpenStack, and Hadoop distributed file system.
    • Develop, install, and configure cloud-computing applications under software-as-a-service principles, employing Pig, Hive, and other cloud-computing frameworks and libraries.
    • Apply the MapReduce programming model to data analytics in informatics-related domains.
    • Enhance MapReduce performance by redesigning the system architecture (e.g., provisioning and cluster configurations).

    Course Schedule

    Syllabi