This course Provides instruction on the processes and practice of data science, including machine learning and natural language processing. Included are: tools and programming languages (Python, IPython, Mahout, Pig, NumPy, pandas, SciPy, Scikitlearn), the Natural Language Toolkit (NLTK), and Spark MLlib.

Course Objectives

• Recognize use cases for data science on Hadoop
• Describe the Hadoop and YARN architecture
• Describe supervised and unsupervised learning differences
• Use Mahout to run a machine learning algorithm on Hadoop
• Describe the data science life cycle
• Use Pig to transform and prepare data on Hadoop
• Write a Python script
• Describe options for running Python code on a Hadoop cluster
• Write a Pig User-Defined Function in Python
• Use Pig streaming on Hadoop with a Python script
• Use machine learning algorithms
• Describe use cases for Natural Language Processing (NLP)
• Use the Natural Language Toolkit (NLTK)
• Describe the components of a Spark application
• Write a Spark application in Python
• Run machine learning algorithms using Spark MLlib
• Take data science into production

Eligibility / Requirements

Students must have experience with at least one programming or scripting language, knowledge in statistics and/or mathematics, and a basic understanding of big data and Hadoop principles. Students new to Hadoop are encouraged to attend the HDP

