Course Title: CS 6604: Data Challenges in Machine Learning (Spring 2021)

3 credits, Online, CRN: 21452

Instructor: Ismini Lourentzou

Meeting time: Tuesdays and Thursdays, 9:30-10:45am EST, Zoom Link

Office hours: Tuesdays 11:30am-12:30pm EST and Thursdays 3:00-4:00pm EST, Zoom Link

Video lectures available to registered VT students

Background Survey (with an option for force-add requests) to be completed by 01/31.

Reading list has been released.

Course Description:

Machine learning (ML) has revolutionized a wide variety of domains, ranging from computer vision and natural language processing to robotics, speech recognition, and beyond. In general, ML models are exceptional in recognizing subtle patterns when trained on large volumes of high-quality data. Such reliance on well-defined, accessible, and consumption-ready data poses broad challenges to the successful application of machine learning. In this course, we will explore recent advances that address data challenges, for example, limited imperfect supervision and low-resource scenarios, outliers, class imbalance, missing attributes/values, robustness and generalization.

The course roles, background survey and other material in this syllabus are based on Anuj Karpathe’s CS6804 course at Virginia Tech, Alec Jacobson’s CSC2521 course at the University of Toronto, and Colin Raffel’s COMP790 course at the University of North Carolina, Chapel Hill.

Prerequisites:

Students should have experience with machine learning, data analytics, and preferably with deep learning. Familiarity with linear algebra, statistics and probability are necessary, as well as with the design and implementation of machine learning models (ideally with a framework that is well-suited for rapid ML prototyping, e.g., PyTorch, Tensorflow, Keras, etc.). Most importantly, students are expected to extract key concepts and ideas from reading ML conference papers.

Course Topics:

The course is structured around reading, presenting and discussing weekly papers. Each student will have a unique, rotating role per week. This role defines the lens through which each student reads the paper and determines what they prepare for the group in-class discussion. All students, irrespective of the role, are expected to have read the paper readings of each corresponding session before class and come to class ready to discuss. Students will obtain a thorough understanding of the chosen papers and will develop their paper reading, literature review and prototyping skills.

Readings will cover a range of topics, included, but not limited to:

Presentation roles:

This seminar is organized around the different “roles” students play each week: Presenter, Reviewer, Archaeologist, Researcher, Industry Expert, Hacker, and Ethicist. Students will be divided into two groups, one group presenting on Tuesdays and the other on Thursdays. In a given class session, students in the presenting group will each be given a rotating role (described below). Roles define the lens through which students read the paper. You should present your findings with a formal presentation, i.e., have some slides prepared for the group in-class discussion. Your assigned role determines what you should include in the slides. Each role, irrespective of whether it is designed for two students or one, should create 5-6 slides and present for approximately 8-10 minutes. The length and number of slides can vary, this is merely a recommendation to make sure we have sufficiently covered each role. The Hacker role is the only exception to the rule; hackers should provide a Jupyter Notebook instead of slides. Note that two people will team up for the roles of Presenter, Archaeologist, Hacker and Industry Expert. Pairings can change in each class. The remaining roles are designed for one student.

Depending on changes in course enrollment, the roles might change, for example, remove roles or make roles optional in case enrollment decreases or allow groups of two students for all roles in the event of enrollment increase. Improving over student feedback, as we go along with the readings, is crucial.

Everyone, every week: After each class session, by 9 pm on the same day, post your thoughts on Piazza, for example, which parts did you enjoy reading, what results and insights did you find interesting, a missing result the paper could have included, etc.
Important: Your post should start with the phrase “Weekly assignment: Paper_Title”(assignment completion checks will be done in an automatic manner). Whenever you agree with the comments of a student’s post, make sure to endorse their answer. You can also post a reply with your additional thoughts.

Final Project:

The main project goal is to engage students in research on related machine learning fields. In particular, students should try to extend papers from topics covered in class and present the research outcomes as a research paper, in a standard conference paper format. Students are encouraged to work in groups of no more than three members, taking into consideration that the work produced should be proportional to the number of members in a team. Groups are required to include a “contributions” section in the final project report, listing each member’s contributions in detail. Projects will be hosted on GitHub and should include a written report accompanied by a descriptive Jupyter Notebook, with a format similar to this notebook. In addition, groups will present their final projects during the last two class sessions. A PowerPoint or LaTex final presentation is required.

A list of suggested topics as well as more details about the project proposal will be provided at a later time. Students are encouraged to experimentally demonstrate any limitations of related work and suggest improvements by applying the methods to other public datasets, apart from the ones used in the respective papers.

Technology:

Piazza will be used for announcements, general questions and discussions, etc. If you are unable to register to Piazza, please email me. Please familiarize yourself with GitHub, Zoom, LaTeX and paper writing practices. Unless restricted by low internet bandwidth, please try to keep your video turned on during class. This will enhance class participation and hopefully compensate for the lack of in-person interactions. Please keep your audio muted unless you would like to respond to an on-going discussion or have a question. You can also use the “raise hand” option, type in the chatbox or use the Zoom reactions for nonverbal feedback. Please remember that all in-class discussions should adhere to Virginia Tech’s Principles of Community. To keep track of student order during office hours, please type your name in the chat as soon as you enter the Zoom room. For one-on-one interactions with the instructor, please post a private note on Piazza or use Slack.

Schedule:

We will update the schedule regularly based on the readings and presentations. Based on students’ suggestions, topics have been merged such that for any four consecutive weeks with the same pairs of topic labels, students can pick any paper from the topics of the respective weeks, with the constraint that only up to 3 papers maximum can be from one of the two topics, e.g., up to 3 papers can cover weak supervision and the remaining 1 should cover self-supervision (in other words, if for example 02/09/2021->1 weak supervision paper, 02/11/2021->1 weak supervision paper, 02/16/2021-> 1 weak supervision paper, then 02/18/2021 has to be 1 self-supervision paper).

Lecture No.DateReading
1Tuesday, 01/19/2021Course introduction & logistics: slides
2Thursday, 01/21/2021Active Learning overview (presented by Ismini)
3Tuesday, 01/26/2021Semi-supervised: Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks
4Thursday, 01/28/2021Active Learning: Learning How to Active Learn by Dreaming
5Tuesday, 02/02/2021Semi-supervised: Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning
6Thursday, 02/04/2021Active Learning: Asking the Right Questions to the Right Users: Active Learning with Imperfect Oracles
7Tuesday, 02/09/2021Self-supervision: Rethinking the Value of Labels for Improving Class-Imbalanced Learning
8Thursday, 02/11/2021Weak supervision: Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages
9Tuesday, 02/16/2021Weak supervision: Stochastic Generalized Adversarial Label Learning
ncThursday, 02/18/2021Classes canceled
10Tuesday, 02/23/2021Self-supervision: Contrastive Multi-View Representation Learning on Graphs
ncThursday, 02/25/2021Spring Break Day - No Class
11Tuesday, 03/02/2021Data augmentation: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning
12Thursday, 03/04/2021Adversarial Training: Adversarial Policies: Attacking Deep Reinforcement Learning
13Tuesday, 03/09/2021Adversarial Training: Adversarial Self-Supervised Contrastive Learning
14Thursday, 03/11/2021Bias: Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
15Tuesday, 03/16/2021Interpretability brainstorming: Semi-Supervised Learning under Class Distribution Mismatch
16Thursday, 03/18/2021Bias: Language (Technology) is Power: A Critical Survey of “Bias” in NLP
17Tuesday, 03/23/2021Interpretability: Causal Interpretability for Machine Learning - Problems, Methods and Evaluation
18Thursday, 03/25/2021Outlier & OoD detection: Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks
19Tuesday, 03/30/2021Outlier & OoD detection: Outlier Exposure with Confidence Control for Out-of-Distribution Detection
20Thursday, 04/01/2021Outlier & OoD detection: Deep Sets
ncTuesday, 04/06/2021Spring Break Day - No Class
21Thursday, 04/08/2021Class imbalance: Dice Loss for Data-imbalanced NLP Tasks
22Tuesday, 04/13/2021Missing values&attributes: Inductive Matrix Completion Based on Graph Neural Networks
23Thursday, 04/15/2021Project Workshop Day
24Tuesday, 04/20/2021Robustness & generalization: Robustness in Machine Learning Explanations: Does It Matter?
25Thursday, 04/22/2021Robustness & generalization: On the Connection Between Adversarial Robustness and Saliency Map Interpretability
26Tuesday, 04/27/2021TBD based on topic popularity (revisiting concepts & discussion)
27Thursday, 04/29/2021Final project presentations
28Tuesday, 05/04/2021Final project presentations

Policies:

Grading:

  1. Readings: 56 points: Each student will be in the presenting role for 12 sessions and the non-presenting role for the remaining 12. You can earn up to 3.5 points each time you present (all presenting roles considered equal). When you aren’t presenting, you can earn up to 1 point by completing the non-presenting assignment and by participating in the class, plus 2 points for peer-review in the final class sessions. At the end of the semester, extra credit up to 3 points will be assigned to the most well-made presentation and notebook.

  2. Final Project: 44 points divided into the following categories:

    • Proposal: 5 points.
    • Clarity: 12 points; your paper should be readable, contain well-defined and clear motivation and contribution statements, and appropriately make connections with related work. In general, your project report should follow standard machine learning conference paper formatting and style.
    • Novelty: 5 points; your project should propose something new (a new method, application or perspective).
    • Code: 10 points; the code accompanying your project should be well-documented and your experimental results should be reproducible. Your repository should include a README file with full instructions on how to run the code. Moreover, your code should be easy to run with one simple command; if there are multiple steps involved, please make a bash script.
    • In-class presentation: 12 points; your final presentation should be clear to the audience and provide a solid review of your work as if you were presenting at a conference. You can find examples in the NeurIPS’20 schedule (Oral Spotlight sessions such as this one).

Attendance and late work:

If you expect to miss a class session in which you are in a “presenting” role, you should still create the presentation for your assigned session and find someone else to present for you before the day of the presentation. Missing the class session in which you are supposed to present without arranging the aforementioned accommodations will result in a penalty of 12 points from your total grade, as this disrupts the whole class.

If you miss a non-presenting assignment, you’ll get a zero for that session. Final project presentations cannot be postponed, as they are scheduled in the course’s last few sessions and students need to present at their assigned timeslot. You are welcome to switch your timeslot with another group, but you are responsible for making such arrangements. Other material, such as the final project submission and report are negotiable, based on the severity of the request, e.g., medical reasons.

At any time during the course, if you are facing any difficulties to meet the course deliverables or would like to discuss any concerns, you are welcome to contact me over email, Slack or Piazza. Students can also submit anonymous feedback to this link. Students seeking special accommodations based on disabilities should contact me and also coordinate accessibility arrangements with the Services for Students with Disabilities office.

Honor Code Statement:

All assignments submitted shall be considered “graded work” and all aspects of your coursework are covered by the Honor Code. Students enrolled in this course are responsible for abiding by the Honor Code. The Academic Integrity expectations for Hokies are the same in an online class as they are in an in-person class. Hokies are expected to meet the academic integrity standards at Virginia Tech at all times. For additional information about the Honor Code, please visit https://www.honorsystem.vt.edu/ and read the Graduate Honor System Constitution. Ignorance of the rules does not exclude any member of the University community from the requirements and expectations of the Honor Code.

In this class, you must attribute appropriate credit to existing ideas, facts, methods and external sources of code by citing the source. At all times, you should avoid claiming someone else’s work as your own. This course will have a zero-tolerance philosophy regarding plagiarism or other forms of cheating, and incidents of academic dishonesty will be reported. A student who has doubts about how the Honor Code applies to this course should obtain specific guidance from the course instructor before submitting the respective assignment.