Course Title: CS 6604: Data Challenges in Machine Learning (Spring 2021)

3 credits, Online, CRN: 21452

Meeting time: Tuesdays and Thursdays, 9:30-10:45am EST, Zoom Link

Office hours: Tuesdays 11:30am-12:30pm EST and Thursdays 3:00-4:00pm EST, Zoom Link

Video lectures available to registered VT students

Background Survey (with an option for force-add requests) to be completed by 01/31.

Reading list has been released.

Course Description:

Machine learning (ML) has revolutionized a wide variety of domains, ranging from computer vision and natural language processing to robotics, speech recognition, and beyond. In general, ML models are exceptional in recognizing subtle patterns when trained on large volumes of high-quality data. Such reliance on well-defined, accessible, and consumption-ready data poses broad challenges to the successful application of machine learning. In this course, we will explore recent advances that address data challenges, for example, limited imperfect supervision and low-resource scenarios, outliers, class imbalance, missing attributes/values, robustness and generalization.

The course roles, background survey and other material in this syllabus are based on Anuj Karpathe’s CS6804 course at Virginia Tech, Alec Jacobson’s CSC2521 course at the University of Toronto, and Colin Raffel’s COMP790 course at the University of North Carolina, Chapel Hill.

Prerequisites:

Students should have experience with machine learning, data analytics, and preferably with deep learning. Familiarity with linear algebra, statistics and probability are necessary, as well as with the design and implementation of machine learning models (ideally with a framework that is well-suited for rapid ML prototyping, e.g., PyTorch, Tensorflow, Keras, etc.). Most importantly, students are expected to extract key concepts and ideas from reading ML conference papers.

Course Topics:

The course is structured around reading, presenting and discussing weekly papers. Each student will have a unique, rotating role per week. This role defines the lens through which each student reads the paper and determines what they prepare for the group in-class discussion. All students, irrespective of the role, are expected to have read the paper readings of each corresponding session before class and come to class ready to discuss. Students will obtain a thorough understanding of the chosen papers and will develop their paper reading, literature review and prototyping skills.

Readings will cover a range of topics, included, but not limited to:

Active Learning, Semi-supervised Learning and annotation noise
Weak Supervision (e.g., data programming, noisy & partial labels) and Self-supervision
Data Augmentation and Adversarial Training
Data bias and fairness (e.g., selection/sampling bias, confirmation bias, confounding bias)
Class imbalance, skewed distributions, outliers, out-of-distribution instances and class miss-match
Missing attributes/values
Robustness, generalization and interpretability

Presentation roles:

This seminar is organized around the different “roles” students play each week: Presenter, Reviewer, Archaeologist, Researcher, Industry Expert, Hacker, and Ethicist. Students will be divided into two groups, one group presenting on Tuesdays and the other on Thursdays. In a given class session, students in the presenting group will each be given a rotating role (described below). Roles define the lens through which students read the paper. You should present your findings with a formal presentation, i.e., have some slides prepared for the group in-class discussion. Your assigned role determines what you should include in the slides. Each role, irrespective of whether it is designed for two students or one, should create 5-6 slides and present for approximately 8-10 minutes. The length and number of slides can vary, this is merely a recommendation to make sure we have sufficiently covered each role. The Hacker role is the only exception to the rule; hackers should provide a Jupyter Notebook instead of slides. Note that two people will team up for the roles of Presenter, Archaeologist, Hacker and Industry Expert. Pairings can change in each class. The remaining roles are designed for one student.

Depending on changes in course enrollment, the roles might change, for example, remove roles or make roles optional in case enrollment decreases or allow groups of two students for all roles in the event of enrollment increase. Improving over student feedback, as we go along with the readings, is crucial.

Presenter: Create the main presentation, describing the motivation, problem definition, method and experimental findings of this paper.
Reviewer: Complete a full—critical but not necessarily negative—review of the paper. Follow the guidelines for NeurIPS reviewers (under “Review Content”). Please answer questions 1-6 under “Review Content”, and assign an Overall score (question 9) and a Confidence score (question 10). Note that you can bypass questions by filling N/A. For example, you really liked the paper and can’t think of any disadvantage! Therefore you can skip the respective question. Also, please note that this role does not require going over related work, and is not an exhaustive list of all arguments you can think of. The goal is to enhance your overall critical thinking. I reserve the right to contact students who overuse the N/A option.
Archaeologist: Determine where this paper sits in the context of previous and subsequent work. Find and report on one older paper that has substantially influenced the current paper and one newer paper citing this current paper.
Researcher: Propose an imaginary follow-up project that has now become possible due to the existence and success of the current paper.
Industry Expert: Propose a new application or company product for the method in the paper (not already discussed in class), and discuss at least one positive and negative impact of this application. Convince your industry boss that it’s worth investing time and money to implement this paper. Your arguments should be particularly applicable to the chosen industry market.
Hacker (optional): Implement a small part of the paper on a small dataset or toy problem, or any other simplified version of the paper. Share a Jupyter Notebook with the code of the algorithm with the class. This role is optional, i.e., you can declare if you would like to be a Hacker as an alternative for a specific week, for example when you like the topic and would like to get hands-on experience, but no student will necessarily have to complete this role. Please do not simply download and run an existing implementation, make an effort to at least implement a core method from the paper. Afterall, your code does not have to be bug-free or run perfectly in all scenarios. Also, you are welcome to use (and give credit to) an existing implementation for “backbone” code (e.g. model building, data loading, training loop, etc.).
Ethicist: You are an ethical impact assessor from 2021 (or even 2051). What has been the impact (good or bad) of this paper on the economy, society, and/or the environment?

Everyone, every week: After each class session, by 9 pm on the same day, post your thoughts on Piazza, for example, which parts did you enjoy reading, what results and insights did you find interesting, a missing result the paper could have included, etc.
Important: Your post should start with the phrase “Weekly assignment: Paper_Title”(assignment completion checks will be done in an automatic manner). Whenever you agree with the comments of a student’s post, make sure to endorse their answer. You can also post a reply with your additional thoughts.

Final Project:

The main project goal is to engage students in research on related machine learning fields. In particular, students should try to extend papers from topics covered in class and present the research outcomes as a research paper, in a standard conference paper format. Students are encouraged to work in groups of no more than three members, taking into consideration that the work produced should be proportional to the number of members in a team. Groups are required to include a “contributions” section in the final project report, listing each member’s contributions in detail. Projects will be hosted on GitHub and should include a written report accompanied by a descriptive Jupyter Notebook, with a format similar to this notebook. In addition, groups will present their final projects during the last two class sessions. A PowerPoint or LaTex final presentation is required.

A list of suggested topics as well as more details about the project proposal will be provided at a later time. Students are encouraged to experimentally demonstrate any limitations of related work and suggest improvements by applying the methods to other public datasets, apart from the ones used in the respective papers.

Technology:

Piazza will be used for announcements, general questions and discussions, etc. If you are unable to register to Piazza, please email me. Please familiarize yourself with GitHub, Zoom, LaTeX and paper writing practices. Unless restricted by low internet bandwidth, please try to keep your video turned on during class. This will enhance class participation and hopefully compensate for the lack of in-person interactions. Please keep your audio muted unless you would like to respond to an on-going discussion or have a question. You can also use the “raise hand” option, type in the chatbox or use the Zoom reactions for nonverbal feedback. Please remember that all in-class discussions should adhere to Virginia Tech’s Principles of Community. To keep track of student order during office hours, please type your name in the chat as soon as you enter the Zoom room. For one-on-one interactions with the instructor, please post a private note on Piazza or use Slack.

Schedule:

We will update the schedule regularly based on the readings and presentations. Based on students’ suggestions, topics have been merged such that for any four consecutive weeks with the same pairs of topic labels, students can pick any paper from the topics of the respective weeks, with the constraint that only up to 3 papers maximum can be from one of the two topics, e.g., up to 3 papers can cover weak supervision and the remaining 1 should cover self-supervision (in other words, if for example 02/09/2021->1 weak supervision paper, 02/11/2021->1 weak supervision paper, 02/16/2021-> 1 weak supervision paper, then 02/18/2021 has to be 1 self-supervision paper).

Lecture No.	Date	Reading
1	Tuesday, 01/19/2021	Course introduction & logistics: slides
2	Thursday, 01/21/2021	Active Learning overview (presented by Ismini)
3	Tuesday, 01/26/2021	Semi-supervised: Pseudo-Label : The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks
4	Thursday, 01/28/2021	Active Learning: Learning How to Active Learn by Dreaming
5	Tuesday, 02/02/2021	Semi-supervised: Not All Unlabeled Data are Equal: Learning to Weight Data in Semi-supervised Learning
6	Thursday, 02/04/2021	Active Learning: Asking the Right Questions to the Right Users: Active Learning with Imperfect Oracles
7	Tuesday, 02/09/2021	Self-supervision: Rethinking the Value of Labels for Improving Class-Imbalanced Learning
8	Thursday, 02/11/2021	Weak supervision: Transfer Learning and Distant Supervision for Multilingual Transformer Models: A Study on African Languages
9	Tuesday, 02/16/2021	Weak supervision: Stochastic Generalized Adversarial Label Learning
nc	Thursday, 02/18/2021	Classes canceled
10	Tuesday, 02/23/2021	Self-supervision: Contrastive Multi-View Representation Learning on Graphs
nc	Thursday, 02/25/2021	Spring Break Day - No Class
11	Tuesday, 03/02/2021	Data augmentation: FeatMatch: Feature-Based Augmentation for Semi-Supervised Learning
12	Thursday, 03/04/2021	Adversarial Training: Adversarial Policies: Attacking Deep Reinforcement Learning
13	Tuesday, 03/09/2021	Adversarial Training: Adversarial Self-Supervised Contrastive Learning
14	Thursday, 03/11/2021	Bias: Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings
15	Tuesday, 03/16/2021	Interpretability brainstorming: Semi-Supervised Learning under Class Distribution Mismatch
16	Thursday, 03/18/2021	Bias: Language (Technology) is Power: A Critical Survey of “Bias” in NLP
17	Tuesday, 03/23/2021	Interpretability: Causal Interpretability for Machine Learning - Problems, Methods and Evaluation
18	Thursday, 03/25/2021	Outlier & OoD detection: Enhancing The Reliability of Out-of-distribution Image Detection in Neural Networks
19	Tuesday, 03/30/2021	Outlier & OoD detection: Outlier Exposure with Confidence Control for Out-of-Distribution Detection
20	Thursday, 04/01/2021	Outlier & OoD detection: Deep Sets
nc	Tuesday, 04/06/2021	Spring Break Day - No Class
21	Thursday, 04/08/2021	Class imbalance: Dice Loss for Data-imbalanced NLP Tasks
22	Tuesday, 04/13/2021	Missing values&attributes: Inductive Matrix Completion Based on Graph Neural Networks
23	Thursday, 04/15/2021	Project Workshop Day
24	Tuesday, 04/20/2021	Robustness & generalization: Robustness in Machine Learning Explanations: Does It Matter?
25	Thursday, 04/22/2021	Robustness & generalization: On the Connection Between Adversarial Robustness and Saliency Map Interpretability
26	Tuesday, 04/27/2021	TBD based on topic popularity (revisiting concepts & discussion)
27	Thursday, 04/29/2021	Final project presentations
28	Tuesday, 05/04/2021	Final project presentations

Policies:

Grading:

Readings: 56 points: Each student will be in the presenting role for 12 sessions and the non-presenting role for the remaining 12. You can earn up to 3.5 points each time you present (all presenting roles considered equal). When you aren’t presenting, you can earn up to 1 point by completing the non-presenting assignment and by participating in the class, plus 2 points for peer-review in the final class sessions. At the end of the semester, extra credit up to 3 points will be assigned to the most well-made presentation and notebook.
Final Project: 44 points divided into the following categories:
- Proposal: 5 points.
- Clarity: 12 points; your paper should be readable, contain well-defined and clear motivation and contribution statements, and appropriately make connections with related work. In general, your project report should follow standard machine learning conference paper formatting and style.
- Novelty: 5 points; your project should propose something new (a new method, application or perspective).
- Code: 10 points; the code accompanying your project should be well-documented and your experimental results should be reproducible. Your repository should include a README file with full instructions on how to run the code. Moreover, your code should be easy to run with one simple command; if there are multiple steps involved, please make a bash script.
- In-class presentation: 12 points; your final presentation should be clear to the audience and provide a solid review of your work as if you were presenting at a conference. You can find examples in the NeurIPS’20 schedule (Oral Spotlight sessions such as this one).

Attendance and late work:

If you expect to miss a class session in which you are in a “presenting” role, you should still create the presentation for your assigned session and find someone else to present for you before the day of the presentation. Missing the class session in which you are supposed to present without arranging the aforementioned accommodations will result in a penalty of 12 points from your total grade, as this disrupts the whole class.

If you miss a non-presenting assignment, you’ll get a zero for that session. Final project presentations cannot be postponed, as they are scheduled in the course’s last few sessions and students need to present at their assigned timeslot. You are welcome to switch your timeslot with another group, but you are responsible for making such arrangements. Other material, such as the final project submission and report are negotiable, based on the severity of the request, e.g., medical reasons.

At any time during the course, if you are facing any difficulties to meet the course deliverables or would like to discuss any concerns, you are welcome to contact me over email, Slack or Piazza. Students can also submit anonymous feedback to this link. Students seeking special accommodations based on disabilities should contact me and also coordinate accessibility arrangements with the Services for Students with Disabilities office.

Honor Code Statement:

All assignments submitted shall be considered “graded work” and all aspects of your coursework are covered by the Honor Code. Students enrolled in this course are responsible for abiding by the Honor Code. The Academic Integrity expectations for Hokies are the same in an online class as they are in an in-person class. Hokies are expected to meet the academic integrity standards at Virginia Tech at all times. For additional information about the Honor Code, please visit https://www.honorsystem.vt.edu/ and read the Graduate Honor System Constitution. Ignorance of the rules does not exclude any member of the University community from the requirements and expectations of the Honor Code.

In this class, you must attribute appropriate credit to existing ideas, facts, methods and external sources of code by citing the source. At all times, you should avoid claiming someone else’s work as your own. This course will have a zero-tolerance philosophy regarding plagiarism or other forms of cheating, and incidents of academic dishonesty will be reported. A student who has doubts about how the Honor Code applies to this course should obtain specific guidance from the course instructor before submitting the respective assignment.