Course Title: CS 5604: Information Storage and Retrieval (Fall 2021)
Instructor: Ismini Lourentzou
Teaching Assistant: Makanjuola Ogunleye
Meeting time: Mondays and Wednesdays 5:30-6:45 PM EST, Torgersen Hall 1020
Instructor Office hours: Mondays 4:00-5:00 PM EST, on Zoom
TA Office Hours: Mondays 10:00-11:00 AM and Wednesdays 3:00-4:00 PM EST on Zoom
Course Description:
Welcome to CS 5604, ever wondered how Google or other search engines work? How we can retrieve relevant information from large-scale collections of documents, videos, news articles, tweets and forum posts in just a few seconds? In this course, students will learn the basics of information storage and retrieval, as well as explore cutting-edge research trends. The expected outcome is for students to gain understanding and hands-on experience of the underlying technologies used in modern information retrieval systems. We will cover algorithms and design of search engines and implement our own retrieval models. Other topics include text mining and analysis, indexing, query understanding and expansion, retrieval models (vector space, probabilistic, learning-to-rank, etc.), evaluation and feedback, recommender systems and personalization. We will also explore applications and recent research trends IN information storage and retrieval systems.
Prerequisites:
Programming experience with at least one programming language (Python is recommended and most likely will be used for programming assignments), bash scripting, Linux operating systems usage will be helpful. Check out these video lectures with recommended tools and tips to know. Familiarity with basic math concepts (linear algebra, statistics and probability) will be needed. Any prior experience with machine learning, data analytics and natural language processing is a plus, however, all necessary concepts will be re-introduced in this course. The most important component for successful completion is to extract key concepts and ideas from reading conference papers, be curious when implementing your IR systems, ask questions and participate in class discussions.
Textbooks:
The official textbook for this course is Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schütze (Cambridge University Press, 2008). There exist several other good textbooks, for example:
- Text Data Management and Analysis: A Practical Introduction to Information Retrieval and Text Mining by C. Zhai and S. Massung (Morgan & Claypool, 2016)
- Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto (Addison-Wesley Professional, 2011)
For recent cutting-edge research trends, we will also look at recent publications in IR conferences, e.g., SIGIR, The Web Conference, ICTIR, ECIR, CIKM, WSDM, etc.
Topics (tentative and subject to change):
- Introduction: What is IR, Notion of Relevance, IR problems, Conceptual Models
- Information Storage: crawling, text analysis (Zipf’s Law, stop-words, stemming, lemmatization, dimensionality reduction, LSI/LSA/LDA, WordNet, etc.), inverted indices, query processing, multi-modal/multi-media information storage, etc.
- Retrieval models: Boolean, Vector Space Model, probabilistic and language models, learning-to-rank, dynamic retrieval, etc.
- Retrieval evaluation and relevance feedback: implicit user feedback, test collections and evaluation methodology. Introduction to evaluation metrics, e.g., Mean Average Precision, Normalized Discounted Cumulative Gain (NDCG), etc.
- Miscellaneous: PageRank, query expansion, visualization and IR interfaces, etc.
- Applications: recommender systems, personalization, online advertising, etc.
Assignments:
There are no exams for this course. Grading is based on hands-on assignments and projects. The grading policy is tentative and subject to change, as students will have the opportunity to provide feedback for the grading components and respective percentages.
- HW Assignments (40%): There will be 3-4 homework assignments. HW will be a mix of programming and written problems sets. In addition, there will be a leaderboard-based assignment for which we will collaboratively create a test collection and participate in a learning-to-rank competition. All other homework assignments (except the leaderboard competition) are due at the start of class. Policies on late submissions will be discussed during the first few class sessions.
- Class Presentation (20%): Students will present papers from related conferences covering specific assigned topics. The goal is to develop literature review and paper reading skills, so paper selection will be flexible (instructions will be provided).
- Final Project (40%): Students are encouraged to work in groups of no more than three members, taking into consideration that the work produced should be proportional to the number of members in a team. The goal is to engage students in information retrieval research in a collaborative environment. Topic selection is flexible, for example, students could try new research ideas, experimentally demonstrate any limitations of related work, extend papers from topics covered in class, design and implement an IR application, etc. Project submissions should include code (descriptive Jupyter Notebooks are recommended), a written report (written in LaTex) and a final project representation (PowerPoint or LaTex). All project reports should be written as a research paper, in a standard conference paper format. Groups are required to include a “contributions” section in the final project report, listing each member’s contributions in detail. Code is encouraged to be hosted on GitHub repositories. A list of suggested topics as well as more details about the project proposal will be provided later.
- Piazza Participation (2%): An additional bonus credit will be given for Piazza participation (asking and answering student questions). This will be based on an aggregated list of Piazza stats, including but not limited to Student Participation, Top Student Askers, Top Student Answerers and Top Good Question Askers. TL;DR: be active, share helpful resources and engage in discussions! But do not try to “trick” grading, since no extra credit will be gained via spamming.
Notes:
- Piazza will be used for announcements, general questions and discussions, etc.
- Please familiarize yourself with LaTeX and paper writing practices.
- All in-class discussions should adhere to Virginia Tech’s Principles of Community.
- At any time during the course, if you are facing any difficulties to meet the course deliverables or would like to discuss any concerns, you are welcome to contact me over email or Piazza.
- Students seeking special accommodations based on disabilities should contact me and also coordinate accessibility arrangements with the Services for Students with Disabilities office.
- Students are welcome to submit anonymous feedback to this link.
Honor Code Statement:
All assignments submitted shall be considered “graded work” and all aspects of your coursework are covered by the Honor Code. Students enrolled in this course are responsible for abiding by the Honor Code. The Academic Integrity expectations for Hokies are the same in an online class as they are in an in-person class. Hokies are expected to meet the academic integrity standards at Virginia Tech at all times. For additional information about the Honor Code, please visit https://www.honorsystem.vt.edu/ and read the Graduate Honor System Constitution. Ignorance of the rules does not exclude any member of the University community from the requirements and expectations of the Honor Code. In this class, you must attribute appropriate credit to existing ideas, facts, methods and external sources of code by citing the source. At all times, you should avoid claiming someone else’s work as your own. Whenever I learn that a student has violated the honor code, I am obligated to appropriately report the violation. A student who has doubts about how the Honor Code applies to this course should obtain specific guidance from the course instructor well in advance homework submission.
COVID-19 Classroom Conduct:
Virginia Tech is committed to protecting the health and safety of all members of its community. By participating in this class, all students agree to abide by the Virginia Tech Wellness principles and the guidance stated in the Fall 2021 plans. To adhere to these, you must do the following in this class:
- Wear a mask at all times while in class.
- Wear a mask during all other activities conducted for the class in public indoor areas.
- Isolate yourself from campus if you test positive for COVID or begin to feel symptoms that might be related to COVID.
- Be respectful of the well-being of others by practicing appropriate personal hygiene and by providing appropriate physical distance when feasible.
Masks may be reusable or homemade cloth masks, dust masks, or surgical masks and should fit close to the face to provide thorough filtration of breathed air. Face shields that are open around the sides do not satisfy this requirement and are currently not accepted as a viable alternative by the university. If a student feels that they cannot wear a mask for health concerns and must use an alternative form of face covering such as a face shield, they should contact Services for Students with Disabilities to request an accommodation. No exceptions for masks will be provided unless there is an official accommodation notice provided by SSD to the instructor. These requirements will not be waived. The instructor has the authority to terminate the class session early if the health and safety requirements are not maintained. Students who fail to follow the requirements will be reported to the Office of Student Conduct. If a student will miss significant class activities because of the need to self-isolate, then the Dean of Students Office should be contacted for an official absence verification. Prolonged absences may be difficult to make-up. Students should consult with their advisor about possible options if too much course work is missed to feasibly make-up. As pandemic conditions continue to evolve through the semester, these requirements may need to change. The guidance posted by the university at VT Ready should represent the most up-to-date requirements of the university and should be checked periodically for changes.