Many recent successes in linguistic models (LM) have been achieved in a “static paradigm”, where the focus is on improving performance on benchmarks generated without regard to the temporal aspect of the data. For example, answering questions about events that the model could learn about during training or evaluating on text samples from the same period as the training data. However, our language and knowledge is dynamic and constantly evolving. Therefore, to enable a more realistic evaluation of query answering models for the next leap in performance, it is important to ensure that they are flexible and robust when dealing with new and unseen data.
We launched in 2021 Mind the Gap: Assessing Temporal Generalization in Neural Language Models and dynamic language modeling benchmarks for WMT and arXiv to facilitate language model evaluation that takes temporal dynamics into account. In this paper, we highlighted issues facing current state-of-the-art large LMs with temporal generalization and found that knowledge-intensive tokens take a significant performance hit.
Today, we publish two papers and a new benchmark that further advance research on this topic. In StreamingQA: A Benchmark for Adapting to New Knowledge Over Time in Question Answering Modelswe study the downstream work of answering questions in our newly proposed benchmark, StreamingQA: we want to understand how parametric and retrieval-augmented, semi-parametric question-answering models adapt to new information in order to answer questions about new events. In Internet Augmented Language Models via Few-Download Prompting for Open Domain Question Answering, we explore the power of combining a large language model with a few download prompts along with Google Search as a retrieval component. In this way, we aim to improve the model’s realism while ensuring access to up-to-date information to answer a variety of questions.
StreamingQA: A Benchmark for Adapting to New Knowledge Over Time in Question Answering Models
Language knowledge and understanding of models assessed through question answering (QA) has been commonly studied in static knowledge instantiations such as Wikipedia. To study how semiparametric QA models and underlying parametric LMs adapt to evolving knowledge, we constructed the new large-scale benchmark, StreamingQA, with human-written and automatically generated questions asked on a given date, to be answered by 14 years of time-stamped news articles (see Figure 2). We show that parametric models can be updated without full retraining while avoiding catastrophic forgetting. For semi-parametric models, adding new articles to the search space allows for rapid adaptation, however, models with an outdated LM background underperform those with a retrained LM.
Internet Augmented Language Models via Few-Download Prompting for Open Domain Question Answers
Our goal is to take advantage of the unique few-shot capabilities that large-scale language models offer to overcome some of their challenges in terms of grounding in real and up-to-date information. Motivated by semi-parametric LMs, which base their decisions on externally retrieved data, we use prompting to learn to tune LMs with information returned from the web using Google Search, a broad and continuously updated knowledge source. Our approach does not involve refinement or learning of additional parameters, thus making it applicable to almost any language model. And indeed, we find that Web-dependent LMs outperform closed-book models of similar, or even larger, model size in answering open-domain questions.