小程序
传感搜
传感圈

AI in Medicine Is Overhyped

2022-10-18
关注

We use tools that rely on artificial intelligence (AI) every day, with voice assistants like Alexa and Siri being among the most common. These consumer products work reasonably well—Siri understands most of what we say—but they are by no means perfect. We accept their limitations and adapt how we use them until they get the right answer, or we give up. After all, the consequences of Siri or Alexa misunderstanding a user request are usually minor.

However, mistakes by AI models that support doctors’ clinical decisions can mean life or death. Therefore, it’s critical that we understand how well these models work before deploying them. Published reports of this technology currently paint a too-optimistic picture of its accuracy, which at times translates to sensationalized stories in the press. Media are rife with discussions of algorithms that can diagnose early Alzheimer’s disease with up to 74 percent accuracy or that are more accurate than clinicians. The scientific papers detailing such advances may become foundations for new companies, new investments and lines of research, and large-scale implementations in hospital systems. In most cases, the technology is not ready for deployment.

Here’s why: As researchers feed data into AI models, the models are expected to become more accurate, or at least not get worse. However, our work and the work of others has identified the opposite, where the reported accuracy in published models decreases with increasing data set size.

The cause of this counterintuitive scenario lies in how the reported accuracy of a model is estimated and reported by scientists. Under best practices, researchers train their AI model on a portion of their data set, holding the rest in a “lockbox.” They then use that “held-out” data to test their model for accuracy. For example, say an AI program is being developed to distinguish people with dementia from people without it by analyzing how they speak. The model is developed using training data that consist of spoken language samples and dementia diagnosis labels, to predict whether a person has dementia from their speech. It is then tested against held-out data of the same type to estimate how accurately it will perform. That estimate of accuracy then gets reported in academic publications; the higher the accuracy on the held-out data, the better the scientists say the algorithm performs.

And why does the research say that reported accuracy decreases with increasing data set size? Ideally, the held-out data are never seen by the scientists until the model is completed and fixed. However, scientists may peek at the data, sometimes unintentionally, and modify the model until it yields a high accuracy, a phenomenon known as data leakage. By using the held-out data to modify their model and then to test it, the researchers are virtually guaranteeing the system will correctly predict the held-out data, leading to inflated estimates of the model’s true accuracy. Instead, they need to use new data sets for testing, to see if the model is actually learning and can look at something fairly unfamiliar to come up with the right diagnosis.

While these overoptimistic estimates of accuracy get published in the scientific literature, the lower-performing models are stuffed in the proverbial “file drawer,” never to be seen by other researchers; or, if they are submitted for publication, they are less likely to be accepted. The impacts of data leakage and publication bias are exceptionally large for models trained and evaluated on small data sets. That is, models trained with small data sets are more likely to report inflated estimates of accuracy; therefore we see this peculiar trend in the published literature where models trained on small data sets report higher accuracy than models trained on large data sets.

We can prevent these issues by being more rigorous about how we validate models and how results are reported in the literature. After determining that development of an AI model is ethical for a particular application, the first question an algorithm designer should ask is “Do we have enough data to model a complex construct like human health?” If the answer is yes, then scientists should spend more time on reliable evaluation of models and less time trying to squeeze every ounce of “accuracy” out of a model. Reliable validation of models begins with ensuring we have representative data. The most challenging problem in AI model development is the design of the training and test data itself. While consumer AI companies opportunistically harvest data, clinical AI models require more care because of the high stakes. Algorithm designers should routinely question the size and composition of the data used to train a model to make sure they are representative of the range of a condition’s presentation and of users’ demographics. All datasets are imperfect in some ways. Researchers should aim to understand the limitations of the data used to train and evaluate models and the implications of these limitations on model performance.

Unfortunately, there is no silver bullet for reliably validating clinical AI models. Every tool and every clinical population are different. To get to satisfactory validation plans that take into account real-world conditions, clinicians and patients need to be involved early in the design process, with input from stakeholders like the Food and Drug Administration. A broader conversation is more likely to ensure that the training data sets are representative; that the parameters for knowing the model works are relevant; and what the AI tells a clinician is appropriate. There are lessons to be learned from the reproducibility crisis in clinical research, where strategies like pre-registration and patient centeredness in research were proposed as a means of increasing transparency and fostering trust. Similarly, a sociotechnical approach to AI model design recognizes that building trustworthy and responsible AI models for clinical applications is not strictly a technical problem. It requires deep knowledge of the underlying clinical application area, a recognition that these models exist in the context of larger systems, and an understanding of the potential harms if the model performance degrades when deployed.

Without this holistic approach, AI hype will continue. And this is unfortunate because technology has real potential to improve clinical outcomes and extend clinical reach into underserved communities. Adopting a more holistic approach to developing and testing clinical AI models will lead to more nuanced discussions about how well these models can work and their limitations. We think this will ultimately result in the technology reaching its full potential and people benefitting from it.

The authors thank Gautam Dasarathy, Pouria Saidi and Shira Hahn for enlightening conversations on this topic. They helped elucidate some of the points discussed in the article.

This is an opinion and analysis article, and the views expressed by the author or authors are not necessarily those of Scientific American.

参考译文
人工智能在医学领域被过度炒作
我们每天都在使用依赖人工智能(AI)的工具,像Alexa和Siri这样的语音助手是最常见的。这些消费产品运行得相当不错,siri能听懂我们说的大部分话,但它们绝不是完美的。我们接受它们的局限性,并调整使用它们的方式,直到它们得到正确的答案,或者我们放弃。毕竟,Siri或Alexa误解用户请求的后果通常是很小的。然而,支持医生临床决策的人工智能模型的错误可能意味着生或死。因此,在部署这些模型之前了解它们的工作情况是至关重要的。有关这项技术的公开报道目前对其准确性描绘得过于乐观,这有时会转化为媒体上耸人听闻的报道。媒体上充斥着对算法的讨论,这些算法可以诊断早期阿尔茨海默症,准确率高达74%,甚至比临床医生还准确。详细介绍这些进展的科学论文可能成为新公司、新投资和研究路线的基础,并在医院系统中大规模实施。在大多数情况下,技术还没有准备好部署。原因如下:随着研究人员将数据输入人工智能模型,预计模型将变得更准确,或至少不会变得更糟。然而,我们的工作和其他人的工作发现了相反的情况,在发表的模型中,报告的准确性随着数据集规模的增加而降低。这种违反直觉的情况的原因在于科学家如何估计和报告一个模型的报告准确性。在最佳实践中,研究人员用他们数据集的一部分来训练他们的AI模型,把剩下的部分放在一个“锁盒子”里。然后,他们使用“hold out”数据来测试他们的模型的准确性。例如,假设正在开发一个人工智能程序,通过分析痴呆患者的说话方式来区分他们和没有痴呆的人。该模型是使用由口语样本和痴呆症诊断标签组成的训练数据开发的,可以从一个人的语言预测他是否患有痴呆症。然后根据相同类型的持有数据对其进行测试,以估计其执行的准确性。然后在学术出版物上报告这种估计的准确性;科学家表示,截留数据的准确性越高,算法的表现就越好。为什么研究说报告的准确性随着数据集规模的增加而降低?理想情况下,在模型完成并修复之前,科学家永远不会看到隐藏的数据。然而,科学家可能会偷看数据,有时是无意的,并修改模型,直到它产生很高的精度,这种现象被称为数据泄漏。通过使用hold -out数据来修改他们的模型,然后对其进行测试,研究人员实际上是在保证系统能够正确预测hold -out数据,从而导致对模型真实准确性的高估。相反,他们需要使用新的数据集进行测试,看看模型是否真的在学习,并可以查看一些相当不熟悉的东西,从而得出正确的诊断。虽然这些过于乐观的准确性估计被发表在科学文献中,但表现较差的模型被塞在众所周知的“文件抽屉”中,永远不会被其他研究人员看到;或者,如果它们被提交出版,它们不太可能被接受。对于在小数据集上训练和评估的模型,数据泄漏和发表偏倚的影响是非常大的。也就是说,用小数据集训练的模型更有可能报告夸大的准确性估计;因此,我们在已发表的文献中看到了这种特殊的趋势,即在小数据集上训练的模型报告的准确性高于在大数据集上训练的模型。 我们可以通过更加严格地验证模型和在文献中报告结果的方式来防止这些问题。在确定AI模型的开发对特定应用程序是合乎伦理的之后,算法设计师应该问的第一个问题是:“我们是否有足够的数据来建模像人类健康这样复杂的结构?”如果答案是肯定的,那么科学家应该花更多的时间在模型的可靠评估上,而不是花更少的时间试图从一个模型中挤出每一盎司的“准确性”。模型的可靠验证始于确保我们有代表性的数据。AI模型开发中最具挑战性的问题是训练和测试数据本身的设计。虽然消费AI公司投机地获取数据,但临床AI模型需要更多的关注,因为风险很高。算法设计人员应该定期询问用于训练模型的数据的大小和组成,以确保它们能够代表条件的表现范围和用户的人口统计数据。所有的数据集在某些方面都是不完美的。研究人员应该致力于理解用于训练和评估模型的数据的局限性,以及这些局限性对模型性能的影响。不幸的是,没有可靠验证临床AI模型的灵丹妙药。每种工具和每个临床人群都是不同的。为了得到考虑到现实情况的令人满意的验证计划,临床医生和患者需要在设计过程的早期参与,并从食品和药物管理局(Food and Drug Administration)等利益相关方获得输入。更广泛的对话更有可能确保培训数据集具有代表性;了解模型工作的参数是相关的;人工智能告诉临床医生什么是合适的。我们可以从临床研究的可重复性危机中吸取教训,在临床研究中提出了预先注册和以患者为中心的策略,作为增加透明度和培养信任的一种手段。同样,AI模型设计的社会技术方法认识到,为临床应用建立可靠和负责任的AI模型严格来说不是一个技术问题。它需要对潜在的临床应用领域有深入的了解,认识到这些模型存在于更大的系统环境中,并理解在部署时如果模型性能下降的潜在危害。没有这种全面的方法,人工智能炒作将继续下去。这是不幸的,因为技术有真正的潜力来改善临床结果,并将临床延伸到服务不足的社区。采用一种更全面的方法来开发和测试临床AI模型,将导致关于这些模型的工作效果及其局限性的更细致的讨论。我们认为,这最终将使这项技术发挥其全部潜力,并使人们从中受益。作者感谢Gautam Dasarathy, Pouria Saidi和Shira Hahn启发了关于这个话题的对话。他们帮助阐明了文章中讨论的一些要点。这是一篇观点分析文章,作者或作者所表达的观点不一定是《科学美国人》的观点。
您觉得本篇内容如何
评分

评论

您需要登录才可以回复|注册

提交评论

scientific

这家伙很懒,什么描述也没留下

关注

点击进入下一篇

2023城博会|上海国际智慧物业展览会

提取码
复制提取码
点击跳转至百度网盘