Applying AI for Early Dementia Diagnosis and Prediction

2022-11-17

关注

Applying AI for Early Dementia Diagnosis and Prediction — Illustration: © IoT For All

Mental illnesses and diseases that cause mental symptoms are somewhat difficult to diagnose due to the uneven nature of such symptoms. One such condition is dementia. While it’s impossible to cure dementia caused by degenerative diseases, early diagnostics help reduce symptom severity with the proper treatment or slow down illness progression. Moreover, about 23 percent of dementia cases are believed to be reversible when diagnosed early.

'Applying AI for audio and speech processing significantly improves the diagnostic opportunity for dementia and helps to spot early signs years before significant symptoms develop.' -MobiDevClick To Tweet

Communicative and reasoning problems are some of the earliest indicators used to identify patients at risk of developing dementia. Applying AI for audio and speech processing significantly improves the diagnostic opportunity for dementia and helps to spot early signs years before significant symptoms develop.

In this study, we’ll describe our experience creating a speech processing model that predicts dementia risk, including the pitfalls and challenges in speech classification tasks.

AI Speech Processing Techniques

Artificial intelligence offers a range of techniques to classify raw audio information, which often passes through pre-processing and annotation. In audio classification tasks, we generally strive to improve the sound quality and clean up any present anomalies before training the model.

If we speak about classification tasks involving human speech, generally, there are two major types of audio-processing techniques used for extracting meaningful information:

Automatic speech recognition or ASR is used to recognize or transcribe spoken words into a written form for further processing, feature extraction, and analysis.

Natural language processing or NLP is a technique for understanding human speech in context by a computer. NLP models generally apply complex linguistic rules to derive meaningful information from sentences, determining syntactic and grammatical relations between words.

Pauses in speech can also be meaningful to the results of a task, and audio processing models can also distinguish between different sound classes like:

human voices
animal sounds
machine noises
ambient sounds

All of the different sounds above may be removed from the target audio files because they can worsen overall audio quality or impact model prediction.

How Does AI Speech Processing Apply to Dementia Diagnosis?

People with Alzheimer’s disease and dementia specifically have a certain number of communication conditions such as reasoning struggles, focusing problems, and memory loss. Impairment in cognition can be spotted during the neuropsychological testing performed on individuals.

If recorded on audio, these defects can be used as features for training a classification model that will find the difference between a healthy person, and an ill one. Since an AI model can process enormous amounts of data and maintain the accuracy of its classification, the integration of this method into dementia screening can improve overall diagnostic accuracy.

Dementia-detection systems based on neural networks have two potential applications in healthcare:

Early dementia diagnostics. Using recordings of neuropsychological tests, patients can learn about the early signs of dementia long before brain cell damage occurs. Applying even phone recordings with test results appears to be an accessible and fast way to screen the population compared to conventional appointments.
Tracking dementia progression. Dementia is a progressive condition, which means its symptoms tend to progress and manifest differently over time. Classification models for dementia detection can also be used to track changes in a patient’s mental condition and learn how the symptoms develop, or how treatment affects manifestation.

So now, let’s discuss how we can train the actual model, and what approaches appear most effective in classifying dementia.

How Do You Train AI To Analyze Dementia Patterns?

The goal of this experiment was to detect as many sick people as possible from the available data. For this, we needed a classification model that was able to extract features and find the differences between healthy and ill people.

The method used for dementia detection applies neural networks both for feature extraction and classification. Since audio data has a complex and continuous nature with multiple sonic layers, neural networks appear superior to traditional machine learning for feature extraction. In this research 2 types of models were used:

Speech-representation neural network which accounts for extracting speech features (embeddings), and
Classification model which learns patterns from the feature-extractor output

In terms of data, recordings of Cookie Theft neuropsychological examination are used to train the model.

Cookie Theft Graphic Task for Dementia Diagnosis

In a nutshell, Cookie Theft is a graphic task that requires patients to describe the events happening in the picture. Since people suffering from early symptoms of dementia experience cognitive problems, they often fail to explain the scene in words, repeat thoughts, or lose the narrative chain. All of the mentioned symptoms can be spotted in recorded audio and used as features for training classification models.

Analyzing Data

For the model training and evaluation, we used a DementiaBank dataset consisting of 552 Cookie Theft recordings. The data represents people of different ages split into two groups: healthy and those diagnosed with Alzheimer’s disease — the most common cause of dementia. The DementiaBank dataset shows a balanced distribution of healthy and ill people, which means neural networks will consider both classes during the training procedure, without skewing to only one class.

The dataset contains samples with different length, loudness, and noise levels. The total length of the whole dataset equals 10 hours 42 min with an average audio length of 70 seconds. In the preparation phase, it was noted that the duration of the recordings of healthy people is overall shorter, which is logical since ill people struggle with completing the task.

Audio Length Distribution in DementiaBank Dataset

However, relying just on the speech length doesn’t guarantee meaningful classification results. Since there can be people suffering from mild symptoms, or we can become biased for quick descriptors.

Data Preprocessing

Before actual training, the obtained data has to go through a number of preparation procedures. Audio processing models are sensitive to the quality of the recording as well as the omission of words in sentences. Poor quality data may worsen the prediction result, since a model may struggle to find a relationship between the information where a part of the recording is corrupted.

Preprocessing sound entails cleaning any unnecessary noises, improving general audio quality, and annotating the required parts of an audio recording. The Dementia dataset initially has approximately 60 percent poor-quality data included in it. We have tested both AI and non-AI approaches to normalize loudness levels and reduce noises in recordings.

The Huggingface MetricGan model was used to automatically improve audio quality, although the majority of the samples weren’t improved enough. Additionally, Python audio processing libraries and Audacity were used to further improve data quality.

For very poor-quality audio, additional cycles of preprocessing may be required using different Python libraries, or audio mastering tools like Izotope RX. But, in our case, the aforementioned preprocessing steps dramatically increased data quality. During the preprocessing, samples with the poorest quality were deleted, accounting for 29 samples (29 min 50 sec length) which is only 4 percent of the total dataset length.

Approaches to Speech Classification

As you might remember, neural network models are used in conjunction to extract features and classify recordings. In speech classification tasks, there are generally two approaches:

Converting speech to text, and using text as an input for the classification model training.
Extracting high-level speech representations to conduct classification on them. This approach is an end-to-end solution since audio data doesn’t require conversion into other formats.

In our research, we use both approaches to see how they differ in terms of classification accuracy.

Another important point is that all feature extractors were trained in two steps. On the first iteration, the model is pre-trained in a self-supervised way on pretext tasks such as language modeling (auxiliary task). In the second step, the model is fine-tuned on downstream tasks in a standard supervised way using human-labeled data.

The pretext task should force the model to encode the data to a meaningful representation that can be reused for fine-tuning later. For example, a speech model trained in a self-supervised way needs to learn about sound structure and characteristics to effectively predict the next audio unit. This speech knowledge can be re-used in a downstream task like converting speech into text.

Modeling

To evaluate the results of model classification, we’ll use a set of metrics that will help us determine the accuracy of the model output.

Recall evaluates the fraction of correctly classified audio records of all audio records in the dataset. In other words, recall shows the number of records our model classified as dementia.
Precision metric indicates how many of those records classified with dementia are actually true.

F1 Score was used as a metric to calculate harmonic mean out of recall and precision. The formula of metric calculation looks like this: F1 = 2*Recall*Precision / (Recall + Precision).

Additionally, as in the first approach when we converted audio to text, Word Error Rate is also used to calculate the number of substitutions, deletions, and insertions between the extracted text, and the target one.

Approach 1: Text-to-Speech in Dementia Classification

For the first approach, two models were used as feature extractors: wav2vec 2.0 base and NEMO QuartzNet. While these models convert speech into text and extract features from it, the HuggingFace BERT model performs the role of a classifier.

Extracted by wav2vec text appeared to be more accurate compared to QuartzNet output. But on the flip side, it took significantly longer for wav2vec 2.0 to process audio, which makes it less preferable for real-time tasks. In contrast, QuartzNet shows faster performance due to a lower number of parameters.

End-to-end Dementia Classification with AI

The next step was feeding the extracted text of both models into the BERT classifier for training. Eventually, the training logs showed that BERT wasn’t trained at all. This could happen due to the following factors:

Converting audio speech into text basically means losing information about the pitch, pauses, and loudness. Once we extract the text, there is no way feature extractors can convey this information, while it’s meaningful to consider pauses during the dementia classification.
The second reason is that the BERT model uses predefined vocabulary to convert word sequences into tokens. Depending on the quality of the recording, the model can lose the information it’s unable to recognize. This leads to the omission of, for example, incorrect words that still make sense to the prediction results.

As long as this approach doesn’t seem to bring meaningful results, let’s proceed to the end-to-end processing approach and discuss the training results.

Approach 2: End-to-End Processing

Neural networks represent a stack of layers, where each of the layers is responsible for catching some information. In the early layers, models learn information about raw sound units also called low-level audio features. These have no human-interpretable meaning. Deep layers represent more human-understandable features like words and phonemes.

The end-to-end approach entails the use of speech features from intermediate layers. In this case, speech representation models (ALBERT or HuBERT) were used as feature extractors. Both feature extractors were used as a Transfer learning while classification models were fine-tuned. For these classification tasks, we used two custom s3prl downstream models: an attention-based classifier that was trained on the SNIPS dataset and a linear classifier that is trained on the Fluent commands dataset, but eventually both models were fine-tuned using Dementia dataset.

Looking at inference results of the end-to-end solution, it’s claimed that using speech features, instead of text, with fine-tuned downsample models led to more meaningful results. Namely, the combination of HuBERT and an attention-based model shows the most concise result among all approaches. In this case, classifiers learned to catch relevant information that could help differentiate between healthy people and those with Dementia.

For an explicit description of what models and methods for fine-tuning were used, you can download the PDF of this article.

How to Improve The Results

Given the two different approaches to dementia classification with AI, we can derive a couple of recommendations to improve the model output:

Use more data. Dementia can have different manifestations depending on the cause and the patient’s age, as symptoms will vary from person to person. Obtaining more data samples with dementia speech representations allows us to train models on more diverse data, which can possibly result in more accurate classifications.

Improve preprocessing procedure. Besides the number of samples, data quality also matters. While we can’t correct the initial defects in speech or actual recording, using preprocessing can significantly improve audio quality. This will result in less meaningful information lost during the feature extraction and have a positive impact on the training.

Alter models. As an example of end-to-end processing, different upstream and downstream models show different accuracy. Trying different models in speech classification may result in an improvement in classification accuracy.

MobiDev would like to acknowledge and give its warmest thanks to the DementiaBank which made this work possible by providing the data set.

Artificial Intelligence
Health and Wellness
Healthcare

Artificial Intelligence
Health and Wellness
Healthcare

参考译文

应用人工智能进行早期痴呆诊断和预测

精神疾病和引起精神症状的疾病由于这些症状的性质不均匀，在某种程度上很难诊断。其中一种情况就是痴呆症。虽然不可能治愈退行性疾病引起的痴呆，但通过适当的治疗，早期诊断有助于减轻症状的严重程度或减缓疾病的发展。此外，约23%的痴呆病例被认为在早期诊断时是可以治愈的。沟通和推理问题是最早用于识别患痴呆症风险患者的一些指标。将人工智能应用于音频和语音处理显著提高了痴呆症的诊断机会，并有助于在严重症状出现前几年发现早期迹象。在这项研究中，我们将描述我们创建预测痴呆风险的语音处理模型的经验，包括在语音分类任务中的陷阱和挑战。人工智能提供了一系列技术来分类原始音频信息，这些信息通常经过预处理和注释。在音频分类任务中，我们通常在训练模型之前努力提高声音质量并清除任何现有的异常。如果我们谈到涉及人类语音的分类任务，通常有两种主要类型的音频处理技术用于提取有意义的信息:自动语音识别(ASR)用于识别或将口语单词转录为书面形式，以便进一步处理、特征提取和分析。自然语言处理(NLP)是一种通过计算机在语境中理解人类语言的技术。NLP模型通常应用复杂的语言规则从句子中获得有意义的信息，确定单词之间的句法和语法关系。语音中的停顿对任务的结果也很有意义，音频处理模型也可以区分不同的声音类别，比如:以上所有不同的声音都可以从目标音频文件中删除，因为它们可能会恶化整体音频质量或影响模型预测。阿尔茨海默氏症和痴呆患者有一定数量的沟通障碍，如推理困难、注意力集中问题和记忆力丧失。在对个体进行神经心理学测试时，可以发现认知障碍。如果用音频记录下来，这些缺陷就可以作为训练分类模型的特征，从而发现健康的人和生病的人之间的区别。由于人工智能模型可以处理大量的数据，并保持其分类的准确性，将该方法集成到痴呆症筛查中可以提高整体诊断准确性。基于神经网络的痴呆症检测系统在医疗保健领域有两个潜在的应用:现在，让我们讨论如何训练实际的模型，以及对痴呆症进行分类最有效的方法。这个实验的目标是从可用的数据中检测出尽可能多的病人。为此，我们需要一个分类模型，能够提取特征，并找到健康和生病的人之间的差异。用于痴呆检测的方法应用神经网络进行特征提取和分类。由于音频数据具有复杂和连续的性质，具有多个声音层，神经网络在特征提取方面优于传统的机器学习。本研究使用两种模型:数据方面，使用Cookie Theft神经心理检查的记录来训练模型。简而言之，Cookie Theft是一项图形任务，要求患者描述图片中发生的事件。由于早期痴呆患者会出现认知障碍，所以经常无法用语言解释当时的情景，反复思考或失去叙事链。所有提到的症状都可以在录制的音频中发现，并用作训练分类模型的特征。为了进行模型训练和评估，我们使用了一个由552个Cookie Theft录音组成的DementiaBank数据集。这些数据将不同年龄的人分为两组:健康的人和被诊断患有阿尔茨海默病的人——阿尔茨海默病是痴呆症最常见的病因。DementiaBank数据集显示了健康人和病人的均衡分布，这意味着神经网络在训练过程中会考虑这两类人，而不会只偏向于一类人。数据集包含不同长度、响度和噪声水平的样本。整个数据集的总长度为10小时42分钟，平均音频长度为70秒。在准备阶段，注意到健康人的录音时间总体上较短，这是合乎逻辑的，因为病人很难完成任务。然而，仅仅依靠语音长度并不能保证有意义的分类结果。因为有些人可能会有轻微的症状，或者我们会对快速的描述产生偏见。在实际训练之前，获得的数据必须经过一系列的准备程序。音频处理模型对录音的质量和句子中单词的省略都很敏感。低质量的数据可能会使预测结果恶化，因为当部分记录被损坏时，模型可能难以找到信息之间的关系。声音预处理包括清除任何不必要的噪音，提高总体音频质量，并注释音频记录中所需的部分。痴呆数据集最初包含了大约60%的低质量数据。我们已经测试了人工智能和非人工智能方法，以正常化音量水平和减少录音中的噪音。Huggingface MetricGan模型被用来自动提高音频质量，尽管大多数样本没有得到足够的改善。此外，使用Python音频处理库和Audacity进一步提高数据质量。对于质量非常差的音频，可能需要使用不同的Python库或音频掌握工具(如Izotope RX)进行额外的预处理周期。但是，在我们的例子中，前面提到的预处理步骤极大地提高了数据质量。在预处理过程中，删除了质量最差的样本，共有29个样本(29 min 50 sec)，仅占数据集总长度的4%。你可能还记得，神经网络模型是用来提取特征和分类记录的。在语音分类任务中，通常有两种方法:在我们的研究中，我们使用这两种方法来观察它们在分类精度方面的差异。另一个重要的点是，所有的特征提取器训练分为两个步骤。在第一次迭代中，以自监督的方式对模型进行语言建模(辅助任务)等借口任务的预训练。在第二步中，使用人工标记的数据，以标准的监督方式对下游任务的模型进行微调。借口任务应该迫使模型将数据编码为有意义的表示，以便在以后进行微调时重用。例如，以自我监督方式训练的语音模型需要了解声音结构和特征，以便有效预测下一个音频单元。这些语音知识可以在下游任务中重用，比如将语音转换为文本。为了评估模型分类的结果，我们将使用一组度量标准，这将帮助我们确定模型输出的准确性。采用F1评分作为度量，根据查全率和精度计算谐波平均值。度量计算公式为:F1 = 2*召回率*精度/(召回率+精度)。此外，与第一种方法中我们将音频转换为文本时一样，单词错误率也用于计算提取的文本和目标文本之间的替换、删除和插入的数量。对于第一种方法，使用两个模型作为特征提取器:wav2vec 2.0 base和NEMO QuartzNet。这些模型将语音转换为文本并从中提取特征，HuggingFace BERT模型则扮演分类器的角色。与QuartzNet输出相比，wav2vec文本提取的结果似乎更准确。但另一方面，wav2vec 2.0处理音频的时间明显更长，这使得它不太适合实时任务。相比之下，由于参数数量较少，QuartzNet表现出更快的性能。下一步是将提取的两个模型的文本输入BERT分类器进行训练。最终，训练日志显示BERT根本没有受过训练。这可能是由于以下因素:只要这种方法似乎没有带来有意义的结果，让我们继续进行端到端处理方法并讨论培训结果。神经网络代表了一层的堆栈，其中每一层都负责捕获一些信息。在早期层中，模型学习原始声音单元(也称为低级音频特征)的信息。这些没有人类可以理解的含义。深层代表了更多人类可以理解的特征，比如单词和音素。端到端方法需要从中间层使用语音特征。在这种情况下，使用语音表示模型(ALBERT或HuBERT)作为特征提取器。两种特征提取器都被用作迁移学习，同时对分类模型进行微调。对于这些分类任务，我们使用了两个自定义的s3prl下游模型:一个是在SNIPS数据集上训练的基于注意力的分类器，一个是在Fluent命令数据集上训练的线性分类器，但最终这两个模型都使用痴呆数据集进行了微调。观察端到端解决方案的推断结果，使用语音特征而不是文本，加上微调的下样本模型可以得到更有意义的结果。即，HuBERT和基于注意力的模型的结合显示了所有方法中最简洁的结果。在这种情况下，分类器学会了捕捉相关信息，可以帮助区分健康人和痴呆症患者。要了解使用了什么模型和方法进行微调的详细描述，可以下载本文的PDF文件。考虑到用人工智能进行痴呆分类的两种不同方法，我们可以得出一些改进模型输出的建议:使用更多的数据。痴呆症会有不同的表现，这取决于病因和患者的年龄，因为症状因人而异。获得更多痴呆症语音表示的数据样本，使我们能够在更多样化的数据上训练模型，这可能会导致更准确的分类。改进预处理程序。除了样本数量，数据质量也很重要。虽然我们无法纠正语音或实际录音中的初始缺陷，但使用预处理可以显著提高音频质量。这将导致特征提取过程中丢失的有意义信息较少，对训练有积极影响。改变模型。作为端到端处理的一个例子，不同的上游和下游模型显示出不同的精度。在语音分类中尝试不同的分类模型可以提高分类精度。 MobiDev要感谢DementiaBank提供的数据集，使这项工作成为可能。

您觉得本篇内容如何

评分

声明：转载此文是出于传递更多信息之目的。若有来源标注错误或侵犯了您的合法权益，请与我们联系，我们将及时更正、删除，谢谢。