select search filters
roundups & rapid reactions
before the headlines
Fiona fox's blog

expert reaction to study looking at blood protein ‘signatures’ which could potentially predict risk of long COVID

A study published in Lancet eBioMedicine looks at plasma proteomic signatures and persistent symptoms following SARS-CoV-2 infection.


Prof Kevin McConway, Emeritus Professor of Applied Statistics, The Open University, said:

“This research study does look very interesting to me. I’m a statistician, so can’t comment on the detail of how the patterns of proteins in the participants’ blood might relate to the Covid-19 disease processes. But there is a statistical aspect that I’d like to comment on. It’s about the possibility, mentioned in the research paper and in the top line of the press release, that the pattern of proteins found in someone’s blood might be able to predict whether or not they would develop long Covid.

“Rightly, the researchers are quite cautious about this claim in their research paper. They say no more than that the pattern of proteins in a person’s blood “has the potential to predict those more likely to suffer from persistent symptoms [that is, long Covid].” The press release does appear to be rather more upbeat about this possibility, though the quote from the lead author does make it clear that their “tool predicting long Covid still needs to be validated in an independent, larger group of patients.” You might wonder why all this caution is needed, given that their statistically based tool does appear to do well in the predictions it made for the participants in this study.

“It’s because the researchers were unable to carry out a standard and important aspect of the machine-learning approaches they used to develop their statistical tool for predicting long Covid. The researchers used two methods, commonly used in machine learning, to develop their prediction tool. The primary method, which goes by the odd-sounding name of “random forest”, is a very flexible way of producing predictions. However, an issue with methods like that is that they can pick up patterns in the data that turn out not to have all that much to do with the biology behind what they are actually trying to predict. Those patterns in the data might relate to some aspect that happens to be a feature of the specific patients who provided data for the machine learning and wouldn’t apply in other patients, or sometimes they could even just be random. So it’s standard to do what’s called “validation”, that is, to see how the prediction tool works in a different set of data from that used to develop the tool.

“This validation can be internal, where generally the original set of data is divided into two parts, the tool is developed on one part, and then it is tested on the other part. Usually the tool performs rather less well on the test data set than on the data set used to construct it, because part of the good performance on the original data set will be because some patterns specific to just that part have been built in. But, pretty often, the performance on the test data set is still good and the prediction tool has therefore been shown to be useful. “Alternatively, or in addition, external validation can be used, where the new prediction tool is tried out on a completely independent data set, perhaps involving data from an entirely separate group of participants from a different place or time.

“The researchers on this study were unable to carry out any external validation, and only a limited and rather unusual form of internal validation. I’ll explain why, next. But, because of this lack of validation, while their prediction tool certainly looks quite promising, this research can’t provide enough evidence that it can work in a wider context.

“The researchers couldn’t really split their data into a training data set (for developing the tool by machine learning) and a test data set for internal validation, because they didn’t have enough data. The tool was developed using data on the level of 91 different proteins, but for just 52 patients, those who developed antibodies (“seroconverted”). That’s a pretty small number for developing this kind of predictive tool. Of those 52, just 11 had long Covid in the way defined in this study (persistent symptoms continuing for a year or more). Given that the random forest method can be pretty flexible in the way it learns from the data, it’s not very surprising that the results from applying the tool to just these 52 patients seems to correspond exactly to whether they, in fact, had long Covid.

“The researchers did carry out a limited type of internal validation, by using an entirely different machine learning method to the same data. This method, linear discriminant analysis (LDA), is very old (developed in the 1930s and 1940s), but that certainly doesn’t mean it is no good. It can certainly be an acceptable approach in machine learning, and it is quite often used, even if it is less flexible than random forest methods. LDA also performed well, with only two participants being misclassified in terms of whether they had long Covid. But given that it is still based on results from a rather small number of participants, only 11 of which had long Covid, but is based on quite a large number of protein measurements, I don’t really consider this to be a particularly rigorous internal validation.

“I assume that no external validation was done because the researchers did not have access to an independent data set from a different group of participants, and that’s why the lead author points out that some external validation is needed before they can be sure that their approach really does work well.

“What I’m concerned about, though, is where such a data set might come from. All the data in this study comes from the first wave of the Covid pandemic, before new virus variants emerged and before vaccines were developed, and when the participants (who were all health care workers) were subject to a quite specific set of conditions in their work and in the country generally. The researchers mention this as a limitation in their research paper, pointing out that looking at other variants or at vaccination were beyond the scope of their study. But presumably the patterns of proteins in the blood of infected people might be different in patients infected with a different variant, or after being vaccinated, or even just at a different time in the pandemic. So, in a data set from after the emergence of new variants and/or after vaccination, the specific prediction tool developed in this study might not work well simply because the protein patterns have changed. A validation in a data set like that won’t necessarily tell us much about how good the original prediction tool was, though of course the general approach might well still work in the new data (and maybe internal validation in the new data set would be possible).”



‘Plasma proteomic signature predicts who will get persistent symptoms following SARS-CoV-2 infection’ by Gabriella Captur et al. was published in Lancet eBioMedicine at 00:01 UK time on Wednesday 28 September.




Declared interests

Prof Kevin McConway: “I am a Trustee of the SMC and a member of its Advisory Committee.  My quote above is in my capacity as an independent professional statistician.”

in this section

filter RoundUps by year

search by tag