false
Catalog
Breast Series: Emerging Technologies (2023)
RC31519-2023
RC31519-2023
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Pleasure to be here. So, first let's begin the motivation for developing CAD for mammography. This is originally to assist radiologists in identifying subtle cancers that might otherwise be missed. So, the CAD is going to display marks where potential areas are abnormal, and then it's up to the radiologist to determine whether that area warrants further evaluation or may be dismissed. So, in 1988, CAD was approved of by the FDA. By 2002, CMS allowed for reimbursement, and these two events really led to the rapid dissemination of CAD. So, in 2001, there was only 5 percent of screening mammograms were performed with CAD. By 2008, that number jumped up to 74 percent. In terms of healthcare dollars, in 2009, CAD cost over $400 million, and nowadays CAD is still currently used for most screening mammograms performed in the U.S. And many early studies really well done performed had showed that there was an increase in diagnostic accuracy of radiologists who used the CAD, but unfortunately, subsequent multi-center studies could not reproduce these results. This is a large study done in the U.S. You can see many mammograms, many radiologists, 66 sites, including academic and community practices, and the authors found no difference in cancer detection rate. Looking at the graph here, you can see that these ROC curves are superimposed, which means similar performance. If you look carefully, the authors found that actually sensitivity of radiologists who did not use CAD was actually a little bit higher than those who did. So, I would say, at best, CAD had delivered no benefits, and at worst, it actually reduced the radiologist's accuracy. So, this makes us the classic example to use for the Gardner hype cycle. This is what happens when a new technology is introduced. When CAD came out, lots of inflated expectations. Then, as we used it, there was a little bit of disillusionment with it, and the term is called a AI winter, where many of us consider CAD to be a disappointment. And what I'm going to do is actually superimpose the reimbursement curve of this cycle. So, you can see, when it first came out, it was paid $15 per exam, and a lot of that was related to introduction of PACS and digital mammography. It was intense efforts by the vendors to get paid for this, but you can see this precipitous drop in reimbursement by 2016, $0.18, and now it's bundled. We don't even get paid for CAD. So, I think now, when we do a root cause analysis, we go back, the issue really is that traditional CAD did not address the main problem in screening. It really focused on sensitivity at the expense of decreased specificity, and these traditional systems do not aid with decision-making. We know now the algorithms, insufficient data for training compared to what we're using for our models. Reimbursement certainly takes away the incentive to improve your model, and these are what we call brittle systems that can only focus on a narrow task. Now, I'm going to try to conceptually explain to you what a deep learning model is. This is where you're going to have your input, the image here of a mammogram. Black box is whatever your computer algorithm is spitting out, and output is the prediction. So, what you have is that you have your mammogram, and then there's going to be a filter, looking at very simple things, maybe pixel analysis. Then it gets more detailed, zooming in, look at edge enhancement, and look at other features that eventually are really imperceptible to the human eye, and a deep learning model is just able to rapidly process this large amount of information, again, looking at detailed features that we can't perceive, and I'll give the analogy when I explain this to my clinicians, think of it as the facial recognition app that we all have on our phone. First layer is just simple pixels, light or dark, gets a little bit more sophisticated, and now we're doing the, identifying the edges, and then afterwards, the neural network just learns to use layers of increasingly complex rules to characterize complicated shapes, such as faces. So, to summarize, the limitations CAD was a slightly lower sensitivity, much lower specificity, which translates into higher recall and biopsy rates, showed an increased cost without improvement in performance, and also, there's a poor user experience. I think many of us have what I call alert fatigue due to high false positive rates. In contrast with deep learning models, the important point is it's flexible, we call it task agnostic. This means that it simply learns from the data that it's given. As I showed you, exploit features that hopefully are more meaningful and also more broadly useful. So, deep learning tools can perform well, multiple similar tasks, and can be fine-tuned in new tasks with much less effort than the traditional CAD systems, and there's much hope about this concept of precision health. The model's going to analyze pixel-level features of the images, along with patient-level and genomic-level variables. So, now I'm going to shift gears, talk about some of the current applications of AI and the roles for AI in the breast cancer care continuum. I'm going to highlight a few papers in this section. So, this is a diagram showing you the many tasks in the purple box that deep learning tools may perform to assist with the breast cancer care continuum. I'm going to start with the lesion analysis, but the patient's cancer, where you see lots of studies look at radiomics, can be used to predict prognosis, likelihood of recurrence, and treatment response, very meaningful for the individual woman. And on the bottom, you look at automated breast density, and as well as breast cancer prediction risk. I'm going to begin with a dream challenge. This is an open, crowdsourced algorithmic analysis challenge to reduce the recall rate. We actually had a hot topic session two years ago, Wednesday, 7 a.m., standing room only. You know, it was just such enthusiasm for this. This was in a relatively large database, big money. The winner would get a million dollars. And, again, well-received, over 1,300 participants. The best team had an AC of 0.9. However, when the top three teams had worked together, they did not beat the community-based radiologists. Both radiologists and the challengers had a similar sensitivity, but you see that the radiologists had a much higher specificity. And here, I'm going to plant a point and go back to it. I think, conceptually, redeveloping models for screening setting, we should really be focusing on trying to obtain as high of a negative predictive value as possible. With that in mind, I'm just going to brief you over this study coming to us from colleagues in the Netherlands. Here, the authors want to compare the diagnostic accuracy of radiologists who had interpreted mammograms with and without the AI system. This is an image from their system. So, this is an interactive decision support tool. So, if I were to identify this as being suspicious, you click on that, and that activates the decision support, giving you the likelihood of malignancy on both views. And also, it gives you a final global score. On their model, 10 means it's very suspicious. So, initially, only four radiologists called this back, but with the AI system, that jumped up to 11. And this turned out to be a small invasive carcinoma. Looking at the reader performance, see all the radiologists, it's similar. The dashed line is the ROC curve of the model. And of note, you see a small improvement in sensitivity, but more importantly, the specificity was maintained. And what the authors did cleverly is, at the end, they saw how the AI does on its own. You can see the AEC was 0.89, quite similar to the radiologists. This, to me, is the most interesting slide from their study. So, I'm going to focus on this right-hand slide. You can see that score I mentioned before, from 1 to 10. Those that are scored of 1 through 5, think of those as mainly negative mammograms, and what do you see? The radiologists actually have a shorter interpretation time using this model. This makes sense to me. I think of it as having the help of an excellent resident or fellow, you just get through these images quicker. And again, the importance of having that high negative predictive value. Those mammograms have more higher score. These are more complex. It's not unexpected. It's going to take us a longer time to read these studies. And overall, when the authors analyzed the reading time, they thought that for all of the readers, it saved about 5% of their interpretation time. So, I just want to emphasize with just one system that I selected, it actually has shown many more possible applications of what I call an AI-based CAD system. They were able to use an interactive decision support, determine likelihood malignancy, and also function as an independent second reader. I want to emphasize that this improvement was seen across all breast categories, regardless of lesion type and vendor. So, for me, the AI helped the radiologists, small increase in sensitivity, but it didn't slow them down. So, this is a very important workflow issue for us, and it has much more value than a traditional CAD system, which I've already showed you earlier. And also, the implications here is not only that AI can function as a second reader, but additional studies will be necessary to see if those mammograms, with a very low likelihood, may be completely triaged and interpreted by the AI systems alone. This is something that's building traction, let's say in chest X-rays. And at the FDA level, now they're thinking about defining categories of risk actually based on a level of human supervision over the algorithm. Next, I'm just going to describe the model that my team has developed. We had about 200,000 mammograms, also reader study, and I want just to show you what we did as an example, screening mammogram. You can see there's this mass here, and what we had superimposed was these heat maps or saliency maps in red. So, this is trying to help radiologists understand what's in the black box, we call it explainable AI. So, in red are those pixels that the computer focused most on to make its prediction. And if you look, that's the area where the mass is most irregular and speculated, and this was an invasive ductal carcinoma. Another example is 67-year-old women. Dense mass, irregular, obvious cancer, green is what we used to determine it's going to be benign. And what we found is actually just highlighting the most suspicious reason. Actually, had improved the interpretation time for the mammograms. Again, the whole goal is to have increased transparency. So, I'm just going to spend a little bit of time on this. This is complicated, but there's a couple of important take-home points. We're looking at how the radiologists did. First, individual radiologists, it's at .78. This number is actually really consistent in the literature by using a mammography or MR. We showed the model did much better at 0.88, but not as well as when you assemble all of the radiologists together. But if I'm a radiologist and I'm using my model, I can get to the performance of where all the radiologists were combined. And of course, the best information is if you, you know, throw the kitchen sink, you have the results of all the physicians as well as the model. But an important take-home point I want to stress is really since the advent of us developing all these clinical decision support tools, the human-machine collaborations have performed better than either one alone. And all these studies we're seeing on radiology AI systems are really no different. So, let's just take a break for a minute, you know, but here at this meeting, vendors have been knocking on your door, emailing you, asking you to look at the software, and most of us say, you know, it looks cool, but I'm not paying for it. But what if the model can really predict with high accuracy that a study is negative or normal? Negative predictive value is high, allows me to use a system with much more confidence, and I think of it as equivalent as having the Ted Keats normal variant in your back pocket. So, if I have this, this will add value to my practice and I'm willing to pay for it. Again, going back to that concept of high negative predictive value, I'm sorry, high NPV is going to equal to value. So, once it tells me it's normal, okay, and then it's up to me to figure out how to analyze the other findings and how it fits into the scope of the patient. So, we talk about lesion analysis. There are lots of models out there now. The vendors will show it. They can predict breast density. I'm just going to focus on breast cancer risk prediction, showing you two papers. This first one came up from our colleagues at MGH and MIT, and here the authors want to develop a mammal-based deep learning tool to predict breast cancer risk and to compare its performance to a conventional breast cancer risk assessment tool. We already know that the dense breast, you know, is a risk factor, but here the author is saying, well, what if you analyzed the texture complexity of the mammogram directly? Hypothesis is that the Parenchymal pattern is a unique signature, quite specific for a woman, and might be associated with her risk. This builds upon the prior texture analysis that has shown this, that the dense breast tissue, you think, okay, could be related to where it's located, the nodularity, but again, many of these are normal features that the human eye cannot perceive. We also know this is also seen, whether you look at the normal pattern in ultrasound or on the background parenchymal enhancement on MR. So, let's look at their results. If you look at the predictive model, in blue here is the Tyrecusic model, 0.61, and what the author showed in green is if you use the prediction based upon the mammographic features, they're much better, 0.68, and showed you what I emphasized before, more information, incorporating risk factor plus the mammal information, have the best say you see. An important point is looking at performance in women by ethnicity, looking at Caucasian women, the models alter relatively well, and again, the hybrid model performs the best. If you look at African-American women, you can see the Tyrecusic is 0.45, and although the authors didn't have a lot of African women, this is an important point. You know, Tyrecusic is no better than a flip of a coin, and again, you can see that the hybrid model really performs quite well with not affected by ethnicity. This makes sense. Most of the risk models for cardiovascular risk or breast cancer have only been validated in Caucasian men and women. So, for me, here I'm showing you a montage of all these normal mammograms, and the deep learning model could predict which one of these patterns were more prone to developing breast cancer. So, take-home point, radiomics, it's a powerful tool, and there really is a role for a mammal-based AI tool to predict long-term risk beyond breast density. So, the final paper, I think it was like a tour de force paper, it's a very large study, but it really combines AI interpretation with mammal and EHR. So, this is identifying risk factors beyond the Tyrecusic that I showed you, and hypothesis is that combining EHR data with image interpretation is going to lead to a better diagnosis. So, the office actually developed two networks. One is a mammography, similar to what I showed you earlier. The other is looking at the clinical EHR data. By definition, women had to have the EHR data a year before the mammogram. Each woman, they had over 1,300 parameters. What are these parameters? Basic ones, age, weight, family history, but then lots of routine lab tests, blood tests, urine tests, and so forth. And they wanted to see which of these clinical features could predict which woman would develop a breast cancer in one year. So, I'm going to go through this graph here first on the left. In terms of those features predicting future breast cancer, some we know. The previous priority score, the age, whether the patient's symptomatic. You see here, turns out thyroid function tests, white blood cell profiles, these all actually help to make a better prediction. On the right here, features that will contribute to making a normal, to predicting a normal mammogram in two years. Again, these are not factors we routinely incorporate into any risk scores. So, the model could predict which patients develop breast cancer in the future, but interestingly, could also predict which woman would have a normal mammogram. EHR data, no information mammogram, 0.8. Using the mammogram only model, similar performance, 0.8. Combo does a little bit better. So, for me, the take-home point here is really precision medicine is at our fingertips. By allowing AI to comb through the medical records to discover additional clues, you know, it can make an earlier or better diagnosis. And again, identifying these unexpected risk factors. One of the lessons that we learned is this issue called generalizability. Now, I've showed you all of these models. We all make great predictions, but is it going to be consistent across sites, different vendors, different image protocols? It will work as well on MAML or DBT, different patient populations. And this is important because all of us are training our models on single institution data sets, and we want to identify any potential bias. So, just a word about assessing the performance of your algorithm. We ideally want this category in the middle, appropriate fitting. This red line is really able to separate most of the X's and O's. If you look at performance and you have this straight line, this is called underfitting. This means that the model is really too simple. Usually, it's because you train with a sparse data set or you just have too many parameters, you know, in your model. What we look for when we're looking at manuscripts and reviewing grants, you know, it's really this concept of overfitting. First, you draw this line that's overfitting, and these results are just too good to be true, not used because your model has been trained to predict the training data too well. So, going forward, I think in order for AI to be implemented into our clinical workflow, it's really essential to see how AI works with respect to real-life considerations, and these are metrics that we're being asked now. Does the algorithm have a measurable impact on efficiency? Is it reducing variability, you know, between radiologists, between exams? And most importantly, are there going to be sustained improvements in outcomes? And something else that we've learned along the way is that we need to have multiple stakeholders involved as we're, you know, marching on to clinical implementation. So, I'm just going to show you in terms of stakeholders, this is a recent multi-society statement on ethics of AI in radiology. Here, we have the ACR, RSNA, as well as European and Canadian societies, and this really outlined key ethical issues as technologies developed and being implemented in clinical practice, again, trying to avoid these inherent biases in the system. The other thing I want to show you is that last year, the RSNA sponsored a big summit. Here, the idea is to foster a healthy ecosystem for all of our AI tools, and the word that jumps out to me in this is there's a big need for standardization across the board. And here, the stakeholders were from academia, industry, societies, government agencies, and the idea is to develop a guide to allow for the, you know, development and dissemination of AI algorithms throughout the healthcare. I'll give an example by just showing you an example of image annotation. Quite often now, you will see papers where the label, whether it's cancer or no cancer, it's just extracted from a report using natural language processing. And then, sometimes you will have an image label, it's this view, and then going from left to right, kind of bounding box, polygon, with pixel label, probably being the most accurate. So, a point here is that your algorithm's output is limited to a type of annotation. Moving from left to right, you get more information, but it requires a lot more input, and, you know, radiologists are expensive. Something that we've been discussing a lot behind the scenes in this meeting is will there be a reference for an acceptable curated dataset? The other thing that we've talked a lot is that, amongst all the stakeholders, you know, the effect of AI right now is mostly an illusion. There are two big impediments. One is the scarcity of training datasets, and the other is sort of the sluggish march to regulatory approval. So, it kind of gives you, I think, the three key ingredients to building a model. Really strong computers. The GPUs get, you know, bigger and bigger and more expensive each year. Model size, you've seen how quickly the number of layers, complexity, all the subtle weights we're placing, but publicly available datasets, that has remained constant over the years. I want to introduce a concept that we're really trying to solve at this meeting. It's been around for a while. It's called federated learning. So, here, the idea is that you have all of your data. It's still in your institution, but your model can train on multiple institution datasets, but this data never leaves your institution. How does that work? The concept is called federated learning, where you have this very safe cloud server, and it allows multiple institutions to collaborate. You can go ahead and test your model, but you don't need to directly share sensitive clinical data with each other. So, I'm really hoping that this will be a way for us to exchange datasets and to really test the reproducibility of our algorithms in a much faster way than what we can do now. Finally, just talking about stakeholders, I'm just going to spend the closing two minutes talking from a standpoint of journals, specifically within radiology. To tell you the last few years, we have seen a big increase in the number of papers in AI and radiomics. This is just in the Gray Journal alone. We're seeing this in the other subspecialty journals, and it's predicted to be about 25 percent at the end of this year. So, what we did was, under David Blumke, the rest of the editorial board, we decided to develop a checklist to improve the soundness and applicability of AI in imaging. This is really an interim step until we develop guidelines for AI and radiology, and we're going to target that for the AI radiology journal. So, what we found was nine key considerations that we think authors, reviewers, and readers should use when you're preparing a manuscript or reading a paper in our journal. I don't have time to go over all this, and I'm just going to mention the last point, which is, is the AI algorithm publicly available? Now, for all of our suites of journals, we request that the authors have to give enough information for other readers who would replicate their study. And all authors have to give us their computer code to use for modeling and deposit it in some sort of publicly accessible repository. So, in their papers, the authors have to give us a link to where the software is being found. Usually it's a DOI, so I'll emphasize arsenate. We don't want the algorithm, we just want a link to it. I'd say, by far, GitHub is the most popular storage center, and to date, no authors have turned down this request. So, this is my last slide. Let me go back to my title. What are the lessons that we've learned from CAD and AI? Well, first, now we know CAD failed because there was a difference between performance in these controlled multireader studies compared to how it actually does in clinical practice. But we've learned from the past, these decades of hard-won experience, now we can use this to guide us as we are making AI designs, testing, validating, developing policy and regulation. What I've showed is hopefully the federated learning and having publicly deposited codes will allow other authors to build upon your work and also to make new discoveries to speed up the field. And finally, I want to look at the success, the definition of success for AI and imaging. It's based upon value, but many metrics. What I've showed is really increasing diagnostic accuracy, being able to read the studies faster. But the two that are most important to me, better outcome for patients and also better quality of work life for the radiologists. On that note, thank you very much for your attention. I'm presenting for first author Theo Cleland, who is our AI expert. I'm not the expert, so bear with me. Motivation as to why we're interested in masking risk is sort of illustrated here. I've shown this kind of slide before at previous presentations. And it's a set of mammograms of increasing breast density. And there's any number of studies that have shown that screening sensitivity for mammography decreases with increased breast density. And we contend that that's primarily due to masking by that breast density. In other words, the lesions are reduced in conspicuity or they're completely obscured by the normal tissue. And because of this awareness in terms of the reduced sensitivity and the reduced performance of mammography in screening for dense breasts, there's sort of a move towards perhaps a stratified screening approach where women who have dense breasts, say labeled C and D for BI-RADS scores, would go on to have some kind of supplemental test. And that supplemental test could potentially find more of those cancers. If we take a look at what those numbers might be, for example, at 100,000 screening exams, we might find that there's maybe 60 cancers that could be missed in that cohort. And this is adapted from data from the Breast Cancer Surveillance Consortium. That corresponds in those exams to about 44% of women would be classified as dense breast. And in that group, there would be about 33 interval cancers. What that means is if we do supplemental screening for every woman in that category, that's 44,000 additional tests. And this is obviously a huge drain on both financial resources and in terms of workflow for a screening program. And that all assumes, of course, that that supplemental test is a perfect test and that we would find those missed cancers. Just sort of regrouping those numbers in a bit of a blocks plot, we have the screen-detected cancers here for that 100,000 women case and the missed cancers on the other side. If we do that supplemental screening on the top here, we might potentially find 33 of those missed cancers, but at the expense of 44,000 additional exams. Ideally, we want to do this more efficiently. We want to do something where we can find more of the missed cancers in that supplemental screen and doing it with fewer women who would have to then go into the supplemental screening. So maybe something like the numbers shown here. These numbers are made up, of course. So we've been interested in masking risk for some time now. And we have a cohort with Dr. Jennifer Harvey at the University of Virginia where they've been looking at breast cancer risk from a fairly large cohort. And what we've done is been able to identify mammograms that were ultimately associated with a non-screen-detected cancer. And those mammograms are referred to as masked because the cancer was not found at screening time. And those cancers that were found and confirmed to be found on a screening exam, and those are referred to as non-masked. And this is a case-case study where both sides of the test have cancer, but their mammograms are different because of the way the cancers were found or not found. We're not looking at finding those cancers. What we're trying to do is assess the masking risk associated with the breast density. And so we only look at the contralateral breast, the disease-free side, to evaluate the density and the texture and so on in those images. And previously, we had presented results of a masking risk indicator based on something referred to as detectability, which is sort of an estimate of detection, the SNR of detection of those simulated lesions that could be inserted into a mammogram to look at things like the image quality of that mammogram. In that kind of masking risk indicator, we were able to show an area under the curve for discriminating between high masking risk and low masking risk mammograms of about 0.75. And this is compared to just using the BI-RADS density categories, which would have given a masking discrimination AUC of 0.64. Now, this is something that's referred to as a handcrafted metric. We picked the features that we wanted to use that we thought were important. And the question is, is maybe can deep learning do a little bit better? The hurdle, of course, is that although this is a nice, well-defined data set, it's a relatively small data set. And when you have small data sets, sometimes you can take advantage of something called pre-training. And this is where you take an existing trained neural network. In this case, we used the VGG16 neural net, which is a deep convolutional network, or CNN, that is able to classify natural images. So this thing can be used to identify things like dogs, cats, cars, trains, and so on in images. And then what you do is you actually retrain it a little bit to whatever your task is that you're interested in. And in our case, we're interested in assessing masking risk. And the way you do that is you take off the last bit of layers that actually is your decision tree in your neural net, and you replace it with your own classifier. And in this case, our classifier is, is it masked or not masked? And you start out with a little quick bit of training. You can see there's 50 rounds there. And that just burns in the weights that are needed for that decision tree. And then the second step is that you take out of that very deep convolutional neural network, you take that last layer, and you allow it to learn. So that's the layer that would look like cars and trains and so on. And you allow that to be adjusted, and the weight's adjusted so that it's now sensitive to mammograms and what the features are in the mammogram that are associated with masking versus non-masking. When we see that here on the bottom left, the curve here is showing the ROC curve for BI-RADS categorization of masking risk. And those are the dots with the error bars. The error bars indicate the 95% confidence intervals. And the blue line, which indicates the masking risk based on the CNN pre-trained model. The gray area corresponds to the 95% confidence intervals on our analysis. And we get an area under the curve of 0.82, and so we're pretty happy with that. But the question is, how will that do on women who don't have cancer and we don't know what their masking risk is? So in other words, what would happen? How many women would we have to screen in order to get the right women into that stratified screening program to find those missed cancers? We know that the discriminator, the CNN, can find a lot of cancers, or sorry, not find a lot of cancers. It can redirect women into that stratified screening using that approach. But we don't know how many extra screens we would have to do on the cancer-free women. So we simulate that by running the two models, our CNN model and the BI-RADS density model on 44 non-screen detected cancers and then almost 2,000 cancer-free mammograms. And then we look at stratifying those into high masking risk and low masking risk. So it's something like an ROC study, but we don't really know what the masking risk is on the normal women. And it's a plot, again, that looks very much like an ROC curve. On the y-axis, we have the capture fraction, and that's the fraction of cancers that would have been missed that have been now redirected to stratified screening, that secondary screen. And that's shown there. And then on the x-axis is the recruitment fraction, so the number of normal women that would get that extra screen but wouldn't necessarily benefit from that screen, at least not at that point in time when they don't have cancer. And you can pick a different operating point where you can have a higher capture fraction, but at the expense of an increased recruitment fraction. And that's sort of shown there in the colored boxes. And the area under the curve, or the so-called C-statistic for this, shows that the CNN model is about 0.75 compared to BI-RADS doing this stratification for supplemental screening at 0.63. So the cost, then, you can take that sort of curve and you can actually multiply it out and figure out what that means in terms of, say, screening 100,000 women. And I'll just direct you to the star here, the red star. At that point there, you're capturing about 40% of the missed cancers into the stratified secondary screens, which would then hopefully find those cancers. And that would be at a cost of 11,000 additional screens. And if you work that out, I think it's roughly 500 extra screens required per potential cancer found. And then the other two spots are shown there to sort of show how that's working. And again, we have the dots corresponding to the BI-RADS classifier for masking risk. And the solid line is the CNN classifier. And basically, for almost all of the settings, the CNN classifier does better. So we have an efficient masking risk classifier. It performs well at sort of catching the mammograms of high masking risk. And it also appears to be working well on the normal women, in that it's efficient and you require relatively few extra screens to capture those missed cancers. Now, of course, this is all based on our analysis here. And it is a small cohort. The question is, going back to some of the comments in the earlier talks, is it generalizable? So we are testing this on other cohorts. We're developing our own cohort at Sunnybrook with Dr. Roberta Zhang, and as well with cohorts at University of Cambridge with Dr. Fiona Gilbert. And this is a simulation. So we don't really run this in a real situation. And so we have assumed that the secondary screen, whatever that is, ultrasound, MR, or something new, is a perfect second screen. And so we're going to miss cancers even in that second screen. And so we need to assess how that would work in a real program. And so the prospective pilot study we're developing is to bring the masking risk indicator into the clinic and to see how that affects decision making and performance in the clinic. And then I'll just follow up with acknowledgments. Thank you. My name is Abdurahman, but I go by Abe. And I am a machine learning engineer at Deep Health in Cambridge, Massachusetts. Today I will be discussing the retrospective detection of breast malignancies using deep learning from clinically negative prior screening mammograms of breast cancer patients. Screening mammography is a crucial tool for detecting breast cancer and reducing breast cancer mortality. However, despite its efficacy, it remains an extremely difficult visual task. And even the most experienced interpreters can occasionally miss cancers. Studies have shown that approximately one in three of these missed cancers is typically visible retrospectively. Now recent advances in deep learning have shown great promise when it comes to tackling such challenging visual tasks. And here we take our custom deep learning model and test its ability to detect cancers on prior clinically negative exams of breast cancer patients. Now to start off, I would like to formalize a few definitions. First off, we define an index exam as a screening mammogram, which took place up to three months prior to a malignant biopsy. We define a pre-index exam as a screening mammogram, which was interpreted as BI-RADS 1 or 2, and which occurred 9 to 24 months prior to a malignant biopsy. And lastly, we define a verified negative exam as a screening mammogram, which was interpreted as BI-RADS 1 or 2, and which was followed by at least two additional screening mammograms with a similar interpretation over the next 24 months. So in other words, the only exams we consider to be true negatives are ones where we can look ahead for two years and ensure that no malignancy occurred. Now our test data set consisted of approximately 14,000 screening FFDM exams collected from a U.S. facility between 2011 and 2017. This test set contained 328 index exams and 328 pre-index exams from the same women, and approximately 13,500 verified negative exams from different women. The mean time difference between the pre-index exam and the malignant biopsy was 15 months. And before moving on, I would just like to stress that the model's test and training data sets were obtained from different facilities, meaning that they were drawn from separate distributions. To evaluate our test data, we used our custom top-scoring model from the Dream Mammography Challenge. This model uses a MobileNet backbone. It was trained on approximately 5,200 cancers, 1,200 benigns, and 111,000 negative exams. And it achieved an AUC of 0.90 on the Dream test sets. This model was trained over two separate stages, the first of which involves patch-level classification. In this stage, we take mammograms where we know the location of the cancer, split these mammograms into several patches, and then train a model to classify these patches as cancer or no cancer. We then use this patch classifier to initialize a more holistic image-level model, which we train on whole mammograms. Now to clarify these slides, I am going to be discussing two sets of results, the first of which comes from the model that I just described, which we submitted to the Dream Mammography Challenge, and the second will be a more refined version of that model, which we have been working on since submitting this abstract. And in this figure, the dashed gray line shows what our model's performance would be if it did no better than chance. So to start off by describing the results of the initial Dream model, the model achieved an AUC of 0.90 on the index exams of our test set, which lines up with its performance on the Dream test set. On the pre-index exam, it achieved a lower AUC of 0.70. However, I would like to reiterate that, by definition, these pre-index exams are clinically negative, making them an inherently more difficult test set. At the mean U.S. radiologist specificity, which is 88.9 percent, according to the Breast Cancer Surveillance Consortium, our model has a sensitivity of 35 percent on the pre-index exams, meaning that it was able to detect suspicious abnormalities on 35 percent of the pre-index exams. At the specificity, it was able to detect suspicious lesions on the pre-index exams throughout the whole defined timeframe of pre-index exams, which is up to 24 months prior to the malignant biopsy. Now, over the past few months, we have continued to improve and refine our model, and we did so by training it on additional exams and by adding the ability to localize lesions. And this refined model performs better on both index and pre-index exams, achieving an index AUC of 0.95 up from 0.90 and a pre-index AUC of 0.77 up from 0.70. And again, at mean US radiologist specificity, it achieves a sensitivity of 38% on the pre-index exams. I'm now going to show two examples of cases where our model detected suspicious abnormalities on clinically negative pre-index exams. The images on the right half of the slide are taken from the patient's index exam, and the red bounding boxes show the suspicious region that our model localized on this index exam. These bounding boxes were verified against the interpreting radiologist's report. Now the images on the left half of the slide come from the patient's pre-index exam, which took place 13 months prior to the index exam. And again, the red bounding boxes show the localized region on the pre-index exam, which lines up with the same region in the index exam. Now this is an example of a pre-index exam in which the interpreting radiologist did not detect any suspicious abnormalities. And in this next exam, the interpreting radiologist did detect an abnormality on the pre-index exam, which took place 12 months prior to the index exam. However, they classified this abnormality as benign, whereas our model classified it as malignant. In both of these two examples, our model could have helped radiologists catch these cancers approximately one year earlier for both women. To conclude, our model detected suspicious abnormalities on 38 percent of pre-index exams, and it did so at mean U.S. radiologist specificity, or mean U.S. recall rates. And the detected abnormalities on the pre-index exams were up to 24 months prior to the malignant biopsy. And we believe that using this model in a clinical setting could potentially help radiologists catch some cancers earlier, which could help improve women's prognosis and help further decrease breast cancer mortality. Now, I've discussed one of our model's applications in FFDM exams. Please check out my colleague's talk at 1045 to hear about some of our work on DBT. Thank you. I think that AI represents a once-in-a-generation opportunity for us to really dramatically improve patient care and lower the costs of high-quality health care, and really something on the order of thinking about moving from film to pet MR. And not because AI is going to replace us as radiologists, and not because there's going to be some robot that's going to be superior to us, but really in the same way that in the industrial revolution in the 19th century, we saw human physical productivity exponentially improved and changed. In the same way, I think we're going to see the tools that you're all talking about significantly increase our ability to make diagnoses and improve care for our patients. I will sound a note of some caution, which is that if we do not come together as an ecosystem, and this is a word that you're hearing here at the RSNA. It's a word that we've been talking about in our Data Science Institute. If we do not come together, then we really do risk, through fragmentation and through a number of these barriers, overblown hype, our fragmented health care system. We do risk losing this incredible opportunity. And it's important that we also look back at our history and see where perhaps we've had missteps in the past, where innovation has been fraught with risks and is always something where we have to be thoughtful. Let's go back to the very birth of our specialty, where we discovered x-rays and realized that very quickly this innovation was adopted across the ecosystem. Within four months, we're seeing an x-ray taken in rural Ohio. But if we're looking for these AI tools now to be safe or effective, in our own history, we've seen innovation that has been neither. Madame Curie dying of a radiation-induced illness, the use of radium for all sorts of cosmetic applications. It's been a long journey for us to get to the focus that we have now on the safe and appropriate use of imaging. And I want to talk more about this concept of an ecosystem, and I think Apple provides us with a very nice example there. And Apple is not a trillion-dollar company, and it is indeed a trillion-dollar company because of the phones that it sells, but it is because of the community of developers and users and the way in which we all interact with those tools. But this gives me an opportunity to talk about the fact that we have an ecosystem and we have very different motivators and incentives and requirements of us. And while it is absolutely necessary that we as practitioners and industry work together, we must be very aware that as physicians, we have sworn an oath to care for our patients. Our industry partners have shareholder value that they are required to create. And that creates, I think, a lot of possibilities, but also we have to remain very thoughtful about what those incentives mean for us. And wanting to go back to what it is that we aim to do as physicians. We aim to prevent illness, help sick patients get healthy as soon as possible, and for those who increasingly live with chronic disease, manage their conditions. And certainly for those of us as breast imagers, it is the prevention of illness, the idea that we can detect breast cancer before it becomes apparent on physical exam and while it is still curable that is so exciting. We've set goals for ourselves as a healthcare system. And again, as breast imagers, really feel that we have a huge impact in improving the lives of our patients and doing so at lower cost when we detect disease earlier. And this is the quadruple aim now, which also speaks to the importance of improving the work life balance and the work life of those of us who deliver care. One of my signature or one of my key areas of focus as an ACR leader has been what we've called Imaging 3.0. And the goal that we set for ourselves was to deliver all of the imaging that's necessary and beneficial, none that's not. We talked about the culture change, and those of us in breast imaging have always felt that we were ahead of that game with our focus on the patient, our metric-driven practice. But we all needed tools so that we could actually deliver that care more effectively. We see artificial intelligence for us as breast imagers as just being another one of those tools. And we are an innovative specialty as radiologists, and I think as breast imagers, we've adopted innovation thoughtfully. I use technology with tomosynthesis that wasn't even invented when I was a trainee. But our adoption of that has been evidence-based, and we have made sure that it adds value before we've added it to the screening protocol for our patients. This is just one of a long line of innovations that we've adopted. I think that the tendency is to think of the benefits of AI in breast imaging as being around the image interpretation, but I think what I would say is there is so much opportunity to think about the entire imaging value chain and how we can potentially apply AI to reducing administrative complexity and burden, helping us with the reporting requirements to regulatory authorities and the ability to deliver data into registries. So I would urge us to think along the entire value chain and along the entire spectrum of what it is that we do as radiologists. Obviously, we're not in this alone. I know that many of you are here from industry, and this is a very, very—that breast imaging is a very compelling potential use case because we are looking at trying to deliver care to a population, trying to do that. We think about the ideal screening population, trying to do that as cost-effectively as possible and with as much accuracy as possible. We're delighted to have this amount of interest, but I think that some of the things that give us pause are when we look at the players that are in our space now—obviously, a lot of research interest—but we look at the amount of interest in our space from people who operate in a very different financial space, comparing the market capitalizations of the companies that we've been familiar with as radiologists to the new players in our space, a combined market cap of $4 trillion, more than we spend on our entire healthcare ecosystem here in the U.S. every year, and with some of the philosophies that we've seen espoused by leaders in these industries—fail fast, break things. I think as physicians, we are always given pause because, again, we've sworn that oath to our patients to protect them. So again, these industry relationships are something that we must engage in, but we must do so thoughtfully. I apologize if there are any of the players in the room from either of these entities, but we are given pause when we see large amounts of money put into high-profile projects that don't get the results that we had hoped for. So I think that what we are looking for is a thoughtful collaboration that can really move the profession forward. One of the things that breast imagers do come into this new space with is a little bit of baggage on what we would have thought of as an artificial intelligence tool surrogate, and that's CAD. And again, apologies to anybody who was involved in that. This is a quote from a former partner of mine that, you know, if CAD was a partner in our practice, we'd have fired him or her long ago. I think this tells us—and I know that there's a lot of interest from industry about how we pay for these tools, and my background is in economics—but we, I don't think, saw that tool evolve as it should have been, perhaps because of the payment policy that embedded it into our workflow. I think as breast imagers, we owe it to ourselves and to the patients we care for to really demand more and demand more of you in industry and you in data science. So yeah, I think that you'll see us wanting to be much more involved in this. Obviously, I think our notion of what we're expecting artificial intelligence to deliver in breast imaging is changing. We no longer think that a robot is going to be sitting in our chair when we walk in. We recognize, since we're all walking around with artificial intelligence tools on our phone, that we have to, that we're going to be using more narrow tools, and I think that's something we're all comfortable with. And I think that, not just for breast imagers, but I think that across the radiology profession, I'm seeing very little fear, rather excitement, optimism, tempered with a real concern that we do this right. And, you know, again, going back to who we are as specialists, we have always been at the interface of technology and patient care. So I think, you know, in the same way as we, you know, we're all here at RS&A looking at the latest technology, you know, we all want the latest iPhone. I think as breast imagers, we are going to want the very best tools that this collaboration of industry and the healthcare delivery community can deliver. This quote is attributed to Kurt Langlotz, radiologists who use AI will replace those who don't. I'm going to give you a quote from me, as humble breast imager, which is that I think that breast imagers, given the way that we have led in patient and family-centered care, you know, being in contact with our patients, a standardized reporting system, and the way in which we really use outcomes and metrics to drive our practice, I'm going to say that breast imagers are going to be the ones that are going to lead the way in the safe and effective adoption of AI tools. I do not have time to go into all of the ways in which I think that AI can really impact our practice. I will just say that going back to that concept of the value chain, that I would really hope that we don't just look at the image interpretation, that we look at how is it that we provide education on guidelines to our patients? How is it that we do that in a way that meets their needs, where they consume information? How is it that we, you know, optimize protocols, perhaps simplify and reduce the cost of the accreditation programs that we know have driven quality? And as a breast imager, you know, I think there are a number of ways in which every day I'm looking for these tools, and we'll talk a little bit more about those. I'm going to skip over a few of these slides, but just I think to give you a sense that there are so many ways in which we can improve our practice, specifically beyond the image interpretation, you know, patient care and safety, practice operations. At the end of the day, what we're looking to do is improve our ability to diagnose. This report from the IOM, Improving Diagnosis in Healthcare, I love the way it talked about it. When we're looking at how to optimize the diagnostic process, it is because we're looking to integrate and interpret information. So bringing together all of the information that we need to give the best diagnosis to the patient. We've heard a lot of discussion of sensitivity and specificity this morning, and, you know, thinking about it, we always hope that we're better than 50-50, although if you look at some of the statistics about a dense mammogram, you know, it is challenging. We know we're not where we need to be in terms of sensitivity. We have, with improved image quality and innovations like tomosynthesis and better quality and safety, better training, we've moved ourselves up into that perfect corner. I see AI as something that's going to give us that additional edge. All we're looking to do is improve diagnosis and improve the health of our patients. So let's think about some of the practicalities for an average breast imager like me. Well, I know that I am never going to detect every breast cancer as a human. I would love for AI to help me with that. I know that inevitably I'm going to call patients back for additional testing, sometimes a biopsy. I would love not to have to do that. And at a time when there are multiple sets of confusing guidelines out there, so that when I say to my patient, we'll see you next year, as often as not, she says, really? Do I have to come back next year? I need all the time I can get to speak to my patients. So I would love for AI to help me be more efficient so I can spend that time in front of my patient rather than going on to the next test. So I'm going to keep my talk a little bit shorter than my time because I would love if I could get some questions and comments for you all. But I did, as I was preparing for this talk, there was a piece in Forbes around RSNA last year that said, you might want to get AI to read your mammogram next year. So what I would say as a breast imager to our ecosystem, to our community of physicians, data scientists, and industry, yes, that's fine. As long as your algorithm is safe and effective and protects patients and is unbiased, then I'm happy to do that. Because again, as breast imagers, we are so much more than the images we read. So the more that we can do to read those images as sensitively and specifically as possible, what that allows us to do is spend more time with our patients. And at the end of the day, isn't that what we all went into medicine to do? I was asked to talk a bit about artificial intelligence for digital breast thomosynthesis, and that's my topic of today. And I will say that I will add a bit of mammography results and insights because, well, there's just much more information about that and results over the last few years. So AI and breast imaging, or for medical image interpretation, over the last few years here at RSNA, we've also presented results and how some algorithms have been doing. And some people push back against it and they feel threatened and they want to pull a Linda Hamilton and kind of fight it. I do think, I agree with Dr. McGinty when she said, your job, I'm a physicist, so I'm talking about your job now, is much more than reading images. I don't think AI should be viewed as, well, radiologists should find another job in five years, this is going to die. I don't agree with that. And I think AI should be viewed as a much more friendly assistant in things that will make your lives better and easier and improve patient outcomes, I hope. You've already seen this quote this morning. I heard it from a certain Dutch radiologist a couple of years ago, and of course we've all heard it many times since then, that radiologists who do AI will replace radiologists who don't. And perhaps that will end up being true. So today I'm going to briefly, very briefly, talk about why now. You've probably heard how AI works for the last two years, many, many times, so I don't want to spend too much time on that. And then how we could use it perhaps to go faster, to do better, and eventually to do less. So why now? Well, over the last five years, this deep learning has been introduced, or medical image analysis, or what I call magic. And the performance of these algorithms have changed radically to go from what people would call conventional CAD to the new AI methods. So of course the idea is we input some mammograms, tomosynthesis images, MRI, whatever it is, and they go through a series of steps of mathematical operations, which are by themselves individually not really complicated, they're just multiplications and convolutions, and outcome either maps of probability of malignancy present here or there, or a decision, yes, no cancer is present, or a number which represents probability of malignancy present. So some, an example of one of these maps that shows in red, well, this is the area that caused our algorithm to give this decision of this probability of malignancy present in this case. This is an interface from one AI algorithm, or one AI commercial software. You see, let me see if you can see the arrow, no, now, yes, okay. So you see down here, case-based, whoops, case-level score, and next to each lesion a probability that this lesion is malignant. So a couple years ago, I actually presented this pretty much exactly one year ago in this room. We did a study to determine how this one commercial AI algorithm, how good is it in detecting cancer, standalone mode. So we gathered data from nine ROC studies already published. We gathered the mammography images, the case scores from many, many readers, and the truth, and these studies spanned seven countries, two continents, and four digital mammography system vendors. We ended up with 2,600 cases, of which 650 were cancers, and these cases were read by 101 breast radiologists. Not all radiologists read all cases. We ended up with 28,000 reads, and when we compared the area under the curve for the system to the average of the radiologists, we ended up with a non-inferior comparison. The AI system resulted to be non-inferior to the average radiologist. In a more recent publication, this other group evaluated 720 cases of 14 readers, images from four different models of two vendors' systems, and they showed also actually an improvement in the area under the curve for detection of breast cancer. In that case, again, this is mammography. This is supposed to be a DBT talk, so let me show you one quite recent publication from UPenn, in which they used a commercial AI system for breast tomosynthesis reading, and they compared the performance of 24 MQSA radiologists reading 260 cases with that commercial system, and I actually put together from two figures that appeared in that paper. The dashed line is the radiologist's average, and the straight solid blue line is the AI, and you can see it's pretty much comparable performance. And all the little circles that we're going to look at in a few minutes are, again, each individual reader. So there are quite a few publications that show that the performance of AI during these studies, in comparison to radiologists, is quite comparable, if not even a bit better. But, of course, these three and all the other ones I've seen, studies I've seen, they're all enriched datasets performed in retrospective reading sessions, so there's a laboratory effect. There's a much higher prevalence of cancer cases than what you would expect in a screening dataset, and AI, right now, there's no algorithm that I know of that can read the prior image and do that comparison, that you do, and is so valuable and improves your performance so much. One interesting thing is that there was one paper a couple years ago that added this ability to look at the prior image from that woman, and, you know, supposedly improve its performance, but what they found is that adding that prior, that ability to look at the prior in that image, they don't actually improve the AI's algorithm performance, and I'm not sure why. I do think there might be one of two explanations, or maybe perhaps a combination of both. Well, first of all, perhaps they should have implemented the ability to read the prior better, or perhaps there's information in the current mammogram that is the same information provided by a prior that we humans cannot see or haven't learned how to detect yet, and the AI already did. So when we add the prior information to the AI system, well, the AI already knows whatever that prior information is, so its performance will not improve, and perhaps the answer is a combination of those things, or not. So the key thing is, yes, we do have studies that show that AI performance is much improved compared to conventional CAD, and that we're well on our way, but we don't have final definitive proof with screening data sets in a prospective setting, a real screening setting, to say, okay, we're there, we can do as well as radiologists. So we're on our way, but to where? What do we want to do with these things? Well, of course, one option perhaps is to read faster, and this has been mentioned already a few times this morning, and since you invited a physicist to give a talk, I'm sorry I have to show one equation, and this is how I would say, well, reading time in tomosynthesis takes, it's twice the reading time in digital mammography. This is the standard knowledge that people publish over and over again, and in some cases that might not be a problem, but in other cases it is, and especially where now I live in the Netherlands, when you have a nationwide screening program, all of a sudden if you want to transition from mammography to tomosynthesis, and you require double the resources of time or people to read the same number of images, that is a problem. The resources might actually not be there. So one of the limitations to move from mammography to tomo is this reading time issue. So there are studies and software to try to accelerate the reading of tomosynthesis images. This is one example from just not that long ago, and the idea of this software is it detects the lesion, and it shows on the stack of tomo a box surrounding where it detected the lesion, and this, using this software, the hope is that the radiologist will find this lesion and interpret the image faster. So in this paper, they compared the timing of four radiologists without and with using the system, reading a hundred cases, and their reading time was reduced by 14 percent, and it was statistically significant. And for the two newer radiologists and the two experienced ones, they both reduced the time to fantastically exactly the same time with the AI, whereas they had a small difference without it. In a different publication, and this with this product, the lesion is marked in all slices of the stack, but when it's off, the slice where the lesion actually is or where the center of it is, the mark is more of a grayish. Let me see. Okay, well, you can see the grayish area, and if you click on that grayish border in any slice, it takes you directly to the slice where is the center of the lesion, and then it appears as a bold black line. So in a study, well, the study I mentioned five minutes ago or so from University of Pennsylvania, this 260 cases, 24 radiologists, they evaluated this system and found a 53 percent reduction in reading time. So considerably faster reading with this solution, and this was consistent for all readers. All 24 did faster except for one, number 18, that was already pretty fast and reading without the AI. One example case, I chose, well, this was up here in the paper, and here you can see how a mark when it's off the slice, the correct slice, it appears more grayish, and when it's in the right location, when the center of the lesion is in that position, you see the bold end with a probability of malignancy estimate. So two studies, okay, two different software, two different representations of the results, but one has 14 percent reduction in time, that one 53. Why is such a big difference? Well, the 14 percent reduction one had a 70 percent of the cases were positive, while the other one had only a 25 percent prevalence. So that, of course, I believe makes a huge difference in how much time the radiologists spend to interpret a positive case versus a negative one. And of course, in screening, we expect more of a closer to 1 percent prevalence, so the 50-ish reduction in time is more relevant. However, I think that we need to evaluate this a bit, because if we're going to reduce the reading time of reading a tomo stack by 50 percent, I want to ask, are we doing AI-assisted radiologist reading, or is this radiologist-assisted AI reading? I believe the only way to go that fast is, well, you're not doing search anymore, you're only looking at what the AI found and deciding, am I gonna, I agree with AI, this should be recalled or not, and you're not looking through the stack to find any other lesions. And I wonder if that's okay. Are people gonna push back? Are we giving up the idea of, you know, the radiologist searching for the lesions first, and then the AI being assisting you in cases where I want to know, I'm not sure, should I recall this or not? Or can we do this? And some people say, yes, we can. So I'm gonna ask you to pull out your phones, or at least get out of Twitter and Facebook and move on to your mobile browsers, and I'm gonna ask you if, what do you think? So if you go to menti.com right now, and that's a code you need to enter to answer the question, so if we could move over to the Menti web browser, please. This is supposed to work. You have to do it in the back. Yes, thank you. So let me ask you, what do you think? Is a 50% reduction in time only possible if we give up human search, and we just, well, you just look at the lesions that AI is showing you? Okay, so we have, apparently I'm not crazy, and we have a considerable majority for yes. Okay, I think we need to move to the next, let me see, I lost the mouse. No, let me go back to the web browser again, please. I need to go to the next question. Yeah, there we go. Where's my mouse? Yeah, so is it acceptable for only the AI to do this? This is, okay, yes, I agree. It depends, right? I mean, if it's, we're not missing any cancers, or you're not missing any cancers, then it depends. It might be okay, and that seems to be pretty much your opinion. Great. Thank you very much. Okay, can we move back to the slides? Okay, so we have a 50% reduction in time only possible if we give up human search. Yeah, so what if it's good at it? If it's good at it, it should be okay, right? So that moves on to, okay, better. This is another reason why we would add AI, not only to go faster, but to do at least as well or better. So is that true? If we go back to the study that showed a 14% time in reduction, and we looked at the performance, the performance is actually the same. These are the curves for the four readers without the AI, and these are with, and of course, you can't really see a difference if you go back and forth, and the area under the curve is equivalent. The 53% time reduction study showed actually an improvement, a significant improvement, in the area under the curve of the radiologists without the AI than with, and I showed you, I showed part of you, part of this curve already. This is all 24 radiologists. Those circles are each one. The size of the circles represent the time spent reading, and of course, there's a spread of sensitivity and specificity. If we look at without AI, this is it. With, they all become smaller circles and move up and to the left as we want, and it actually showed this nice analysis in the paper. If you look at the yellow, the ones in grouped in yellow, they're the more specific, less sensitive radiologists that with the AI, they move up in sensitivity, but don't lose any specificity. If you look at the more sensitive, less specific radiologists in blue, they also move to the left. They don't lose any sensitivity, and they gain specificity, and the ones that are, you know, pretty average sensitivity, specificity, they move up and to the left. So all of them move in the right direction without sacrificing any of the other side. So we do see, again, this is a retrospective, high-prevalence case set in the lab, but we do seem to have this nice feasibility that AI, for faster, better reading of tomo images, it seems to be a really nice option. The other, third application, if you will, is to do less, to read less, and therefore, to be able to do more of the other, more interesting stuff, or the more critical, spend more time on the cases that do really need to, you really need to spend more time in. So moving from AI-assisted reading to using the AI standalone, and then determining, for example, these cases that the AI definitely thinks are normal, we should not read at all. So the idea is a typical AI solution scores the cases from 1 to 10, and spreads, if the case set is a typical screening data set, all the exams equally from the score from 1 to 10, and of course, if it works at all, most of the cancer cases will be scored very highly, like in this breakdown of all these data sets. And the idea is we draw a line somewhere, and we say, well, for cases that got a score above this threshold, we read them as we always do. The cases below that threshold, we pre-identify them as normal, and nobody reads them. And of course, that threshold could be anywhere above 5 or maybe above 2, depending on how aggressive we want to be. So we tried this on our data set of 20,000 reads. We tried this retrospectively, and we showed that if we draw that line at 5, of course, we end up with about a 50% reduction in workload, but 7% of cancer cases actually marked as normal, while the benefit is, aside from reducing the workload, we have a 27% reduction in what was marked as false positives. If we want to be less aggressive, we can move that threshold over to the left. Now, we have a 20% decrease in workload. We lose 1% of cancers, but we also lose 5% of false positives. If we actually do this and look at the area under the curve, or the ROC curve, for setting the threshold at all those nine scores possible, aside from the very aggressive, you know, score of nine threshold, all curves actually fall one on top of each other, which means the performance is the same, and the number of cancers and recoil rate that we end up doing just depends on training and moving up and down the curve, changing our operating point. So we can reduce the workload and find exactly the same number of cancers, not necessarily the same ones, and do that faster. In a separate study from colleagues in Sweden who used the same software on their own, on a part of the Malmo breast cancer screening trial for tomosynthesis, they did the same thing. They evaluated how many cancers would be pre-scored as normal, and the eight that that case said would have been pre-scored as normal, they actually had three breast radiologists look at those cases and see are those lesions visible, why did the AI miss them? And the thing is, thankfully, I think, the eight cancers were clearly visible. So, okay, that sounds like that's not a good idea. I always argue that this concept of triaging doesn't matter how much the numbers show that this works. As long as somebody can pick up a mammogram and say, look, how could an AI algorithm say this is normal, and even a physicist like me can see it, then that's a political bomb that will never fly. But if they are clearly visible, then there is a hope that the next version of the AI algorithm, with better training, can actually pick them up. So, low-hanging fruit, possibly. In another publication that showed this idea of triaging, they developed a deep learning model specifically for this use. They evaluated 26,000 cases read by 23 breast radiologists, only one read each case, and they showed pretty much similar results, 20% reduction in workload, same sensitivity, or at least not statistically significant different, with actually a slight increase in the specificity. Of course, it's not just about number of cancers and recalls, but also which cancers. That's one thing we need to look at is, okay, are we going to be finding less or more of the low-grade or less or more of the high-grade cancers, which cancers, if there is a difference in the types of cancer that the AI scores incorrectly and that says these are normal versus the ones that it scores correctly and allows the radiologist to read. You know, we are moving to, it's not just about finding cancers, but finding cancers that we need to find and do something about. So, we do have to look at the grade of these different cancers that we're going to be looking at. However, I do have another couple of questions for you regarding this application. So, if we can move on back to the Menti system, and you can pick up your phones. Yeah, okay, apparently you've been answering this question already. So, all these studies up to now have been retrospective, right? We take the reader results, we throw away the 50% or 20% of the cases, and we say this is what would have happened if we had triaging. But, we're not considering how you would read these cases. Would you read them the same way, or would your operating point vary? And, apparently, you think, and I agree, that you probably wouldn't read the same way. So, that's something that we haven't considered yet, and that's why we need to do this prospectively. So, next question, please. Do you want to know the answer? Do you want to know why the AI has marked this as needing to be human-read? I've asked this question to a few radiologists, and I've, up to now, up to right now, I've gotten a 50-50 response. So, apparently, Dutch radiologists feel differently than the audience. But, okay. So, you do want to know. Surprising. Good. So, but, actually, that makes sense to me, but, of course, I'm not a radiologist. Some radiologists tell me, no, I don't want to know, I want a clear mind, I want to do my thing, and then, if you want, you can tell me. But, apparently, those are the actual minority. So, if we go back to the slides. Thank you very much. So, if you do want to know, and we are going to show you the results, then we have both of what I showed you. We can read half the cases, so that's half the time, and you're going to be, you know, you're going to know what you're doing, so we can use the assisted reading, and that's, again, we showed perhaps a 50% reduction in time. So, we're actually going to end up gaining a 75% reduction from what not using the AI would give us in terms of terminal reading, according to the lab results. If we keep going, you're going to owe us time, as opposed to, you know, reading. Okay, so, there are other applications for AI in TOMO. There's a lot of applications in terms of physics that I'm not going to cover. This is other work we are doing, but regarding interpretation, there's actually, just a couple weeks ago, one FDA-approved product came out that the idea is to move from one millimeter separated slices to six millimeter slabs, and what's contained in those slabs is actually defined by an AI algorithm, so that's a very new development, and, of course, we, what I mentioned is moving from, you know, no reading, well, human reading to no reading for some cases, and, of course, that's the only option pretty much here in the U.S., where you do single reading, but in Europe, you know, double reading of screen is the standard, and perhaps we don't need to go from double reading to no reading, but AI plus single reading could be another application, and we're going to look into that. So final question for you, if we can go back to the Menti system. I would like to know, what is the atmosphere of, do you see AI as an opportunity? Great, you do. I wonder if a couple of years ago if we had asked that question, we would have seen different results. But this is encouraging. And nicely, not many people think this is all just hype. So if we move back to the slides. So the options that we talked about is we can read faster. Apparently, we can. We can use, however, we seem to need to use RAD-assisted AI reading as opposed to vice versa. Better, the lab results say yes. There is a need for large prospective screening trials. And hopefully, we will see those coming in the future. And of course, to read less and therefore gain time that way, we can do triaging, perhaps slabbing, single reading as opposed to double reading. And again, we need more prevalence trials. So thank you very much. And thank you for participating in the polls. So I'll preface by saying that I'm the CTO and one of the co-founders at Deep Health. I'll be talking about DBT. And two minutes ago, you said the majority of you voted that you think AI does have a good opportunity in DBT. So I hope we'll at least consist at that level and maybe even rise it. So today, I'll be talking about our efforts to build AI-optimized synthetic 2D images from DBT data. So I'll be talking about our efforts to build AI-optimized synthetic 2D images from DBT data. And I'll preface by saying I'm not a radiologist. I'm a computer scientist. And so while it may be obvious to any radiologist that 2D and 3D mammography are complementary, we've found this to be the case for AI as well. The core underlying challenge is the huge data set size of DBT, which really exacerbates the needle-in-the-haystack nature of mammography. And by that, I mean that lesions tend to take up a small percentage of the overall volume of pixels. And this number of pixels increases by two orders of magnitude when you go to DBT from FFDM. So you can imagine that if you were to give all these pixels to an AI model, it could easily learn features that are spurious and not actually predictive of cancer. Obtaining annotations can help along these lines and reduce overfitting. But they can also be costly and hard to scale, especially for DBT. So for all these reasons, while DBT gives more data and this can be beneficial, there's also a complementary nature for 2D images that can also help AI. Given the overall complementary nature between 2D and 3D mammography, most DBT systems offer a mode where they can generate synthetic 2D images from DBT stacks. And so there's many implementations of this. And you can imagine many infinite more ways to implement this. And the key is that there couldn't be more room for improvement. Ultimately, you can think of the problem as developing an algorithm that takes a DBT stack as input and generates a synthetic 2D as output. One particular algorithm you could imagine would be using a MIT-based approach or maximum intensity prediction. And so that could be one implementation you might try. For any given algorithm, you'd ultimately want to evaluate how well the algorithm did. And here we're proposing to use AI both for the algorithm development for synthetics, but also as an initial evaluation of the synthetics themselves that we can then maybe later do a reader study with. Sorry about that. So now digging into the algorithm itself. As much as deep learning seems like magic, and I agree with Yanis, some parts seem magical at times, you do have to encode intuition and structure into your model for it to learn effectively. With this in mind, we want to develop an algorithm that can ultimately optimize for cancer detection, but particularly take advantage of DBT. And one of the core motivations of DBT is handling tissue superposition. So we want to make sure we use this factor when we create our synthetics. And in particular, you can imagine a method like MIP might not use that as well. It could essentially, if we're taking a maximum pixels over all the slices, this essentially could undo a lot of the benefits of DBT. So with this intuition in mind, we've developed an approach where we take an AI model and sift through all the planes in a DBT stack. And at each spatial location, we find the region that looks the most suspicious, according to the AI model. Then once we have these ROIs at different slices, we merge them onto a final synthetic 2D image and do some additional image processing to result in a final synthetic. And we are calling these synthetics DViews. And overall, you can think of this method as similar to a MIP and except for two key features. So instead of taking a max over pixel intensity, we're taking a max over cancer suspicion level. And then instead of operating at a spatial unit of a pixel, we're looking at ROIs, so bigger regions. So with our algorithm defined, we next designed a study to assess its performance. The study will consist of three steps. First, we train the model in the DView algorithm. Then after that, we train an AI classification model that takes as input the DView images. So this model will take as input the images and then output a score of how likely the model perceives that cancer is present. And then finally, we evaluate this classification model. And so for these three steps, we'll use three different data sets to best ensure generalizability. And a core benchmark we'll use is a comparison to default synthetics. To make this comparison as fair as possible, we use a shared pipeline between the DView pipeline and the default synthetics and share a similar model and training scheme. So first on a data set A, we train a 2D model. And then from that, given a data set B, we use the 2D model in the DView algorithm to generate the DViews. And then from this model alone, we fine tune it on the DViews themselves. We do a similar thing with the default synthetics in parallel, where we take the original 2D model train and fine tune them on the default synthetics. Finally, we go to a data set C, where we do evaluation of the classification performance for both the classification model trained on the DViews and the default synthetics. And you can see in the table the statistics of the data sets we used. The third data set, data set C, is our testing data set. Consists of 1,000 studies with 100 cancers, with the non-cancers consisting of negative bi-RADs or confirmed benigns. And as the results flash up real quick, I'll take another sip of water. So in our evaluation, we looked at two metrics, one AUC for an ROC curve. And the next is a sensitivity at a fixed specificity, picking the specificity to match the natural average according to BCSC. And so you can see that the DView synthetics led to higher performance than the default synthetics. Importantly, though, the predictions from the DViews and the defaults were complementary, in that if we can combine the predictions, it led to even better performance. Finally, here are some examples. So this first sample, there's a malignant mass. And you can see on the DView on the right, if you zoom in where the mass is, you can see the mass in some speculations. On the default synthetic, it's nearly invisible. And indeed, our AI classification model gave this image a higher classification score for malignancy than in the default synthetic. Next, again, we have an example of a density where you can see the speculations better in the DView. So this is another confirmed malignancy. And it was scored highly by the classification model. On the default synthetic, the score was much lower. And you can see a lot of the details are lost due to the tissue superposition in this default synthetic. And last one is another example of a malignancy density where you can see the speculations a little bit better in the DView than you can in the default synthetic, again, because some of this noise in tissue superposition. So in conclusion, we developed a novel algorithm that collapses suspicious information across the dbt volume into an optimized 2D image using AI for this approach. And compared to the default synthetics, our DViews enabled higher performance in AI-based classification, although they were both complementary. And finally, while we have demonstrated that our approach can be useful in training AI algorithms, we also envision that it can be potentially an aid for interpreting radiologists, and an avenue we wish to explore in the future. And then lastly, I want to thank one of our collaborators, Mac Bandler, for his continued collaboration, as well as thanks to some of our funding sources. By way of introduction to our project, the development of CAD and radiomics can support medical decision-making in diagnosis, prognosis, and treatment of cancer. Machine learning models can use information gathered retrospectively to predict diagnosis and prognosis of lesions prospectively. But we need to understand how combining populations of lesions affects the development and performance of machine learning models in breast cancer diagnosis and prognosis. So here's an example of the kind of breast images that we use in our work. This is a 3D DC MRI image, all stuck together. And at the University of Chicago, we use the Quantitative Radiomics Workstation to extract features. We start with 4D DCE MRI, and a radiologist indicates the tumor center. Then the lesion undergoes segmentation automatically. And then we collect computer-extracted image phenotypes in these five categories seen here. This is five out of the six that can be collected by this station. But in our study, we use five categories seen here. After extraction, the radiomic features can be merged into tumor signatures. And this is an example of a lesion after undergoing extraction. The motivation for our work is that many machine learning models for breast cancer lesion diagnosis and prognosis make use of images acquired within a single institution. So there is a need to characterize and better understand the classification performance of machine learning models when images of lesions are included from different populations, which may differ in several ways, including patient biology, screening protocols, and imaging protocols. So the objective of our study is to assess the impact of harmonization on classification performance of radiomic features extracted from dynamic contrast-enhanced magnetic resonance images of breast lesions in two populations. Our hypothesis is that the harmonization of radiomic features between the two populations will improve classification performance using radiomic features in a combined data set from two populations. This is a description of our database. We worked with 2,229 images of these unique lesions, so one lesion per image across the two populations, so one lesion per patient. And as you can see here, there are some differences in the databases here. For the United States data set, here are the number of benign lesions and malignant lesions, but they've been also separated out by fill strength of acquisition as well as biopsy status if the lesion was imaged pre-biopsy or post-biopsy. And then we see the same numbers here for the China data set. Some differences, most of the lesions in the China data set were imaged at 3T, for example, and then most were pre-biopsy in terms of their imaging. There were other differences between the two populations in terms of image acquisition. Most of the images in the United States data set were acquired in the axial plane as compared to being acquired in the sagittal plane in China. Differences in the time interval between post-contrast images as well as scanner manufacturer and time period of collection. We accomplished computerized feature extraction through these methods. The lesions were first segmented using a fuzzy C-means method. Thirty-two radiomic features were automatically extracted in the five phenotypic feature categories of size, shape, morphology, enhancement texture, and kinetic curve assessment. To accomplish the harmonization across the populations, we used the combat data harmonization method. According to the feature categories, dependence upon system gain, resolution, and noise. So, features in the categories of morphology, texture, and kinetics. We excluded the washout rate and curve index from these harmonization methods as well as the volume of most enhancing voxels. The method uses empirical Bayes methods to pull information across lesions in each population, also known as the batch, to shrink the batch effect parameter estimates of mean and variance. The combat method has been previously used with features extracted from FFDM images and from PET images. Three covariates were used to retain the biological nature of the lesions. These were lesion type as benign or cancerous, the status of the lesions as having been imaged pre- or post-biopsy, and then the field strength of image acquisition, 1.5 or 3T. To look at the effect of harmonization, the dimensions of features were reduced from 32D to 2D using T-distributed stochastic neighbor embedding separately for features pre- and post-harmonization. The effect of harmonization was assessed separately for benign lesions and for cancers across the two populations using K-means clustering of the T, S, and E values for the two groups. The Davies-Bolden metric was used as a measure of inter- and intra-cluster agreement, and that metric is the ratio of within-cluster distances to between-cluster distances, and an increase in the DB metric indicates that feature values are less clustered by population and thus more overlapped. For the classification between malignant and benign lesions, it was performed using two sets of the 32 radiomic features, first using a pre-harmonization feature set, all features in their original form, and then a post-harmonization feature set, which was made up of features not eligible for harmonization and then combined with those that had undergone harmonization. Tenfold cross-validation was performed using a random forest classifier with 100 decision trees, and classifier output was used in ROC analysis with the area under the ROC curve used as figure of merit. The difference in AUC was statistically significantly different if P was less than 0.05. So, first for results, here are examples of the effective harmonization on the features. This figure shows, at the top, feature distributions pre-harmonization and then post-harmonization at the bottom. And the blue is for the China lesions that were benign, and the green is for the U.S. lesions that were benign. Here, these dots show the median of the distributions, and the line is drawn to kind of show the difference, the change in the median after harmonization. And this is a rain cloud plot, so you also see the individual data points here at the bottom. This is for the texture feature of entropene, and this is the same feature for the cancers in the two populations pre- and post-harmonization. Here is TSNE space for the benign lesions. Remember that this helps us reduce the 32 dimensions down to two for the purposes of visualization. And again, blue is the China benign lesions, and green is the U.S. benign lesions, and you can see a clear separation in the space pre-harmonization. And then post-harmonization, you see more overlap. Similarly, for TSNE space for the cancers pre-harmonization, you see the separation. Yellow is for the China cancer cases. Red is for the U.S. cancer cases. And then that TSNE space post-harmonization. Here is the Davies-Bolden index for degree of clustering. On the left is pre-harmonization, and then on the right, post-harmonization. Green is for the benign lesions, red for the cancers. The DB index increased by 39 percent for the benign lesions and by 31 percent for the cancers. The increase in the DB metric indicates that feature values are less clustered by population after harmonization and thus more overlapped. Here are the results for classification performance. This is the ROC curve here. The solid line is the ROC curve pre-harmonization, and then the dotted line is post-harmonization. And we see here a statistically significant increase in classification performance when using the post-harmonization features. In conclusion, the data harmonization by the combat method resulted in a decrease in population-based separation of radiomic features, as indicated by the increase in the Davies-Bolden metric for each type of lesion after feature dimension reduction through TSNE. In the task of classification of lesions as benign or malignant, classification performances given by AUC demonstrated a statistically significant increase when post-harmonization features were used compared to when pre-harmonization features were used. In the future, we are investigating the classification performance in classification schemes using separate training and testing sets within the populations and between the populations. Thank you. I will go over breast ultrasound and MRI-AI, and I've changed my format, is that okay? I've given a number of these here, and also tomorrow morning, I'm talking on radiomics in general, so I've kind of taken a higher-level view and given a cursory overview of the field. And I'm going to thank my lab up front. What we do is we discover new ways to use computers to enrich the information extracted from medical images so that radiologists can better find, diagnose, and understand disease. So, many of you probably know, and I want to get these two very clear for the audience. So, if you think of where's Waldo, where's Wally, those books. So, detection is a computer algorithm that finds anything with red and white stripes, and computer-aided diagnosis is going to say those red and white stripes are Waldo or not, okay? And I want to pull these apart, although ultimately in the future they're being, and now being merged together. So, let's focus first on computer-aided detection. These are often used in breast cancer screening, and we've heard about Fulfill Digital and Tomo. And ultrasound and MRI are now being used as adjuncts to breast cancer screening, especially in those women with dense breasts. Also, for example, with MRI, they're used with high-risk populations such as those with dense breasts, family or personal history or genetic reasons. Now, I'm going to put this in because it gives the overall view of computer-aided detection. And now everyone's doing AI, and I think that's wonderful, and we're seeing all sorts of different ways to do things, but there's a lot to have been learned from over the ages. So this is our 1994 prototype. It's the first computer-aided detection system. It was very big. You can see the film digitizer, the computer, the printout, and the printout was mainly done in Arrow. Look here, look there. Is this red and white stripes or not? And I want to put this in because this system included convolutional neural networks. It wasn't called that. It was called shipped invariant artificial neural networks. This is deep learning, but rather shallow still, but the concept was there, and we believe this is the first use of convolutional neural networks in medical image analysis. And this was in the detection of calcifications. And the reason I'm talking about this is because computer-aided detection, when it first came out, it came out where the radiologists were the primary readers and the computer was the secondary reader. So the approved way of using computer-aided detection was for the radiologists to look at the image, look at the whole image, and once they found something, they could press a button, then request the computer result, and then use it. And the computer was used as a second reader. And this has moved over the ages. It was first approved in 98, and now many institutions use computer-aided detection in screening mammograms. The one thing I want to point out here, and I think this is important, as more AI techniques move over to clinical use, we have to ask the question, how are they being used? This was used as a second reader, and if you don't use it as a second reader, it won't work as it was deemed to. However, what's going on now in screening? Now, with more 3D images used in screening, 3D ultrasound and breast MRI, computer-aided detection is moving from a second reader to a concurrent reader. And the goal here is, yes, improve performance, but also to improve efficiency over just the radiologists reading the 3D image data on their own. It's a lot more data to review, so could the computer act as a concurrent reader? And these have been looked at for breast ultrasound, TOMO, and breast MRI. So I'm going to go through a few examples, and I'm going to start with this one here as an example of computer-aided detection and automatic whole breast ultrasound. Zhang et al. looked at this system and evaluated it as a concurrent reader, that is, a radiologist sits down to read the image as well as looking at the computer output at the same time. And as I go through this talk, because I only have 15, 20 minutes, I can't review everything. I'm going to go over some more of the research aspects tomorrow morning, but I want to point out that, as the earlier speaker said, we have to make sure what we're looking at is effective and safe for the patient. And so I've tried in these, where possible, to note whether it's gone through FDA or not. So this is an example of a system that's gone through FDA, and then the reading study was looked at to see how does it affect the reader performance. And this showed a reduction in the exam interpretation time while maintaining diagnostic accuracy. Because for AI to be incorporated into the clinical workflow, we will have to be careful about making sure we remain at efficiency or improve it. Here's another study focused on using computer-aided detection algorithm on whole breast ultrasound. And again, here it was shown to supplement what the radiologist is doing as a concurrent read. We've had concurrent read AI for breast tomosynthesis. And in this study shown here, it was again using, these are FDA-cleared system, showed improved accuracy and efficiency in using it. And pushing computer-aided detection in breast cancer screening is, okay, we went from the second reader to a concurrent reader. Can we now push it to an independent reader? And now it's sometimes called CADT. It's using a computer to help triage. This is a simulation study where they asked the question of, instead of the radiologist looking at all cases, can some of them be told, you're cancer-free, come back next year, no radiologist reading it. And they showed there that in the simulation study, 20 percent of mammograms were not seen by radiologists. And this showed improvement in radiologist efficiency and specificity without harming sensitivity. So now as we evaluate systems, we'll have to look at both, how does it help radiologists and how does it improve efficiency? So that's what's going on in computer-aided detection. We're a little at the higher level rather than in specific techniques. So now if we move to computer-aided diagnosis, basically how do we use a computer to help and work up? We can do this in Fulfill Digital, Tomo, Ultrasound, both handheld and automatic 3D, as well as MRI. And here the computer is to help characterize the lesion and potentially indicate a computer-determined probability of malignancy of the found lesion. So this is saying, I found red and white stripes, is it Waldo or not? And you can see where this has a clinical role because in workup, it's often used where a radiologist, and this is Dr. Gillian Newstead, looks at all three modalities at once. So there is information overload. And could the computer integrate all this and help with the decision-making? Many people are working on this, and you've heard talks today on both human-engineered radiomics as well as those using convolution neural networks. And we've looked at both, basically looking at human-engineered radiomics where you can indicate the tumor. It's segmented by the computer. The computer extracts out characteristics and then merges those with some classifier. Basically here, the computer is acting as a virtual biopsy using automatic segmentation and analysis. It's 3D. It's non-invasive. It covers complete tumor, and it's repeatable. You can keep imaging the patient. And then after it's extracted, we could calculate various characteristics of the tumor such as size, shape, morphology. And if it is a contrast-enhanced type imaging technique, we can look at the texture of the enhancement, the curve shape, and the variance, look at the kinetics of it. I want to point out one example of these radiomic features, and that is heterogeneity. This is an example for contrast-enhanced MRI, and this analysis is done in 4D. And these are the uptake curves. And within this tumor, these uptake curves vary greatly. So this is looking at the heterogeneity of that contrast uptake, the heterogeneity of the angiogenesis. And on this, you can do a couple things. You can look at the texture of that uptake. And if it's very heterogeneous, it tends to be cancerous. You can also look at things as the most enhancing volume. So there's multiple radiomics you can pull out by zooming into this uptake aspect. I'm not going to go into detail on that. There's been a lot of talks on that. Or we can now say, let's take a convolutional neural network radiomic approach. And you can do this multiple ways. You can train from scratch, or you can do transfer learning. The advantage here is you don't have to worry about lesion segmentation, and there's no extraction of features based on a segmented tumor. So it makes that training a little easier. However, you do end up with a, if you're using transfer learning, a lot of other features to worry about. This was looked at across modalities, digital mammography, hand-held ultrasound, and 3D and dynamic contrast enhanced MRI. Here are the number of cases. This is the area under the ROC curve. All of them, and this is in a multifold analysis, both methods are doing promising. But all three showed a statistically significant increase when the human-engineered and the deep-learning methods were merged. So the point here is, let's continue down both tracks and keep looking at it. And some of the talks today did just that. Because all of them, here's the ultrasound and here's the MRI. The green is the fusion method that uses both human-engineered and deep-learning computer-aided diagnosis. If you're developing this and you're going to translate it, you need an efficient human-computer interface. And this is showing the one that we developed and showed here back in 2010. Basically once you segment the lesion, various features are extracted, including size features such as volume, surface area, various other characteristics, as well as an interface that looks at that most enhancing aspect. And this is a histogram of a known database where red is cancer, green is benign, and the unknown case is sitting here. In the area of computer-aided diagnosis, more and more systems are going through FDA and are shown here. This is the one here where what was tested. And the reason I put this up is, yes, it's very important to know how well the computer is doing. But just like with detection, it's how well is the radiologist doing when they use it. So here's the blue curve reading just clinically the MRI cases and red is after, including this context. This is also being pushed. Here's another study by another group looking at computer-aided diagnosis for abbreviated breast MRI. They found that the overall performance of abbreviated and full were similar, but one would have to be careful about the kinetic features. And I'll bring this up in a second when we use different data sets. Computer-aided diagnosis is also being looked at for breast ultrasound. This is currently in conventional ultrasound, however, expected to go to 3D full breast ultrasound. This one is also FDA-cleared. And here it aligns this computer output with the appropriate BioRADS categories. So these are serving as computer aids to help in the work of the diagnosis. Now as this is done, we get more and more cases every day. So what is, should we stop and retrain or should we keep training as we get more data? So an important aspect is, for all of these systems, is what is the effect of continuous learning? And this is an example. This is actually being presented by Dr. Lee on Thursday here. This is showing where cases, and again, this is only a single institution, so it had to be expanded, however, of roughly 2,000 cases, some were from the first half of 2015, then all of 2015, and then all of 2015 plus half of 2016, and then all of 2015 and 16, all tested on an independent data set obtained from year 2017. So the only change was what year the data was obtained. And a significant improvement was shown for those using all the data. Now this is just one example. As data sets change and are added, as multi-institutional data is obtained, that will change. Also, we have found that as data comes from different systems, if you use a breast MRI system with 60 seconds time intervals and another with 90, the kinetics will be greatly different, and which features are important will change. But I think we have to focus on keeping these systems safe and effective, and also how we're going to handle training as we go on. And my last part is, of the objective, was what about beyond detection and diagnosis? Where do we go? Could we use this as a virtual biopsy to inform multi-omics cancer discovery? So it's been shown for response, and this is from Noah Hilton, that a functional tumor volume can predict recurrence-free survival. Can this be more automated now with AI radiomics? And this is from Karen Drucker, showing the most enhancing tumor volume. Those enhancing voxels that I showed you a few slides earlier, if we just look at those and look at that volume, that's a very active volume within the tumor. And that also showed that it could be used for recurrence-free survival, showing here for two different subtypes, a pretty big difference between survival and not, in predicting recurrence-free survival. And it's important to do this early on, either pretreatment or early on in the treatment. But what about cancer discovery? So we can take an approach similar to genomics, where we first find relationships between imaging data, clinical data, molecular data, genomic data, not others. And then we could develop newer and newer predictive models. The studies that I've shown you so far are focused just on the imaging, because we want to push images as far as possible, and then look at combining them with the other omics data. So if you think about this as a process, a woman comes in, has a screening, might then go to diagnostic if it's cancer, excuse me, she'll have treatment planning followed by assessing response, and then assessment of risk of recurrence. Now at this time, she has a biopsy. And at that point, when we can do discovery, at the biopsy, we can relate imaging to all these biological aspects. And what if we find a very unique signature? That signature might be on a breast MRI that we can use in breast MRI screening. Or we can use it later on in assessing risk of recurrence. So what we want to do is build predictive models, and these become our virtual biopsies. And people say, oh, you're going to replace biopsy? And my answer is no, unless, well, some people may need help if they're over-calling. Yes, they will benefit from such methods. But we also want to use virtual biopsy when an actual biopsy is not practical. So we want to learn here from discovery and maybe come up with a signature that we can use here. But we have to go through this first. We did some of this. This is with the NCI, the Cancer Genome Atlas, and the Cancer Imaging Archive Breast Phenotype Research Group, where we mapped breast MRI phenotypes to histopathology and genomics. This is a very multidisciplinary team. We had our group on imaging, computer vision, machine learning. We had computational geneticists. We had the NCI with all their expertise. We had the various radiologists from across the nation. And we had cancer biologists with Chuck Peru's group on this. And we looked to relate all these. Here are some of the items that we found, where we found significant association between This is ER, PR, and HER2, and their relationship to various radiomic features. And we also showed this in the cluster hierarchy clustering, where these are radiomic features. So basically, when we work with these geneticists, they treat our radiomics as just another phenotype. And they clustered as expected. But I want to show you this one, where here we looked at various cancer pathways and how they related to our various radiomic features of shape, size, and heterogeneity. And I know folks will say, I have 400 features. And I'll tell them, well, you have 400 mathematical descriptors of maybe six to 10 characteristics. And I think that's very important. A lot of features will be overlapped, will be correlated. So these are roughly the broad categories. You can see we have multiple features in each of them. And what we found is that transcriptional activities of various genetic pathways were positively associated with tumor size, blurred tumor margin, and irregular tumor shape. And that microRNA expressions were associated with the tumor radiomic phenotypes of size and enhancement texture, suggesting that microRNAs may mediate the growth of tumors and the heterogeneity of angiogenesis in the tumor. So keep in mind, this heterogeneity radiomic feature that we could visualize, because we did human-engineered radiomics there, is now being associated with the biology. So that's kind of a high-level overview of AI for breast, ultrasound, and MRI. There's various scientific and review courses going on this week. I think it's, for the first two objectives, I think it's very important to keep in mind that if these are aids to radiologists, you have to do the reader study to see how the radiologists are going to perform with and without the computer. And if we go beyond that, and we're doing these virtual biopsies, yes, with these virtual biopsies and in the diagnosis, we aim to help both the sensitivity and specificity of the overall radiologist read, but also use it as virtual biopsy in cancer discovery. So I like to leave this. Keep in mind that AI is a very broad word. I use it to cover anything I don't do on my own, right? And you have to be careful on what biomedical question you're answering. Are you using it as a secondary reader? And if you are, it needs to be used as a secondary. Or is it being used as a concurrent reader, which I think things are more and more moving to? Could it be independently used to triage? And then how are we going to use it in cancer discovery? I like to end up with careful, garbage in, garbage out. How you input to your algorithms, whether it's human engineered or deep learning, it's a very important point. And I think I stop there. Thank you.
Video Summary
Breast cancer imaging has evolved significantly with technology, particularly in the realm of computer-aided detection (CAD) and diagnosis, affecting how radiologists perform screenings. Initially, CAD served as a secondary tool for radiologists, marking potential abnormalities on imaging; however, technology advances have shifted its role to a concurrent reader in screening processes. CAD assists in evaluating dense breast tissues, especially in 3D imaging contexts such as whole breast ultrasound and dynamic contrast-enhanced MRI.<br /><br />CAD systems have been developed to potentially enhance radiologist accuracy and efficiency, a necessity given the increasing volume of 3D imaging data. Studies show CAD's ability to reduce image interpretation time without sacrificing diagnostic accuracy, vital for screening programs with limited resources. Emerging equities like AI-based synthetic 2D images generated from 3D tomosynthesis stacks add supplementary diagnostic value, assisting in overcoming tissue superposition challenges in breast imaging.<br /><br />The conversation around CAD evolution includes independent reading, where AI alone might identify normal cases for triage, thereby reducing radiologists' workloads without impacting sensitivity or specificity negatively. Converging CAD and AI further impacts patient management by refining imaging protocols and workflow efficiencies, emphasizing a decrease in false-positive rates while maintaining high diagnostic standards.<br /><br />Beyond CAD for detection and diagnosis, AI contributes to multi-omics cancer discovery through imaging, assessing tumor features like heterogeneity linked to genetic expressions, suggesting new pathways for precision medicine. Continuous learning from diverse data sets, addressing time interval disparities, and integrating biologically relevant features into predictive models remain crucial objectives in evolving imaging landscapes. Emphasis lies in AI's role as an aid, not a replacement, ensuring adherence to clinically safe and effective practices while advancing clinical outcomes and patient care delivery.
Keywords
breast cancer imaging
computer-aided detection
radiologists
screenings
dense breast tissues
3D imaging
whole breast ultrasound
dynamic contrast-enhanced MRI
AI-based synthetic images
tissue superposition
independent reading
AI triage
multi-omics cancer discovery
precision medicine
false-positive rates
RSNA.org
|
RSNA EdCentral
|
CME Repository
|
CME Gateway
Copyright © 2025 Radiological Society of North America
Terms of Use
|
Privacy Policy
|
Cookie Policy
×
Please select your language
1
English