-
Notifications
You must be signed in to change notification settings - Fork 122
Description
I had existing code that downloaded clinical data from the TARGET-NBL cohort using the GDCquery_clinic function, downloading the 842 cases that have clinical data. The code is:
GDCquery_clinic(project = "TARGET-NBL", type = "clinical")
Previously, it worked perfectly, but then months later, I ran the exact same script again, and it downloaded a different set of 842 cases, which had overlap with the original set. After days of digging I found the issue:
For some reason, on line 241, there is an if statement with this code that tests if the project is a TCGA project or not:
if f (grepl("TCGA",project)){
And only if this evaluates to TRUE, part of the case filter for the API request URL will include selecting only cases where the field files.data_category is equal to "Clinical" (lines 246-247). However, my project was TARGET-NBL, so the else block will instead execute, which leaves out this case filter in the API request URL. Without this filter, there are over 1100 cases to choose from (looking at all cases, whether or not they have "Clinical" data).
Importantly, though, on line 233, we set up the "size" (case count) filter, which will set size=842 in the API request URL. Therefore, when we call the function on TARGET-NBL, it still specifies that we must retrieve only 842 cases, so it just arbitrarily selects 842 of the 1100 cases. This is why the code downloaded an oddly different set of 842 cases the second time.
The patch I used was just to get rid of the if statement on line 241 and execute the code in the if-true block always. I'm not sure if there was ever a rationale for this if-else code block, but at least for TARGET projects, we need to execute the if-true block that adds the "files.data_category" case filter.