I have a dream that one day all public information and all software will be free, transparent, and open source, and all people will have the skills to benefit from it. We are not living in my dream world yet. But here at STAT, we’re trying to do our part to open up a little more data stored in government databases.

Our first major project after we launched was a deep data-intensive dive into clinical trials compliance. In December 2015, we published “Failure to Report,” an investigation of the reporting of human study results to ClinicalTrials.gov by government, academic, and industry scientists. We found flagrant violations of federal law and non-existent enforcement.

We didn’t just publish and hope for the best. We went back this year to check whether research organizations were doing a better job posting results. That appears to be mostly true.


But we don’t want you to just take our word for it. The principles of transparency and replication are as important to us as data journalists as they are to researchers.

So we have outlined below the specifics of the methods we used. The details certainly aren’t for everyone who reads STAT, but we’re showing our work for the compliance and data science professionals among you who want to check what we’ve done.

We also have heard from a number of drug companies, who have asked questions about the data and technical details of our methods.

This table (.csv file) summarizes our findings, and it should answer most of the questions.

All of the data are publicly available.

Here’s how we did our analysis.

Step One: Getting the data

We downloaded all ClinicalTrials.gov study record content on September 11, 2017. This is a very large ZIP file (>1Gb) containing all registration information as well as any available results information for all studies — in XML format. (The structure of study records in XML is defined by this XML schema. See more information on ClinicalTrials.gov.

Step Two: Data gap

Some data were not available for download from ClinicalTrials.gov — including data that would have allowed us to exclude from the reporting violators list studies exempted from reporting results because the drug being tested was being assessed by the Food and Drug Administration for initial approval or a new use. So we obtained from the National Institutes of Health a list of trial sponsors that had requested and been granted a “certification for delay” (.xlsx file).

But NIH informed us after publication that trials completed before Jan. 18, 2017, could be exempted without requesting an extension, so the NIH list was incomplete. This means some trials included in our violators list might have been exempt from the reporting requirement. There is no practical way to verifiably identify all exempt trials.

Step Three: Collection

We wrote a Python program to convert 254,163 studies (individual XML files, see step 1) and 5,403 CSV rows (see step 2) into structured JavaScript Object Notation, or JSON, data that we could analyze and visualize.

An aside:

– We used Python and JavaScript to process and visualize the data; they are free and open source.

– We created a spreadsheet (.csv file), which you can try to use. Warning: It may work slowly, depending on your hardware/software, and after you save it with your spreadsheet software, many cells will contain errors/incorrect information. For example, when you open our CSV file with your spreadsheet software, most likely, it will apply its date formatting to the cells containing dates, and before you save it you will see the correct dates; after you save our file with the spreadsheet software and reopen it, all the dates will be corrupted (see picture below).

Pic. 1 – Before the file was saved with spreadsheet software

Pic. 2 – After the file was saved with spreadsheet software and reopened with the same software in the same CSV format.

Spreadsheet software may also change coding and will affect spelling of foreign entities that used non-Latin symbols, among other errors that may occur.

Step Four: Analysis (Here’s where it gets really wonky)

All violations were grouped by the parties responsible for reporting trial results, and responsible parties were grouped by category of research organization.

Here’s an example of what the resulting JSON file looks like:


The program parsed through each XML file, collected necessary information about the study, and filtered only trials whose findings were required by law to be reported.

Parts of an XML document can be accessed using path expressions. Let’s use the following XML document to give an XPath example:

To get the path is <clinical_study/id_info/nct_id>. And it is a “key” in the deepest layer of our JSON file — ClinicalTrials.gov Unique Protocol Identification Number (see ClinicalTrials.gov Data Element Definitions).

Step Five: Filtering

We filtered the data to remove trials whose findings were not required by law to be reported. This included trials completed prior to 2008, phase 1 safety trials, and those completed less than a year earlier.

Filter steps:

1. We checked if NCT# is not in the list of studies that had received an official reporting extension (see step 2).

2. If #1 is true then we checked if any of <clinical_study/intervention//intervention_type/> is ‘Biological’, ‘Device’ or ‘Drug’.

3. If #2 is true then we checked if any of <clinical_study/phase/> is not “N/A” or “Early Phase 1”, or “Phase 1”

4. If #3 is true we checked if any of <clinical_study/location_countries//country> is ‘United States’. We also included trials that failed to disclose information on the location of trial sites.

5. If #4 is true then we checked if XML file has <clinical_study/firstreceived_results_date/>

6. If #5 is true then we checked if XML file has <clinical_study/primary_completion_date>. The trial is a violation if primary_completion_date is between December 26, 2007 – September 11, 2016, and first received_results_date minus primary_completion_date minus 395 days > 395. (We added 30 days to the 365 days of the grace period because many trials have only a month and a year on the file, but no actual date.)

Step Six: Reconciliation

Some trials failed to disclose information on either start_date or primary_completion_date, or both. In these cases, we used any available alternative date fields (e.g. if XML file did not have <clinical_study/primary_completion_date> then we checked if XML file had <clinical_study/completion_date> as an alternative.

The trial was counted as a violation if the completion date was between December 26, 2007, and September 11, 2016, and the first received_results_date minus primary_completion_date minus 395 days > 0.

A data legend:

The reference below describes available key-value pairs in the JSON file (the trial level). The trials are nested in the records by parties responsible for reporting trial results, and responsible parties are nested by category.

“key” – ClinicalTrials.gov Unique Protocol Identification Number, <clinical_study/id_info/nct_id>

“title” – brief title, <clinical_study/brief_title>

“link” – study url, <clinical_study/required_header/url>

“sponsor” – responsible party or name of the lead sponsor (if responsible party information is missing in trial’s XML file), <clinical_study/responsible_party/investigator_affiliation>, or <clinical_study/responsible_party/organization>, or <clinical_study/sponsors/lead_sponsor/agency>

“parent” – name of responsible party cleaned by STAT (we replaced original name (col. A) with cleaned (col. C) including different spelling of the same company and subsidiaries)

“agency_class” – agency class of lead sponsor, <clinical_study/sponsors/lead_sponsor/agency_class>

“newclass” – class assigned by STAT (we replaced original agency_class (col. B) with cleaned (col. D))

“status” – overall recruitment status, < clinical_study/overall_status>

“phase” – study phase, <clinical_study/phase>

“enrollment” – total number of participants that are enrolled in the clinical study, <clinical_study/enrollment>

“deaths” – <clinical_study/reported_events//event[sub_title/text() = “Total, all-cause mortality”]/counts/@subjects_affected>

“completion_date” – <clinical_study/primary_completion_date> or <clinical_study/completion_date> (if <clinical_study/primary_completion_date> is missing in trial’s XML file)

“results_date” – <clinical_study/firstreceived_results_date>

“completion_date_f” – “completion_date” in datetime format

“start_date” – <clinical_study/start_date> or <clinical_study/firstreceived_date> (if <clinical_study/start_date> is missing in trial’s XML file)

“start_date_f” – “start_date” ” in datetime format

“has_results” – “Yes” if the eligible study’s XML file has <clinical_study/firstreceived_results_date>, else: “No, as of 09/11/2017”

“late” – the difference we calculated during step 6 of the data filtering process (in days)

“has_results15” – defines if the eligible trial was required to post results as of 09/11/2015 based on “completion_date” < “September 11, 2014”

“late15” – defines if the eligible trial has results but posted them late based on “completion_date” < “September 11, 2014″ and “firstreceived_results_date” < “September 11, 2015”

“violationsperparent”, “late15perparent”, “no17perparent”, “totalperparent”, “no15perparent”, “late17perparent”, “total15perparent” is the total number of relevant records with the same parent name, e.g. “no17perparent”is calculated as the total number of available records with the given “parent” value and if they have value for “has_results”==”No, as of 09/11/2017”.

Leave a Comment

Please enter your name.
Please enter a comment.

A roundup of STAT’s top stories of the day in science and medicine

Privacy Policy