General-purpose clinical natural language processing (NLP) tools are
increasingly used for the automatic labeling of clinical reports. However,
independent evaluations for specific tasks, such as pediatric chest radiograph
(CXR) report labeling, are limited. This study compares four commercial
clinical NLP systems - Amazon Comprehend Medical (AWS), Google Healthcare NLP
(GC), Azure Clinical NLP (AZ), and SparkNLP (SP) - for entity extraction and
assertion detection in pediatric CXR reports. Additionally, CheXpert and
CheXbert, two dedicated chest radiograph report labelers, were evaluated on the
same task using CheXpert-defined labels. We analyzed 95,008 pediatric CXR
reports from a large academic pediatric hospital. Entities and assertion
statuses (positive, negative, uncertain) from the findings and impression
sections were extracted by the NLP systems, with impression section entities
mapped to 12 disease categories and a No Findings category. CheXpert and
CheXbert extracted the same 13 categories. Outputs were compared using Fleiss
Kappa and accuracy against a consensus pseudo-ground truth. Significant
differences were found in the number of extracted entities and assertion
distributions across NLP systems. SP extracted 49,688 unique entities, GC
16,477, AZ 31,543, and AWS 27,216. Assertion accuracy across models averaged
around 62%, with SP highest (76%) and AWS lowest (50%). CheXpert and CheXbert
achieved 56% accuracy. Considerable variability in performance highlights the
need for careful validation and review before deploying NLP tools for clinical
report labeling.