Advertisement
Research Article| Volume 25, 100194, March 2023

Download started.

Ok

Automated data extraction tool (DET) for external applications in radiotherapy

  • Mruga Gurjar
    Correspondence
    Corresponding author.
    Affiliations
    Medical Radiation Sciences, Institute of Clinical Sciences, Sahlgrenska Academy, University of Gothenburg, Sweden
    Search for articles by this author
  • Jesper Lindberg
    Affiliations
    Medical Radiation Sciences, Institute of Clinical Sciences, Sahlgrenska Academy, University of Gothenburg, Sweden

    Department of Medical Physics and Biomedical Engineering, Sahlgrenska University Hospital, Gothenburg, Sweden

    Regional Cancer Centre West, Western Sweden Healthcare Region, Gothenburg, Sweden
    Search for articles by this author
  • Thomas Björk-Eriksson
    Affiliations
    Regional Cancer Centre West, Western Sweden Healthcare Region, Gothenburg, Sweden

    Department of Oncology, Institute of Clinical Sciences, Sahlgrenska Academy, University of Gothenburg, Sweden
    Search for articles by this author
  • Caroline Olsson
    Affiliations
    Medical Radiation Sciences, Institute of Clinical Sciences, Sahlgrenska Academy, University of Gothenburg, Sweden

    Regional Cancer Centre West, Western Sweden Healthcare Region, Gothenburg, Sweden
    Search for articles by this author
Open AccessPublished:January 07, 2023DOI:https://doi.org/10.1016/j.tipsro.2022.12.001

      Highlights

      • DET generates ready-to-use reliable datasets and statistics from radiotherapy oncology information systems.
      • The user-friendly GUI supports easy extraction of information for external use in clinical or non-clinical settings.
      • Automatic data extraction and cleaning ensure fast data processing and reduce manual processing time.

      Abstract

      Purpose

      Oncological Information Systems (OIS) manage information in radiotherapy (RT) departments. Due to database structure limitations, stored information can rarely be directly used except for vendor-specific purposes. Our aim is to enable the use of such data in various external applications by creating a tool for automatic data extraction, cleaning and formatting. Methods and materials: We used OIS data from a nine-linac RT department in Sweden (70 weeks, 2015–16). Extracted data included patients’ referrals and appointments with details for RT sub-tasks. The data extraction tool to prepare the data for external use was built in C# programming language. It used excel-automation queries to remove unassigned/duplicated values, substitute missing data and perform application-specific calculations. Descriptive statistics were used to verify the output with the manually prepared dataset from the corresponding time period. Results: From the initial raw data, 2030 (51 %)/907 (23 %) patients had known curative and palliative treatment intent for 84 different cancer diagnoses. After removal of incomplete entries, 373 (10 %) patients had unknown treatment intents which were substituted based on the known curative/palliative ratio. Automatically- and manuallyprepared datasets differed < 1 % for Mould, Treatment planning, Quality assurance and ± 5 % for Fractions and Magnetic resonance imaging with overestimations in 80/140 (57 %) entries by the tool. Conclusion: We successfully implemented a software tool to prepare ready-to-use OIS datasets for external applications. Our evaluations showed overall results close to the manually-prepared dataset. The time taken to prepare the dataset using our automated strategy can reduce the time for manual preparation from weeks to seconds.

      Keywords

      Introduction

      Oncology Information Systems (OISs) are used to manage information in radiotherapy (RT) departments. With the vast inflow of cancer patients, large amounts of treatment-related data are continuously added to the OIS, partly also including information on patient demographics. These OIS data are valuable to assist in answering clinical as well as research questions and to provide quality insights for organizational improvements of RT. However, due to a lack of guidelines for data entry and database limitations in OISs, stored information can rarely be directly used for other than vendor-specific purposes [
      • Vieira B.
      • Hans E.W.
      • van Vliet-Vroegindeweij C.
      • van de Kamer J.
      • van Harten W.
      Operations research for resource planning and -use in radiotherapy: a literature review.
      ,
      • McNutt T.R.
      • Bowers M.
      • Cheng Z.
      • Han P.
      • Hui X.
      • Moore J.
      • et al.
      Practical data collection and extraction for big data applications in radiotherapy.
      ]. To enable use in various external applications, the raw data stored in OISs first need to be cleaned and formatted to fit their required input formats.
      Use of external applications with comprehensive high quality and consistent datasets is an important step to broaden the usage of OIS data from RT departments [
      • Babashov V.
      • Aivas I.
      • Begen M.A.
      • Cao J.Q.
      • Rodrigues G.
      • D'Souza D.
      • et al.
      Reducing Patient Waiting Times for Radiation Therapy and Improving the Treatment Planning Process: a Discrete-event Simulation Model (Radiation Treatment Planning).
      ]. Examples of such applications include decision-support tools where RT staff can be assisted in decisions on resource allocations according to available capacity [
      • Lindberg J.
      • Gurjar M.
      • Holmstrom P.
      • Hallberg S.
      • Bjork-Eriksson T.
      • Olsson C.E.
      Resource planning principles for the radiotherapy process using simulations applied to a longer vacation period use case.
      ,
      • da Silva R.B.Z.
      • Fogliatto F.S.
      • Krindges A.
      • dos Santos C.M.
      Dynamic capacity allocation in a radiology service considering different types of patients, individual no-show probabilities, and overbooking.
      ]. Other examples concern usage of dose/volume metrics in clinical studies, for quality assurance purposes or for dose–response modelling [
      • Gong Y.U.
      • Yu J.
      • Pang D.
      • Zhen H.
      • Galvin J.
      • Xiao Y.
      Automated Extraction of Dose/Volume Statistics for Radiotherapy-Treatment-Plan Evaluation in Clinical-Trial Quality Assurance.
      ,
      • Stervik L.
      • Pettersson N.
      • Scherman J.
      • Behrens C.F.
      • Ceberg C.
      • Engelholm S.
      • et al.
      Analysis of early respiratory-related mortality after radiation therapy of non-small-cell lung cancer: feasibility of automatic data extraction for dose-response studies.
      ]. Regardless of application, preparing OIS data for external use can be a lengthy (manual) task, particularly for situations where new datasets need to be extracted frequently [
      • McNutt T.R.
      • Bowers M.
      • Cheng Z.
      • Han P.
      • Hui X.
      • Moore J.
      • et al.
      Practical data collection and extraction for big data applications in radiotherapy.
      ,
      • da Silva R.B.Z.
      • Fogliatto F.S.
      • Krindges A.
      • dos Santos C.M.
      Dynamic capacity allocation in a radiology service considering different types of patients, individual no-show probabilities, and overbooking.
      ]. Data efforts in RT rarely focus on the practical side of data collection and extraction as most of the research focuses on development of applications to improve the workflow or other aspects of RT [
      • Vieira B.
      • Demirtas D.
      • B. van de Kamer J.
      • Hans E.W.
      • van Harten W.
      J BvdK, Hans EW, van Harten W: Improving workflow control in radiotherapy using discrete-event simulation.
      ]. So far, there are only a few contributors who emphasize the importance of creating means such as automated strategies for data retrieval to enable increased use of RT data outside OISs [
      • McNutt T.R.
      • Bowers M.
      • Cheng Z.
      • Han P.
      • Hui X.
      • Moore J.
      • et al.
      Practical data collection and extraction for big data applications in radiotherapy.
      ].
      In this study, we describe practical aspects of automating the process of preparing OIS data for external use. The purpose of this work is to create and verify a data extraction tool for a commercial OIS, to automate extraction, cleaning, and formatting of data for use in external applications. We tested the data extracted from the tool as input to an example external application to confirm and update previously reported results on resource use from a simulation model over the RT process [
      • Lindberg J.
      • Gurjar M.
      • Holmstrom P.
      • Hallberg S.
      • Bjork-Eriksson T.
      • Olsson C.E.
      Resource planning principles for the radiotherapy process using simulations applied to a longer vacation period use case.
      ].

      Materials and methods

      Data characteristics

      Information used for development and testing of the data extraction tool in this study was collected from a nine-linac RT department in Sweden. The OIS ARIA (Varian Medical Systems, Inc., Palo Alto, CA, U.S.A.) used in the department includes information about patient demographics, pre-treatment tasks including imaging, quality assurance (QA), treatment planning, and treatment delivery.
      Information required from the OIS ARIA for the example external application used to test performance of the tool (described below), needed to include specifics about each patient’s treatment path, divided into three parts. One part, referral data, with diagnosis code (ICD-10), treatment intent, and referral start dates. Another part, appointment data, with appointment details for mould, imaging (computed tomography [CT], magnetic resonance imaging [MRI], positron-emission tomography [PET]), and QA with unique appointment identifiers (ids). The third part, fraction data, included number of fractions corresponding to the appointment data.
      For tool development and verification, we used extracted data for a 16-month period (January 2015-April 2016) which were manually cleaned and formatted, referred to as the reference dataset. Subsequently, all the manual steps taken to prepare the data for the example external application were automated in the tool. The investigated application also required the structuring of data according to the largest diagnosis and treatment intent groups (corresponding to 80 % of data) and associated statistics for usage of tasks/resources in the RT workflow.
      Extracted raw data from OIS ARIA included inconsistencies in variables for both referral and appointment data, which had to be cleaned to meet the associated input data requirements. The raw data also included missing information which was handled by applying different substitution strategies per diagnosis (all assigned to 1. curative, 2. palliative, 3. equal split between curative/palliative, or 4. known ratio of curative/palliative; details in supplementary material).

      Tool development

      Details on the development of the tool can be found in the supplementary material. In short, the data extraction tool, primarily built in C# (Visual Studio IDE, Version 17.1, Microsoft, Washington, U.S.A), included excel-automation queries to remove unassigned/duplicated values, substitution of missing data, and execution of application-specific calculations. The overall process from retrieval of raw OIS data to cleaned and formatted input data ready for external use is presented in Fig. 1 and in Table 1. The cleaning and formatting steps take place on the back-end but are triggered by the user via the tool’s graphical user interface (GUI; Fig. 2).
      Figure thumbnail gr1
      Fig. 1Data Extraction Tool (DET) process. Stepwise automation of extraction, cleaning and formatting of data from the ARIA oncology information system (OIS) to fit the input data format of the here investigated example external application. Details of calculation statistics are given in the results.
      Table 1The synchronization between front-end and back-end operations of DET to automatically extract and export datasets for use in external applications, developed with C# in Microsoft’s Visual Studio Application.
      ProcessFront EndBack End
      1.Start application

      Specify time-period
      • 1.
        Server connectivity established
      2.Select Referral/Appointment subtask/Fraction data (Click Event)
      • 1.
        Database connectivity established
        Raw data extracted based on the specified time period and saved in an internal file (non-accessible for user) at a pre-defined location
      • 2.
        Existing Excel file with automated formulae for data cleaning automatically linked to the extracted file
      • 3.
        Cleaned data displayed on the GUI
      3.Select Export/Save to a user-defined location (Click Event)
      • 1.
        Cleaned data exported and existing Excel file with automated formulae for statistics (specific to the external application in use) automatically linked to the exported/saved file (user-accessible file 1)
      • 2.
        Statistics exported on application-specific format to a new excel file (user-accessible file 2)
      4.Loop according to default time interval settings or to user-defined time intervals
      Abbreviation: GUI: Graphical User interface.
      Figure thumbnail gr2
      Fig. 2Extraction tool graphical user interface. Dates are selected using the date picker at the top of the window. Clicking Referrals extracts all referral data for the selected time period presented in the white boxes. Similarly, selecting a sub-task will result in an extraction of all appointment data related to the sub-task. Abbreviations: CT = computed tomography, CT-site1 = CT at main department, CT-site2 = CT at satellite department, DET = Data Extraction Tool, MR = magnetic resonance imaging, PET = positron emission tomography, FRAC = fractions, QA = quality assurance.

      Tool verification

      Comparison of manually cleaned data vs automatically cleaned data

      We compared the reference dataset with the automatically-cleaned dataset for the same time period of 2015–16 (tool dataset). We considered three comparison points for verification purposes. For the data cleaning and formatting steps, we compared the ratio of curative to palliative referrals. The ratio was different for each diagnosis; hence, diagnose-specific ratios were assessed. We also compared the diagnosis-intent groups found after manual calculations and after automated calculations from the tool. Finally, the percentage of null values were removed, and percentage of duplicates removed for referral and appointment data were assessed for the tool dataset.

      External application example

      A simulation model of the RT workflow

      The simulation model used to evaluate the data from the extraction tool illustrates resource utilization at the same RT department as the data used for tool development were taken from. The intended use of the model was to help the department to plan for resource allocations [
      • Lindberg J.
      • Gurjar M.
      • Holmstrom P.
      • Hallberg S.
      • Bjork-Eriksson T.
      • Olsson C.E.
      Resource planning principles for the radiotherapy process using simulations applied to a longer vacation period use case.
      ]. The model separates the RT preparatory part from the treatment part and has previously been used to investigate different scenarios around the Swedish summer vacation period (June-August). Using the abovementioned reference dataset from 2015 to 2016, the most preferable scenario, minimizing the impact on the overall patient throughput without violating legislated staff vacation rights, was identified when the preparatory part vacation period started 1–2 weeks prior to the treatment part vacation period. To test performance of the extraction tool in practice, the original simulation results based on the reference dataset were visually compared to simulation results based on the tool dataset with respect to patterns of patients waiting for preparations, patients waiting for treatment, and patients under treatment. An automatically formatted second dataset for a comparable, but more recent, time period was also investigated (January 2020 to April 2021).

      Results

      Data characteristics – Automatically cleaned data from 2015 to 2016

      For the same time period as for the reference dataset, the raw data behind the tool dataset included 3916 patients before being automatically cleaned. The number of patients treated with known curative intent was 2030 (51 %) and with known palliative intent was 907 (23 %). Number of appointments ranged between 684 (MR) to 3735 (CT) with 58,228 fractions in total for all patients.

      Removal of duplicates and null values

      For referral data, after removal of duplicates and null values, the total number of patients was 3310 (Table 2). Patient referral data had a lower number of duplicates (1 %) and higher number of NULL diagnose entries (15 %) in comparison with all sub-tasks from appointment data where number of duplicates was higher (2–24 %) and number of NULL entries was lower (1–4 %). For appointment data, MR had the lowest number for both duplicates and NULL (2 %), QA had the highest number of duplicates (24 %), and Mould had the highest number of NULL (4 %). Patient fraction data had ≤ 1 % duplicates and ≤ 1 % NULL entries.
      Table 2Data-cleaning statistics for the automatically-cleaned dataset from 2015 to 16 including original (raw data) values, number of removals during data cleaning and final values post cleaning. Data extracted for patients treated with either palliative or curative intent at the RT department at the Sahlgrenska University Hospital in Sweden.
      Initial raw data No.Duplicates

      No. (%)
      NULL intent

      No. (%)
      Final No.
      Referral data (For all 84 diagnoses)
      Patients391628 (1)578 (NULL diagnosis [103], blank rows [107], irrelevant entries* [368]) (15)3310
      Appointment data (For all 84 diagnoses)
      Mould166236 (2)66 (4)1560
      CT- Site1373541 (1)52 (1)3642
      MRI68413 (2)17 (2)654
      QA1285304 (24)35 (3)946
      Fraction data (For all 84 diagnoses)
      Fractions58,228 fractionsNone4 (<1)58,224

      Abbreviations: CT = computed tomography, Initial No. = original number obtained from the raw data before data cleaning, MRI = magnetic resonance imaging, QA = quality assurance.
      *An irrelevant entry was defined as the entry with an unrecognized format or entry with corrupted data / meaningless data.

      Substitution strategies for treatment intent

      After having removed duplicates and null values, 373/3310 (12 %) of patient referrals had unknown treatment intent entries. Out of the remaining 2937 referrals with known treatment intent, 69 % had curative and 31 % had palliative intent. From the four investigated substitution strategies, the ratio substitution fared best with closest percentages and ordering for the six largest diagnosis-intent groups compared to the reference dataset (Table 3). Furthermore, the remaining diagnosis-intent groups of the reference dataset remained identical for the ratio substitution strategy with small changes in the ordering from groups 7 to 20. The other three strategies resulted in up to three diagnosis-intent groups that were not part of the largest 20 groups of the reference dataset.
      Table 3Substitution strategy results in terms of the sequence and percentage distributions of the largest 20 diagnosis and intent groups for all four strategies tested on the automatically-cleaned dataset from 2015 to 16 compared to the manually-cleaned dataset from that same time period.
      Automatically-cleaned dataset
      No.Manually- cleaned dataset (reference)All C

      (Strategy 1)
      All P

      (Strategy 2)
      50–50

      (Strategy 3)
      Ratio

      (Strategy 4)
      Diagnosis -Intent%Diagnosis -Intent%Diagnosis -Intent%Diagnosis -Intent%Diagnosis -Intent%
      1C50 - C22 %C50 - C18 %C79 – P*20 %C79 – P*19 %C50 – C21 %
      2C77 - P16 %C79 – P*19 %C50 – C*17 %C50 – C*16 %C77 – P17 %
      3C61 - C11 %C61 - C10 %C61 - C11 %C61 - C10 %C61 - C9 %
      4C34 - C6 %C34 - C6 %C34 - C6 %C34 - C6 %C34 - C6 %
      5C34 - P5 %C34 - P6 %C34 - P5 %C34 - P4 %C34 - P6 %
      6C71 - P2 %C61 – P*4 %C61 – P*5 %C61 – P*4 %C71 - P3 %
      7C61 - P2 %C71 - P2 %C71 - P2 %C71 - P2 %C61 - P2 %
      8C50 - P1 %C54 - C1 %C50 - P1 %C20 - C2 %C90 - P2 %
      9C90 - P1 %C90 - P1 %C90 - P1 %C50 - P1 %C09 - C1 %
      10C15 - C1 %C50 - P1 %C54 - C1 %C09 - C1 %C54 - C1 %
      11C54 - C2 %C09 - C1 %C09 - C1 %C90 - P1 %C53 - C1 %
      12C09 - C2 %C35 - C1 %C20 - C1 %C54 - C1 %C50 - P1 %
      13C20 - C2 %C20 - C1 %C35 - C1 %C53 - C1 %C20 - C2 %
      14C53 - C1 %C43 - C1 %C53 - C1 %C15 - C1 %C15 - C2 %
      15C01 - C1 %C77 - P1 %C77 - P1 %C77 - P1 %C77 - P1 %
      16C44 - C1 %C83 - C1 %C01 - C1 %C83 - C1 %C21 - C1 %
      17C77 - C1 %C44 - C1 %C44 - C1 %C01 - C1 %C01 - C1 %
      18C83 - C1 %C01 - C1 %C83 - C1 %C44 - C1 %C44 - C1 %
      19C21 - C1 %C21 - C1 %C21 - C1 %C21 - C1 %C83 - C1 %
      20C49 - C1 %C79 - C1 %C77 - C1 %C77 - C1 %C49 - C1 %
      Total (20)25992901291228912877
      Total

      patients ALL (84)
      320981 %331078 %331079 %331075 %331081 %
      Abbreviations: 50–50 = Split substitution between curative and palliative, All C = All curative, All P = All palliative, C-Curative intent, CXX: ICD-code for specific diagnosis, P-Palliative intent, Ratio = Palliative to curative ratio-based substitution per diagnosis, No. = Number indicating ordering of diagnosis and intent groups in the manually-cleaned reference dataset.
      *=diagnosis intent category dissimilar to the manually-cleaned dataset for the top six.
      Note that the 20 largest diagnosis and intent groups corresponded to 80% of all data.

      Comparisons between the manually cleaned reference data and the automatically cleaned tool data

      In comparison to the reference dataset, the overall differences in the total number of referred patients, number of appointments for different tasks and fractions for the tool dataset ranged between −5% to 5 % (Table 4). The smaller differences were found for Mould, TP andQA (<1%), and the larger differences were found for Fractions and MR (±5%). The tool data were numerically overestimated in 80/140 (57 %) of comparisons.
      Table 4Referral, appointment and fraction data statistics for the manually cleaned and automatically cleaned datasets from 2015 to 16.
      Diagn-osisPatientsMouldCTMRTPQAFractions
      IntentREFDETDET-REF (%)REFDETDET-REF (%)REFDETDET-REF (%)REFDETDET-REF (%)REFDETDET-REF (%)REFDETDET-REF (%)REFDETDET-REF (%)
      C50 - C−7167171 (0)45472 (4)748726−22 (−3)000 (N/A)837797−40 (−5)01111 (N/A)14,29913,859−440 (−3)
      C77-79 - P515504−11 (−2)3053083 (1)61263927 (4)3331−2 (−6)673631−42 (−6)36415 (14)20941982−112 (−5)
      C61 - C3453494 (1)110 (0)339318−21 (−6)339307−32 (−9)356328−28 (−8)238211−27 (−11)98309281−549 (−6)
      C34 - C20423430 (15)1091189 (8)232227−5 (−2)396 (200)257224−33 (−13)4841−7 (−15)41113962−149 (−4)
      C34 - P16219230 (19)1281368 (6)1851916 (3)352 (67)215204−11 (−5)1816−2 (−11)141814246 (0)
      C71 - P68680 (0)6865−3 (−4)7467−7 (−9)5046−4 (−8)7365−8 (−11)27314 (15)12751178−97 (−8)
      C61 - P569236 (64)24273 (13)609030 (50)93−6 (−67)7168−3 (−4)85−3 (−38)37041040 (11)
      C50 - P418241 (100)18268 (44)4722−25 (−53)220 (0)4946−3 (−6)132 (200)2412443 (1)
      C90 - P406626 (65)20211 (5)47470 (0)033 (N/A)5654−2 (−4)165 (500)21422410 (5)
      C15 - C446521 (48)42475 (12)46493 (7)30−3 (−100)6356−7 (−11)12175 (42)122412240 (0)
      C54 - C56648 (14)11121 (9)63685 (8)220 (0)6057−3 (−5)4838−10 (−21)144814579 (1)
      C09 - C496112 (24)50566 (12)46471 (2)46460 (0)54540 (0)4841−7 (−15)1622171189 (5)
      C20 - C495910 (20)110 (0)4633−13 (−28)033 (N/A)54584 (7)2921−8 (−28)821819−2 (0)
      C53 - C445410 (23)341 (33)3833−5 (−13)63−3 (−50)60677 (12)5243−9 (−17)1253128835 (3)
      C01 - C385113 (34)4038−2 (−5)359−26 (−74)3026−4 (−13)3934−5 (−13)4132−9 (−22)12701123−147 (−12)
      C44 - C315019 (61)3130−1 (−3)3227−5 (−16)20−2 (−100)36360 (0)139−4 (−31)71378269 (10)
      C77 - C48491 (2)20200 (0)43329 (725)102616 (160)5828−30

      (−52)
      24262 (8)83787639 (5)
      C83 - C324917 (53)3431−3 (−9)375821 (57)033 (N/A)4037−3 (−8)9123 (33)506504−2 (0)
      C21 - C354510 (29)440 (0)93324 (267)000 (N/A)4641−5 (−11)3432−2 (−6)935911−24 (−3)
      C49 - C26260 (0)2619−7

      (−27)
      28291 (4)000 (N/A)2928−1 (−3)31−2 (−67)55659842 (8)
      TOTAL (20)25992877278 (11)980101131 (3)2728274618 (1)538515−23 (−4)31262913−213

      (−7)
      690637−53 (−8)45,03743,857−1180 (−3)
      Other610433−177

      (−30)
      573548−25

      (−4)
      83489662 (7)149135−14 (9)7951003214 (27)25830951 (20)10,49714,3673870 (37)
      MEDIAN diff.22 %6 %9 %32 %21 %5 %7 %
      TOTAL ALL32093310101 (3)155315596 (0)3562364280 (2)687650−37 (−5)39213916−5 (0)948946−2 (0)55,53458,2242690 (5)
      Note that the third column under every sub-table (DET-REF[%]) indicates the difference in manually cleaned and automatically cleaned data represented in values and in percentages.
      Abbreviations: C = curative intent, CT = computed tomography, CXX = ICD-codes for specific diagnosis, DET = data extraction tool/automatically cleaned tool dataset, Median Difference = median difference for largest 20 diagnosis = intent categories, MR = magnetic resonance (imaging), P = palliative intent, QA = quality assurance, REF = reference/manually-cleaned dataset, TP = treatment plans. TOTAL (20) = total for the largest 20 diagnosis groups. Other = combined total for the remaining diagnosis groups. Median difference = median difference in the values from the largest 20 diagnosis groups.
      Note that the 20 largest diagnosis and intent groups corresponded to 80% of all data.

      External application example – Using both manually- and automatically-cleaned datasets from two time periods as input data

      Simulation results for the example external application based on the manually-cleaned reference dataset and the automatically-cleaned tool datasets for the time periods 2015–16 and 2020–21 are shown in Fig. 3 and Fig. 4. The largest 20 diagnosis and intent groups corresponded to 80 % of all data.
      Figure thumbnail gr3
      Fig. 3Simulation results for radiotherapy preparation and treatment steps during an eight-week summer vacation period for the radiotherapy department at the Sahlgrenska University Hospital in Sweden. a. manually-cleaned reference dataset for 2015–16, b. automatically-cleaned tool dataset for 2015–16, and c. automatically-cleaned tool dataset for 2020–21. The data here represents patients from the largest 20 diagnosis and intent groups corresponding to 80% of all patient data.
      Figure thumbnail gr4
      Fig. 4Yearly referral inflow pattern at the radiotherapy department at the Sahlgrenska University Hospital in Sweden for the investigated manually-cleaned and automatically-cleaned patient datasets. The data here represents all patient data from 2015 to 16 and 2020–21 with 84 different diagnoses. Abbreviation: DET = Data Extraction Tool.
      The outputs from the 2015–16 datasets differed with respect to the pattern of patients waiting for treatment in the automatically cleaned data compared to the manually cleaned reference data. The patterns of patients waiting for preparation and patients under treatment were otherwise similar between the two datasets.
      Over time, the ratio of curative to palliative treatments had remained similar but referral, appointment and fraction data had increased in 2020–21 in comparison to the situation in 2015–16. There were 81 % more referrals, 31 % more appointments, and 21 % more fractions. The output from the simulation model showed a somewhat different pattern of patients waiting for preparations in 2020–21 than in 2015–16. However, the pattern of patients waiting for treatment and patients under treatment showed uniform behaviour for both time periods in relation to their respective preparation and treatment capacities.

      Discussion

      In this work, we explored an automated data preparation approach based on previously used manual data cleaning and formatting principles for OIS data in RT. We successfully created and verified the ability of a novel data-extraction tool to time-efficiently prepare ready-to-use datasets for an external application of the RT domain. Using real patient data from a large modern RT department in Sweden, we found that original OIS data included both duplicated and missing information motivating both removal and substitution strategies in the data cleaning process. Information relating to referrals generally included more ambiguities than information relating to appointments for different RT subtasks including treatment fractions. Using a ratio substitution strategy for missing information on treatment intent resulted in numerically overall small differences between the investigated manually-cleaned reference dataset and the automatically-cleaned tool dataset for the same time period. The identified differences did not affect the output from the investigated external application.
      Radiation oncology is one of the most quantitative disciplines in the medical field, but tools which effectively make use of clinically registered data to support decision making in different RT settings are lacking [
      • Kalendralis P.
      • Sloep M.
      • van Soest J.
      • Dekker A.
      • Fijten R.
      Making radiotherapy more efficient with FAIR data.
      ]. A PubMed search on July 25th, 2022, gave only five relevant hits on different combinations of “tools”, “automation”, “extraction”, “datasets” and “radiotherapy”. All five publications were based on automatic extraction of RT data but used purpose-specific extraction features, and only supported explicit database structures. Two of these publications related to tools that were specifically or partly developed to capture dose/volume statistics for RT studies with researchers reporting the dataset preparation to be both labour intensive and challenging [
      • Gong Y.U.
      • Yu J.
      • Pang D.
      • Zhen H.
      • Galvin J.
      • Xiao Y.
      Automated Extraction of Dose/Volume Statistics for Radiotherapy-Treatment-Plan Evaluation in Clinical-Trial Quality Assurance.
      ,
      • Stervik L.
      • Pettersson N.
      • Scherman J.
      • Behrens C.F.
      • Ceberg C.
      • Engelholm S.
      • et al.
      Analysis of early respiratory-related mortality after radiation therapy of non-small-cell lung cancer: feasibility of automatic data extraction for dose-response studies.
      ]. In the study by Gong et al. in 2016, they compared the time to use their developed tool to automatically extract MIMvista data for dosimetry review according to a specific trial protocol with the time needed to obtain the same information manually [
      • Gong Y.U.
      • Yu J.
      • Pang D.
      • Zhen H.
      • Galvin J.
      • Xiao Y.
      Automated Extraction of Dose/Volume Statistics for Radiotherapy-Treatment-Plan Evaluation in Clinical-Trial Quality Assurance.
      ]. They found that the automatic extraction for a small-scale example (dose/volume points for 20 dose-volume histograms) was completed in 3 min whilst the manual work took 1 h. In the other more recent study by Stervik et al., they analyzed radiation-induced toxicity for lung cancer treatments with data collected from multiple hospitals. They addressed both the time aspect and challenges with RT data inconsistencies since the databases in question were not entirely consistent in terms of the availability and coordination between patient and treatment related data. They report that analysis of the extracted data with inconsistencies would have resulted in an unfitting selection of patient profiles for their study and especially found the identification of complex inconsistencies to be a lengthy task. They also noted that there are no strict guidelines for linking unique patient data to the corresponding appointment data in RT. In our experience, manual processing of such raw (atomic) data can initially take up to several weeks (primary preparation) while dataset updates take somewhat less (secondary preparation). Our tool overcomes these limitations by introducing a strategy which automates all the laborious procedures required to clean and process patient and appointment data. Its GUI is designed to be user friendly for the hospital staff. The execution to export the here investigated formatted datasets for both primary and secondary preparations took < 15 s.
      To generate a consistent dataset for external use from the raw dataset in our study, a substantial number of NULL and duplicate entries in both patient and appointment data had to be handled. These extracted data were entirely based on the information entered in the OIS-ARIA. Some missing data can potentially be retracted from the Electronic Medical Records (EMR) and manual efforts for retraction should be taken into consideration before applying any dataset-manipulation strategy [
      • Kalendralis P.
      • Sloep M.
      • van Soest J.
      • Dekker A.
      • Fijten R.
      Making radiotherapy more efficient with FAIR data.
      ]. However, If the data cannot be retraced, a substitution strategy for missing information can improve overall data veracity. Our dataset from 2020 to 21 included a high number of referrals with unknown treatment intents, which made a substitution strategy critical. Close to 40 % of data would have been lost if all missing intent referrals had been removed. The 2015–16 dataset had lower number of missing information so a substitution strategy was less critical for this dataset. Another strategy to handle missing data was implemented by Beesley. et. al. in 2019 where they used Markov chain Monte-Carlo algorithm to avoid the bias arising from missing OIS/RT data [
      • Beesley L.J.
      • Morgan T.M.
      • Spratt D.E.
      • Singhal U.
      • Feng F.Y.
      • Furgal A.C.
      • et al.
      Individual and Population Comparisons of Surgery and Radiotherapy Outcomes in Prostate Cancer Using Bayesian Multistate Models.
      ]. This strategy represents the use of probabilistic inference to stabilize the dataset pattern by only keeping the patient entries with no missing information. This did not affect their overall analysis since their study used more than 17 years of patient data. Such large time periods may not be available for some external applications.. To handle various dataset sizes, our strategy focuses on a pre-removal solution to retain the veracity by lowering the removal rate with probabilistic substitution first. In addition to challenges in handling missing data from the OIS database, patterns in RT may also change over time. Patient referral volumes are naturally affected due to the worldwide increasing number of cancer patients [
      • Ferlay J.
      • Colombet M.
      • Soerjomataram I.
      • Parkin D.M.
      • Pĩneros M.
      • Znaor A.
      • et al.
      Cancer statistics for the year 2020: An overview.
      ]. Along with that, appointment/fraction data may differ periodically given new evidence on treatment strategies or resizing of an RT department. For example, a recent 5-year trial including 4096 breast cancer patients, showed that a short-course radiation regimen (26 Gy; 5 fractions; 1 week) was as safe and effective as a protracted treatment course (40 Gy; 15 fractions; 3 weeks) [
      • Murray Brunt A.
      • Haviland J.S.
      • Wheatley D.A.
      • Sydenham M.A.
      • Alhasso A.
      • Bloomfield D.J.
      • et al.
      Hypofractionated breast radiotherapy for 1 week versus 3 weeks (FAST-Forward): 5-year efficacy and late normal tissue effects results from a multicentre, non-inferiority, randomised, phase 3 trial.
      ]. If accepted in the clinic, such results dramatically change the overall throughput of patients and the associated digital RT landscape. In our dataset from 2020 to 21, we noted a peculiarly high surge for curative prostate cancer patients (C61-C) with 279 % more referrals than in the dataset from 2015 to 16. The primary reason for this was the incorporation of a 3-linac satellite department’s data into the same OIS. To adapt to the abovementioned and other non-linear changes at RT departments, our tool by design supports external applications with yearly statistics for the specified variables.
      Strengths of our study include the use of real data from a large RT department to develop, test and verify the tool, including extensive datasets from two separate time periods. Our developed tool can be used by other RT departments with minor changes in the database queries. It can be customized for other external applications in clinical or non-clinical settings. For example, the tool can present statistics on historical data (summarized data with indexes like mean, median and standard deviation), and it can support time-sensitive applications with real-time cleaned datasets. Currently, the data extraction feature is only available for the ARIA (Varian) OIS. For non-Varian platforms, the tool can still be used for cleaning and preparation but requires that the extracted data from non-Varian databases are specified on a pre-defined input format. One limitation with our strategy is the substitution algorithm, which was based on the available data in the selected time frame. If available data are small in numbers, the substitution ratios can be biased and may influence the overall output from the external application. This can be handled by basing the substitution algorithm on historical data, if available, rather than just the selected time frame to improve data veracity. To simplify the patient-appointment data linking process, we assumed that the yearly number of patients and respective appointments were fitted within the investigated time frame. However, in reality, patients may have a larger gap (2 months or more) between referral and scheduled pre-treatment tasks having appointments falling outside a selected time frame. This can introduce a slight skewness in the distribution of appointments, but for larger datasets, as investigated in our study, this will typically be balanced by the set of appointments for such patients from an older time frame. In case of smaller datasets, the tool itself is equipped to handle these data ambiguity situations. When it comes to institute-specific issues which can introduce systematic errors, additional procedures or configuration changes can be implemented in the tool to compensate for this given that the concerns are brought forward and can be quantified. To summarize, our in-depth data cleaning and preparatory algorithm provides a reliable and fast approach to produce datasets for external use and is available on request.
      In conclusion, preparing OIS data for external applications can be time consuming. We successfully implemented a software tool to prepare ready-to-use OIS datasets for external applications. The evaluations for our investigated application showed overall results close to the manually-prepared dataset. Our tool can import data from a specified OIS and will automatically clean and prepare the information before formatting it for external use. In our experience, the time taken to prepare the dataset using our automated strategy can reduce the time for manual preparation from weeks to seconds. This novel approach can help to efficiently manage OIS data for external use and can bolster continuous data-driven development in RT departments.

      Funding

      This work was supported by The Swedish Research Council under Grant 2017-01735, the Swedish state under the agreement between the Swedish government and the county councils, the ALF-agreement under Grants ALFGBG-72011 and ALFGBG-965800, the Kamprad Family Foundation for Entrepreneurship, Research and Charity under Grant 20190083, and Jubileumsklinikens Cancer Research Foundation 2021:347.

      Declaration of Competing Interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      Appendix A. Supplementary material

      The following are the Supplementary data to this article:

      References

        • Vieira B.
        • Hans E.W.
        • van Vliet-Vroegindeweij C.
        • van de Kamer J.
        • van Harten W.
        Operations research for resource planning and -use in radiotherapy: a literature review.
        BMC Med Inform Decis Mak. 2016; 16: 149https://doi.org/10.1186/s12911-016-0390-4
        • McNutt T.R.
        • Bowers M.
        • Cheng Z.
        • Han P.
        • Hui X.
        • Moore J.
        • et al.
        Practical data collection and extraction for big data applications in radiotherapy.
        Med Phys. 2018; 45: e863-e869https://doi.org/10.1002/mp.12817
        • Babashov V.
        • Aivas I.
        • Begen M.A.
        • Cao J.Q.
        • Rodrigues G.
        • D'Souza D.
        • et al.
        Reducing Patient Waiting Times for Radiation Therapy and Improving the Treatment Planning Process: a Discrete-event Simulation Model (Radiation Treatment Planning).
        Clin Oncol (R Coll Radiol). 2017; 29: 385-391https://doi.org/10.1016/j.clon.2017.01.039
        • Lindberg J.
        • Gurjar M.
        • Holmstrom P.
        • Hallberg S.
        • Bjork-Eriksson T.
        • Olsson C.E.
        Resource planning principles for the radiotherapy process using simulations applied to a longer vacation period use case.
        Tech Innov Patient Support Radiat Oncol. 2021; 20: 17-22https://doi.org/10.1016/j.tipsro.2021.10.001
        • da Silva R.B.Z.
        • Fogliatto F.S.
        • Krindges A.
        • dos Santos C.M.
        Dynamic capacity allocation in a radiology service considering different types of patients, individual no-show probabilities, and overbooking.
        BMC Health Serv Res. 2021; 21: 968https://doi.org/10.1186/s12913-021-06918-y
        • Gong Y.U.
        • Yu J.
        • Pang D.
        • Zhen H.
        • Galvin J.
        • Xiao Y.
        Automated Extraction of Dose/Volume Statistics for Radiotherapy-Treatment-Plan Evaluation in Clinical-Trial Quality Assurance.
        Front Oncol. 2016; 6: 47https://doi.org/10.3389/fonc.2016.00047
        • Stervik L.
        • Pettersson N.
        • Scherman J.
        • Behrens C.F.
        • Ceberg C.
        • Engelholm S.
        • et al.
        Analysis of early respiratory-related mortality after radiation therapy of non-small-cell lung cancer: feasibility of automatic data extraction for dose-response studies.
        Acta Oncol. 2020; 59: 628-635
        • Vieira B.
        • Demirtas D.
        • B. van de Kamer J.
        • Hans E.W.
        • van Harten W.
        J BvdK, Hans EW, van Harten W: Improving workflow control in radiotherapy using discrete-event simulation.
        BMC Med Inform Decis Mak. 2019; 19https://doi.org/10.1186/s12911-019-0910-0
        • Kalendralis P.
        • Sloep M.
        • van Soest J.
        • Dekker A.
        • Fijten R.
        Making radiotherapy more efficient with FAIR data.
        Phys Med. 2021; 82: 158-162https://doi.org/10.1016/j.ejmp.2021.01.083
        • Beesley L.J.
        • Morgan T.M.
        • Spratt D.E.
        • Singhal U.
        • Feng F.Y.
        • Furgal A.C.
        • et al.
        Individual and Population Comparisons of Surgery and Radiotherapy Outcomes in Prostate Cancer Using Bayesian Multistate Models.
        JAMA Netw Open. 2019; 2: e187765
        • Ferlay J.
        • Colombet M.
        • Soerjomataram I.
        • Parkin D.M.
        • Pĩneros M.
        • Znaor A.
        • et al.
        Cancer statistics for the year 2020: An overview.
        Int J Cancer. 2021; 149 (10.1002/ijc.v149.410.1002/ijc.33588): 778-789
        • Murray Brunt A.
        • Haviland J.S.
        • Wheatley D.A.
        • Sydenham M.A.
        • Alhasso A.
        • Bloomfield D.J.
        • et al.
        Hypofractionated breast radiotherapy for 1 week versus 3 weeks (FAST-Forward): 5-year efficacy and late normal tissue effects results from a multicentre, non-inferiority, randomised, phase 3 trial.
        Lancet. 2020; 395: 1613-1626