Skip to content

Data Cleansing Procedure

The raw flight log messages obtained from AirData UAV often contain various forms of noise that can significantly hinder subsequent Natural Language Processing (NLP) tasks. This section provides a detailed, three-step procedure for our data cleansing process. This procedure transforms the raw messages into a structured, consistent, and semantically clearer format, suitable for robust NLP analysis.


1. Raw Data & Repositories

The raw, un-cleansed AirData messages and all intermediate and final cleansing files are hosted in a dedicated GitHub repository for full transparency and reproducibility.


2. Cleansing Procedure: A Three-Step Process

Our data cleansing was a multi-stage process, with each step targeting a specific type of textual noise. All corrections were documented in separate log files and applied using Python scripts to ensure a reproducible pipeline.

2.1. Step 1: Ending Period Correction

The first step was to ensure every message ended with a punctuation mark. Many messages from the raw collection lacked a terminal period, which is essential for accurate sentence boundary detection in subsequent NLP tasks.

  • Procedure: A Python script was used to automatically append a period to any message that did not already end with one. This was a simple, rule-based correction.
  • Resulting File: corrected_data_step_1.xlsx

2.2. Step 2: Missing Punctuation and Space Correction

This step addressed more complex syntactic errors, such as missing spaces, concatenated words, and misplaced commas. These issues are crucial to fix as they directly impact tokenization and linguistic parsing.

2.3. Step 3: Over-generalized Message Correction

The final and most complex step involved addressing messages that were syntactically correct but semantically over-generalized or ambiguous. This process required domain expertise to infer a more precise meaning where possible.


3. Cleansing Logs & Final Output

The comprehensive cleansing log, detailing every correction made across all three steps, is provided in the correction_log files for full transparency. The final output of our cleansing procedure, corrected_data_step_3.xlsx, serves as the input for our annotation tasks.

Final Note on Data

Once our annotation tasks are complete, this data will also be uploaded to Zenodo to provide a persistent, citable DOI and allow for programmatic access via CLI tools for the community.