The reconstruction method starts by grouping n-gram entries by source URL and combining the "pre", "ngram", and "post" fields into textual fragments. These fragments are then joined by detecting word overlaps and considering positional metadata (article deciles). The method includes logic to correct GDELT-specific artefacts, such as misplaced end-of-article content.
For validation, the authors matched 2,211 articles reconstructed from GDELT data to original full texts obtained from EventRegistry, covering major U.S. news outlets. After cleaning and tokenising both sets, they compared them using Levenshtein Similarity and SequenceMatcher Similarity — both sensitive to word order, which is critical when reconstructing coherent article narratives.
Without filtering, reconstructed articles achieved around 75% similarity to originals; when filtering for articles with at least 80% token overlap, the similarity rose to 95%. These results confirm the method’s strong fidelity even under minor noise or variations.
Limitations include the absence of article titles in GDELT’s dataset and slower single-process performance, although a parallel version of gdeltnews mitigates the latter issue. Future improvements aim to support non-space-separated languages and enhance efficiency.
You can learn more by reading the entire paper or accessing the tool on GitHub.