- 1International Center for Climate and Environment Sciences, Institute of Atmospheric Physics, Chinese Academy of Sciences, Beijing, China, 100029. (songxinyi231@mails.ucas.ac.cn; tanzhetao19@mails.ucas.ac.cn; viktor.gouretski@posteo.de; chenglij@mail.iap.a
- 2University of Chinese Academy of Sciences, Beijing, China. (songxinyi231@mails.ucas.ac.cn; tanzhetao19@mails.ucas.ac.cn; chenglij@mail.iap.ac.cn)
- 3NOAA National Centers for Environmental Information, Silver Spring, MD, United States. (ricardo.locarnini@noaa.gov; tim.boyer@noaa.gov)
- 4Istituto Nazionale di Geofisica e Vulcanologia (INGV), Bologna, Italy. (simona.simoncelli@ingv.it; francoreseghetti@gmail.com)
- 5Climate Science Centre, Environment, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Hobart, TAS, Australia. (Rebecca.Cowley@csiro.au)
- 6Physical Oceanography Laboratory, Department of Geophysics, Tohoku University, Sendai, Japan. (kizu@tohoku.ac.jp)
- 7Italian National Agency for New Technologies, Energy and Sustainable Economic Development (ENEA), Santa Teresa Research Centre, Pozzuolo di Lerici, Italy. (francoreseghetti@gmail.com)
- 8Scripps Institution of Oceanography (SIO), University of California, San Diego, La Jolla, United States. (castelao@gmail.com)
A high-quality hydrographic observational database is essential for ocean and climate studies and operational applications. Because there are numerous global and regional ocean databases, duplicate data continues to be an issue in data management, data processing and database merging, posing a challenge on effectively and accurately using oceanographic data to derive robust statistics and reliable data products. This study aims to provide an algorithm to identify the duplicates and assign labels to them. We propose first the definition of exact duplicates and possible duplicates; and second, an open-source and semi-automatic system (named DC_OCEAN) based on crude screening and target screening, which is followed by a manual expert check to review the identified duplicates to detect duplicate data and erroneous metadata. The robustness of the system is then evaluated with a subset of the World Ocean Database (WOD18) with over 600,000 in-situ temperature and salinity profiles. This system is an open-source Python package allowing users to effectively use the software. Users can customize their settings. The application result from the WOD18 subset also forms a benchmark dataset, which is available to support future studies on duplicate checking, metadata error identification, and machine learning applications. This duplicate checking system will be incorporated into the International Quality-controlled Ocean Database (IQuOD) data quality control system to guarantee the uniqueness of ocean observation data in this product.
How to cite: Song, X., Tan, Z., Locarnini, R., Simoncelli, S., Cowley, R., Kizu, S., Boyer, T., Reseghetti, F., Castelao, G., Gouretski, V., and Cheng, L.: DC_OCEAN: An open-source algorithm for identification of duplicates in ocean databases, EGU General Assembly 2025, Vienna, Austria, 27 Apr–2 May 2025, EGU25-5448, https://doi.org/10.5194/egusphere-egu25-5448, 2025.