. 2023 Dec 17;13(12):1199.
doi: 10.3390/metabo13121199.
Affiliations
Free PMC article
Item in Clipboard
Metabolites.
.
Free PMC article
Abstract
A major challenge to integrating public metabolic resources is the use of different nomenclatures by individual databases. This paper presents md_harmonize, an open-source Python package for harmonizing compounds and metabolic reactions across various metabolic databases. The md_harmonize package utilizes a neighborhood-specific graph coloring method for generating a unique identifier for each compound via atom identifiers based on a compound’s chemical structure. The resulting harmonized compounds and reactions can be used for various downstream analyses, including the construction of atom-resolved metabolic networks and models for metabolic flux analysis. Parts of the md_harmonize package have been optimized using a variety of computational techniques to allow certain NP-complete problems handled by the software to be tractable for these specific use-cases. The software is available on GitHub and through the Python Package Index, with end-user documentation hosted on GitHub Pages.
Keywords:
Python package; database harmonization; maximum common substructure; metabolite.
Conflict of interest statement
The authors declare no conflict of interest.
Figures

Figure 1
Example of matrix representation of a compound structure: (A) KEGG compound C00207 with the atoms numbered for comparison to rows and columns in the matrix; (B) matrix representation of KEGG compound C00207.

Figure 2
Example of a mapping matrix between two compound structures: (A) KEGG compound C00207 with atoms numbered to rows in the matrix; (B) KEGG compound C00466 with atoms numbered to columns in the matrix; and (C) mapping matrix between KEGG compound C00207 and KEGG compound C00466.

Figure 3
Flowchart of backtracking algorithm for generating one-to-one atom mappings of two compound structures.

Figure 4
Example of the shortest distance between any two atoms in a compound structure: (A) KEGG compound C00466 with atoms numbered to rows and columns in the matrix; (B) the shortest distance matrix D of KEGG compound C00466.

Figure 5
Flowchart of the modified Dijkstra algorithm for generating the shortest distance between each atom and the R groups in a compound. The “*” represents the multiplication operator.

Figure 6
Shortest distance to the R groups in a compound structure. (A) KEGG compound C05205 with atoms numbered to indeces in the array; (B) the array of the shortest distance from each atom to R groups in KEGG compound C05205.

Figure 7
Organization of the md_harmonize package presented with UML diagrams. (A) UML package diagram of the md_harmonize Python library; (B) UML class diagram of the md_harmonize Python package.

Figure 8
Command line interface of md_harmonize package.

Figure 9
Comparison of substructure performance after algorithm optimization.

Figure 10
Example of incorrect compound pair indicated via HMDB reference with different structure representations. (A) MetaCyc CPD-10813; (B) HMDB HMDB0000265.
References
LinkOut – more resources
-
Full Text Sources
-
Miscellaneous