MS-BioGraph sequence similarity graph datasets are now publicly available on IEEE DataPort: https://doi.org/10.21227/gmd9-1534.
To access the files, you need to register/login to IEEE DataPort and then visit the MS-BioGraphs page. By saving the page as an HTML file such as dp.html
, you may download the datasets (as an example MS1
) using the following script:
dsname="MS1"
html_file="dp.html"
urls=`cat $html_file | sed -e 's/\&/\&/g' | grep -Eo "(http|https)://[a-zA-Z0-9./?&=_%:-]*" | grep amazonaws | sort | uniq | grep -E "$dsname[-_\.]"`
for u in $urls; do
wget $u
if [ $? != 0 ]; then break; fi
done
# removing query strings
for f in $(find $1 -type f); do
if [ $f = ${f%%\?*} ]; then continue; fi
mv "${f}" "${f%%\?*}"
done
# liking offsets.bin to be found by ParaGrapher
ln -s ${dsname}_offsets.bin ${dsname}-underlying_offsets.bin
Instead of wget
you may use axel -n 10
to use multiple connections (here, 10) for downloading each file (https://manpages.ubuntu.com/manpages/noble/en/man1/axel.1.html).
MS-BioGraphs
Related Posts
- Minimum Spanning Forest of MS-BioGraphs
- MS-BioGraphs on IEEE DataPort
- ParaGrapher Source Code For WebGraph Types
- On Overcoming HPC Challenges of Trillion-Scale Real-World Graph Datasets – BigData’23 (Short Paper)
- Dataset Announcement: MS-BioGraphs, Trillion-Scale Public Real-World Sequence Similarity Graphs – IISWC’23 (Poster)
- MS-BioGraphs: Sequence Similarity Graph Datasets
- MS-BioGraphs MS
- MS-BioGraphs MSA500
- MS-BioGraphs MS200
- MS-BioGraphs MSA200
- MS-BioGraphs MS50
- MS-BioGraphs MSA50
- MS-BioGraphs MSA10
- MS-BioGraphs MS1
- MS-BioGraphs Validation