本文的图算法测试用例使用官网数据集,请从官网下载graph500-23、graph500-24、graph500-25、cit-Patents、uk-2002、enwiki-2018、it-2004、com_orkut、usaRoad,cage14,Twitter-2010、Graph500-29。下文所有的数据集解压上传均在server1节点进行。
cat graph500-23.e | tr “ ” “,” > graph500-23.txt
有向图it-2004数据集的下载使用步骤如下,下载后解压即可使用。
1 2 |
mkdir -p /test/graph/dataset/ cd /test/graph/dataset/ |
wget https://suitesparse-collection-website.herokuapp.com/MM/LAW/it-2004.tar.gz
1
|
tar -zxvf it-2004.tar.gz |
1
|
hadoop fs –mdkir –p /tmp/graph/dataset/ |
1
|
hadoop fs –put /test/graph/dataset/it-2004.mtx /tmp/graph/dataset/ |
1
|
hadoop fs -ls /tmp/graph/dataset/it-2004.mtx |
下载官网数据集,需要注意的是,我们只需要下载对应链接下的3个相关的文件,对应后缀分别是:.graph、 .properties、 .md5sums。
uk-2002: http://law.di.unimi.it/webdata/uk-2002/
enwiki-2018: http://law.di.unimi.it/webdata/enwiki-2018/
针对有向图uk-2002、enwiki-2018我们需要采取特定的解压缩方式,以下流程将解压uk-2002数据集为例。
项目 |
下载链接 |
---|---|
logback-classic-1.1.7.jar |
|
logback-core-1.1.7.jar |
|
fastutil-7.0.12.jar |
|
sux4j-4.0.0.jar |
|
dsiutils-2.3.2.jar |
|
jung-api-2.1.jar |
|
jung-io-2.1.jar |
|
jsap-2.1.jar |
|
junit-4.12.jar |
|
commons-configuration-1.8.jar |
|
commons-lang3-3.4.jar |
|
slf4j-api-1.7.21.jar |
|
webgraph-3.5.2.jar |
|
guava-19.0.jar |
|
uk-2002.properties |
|
uk-2002.graph |
mkdir lib cd lib
java -cp "./lib/*" it.unimi.dsi.webgraph.ArcListASCIIGraph -l 5000 uk-2002 uk-2002-edgelist.txt
最后两个参数对应输入数据和输出数据,请注意如果文件包含sample.graph、 sample.md5sums、 sample.properties,则倒数第二个参数应为前面3个文件的前缀sample。
cat uk-2002-edgelist.txt | tr "\t" " " > uk-2002-edgelist.e
下载官网数据集,下载后解压即可使用。
获取链接:http://snap.stanford.edu/data/bigdata/communities/com-orkut.ungraph.txt.gz
mkdir –p /test/algo/incpr
vim incDataGenBatch_twitter-2010.sh
incDataGenBatch_twitter-2010.sh |
---|
num_executors=39 executor_cores=7 executor_memory=23 partition=273 java_xms="-Xms${executor_memory}g" class="com.huawei.graph.IncDataGeneratorBatch" input_allData="hdfs:/tmp/graph/dataset/twitter-2010-edgelist.txt" split="," seed=1 iterNum=100 resetProb=0.15 jar_path="/test/algo/incpr/inc-graph-tools-1.0.0.jar" batch=5 for rate in {0.001,0.01,0.05} do echo ">>> start [twitter-2010-${rate}]" output_incData="/tmp/graph/dataset/twitter-2010_${rate}" hadoop fs -rm -r -f hdfs:${output_incData} spark-submit \ --class ${class} \ --master yarn \ --num-executors ${num_executors} \ --executor-memory ${executor_memory}g \ --executor-cores ${executor_cores} \ --driver-memory 80g \ --conf spark.driver.maxResultSize=80g \ --conf spark.locality.wait.node=0 \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.kryoserializer.buffer.max=2040m \ --conf spark.executor.extraJavaOptions=${java_xms} \ ${jar_path} yarn ${input_allData} ${split} ${output_incData} ${rate} ${partition} ${seed} ${iterNum} ${resetProb} ${batch} echo ">>> end [twitter-2010-${rate}]" done |
vim incDataGenBatch_graph500-29.sh
incDataGenBatch_twitter-2010.sh |
---|
num_executors=39 executor_cores=7 executor_memory=23 partition=800 java_xms="-Xms${executor_memory}g" class="com.huawei.graph.IncDataGeneratorBatch" input_allData="hdfs:/tmp/graph/dataset/graph500-29.e" split="," seed=1 iterNum=100 resetProb=0.15 jar_path="/test/algo/incpr/inc-graph-tools-1.0.0.jar" batch=5 for rate in {0.001,0.01,0.05} do echo ">>> start [graph500-29-${rate}]" output_incData="/tmp/graph/dataset/graph500-29_${rate}" hadoop fs -rm -r -f hdfs:${output_incData} spark-submit \ --class ${class} \ --master yarn \ --num-executors ${num_executors} \ --executor-memory ${executor_memory}g \ --executor-cores ${executor_cores} \ --driver-memory 80g \ --conf spark.driver.maxResultSize=80g \ --conf spark.locality.wait.node=0 \ --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \ --conf spark.kryoserializer.buffer.max=2040m \ --conf spark.executor.extraJavaOptions=${java_xms} \ ${jar_path} yarn ${input_allData} ${split} ${output_incData} ${rate} ${partition} ${seed} ${iterNum} ${resetProb} ${batch} echo ">>> end [graph500-29-${rate}]" done |
mkdir –p /test/dataset/graph/twitter cd /test/dataset/graph/twitter
wget https://snap.stanford.edu/data/twitter-2010.txt.gz
tar zxvf twitter-2010.txt.gz
hdfs dfs -mkdir –p /tmp/graph/dataset hdfs dfs –put twitter-2010-edgelist.txt /tmp/graph/dataset
mkdir –p /test/dataset/graph/graph500-29 cd /test/dataset/graph/graph500-29
wget https://surfdrive.surf.nl/files/index.php/s/VSXkomtgPGwZMW4/download
tar -I zstd -vxf graph500-29.tar.zst
hdfs dfs -mkdir –p /tmp/graph/dataset hdfs dfs –put graph500-29.e /tmp/graph/dataset