bash - Sort lots of large compressed files -
i have lot of large compressed files called xaa.gz, xab.gz, xac.gz etc. unfortunately not sorted. equivalent of following.
zcat x*|sort > largefile split -l 1000000 largefile then gzip split files , throw away other files made before well.
the problem makes massive uncompressed file , lots of smaller uncompressed split files before compressing them. possible whole thing without making huge file in middle of process , ideally without saving split files before compressing them either?
i have 8 cores take advantage of them (i don't have coreutils 8.20 can't take advantage of sort --parallel).
not full code, ideas on can here.
1) partition input files process them in parallel:
num_cores=8 i=0 while read f; part_name=part$i set $part_name="${!part_name} $f" (( i=(i+1)%num_cores )) done < <(ls x*.gz) 2) decompress , sort part of files in different processes:
sort -m <(zcat $part0 | sort) <(zcat $part1 | sort) ... 3) tell split compress files immediately:
... | split -l 1000000 --filter='gzip > $file.gz'
Comments
Post a Comment