bash - Sort lots of large compressed files -


i have lot of large compressed files called xaa.gz, xab.gz, xac.gz etc. unfortunately not sorted. equivalent of following.

zcat x*|sort > largefile  split -l 1000000 largefile 

then gzip split files , throw away other files made before well.

the problem makes massive uncompressed file , lots of smaller uncompressed split files before compressing them. possible whole thing without making huge file in middle of process , ideally without saving split files before compressing them either?

i have 8 cores take advantage of them (i don't have coreutils 8.20 can't take advantage of sort --parallel).

not full code, ideas on can here.

1) partition input files process them in parallel:

num_cores=8 i=0 while read f;   part_name=part$i   set $part_name="${!part_name} $f"   (( i=(i+1)%num_cores )) done < <(ls x*.gz) 

2) decompress , sort part of files in different processes:

sort -m <(zcat $part0 | sort) <(zcat $part1 | sort) ... 

3) tell split compress files immediately:

... | split -l 1000000 --filter='gzip > $file.gz' 

Comments

Popular posts from this blog

matlab - Deleting rows with specific rules -

php - MySQLi multi_query results for later use -