python - Read large file in parallel? -
i have large file need read in , make dictionary from. fast possible. code in python slow. here minimal example shows problem.
first make fake data
paste <(seq 20000000) <(seq 2 20000001) > largefile.txt
now here minimal piece of python code read in , make dictionary.
import sys collections import defaultdict fin = open(sys.argv[1]) dict = defaultdict(list) line in fin: parts = line.split() dict[parts[0]].append(parts[1])
timings:
time ./read.py largefile.txt real 0m55.746s
however possible read whole file faster as:
time cut -f1 largefile.txt > /dev/null real 0m1.702s
my cpu has 8 cores, possible parallelize program in python speed up?
one possibility might read in large chunks of input , run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel data in memory read in large chunk. possible in python using multiprocessing somehow?
update. fake data not had 1 value per key. better is
perl -e 'say int rand 1e7, $", int rand 1e4 1 .. 1e7' > largefile.txt
(related read in large file , make dictionary .)
there blog post series "wide finder project" several years ago @ tim bray's site [1]. can find there solution [2] fredrik lundh of elementtree [3] , pil [4] fame. know posting links discouraged @ site think these links give better answer copy-pasting code.
[1] http://www.tbray.org/ongoing/when/200x/2007/10/30/wf-results
[2] http://effbot.org/zone/wide-finder.htm
[3] http://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/
Comments
Post a Comment