python - Read large file in parallel? -

May 15, 2012

i have large file need read in , make dictionary from. fast possible. code in python slow. here minimal example shows problem.

first make fake data

paste <(seq 20000000) <(seq 2 20000001)  > largefile.txt

now here minimal piece of python code read in , make dictionary.

import sys collections import defaultdict fin = open(sys.argv[1])  dict = defaultdict(list)  line in fin:     parts = line.split()     dict[parts[0]].append(parts[1])

timings:

time ./read.py largefile.txt real    0m55.746s

however possible read whole file faster as:

time cut -f1 largefile.txt > /dev/null     real    0m1.702s

my cpu has 8 cores, possible parallelize program in python speed up?

one possibility might read in large chunks of input , run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel data in memory read in large chunk. possible in python using multiprocessing somehow?

update. fake data not had 1 value per key. better is

perl -e 'say int rand 1e7, $", int rand 1e4 1 .. 1e7' > largefile.txt

there blog post series "wide finder project" several years ago @ tim bray's site [1]. can find there solution [2] fredrik lundh of elementtree [3] , pil [4] fame. know posting links discouraged @ site think these links give better answer copy-pasting code.

[1] http://www.tbray.org/ongoing/when/200x/2007/10/30/wf-results
[2] http://effbot.org/zone/wide-finder.htm
[3] http://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/

Search This Blog

Copy

python - Read large file in parallel? -

Comments

Post a Comment

Popular posts from this blog

image - ClassNotFoundException when add a prebuilt apk into system.img in android -

I need to import mysql 5.1 to 5.5? -

Java, Hibernate, MySQL - store UTC date-time -