JMK no matter what

Making scipy.weave parallel by releasing the GIL

As everybody knows, the current implementation of CPython is pretty much incapable of doing any serious multithreaded computation, because of the GIL. Python's most current solution to this problem is using separate processes instead of threads - in most of the cases, this actually suffices.

A problem I'm trying to solve at work now is embarrassingly parallel. It was natural to try to make use of my quad-core Xeon desktop to decrease my 1-hr runtime down to 15 minutes; however, the programs use inputs from a vast (4+GB) numpy array which is common to all processes. Sending an array of this size through IPC is simply ridiculous. Yes, there are ways to 'share' a single copy of numpy array across processes - but it required me to make drastic changes to the program's structure. I tried numpy.memmap which is essentially an array using memory-mapped files. Single-thread performance was degraded 10x... dude, I was just trying to achieve 3+x speedup. Is this too much???

What I actually ended up doing is writing performance-critical piece of code in C++, using scipy.weave (it basically lets us to inline C++ code in Python, compiling the C++ code as a Python package). Yes, it's not pleasant, but when the C++ code snippet is small, it's doable. Yes, it's not a game-changer - it doesn't allow true multithreading. However: weave is compiled as a C Python module - and they can release the GIL if a macro is used: Py_BEGIN_ALLOW_THREADS.

So, simply adding the two macros at the beginning and the end of the inlined code can make it (almost) truly multithreading.

I did a simple experiment to verify this.

I wrote a program that keeps an array of 1 million integers in memory. I wrote a function that given n, scan the array to find all the integers which are less than or equal to n, and return the sum of number of divisors of such integers. (Yes, it's contrived, but it closely resembles what I'm doing right now.)

For example, suppose the array is [4,7,6,8,11,9] and n is 7. There are 3 numbers less than or equal to 7; and they respectively have 3, 2, 4 divisors. The function will return 3+2+4 = 9.

So, the code follows.

lang:py
import numpy as np
from scipy import weave
from random import randint
from handythread import *
import time

HT = 100*1000 # hundread thousand
inputs = [HT*i for i in [1,4,1,2,4,5,1,1,2,3,4,1,2]]
m = 1000000
common = np.array([randint(0,10*HT) for i in xrange(m)])
code = """
int ret = 0;
for(int i = 0; i < m; ++i)
{  
    if(common[i] <= n)
    {  
        ++ret; // 1
        if(common[i] > 1) ++ret; // common[i]
        for(int j = 2; j*j <= common[i]; ++j)
        {  
            if(common[i] % j == 0)
            {  
                ++ret;
                if(j*j < common[i])
                    ++ret;
            }
        }
    }
}
return_val = ret;
"""

def count_divisors_under(n):
    return weave.inline(code, ["n", "common", "m"])
def count_divisors_under_release(n):
    return weave.inline("Py_BEGIN_ALLOW_THREADS\n" + code + "Py_END_ALLOW_THREADS\n", ["n", "common", "m"])

st = time.time()
print map(count_divisors_under, inputs)
print "Singlethread %.2lf" % (time.time() - st)
st = time.time()
print parallel_map(count_divisors_under, inputs, threads=4)
print "Naive Multithread %.2lf" % (time.time() - st)
st = time.time()
print parallel_map(count_divisors_under_release, inputs, threads=4)
print "ReleaseGIL Multithread %.2lf" % (time.time() - st)

I used handythread library from this cookbook recipe for easier parallelization. It's nothing fancy; it's just a less than a hundred lines of code.

Anyway, the results:

[1177114, 5228072, 1177114, .... 2475507]
Singlethread 4.78
[1177114, 5228072, 1177114, .... 2475507]
Naive Multithread 4.77
[1177114, 5228072, 1177114, .... 2475507]
ReleaseGIL Multithread 1.54

Hooray, 3x speedup! My machine is an i7, which is a quad core. (Don't ask me why it's not 4x... I don't know.)

2009-12-16 15:24:26 | JM | /writings/cs/ | 8 Comments
이희승
2009-12-16 15:50:04
그러니까 스칼라나 그루비 고고싱 ㅡ.ㅡ?
JM
2009-12-16 16:04:53
... 스칼라는 좀 땡기는데요 REPL + 플로팅 등등이 파이썬이 너무 잘되어있어서.. ㅠ.ㅠ
scheme
2009-12-16 18:03:26
why it's not 4x?
개멍
2009-12-16 19:15:23
did you try threads=8? i7 has SMT, you know. since it's mostly integer op...
김기범
2011-02-21 15:10:24
parallel_map이 스레드마다 균등하게 나누나요? work division을 geometric하게 나누게 하면 좀 더 나을 것 같네요.
JM
2011-02-22 03:26:27
@김기범, parallel_map 은 내부적으로 producer-consumer 패턴 써서 돌아가는 걸로 알고 있습니다. 위 cookbook recipe 링크에 가면 보실 수 있지요. geometric 하게 나눈다는 것이 어떤 뜻인지 궁금한데 좀 더 부연설명해 주실 수 있나요?
김기범
2011-02-23 11:57:34
소스를 보니 parallel_map은 스레드가 원소 하나 끝낼때마다 다시 lock하고 다음원소 얻고 하는군요.
geometric하게 나눈다는게 별건 아니고... 예로 원소가 1024개고 스레드가 2개라면, 처음에는 256개 2블록, 다음은 128개 2블록, 다음은 64개 2블록... 이런식으로 데이터를 쪼갠다음에 producer-consumer하게 feed 하는겁니다. lock을 거는 총 횟수를 log(n)으로 줄이면서도, 뒤로 갈수록 work단위가 짧아지기때문에 그냥 큰 블록으로 그냥 나누었을때 발생할 수 있는 스레드 낭비 시간도 거의 없습니다. 1/2보다는, 3/4정도로 geometric하게 나누는게 무난 한 거 같습니다. 근데 애초에 각 function 실행시간이 길어서 서로 lock 거는 오버헤드가 적다면 별로 상관없겠네요 ^^;;
JM
2011-02-24 17:14:45
@김기범, 아, 그렇군요. 인상적이네요. lock contention 을 그렇게 줄이는 거로군요. 이 경우엔 적용되지 않는거 같지만 경우에 따라 매우 유용할 것 같습니다. 좋은 것 배웠네요.

Leave a comment