说Python — GIL与并行

python的多线程利用的就是操作系统原生线程，但是python的解释器存在着全局锁GIL (Global Interpreter Lock)，它要求在任一时刻有且只有一个线程有权利执行解释python代码的任务。换句话说，GIL使得python的多线程在执行python代码时其实是串行的而不是并行，如果之前是写C语言的话，肯定会觉得这种操作和称呼太具有欺骗性了，明明串行的东西以多线程的概念提供给用户。可以通过下面小例子来看多线程的耗时，同时有多进程的模式作为对比。

import threading
import multiprocessing
import time


def boring(x=123):
    # boring calculation funciton used for costing time
    a = x
    for i in range(50000000):
        a = (a + 1) % 2
    return a


def count_time(f, args=()):
    # count the time cost by running function 'f' with 'args'

    t = time.time()
    f(*args)
    return time.time() - t


def parallel(n, method, _target, _args=()):
    """
    run '_target' in a parallel specified by 'method'

    n: the parallen number
    method: parallel method, either multiprocessing.Process or threading.Thread
    _target: target function
    _args : arguments for running _target
    """

    workers = []
    for i in range(n):
        _ = method(target=_target, args=_args)
        workers.append(_)
        _.start()

    for w in workers:
        w.join()

    return


def test():
    single_run_time = count_time(boring)

    # it is lucky that multiprossing.Process and threading.Thread have similar interfaces
    multi_run_time_Thread = count_time(parallel, args=(3, threading.Thread, boring))
    multi_run_time_Process = count_time(parallel, args=(3, multiprocessing.Process, boring))

    print("single: {}, thread parallel: {}, process parallel: {}".format(single_run_time,
                                                                 multi_run_time_Thread,
                                                                multi_run_time_Process))

if __name__ == "__main__":
    """
    In windows, 'if __name__ == "__main__"' is necessary for using multiprocessing.Process 
    """

    test()

由于GIL的存在，应该有 multi_run_time_Thread = single_run_time * 3， multi_run_time_Process = single_run_time ，当然这个是理想状态的数值，即CPU有足够多核心，系统的进程调度不占时间等。实际中可能是 multi_run_time_Thread 远大于 single_run_time * 3， multi_run_time_Process 远大于 single_run_time，这是因为除了测试程序外，操作系统有其它的程序也在占用CPU资源；另外如果CPU是单核的话，那多线程和多进程耗时都是一样的，因为都被退化成了串行。但是依然可以看出多线程的耗时与单个任务的串行耗时累加相差无几。

GIL的实现很简单，在python源码中 PyEval_EvalFrameEx (位于Python/ceval.c中)是负责解释执行python字节码虚拟机的主函数，可以看到这样的代码：

PyObject *
PyEval_EvalFrameEx(PyFrameObject *f, int throwflag)
{
    ......
     for(;;){
     //python字节码执行主循环   
         
         	if (--_Py_Ticker < 0) {     
        	/*_Py_Ticker是为当前线程分配的执行时长标识，每运行一条python字节码指令，时长减1，如果时长耗尽就尝试切换使其它线程工作*/
    			
                ......

                _Py_Ticker = _Py_CheckInterval; // 为下个线程重置执行时长 
           		PyThread_release_lock(interpreter_lock);   //释放锁
                
        		/* Other threads may run now */

                PyThread_acquire_lock(interpreter_lock, 1); 
        		//取得锁，这里取得锁的可能是另外某个线程了

            	......
    		}
    		......
       
     		switch(opcode){
            ......
     		}   
    }
}

这里的interpreter_lock 就是全局锁GIL，可见python线程要想获得执行python字节码的权限需要等待获取interpreter_lock，如果有多个python代码线程存在，那么在一个时刻只有线程能够执行解释opcode的操作，而其它线程都卡在 PyThread_acquire_lock(interpreter_lock, 1);处，这就导致了python字节码的执行本质上是单线程的。

关于GIL锁， Doc/faq/library.rst 中有一段话：

Can’t we get rid of the Global Interpreter Lock?

.. XXX mention multiprocessing
.. XXX link to dbeazley’s talk about GIL?

The :term: ‘global interpreter lock’ (GIL) is often seen as a hindrance to Python’s
deployment on high-end multiprocessor server machines, because a multi-threaded
Python program effectively only uses one CPU, due to the insistence that
(almost) all Python code can only run while the GIL is held.

Back in the days of Python 1.5, Greg Stein actually implemented a comprehensive
patch set (the “free threading” patches) that removed the GIL and replaced it
with fine-grained locking. Unfortunately, even on Windows (where locks are very
efficient) this ran ordinary Python code about twice as slow as the interpreter
using the GIL. On Linux the performance loss was even worse because pthread
locks aren’t as efficient.

Since then, the idea of getting rid of the GIL has occasionally come up but
nobody has found a way to deal with the expected slowdown, and users who don’t
use threads would not be happy if their code ran at half the speed. Greg’s
free threading patch set has not been kept up-to-date for later Python versions.

This doesn’t mean that you can’t make good use of Python on multi-CPU machines!
You just have to be creative with dividing the work up between multiple
processes rather than multiple threads. Judicious use of C extensions will
also help; if you use a C extension to perform a time-consuming task, the
extension can release the GIL while the thread of execution is in the C code and
allow other threads to get some work done.

It has been suggested that the GIL should be a per-interpreter-state lock rather
than truly global; interpreters then wouldn’t be able to share objects.
Unfortunately, this isn’t likely to happen either. It would be a tremendous
amount of work, because many object implementations currently have global state.
For example, small integers and short strings are cached; these caches would
have to be moved to the interpreter state. Other object types have their own
free list; these free lists would have to be moved to the interpreter state.
And so on.

And I doubt that it can even be done in finite time, because the same problem
exists for 3rd party extensions. It is likely that 3rd party extensions are
being written at a faster rate than you can convert them to store all their
global state in the interpreter state.

And finally, once you have multiple interpreters not sharing any state, what
have you gained over running each interpreter in a separate process?

大意是，python最初将将一些对象设置为全局的，这些全局的对象状态的改变客观要求全局锁的存在。比如，小数字，短字符串等，这些常驻的对象的状态是全局的，而python的对象管理主要是靠引用计数，python在执行代码时到处都是引用计数增减的操作，所以全局锁使得大家的引用计数修改不冲突。当然原理上讲，这些对象可以分别建立一个锁，访问这些对象时再获取锁，修改后释放锁。确实也曾经有个叫Greg Stein的家伙这么做过，但是最终发现由于python的引用计数增减如此频繁，导致锁的获取和释放频繁，最终反而使得python的整体运行效率反而降低，尤其对单线程的代码，凭空增添了极其繁多的执行代码。另外一点比较尴尬的是，许多第三方组件利用了python中GIL，将这些第三方组件进行兼容性修改是一件庞大的工程，更尴尬的是，全世界有无数开源爱好者在写第三方组件，他们写的速度比移植的速度快。。。。。
（个人想法：如果将那些全局常驻对象以及不必要的全局状态都按线程私有化，是不是可以稍微细粒度而又不失效率地并行呢？但是这样可能会有一些兼容性总是，比如 is None的判断会出错。。。总之，历史兼容问题有时候会成为技术进步的瓶颈）

在我看来python的GIL利大于弊，也没有必要删掉。很重要的一点，GIL让入门用户也可以大胆地写多线程代码而不用太多地去考虑竞争问题，大大地降低了使用门槛（所以如果自己写的并发python程序无bug运行，不一定是程序逻辑上真的无bug，而是可能有GIL在默默守护），可能很多人写python程序没有竞争问题，就喜欢上了python。而且GIL这种做法对性能的损伤并非想象中绝对串行化那么严重，因为我们很少在一个线程中完全执行计算密集型的操作，如果有类似访问文件，等待信号的阻塞操作，python的多线程的表现和无锁原生线程的表现相差无几，所以实际中python多线程的使用体验一般比想象中的串行化要好的多的多。如果实在想用并发的性能，那python也提供多进程方法multiprocessing.Process，而且它的参数列表与threading.Thread 十分相似，这当然也是python设计者特意为之，以方便使用者切换。

从更大视角看，GIL是并行编程架构客观存在的需求，只是python的设计者将其以线程为锁粒度进行了实现( 可以理解为python作者是个彻底的并行编程pessimistic model的实践者？ )，让我们感觉稍有不适，或者说大多人觉得可以再优化一下。正像Greg Stein做的一样，也可以将锁的粒度微观化，但是这样做要求锁的获取和释放效率极高才能够使程序的性能不受明显影响，而且更大的是历史兼容性问题。GIL让用户感觉诧异是因为python世界的线程和C世界的线程共用了操作系统原生线程而拥有不同的表现形式，GIL将线程作为锁单位使得线程彻底串行化，用户觉得这太鲁莽。当然我想如果Guido重写python的话，他应该会充分考虑到CPU的发展现状，最大化地利用好CPU的并行能力。

锁是具有并行处理能力系统或语言的必备特性。为了保证多任务的正确执行，C语言也有锁，比如EnterCriticalSection和 pthread_mutex_lock，只是我们仅将锁用在某段代码片段而不是整个线程。除了编程者界面使用的锁，操作系统本身也有锁。如上所说，锁的存在是并行架构的客观要求，否则系统会不稳定。操作系统的锁在CPU。现在的CPU基本都有多个核心，每个核心都配有L1、L2缓存用于存储最近使用的内存，这样，如果对应的内存位置存在于缓存中的话，多个核对不同位置的内存写操作可以认为能够同时发生在缓存中，但是如果是往相同位置写，多核心的一致性协议还是会保证本质上不可能同时对一个内存位置进行改写。这些都是CPU自带的机制，而无关用户上层如何编码。更复杂的，在SMP架构中，有一系列机制锁在确保一致性。当这些‘锁’起作用的时候，我们所谓的多线程其实就是退化到串行了。所以，CPU的‘锁’之于操作系统，关键代码段之于C语言，就是GIL之于python，不过前两种锁的粒度更细。所以编程语言和编程模式的研究者的终极目标也是如何将编程者加锁的资源消耗尽量降到硬件水平，以及锁的使用不会引入死锁等bug，现在有的编程模式极好地兼顾了这些方面，比如 software transactional memory编程模式，但是还没有大规模实用，只在数据库中有普遍应用。