0%

Pyc-Format

Pyc 结构简述

浅浅研究了一下 .pyc 的文件结构。

Pyc 简介

.pyc 是 Python 字节码文件,可以由 Python 虚拟机来执行。

在 Python 3 中,Python 会自动在 __pycache__ 目录里,缓存每个模块编译后的版本,名称为 module.version.pyc ,这就是 Python 字节码文件。其中 version 一般使用 Python 版本号。

字节码在 Python 虚拟机程序里对应的是 PyCodeObject 对象。.pyc 文件是字节码在磁盘上的表现形式。

PyCodeObject 对象的创建时机是模块加载的时候,即 import

如果使用 python test.py 命令会对 test.py 进行编译成字节码并解释执行,但是不会生成 test.pyc

如果 test.py 加载了其他模块,如 import util ,Python 会对 util.py 进行编译成字节码,生成 util.pyc ,然后对字节码解释执行。

如果想生成 test.pyc ,我们可以使用Python内置模块py_compile来编译。

加载模块时,如果同时存在 .py.pyc ,Python 会尝试使用 .pyc ;如果 .pyc 的编译时间早于 .py 的修改时间,则重新编译 .py 并更新 .pyc

Pyc 结构概述

Python 的原始代码在运行前都会被先编译成字节码(二进制),并把编译的结果保存到:

  • 一个四字节 magic number
  • 一个四字节的时间戳
  • 一个 PyCodeObject

把这三部分在内存中以 marshal 格式保存为文件,即 pyc 文件。

01 magic number

iPlayForSG 大佬博客里整理了下面的 Magic Number 对照表。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
Known values:
# Python 1.5: 20121
# Python 1.5.1: 20121
# Python 1.5.2: 20121
# Python 1.6: 50428
# Python 2.0: 50823
# Python 2.0.1: 50823
# Python 2.1: 60202
# Python 2.1.1: 60202
# Python 2.1.2: 60202
# Python 2.2: 60717
# Python 2.3a0: 62011
# Python 2.3a0: 62021
# Python 2.3a0: 62011 (!)
# Python 2.4a0: 62041
# Python 2.4a3: 62051
# Python 2.4b1: 62061
# Python 2.5a0: 62071
# Python 2.5a0: 62081 (ast-branch)
# Python 2.5a0: 62091 (with)
# Python 2.5a0: 62092 (changed WITH_CLEANUP opcode)
# Python 2.5b3: 62101 (fix wrong code: for x, in ...)
# Python 2.5b3: 62111 (fix wrong code: x += yield)
# Python 2.5c1: 62121 (fix wrong lnotab with for loops and
# storing constants that should have been removed)
# Python 2.5c2: 62131 (fix wrong code: for x, in ... in listcomp/genexp)
# Python 2.6a0: 62151 (peephole optimizations and STORE_MAP opcode)
# Python 2.6a1: 62161 (WITH_CLEANUP optimization)
# Python 2.7a0: 62171 (optimize list comprehensions/change LIST_APPEND)
# Python 2.7a0: 62181 (optimize conditional branches:
# introduce POP_JUMP_IF_FALSE and POP_JUMP_IF_TRUE)
# Python 2.7a0 62191 (introduce SETUP_WITH)
# Python 2.7a0 62201 (introduce BUILD_SET)
# Python 2.7a0 62211 (introduce MAP_ADD and SET_ADD)
# Python 3000: 3000
# 3010 (removed UNARY_CONVERT)
# 3020 (added BUILD_SET)
# 3030 (added keyword-only parameters)
# 3040 (added signature annotations)
# 3050 (print becomes a function)
# 3060 (PEP 3115 metaclass syntax)
# 3061 (string literals become unicode)
# 3071 (PEP 3109 raise changes)
# 3081 (PEP 3137 make __file__ and __name__ unicode)
# 3091 (kill str8 interning)
# 3101 (merge from 2.6a0, see 62151)
# 3103 (__file__ points to source file)
# Python 3.0a4: 3111 (WITH_CLEANUP optimization).
# Python 3.0b1: 3131 (lexical exception stacking, including POP_EXCEPT
#3021)
# Python 3.1a1: 3141 (optimize list, set and dict comprehensions:
# change LIST_APPEND and SET_ADD, add MAP_ADD #2183)
# Python 3.1a1: 3151 (optimize conditional branches:
# introduce POP_JUMP_IF_FALSE and POP_JUMP_IF_TRUE
#4715)
# Python 3.2a1: 3160 (add SETUP_WITH #6101)
# tag: cpython-32
# Python 3.2a2: 3170 (add DUP_TOP_TWO, remove DUP_TOPX and ROT_FOUR #9225)
# tag: cpython-32
# Python 3.2a3 3180 (add DELETE_DEREF #4617)
# Python 3.3a1 3190 (__class__ super closure changed)
# Python 3.3a1 3200 (PEP 3155 __qualname__ added #13448)
# Python 3.3a1 3210 (added size modulo 2**32 to the pyc header #13645)
# Python 3.3a2 3220 (changed PEP 380 implementation #14230)
# Python 3.3a4 3230 (revert changes to implicit __class__ closure #14857)
# Python 3.4a1 3250 (evaluate positional default arguments before
# keyword-only defaults #16967)
# Python 3.4a1 3260 (add LOAD_CLASSDEREF; allow locals of class to override
# free vars #17853)
# Python 3.4a1 3270 (various tweaks to the __class__ closure #12370)
# Python 3.4a1 3280 (remove implicit class argument)
# Python 3.4a4 3290 (changes to __qualname__ computation #19301)
# Python 3.4a4 3300 (more changes to __qualname__ computation #19301)
# Python 3.4rc2 3310 (alter __qualname__ computation #20625)
# Python 3.5a1 3320 (PEP 465: Matrix multiplication operator #21176)
# Python 3.5b1 3330 (PEP 448: Additional Unpacking Generalizations #2292)
# Python 3.5b2 3340 (fix dictionary display evaluation order #11205)
# Python 3.5b3 3350 (add GET_YIELD_FROM_ITER opcode #24400)
# Python 3.5.2 3351 (fix BUILD_MAP_UNPACK_WITH_CALL opcode #27286)
# Python 3.6a0 3360 (add FORMAT_VALUE opcode #25483)
# Python 3.6a1 3361 (lineno delta of code.co_lnotab becomes signed #26107)
# Python 3.6a2 3370 (16 bit wordcode #26647)
# Python 3.6a2 3371 (add BUILD_CONST_KEY_MAP opcode #27140)
# Python 3.6a2 3372 (MAKE_FUNCTION simplification, remove MAKE_CLOSURE
# #27095)
# Python 3.6b1 3373 (add BUILD_STRING opcode #27078)
# Python 3.6b1 3375 (add SETUP_ANNOTATIONS and STORE_ANNOTATION opcodes
# #27985)
# Python 3.6b1 3376 (simplify CALL_FUNCTIONs & BUILD_MAP_UNPACK_WITH_CALL
#27213)
# Python 3.6b1 3377 (set __class__ cell from type.__new__ #23722)
# Python 3.6b2 3378 (add BUILD_TUPLE_UNPACK_WITH_CALL #28257)
# Python 3.6rc1 3379 (more thorough __class__ validation #23722)
# Python 3.7a1 3390 (add LOAD_METHOD and CALL_METHOD opcodes #26110)
# Python 3.7a2 3391 (update GET_AITER #31709)
# Python 3.7a4 3392 (PEP 552: Deterministic pycs #31650)
# Python 3.7b1 3393 (remove STORE_ANNOTATION opcode #32550)
# Python 3.7b5 3394 (restored docstring as the first stmt in the body;
# this might affected the first line number #32911)
# Python 3.8a1 3400 (move frame block handling to compiler #17611)
# Python 3.8a1 3401 (add END_ASYNC_FOR #33041)
# Python 3.8a1 3410 (PEP570 Python Positional-Only Parameters #36540)
# Python 3.8b2 3411 (Reverse evaluation order of key: value in dict
# comprehensions #35224)
# Python 3.8b2 3412 (Swap the position of positional args and positional
# only args in ast.arguments #37593)
# Python 3.8b4 3413 (Fix "break" and "continue" in "finally" #37830)

02 PyCodeObject

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
/* Bytecode object */
typedef struct {
PyObject_HEAD
int co_argcount; /* #arguments, except *args 代码块的位置参数的个数 */
int co_nlocals; /* #local variables 代码块中局部变量的个数,包括其位置参数的个数。 */
int co_stacksize; /* #entries needed for evaluation stack 执行该代码块需要的栈空间。 */
int co_flags; /* 用来表示参数中是否有 *args 或者 **kwargs */
PyObject *co_code; /* instruction opcodes 该段代码块所对应的 opcode 指令序列。 */
PyObject *co_consts; /* list (constants used) 保存代码块中所有的常量 */
PyObject *co_names; /* list of strings (names used) 保存代码块中所有的符号 */
PyObject *co_varnames; /* tuple of strings (local variable names) 代码块中所有局部变量名 */
PyObject *co_freevars; /* tuple of strings (free variable names) 闭包所需要的 */
PyObject *co_cellvars; /* tuple of strings (cell variable names) 闭包所需要的,内部嵌套函数所需要的局部变量集合 */
/* The rest doesn't count for hash/cmp */
PyObject *co_filename; /* string (where it was loaded from) */
PyObject *co_name; /* string (name, for reference) */
int co_firstlineno; /* first source line number */
PyObject *co_lnotab; /* string (encoding addr<->lineno mapping) See
Objects/lnotab_notes.txt for details. */
void *co_zombieframe; /* for optimization only (see frameobject.c) */
PyObject *co_weakreflist;/* to support weakrefs to code objects */
} PyCodeObject;

所有的 PyCodeObject 都是通过调用以下的函数得以运行:

1
PyObject * PyEval_EvalFrameEx(PyFrameObject *f, int throwflag)

这个函数是 Python 的一个重量级的函数,他的作用即是执行中间码,Python 的代码都是通过调用这个函数来运行的。

PyFrameObject 这个数据结构:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
/* Frame object */
typedef struct _frame {
PyObject_VAR_HEAD
struct _frame *f_back; /* previous frame, or NULL */
PyCodeObject *f_code; /* 【重要】code segment */
PyObject *f_builtins; /* builtin symbol table (PyDictObject) */
PyObject *f_globals; /* 【重要】global symbol table (PyDictObject) */
PyObject *f_locals; /* 【重要】local symbol table (any mapping) */
PyObject **f_valuestack; /* points after the last local */
/* Next free slot in f_valuestack. Frame creation sets to f_valuestack.
Frame evaluation usually NULLs it, but a frame that yields sets it
to the current stack top. */
PyObject **f_stacktop; /* 【重要】 */
PyObject *f_trace; /* Trace function */

/* If an exception is raised in this frame, the next three are used to
* record the exception info (if any) originally in the thread state. See
* comments before set_exc_info() -- it's not obvious.
* Invariant: if _type is NULL, then so are _value and _traceback.
* Desired invariant: all three are NULL, or all three are non-NULL. That
* one isn't currently true, but "should be".
*/
PyObject *f_exc_type, *f_exc_value, *f_exc_traceback;

PyThreadState *f_tstate;
int f_lasti; /* Last instruction if called */
/* Call PyFrame_GetLineNumber() instead of reading this field
directly. As of 2.3 f_lineno is only valid when tracing is
active (i.e. when f_trace is set). At other times we use
PyCode_Addr2Line to calculate the line from the current
bytecode index. */
int f_lineno; /* Current line number */
int f_iblock; /* index in f_blockstack */
PyTryBlock f_blockstack[CO_MAXBLOCKS]; /* for try and loop blocks */
PyObject *f_localsplus[1]; /* locals+stack, dynamically sized */
} PyFrameObject;


参考资料:

  1. Python 字节码 https://www.cnblogs.com/ningjing213/p/16224595.html
  2. Python 程序的执行原理 http://tech.uc.cn/?p=1932、
  3. .pyc 文件的结构 https://blog.csdn.net/weixin_45055269/article/details/105682945
  4. python Magic Number对照表以及pyc修复方法 https://www.cnblogs.com/Here-is-SG/p/15885799.html