Python学习笔记

记录一些Python使用方法，方便日后查阅。

文件读写

JSON文件

#文件中每一行都是字典格式的数据
with open(path, 'r', encoding='utf8')as fp:
        for line in fp.readlines():
            if not line.strip():
                continue
            json_data = json.loads(line, encoding="utf-8")
            print('这是文件中的json数据：', json_data)
            print('这是读取到文件数据的数据类型：', type(json_data))
            data_list.append(json_data)
        fp.close()
        
#整个文件是字典列表
with open(path, 'r', encoding='utf8')as fp:
        json_data = json.load(fp)
        print('这是文件中的json数据：', json_data[:10])
        print('这是读取到文件数据的数据类型：', type(json_data))
        fp.close()

数组

获取元素的索引

name1 = ['python', 'java', 'php', 'MySql', 'C++', 'C', 'php', 'C#']
print(name1.index('php'))
'''
2
'''

#最值的索引
import numpy as np
l = [1, 2, 3, 4, 5]
a = np.array(l)
max_index = np.argmax(a)
print(max_index)
min_index = np.argmin(a)
print(min_index)

''' 输出
4
0
'''

排列组合枚举

列表

import itertools
'''
无序排列 C_N^M
combinations(N个数的集合,选取M个数为一组)
'''
c = list(itertools.combinations([1, 2, 3, 4], 3))
print(c)
print(len(c))
'''
有序排列 A_N^M
permutations(N个数的集合,选取M个数为一组)
'''
p = list(itertools.permutations([1, 2, 3, 4], 3))
print(p)
print(len(p))

''' 输出
[(1, 2, 3), (1, 2, 4), (1, 3, 4), (2, 3, 4)]
4
[(1, 2, 3), (1, 2, 4), (1, 3, 2), (1, 3, 4), (1, 4, 2), (1, 4, 3), (2, 1, 3), (2, 1, 4), (2, 3, 1), (2, 3, 4), (2, 4, 1), (2, 4, 3), (3, 1, 2), (3, 1, 4), (3, 2, 1), (3, 2, 4), (3, 4, 1), (3, 4, 2), (4, 1, 2), (4, 1, 3), (4, 2, 1), (4, 2, 3), (4, 3, 1), (4, 3, 2)]
24
'''

#两组元素的笛卡尔积
for item in itertools.product([1,2,3,4],[100,200]):
    print(item)
    '''
(1, 100)
(1, 200)
(2, 100)
(2, 200)
(3, 100)
(3, 200)
(4, 100)
(4, 200)
    '''

计数

from collections import Counter
c = Counter('abcasd')
'''
Counter({'a': 2, 'c': 1, 'b': 1, 's': 1, 'd': 1})
'''

多进程

res = []
pool = multiprocessing.Pool(40)  # 40个进程
for i in range(data_size):
    args = [src_datas[i].strip(), tgt_datas[i].strip(), i, data_size]
    #process_single是处理单条数据的函数
    res.append(pool.apply_async(process_single, (args,)))
pool.close() #关闭进程池，不允许新的进程加入
pool.join() #等所有进程执行完毕
#回收所有进程的返回数据
for r in res: 
    data = r.get()
    data_list.append(data)

进程卡死

系统环境 Ubuntu 20.04，Python 3.7

在处理比较大的数据（百万级别）时，上面的的程序会卡死，即数据已处理完，但是程序没有退出。可能是因为同时处理太多数据出现进程死锁¹。

解决方法

分批处理，既然不能同时处理大量数据，可以设置为每处理完一部分数据（比如10%）就通过pool.close() 和pool.join() 结束当前批量数据的处理，处理完再开启多进程处理下一批。

更正：我遇到的进程卡死的情况，最后查出的原因是部分极端数据导致程序计算复杂度非常高 $O(n!)$, 一直在该处运行导致部分进程一直没退出看起来像卡死。不过最好还是分批处理，这样可以减少

参考

[1] Python doc

[2] Why your multiprocessing Pool is stuck (it’s full of sharks!)

类

特殊属性和方法

repr()方法：显示属性

通常情况下，直接输出某个实例化对象，得到的信息只会是“类名+object at+内存地址”。

class Posting(object):
    def __init__(self, docid, tf=0):
        self.docid = docid
        self.tf = tf

    def __repr__(self) -> str:
        return "<docid: %d, tf: %d>" % (self.docid, self.tf)

p=Posting(-1,0)
print(p)
'''
<docid: -1, tf: 0>
'''

hasattr

Syntax : hasattr(obj, key)
Parameters :
obj : The object whose which attribute has to be checked.
key : Attribute which needs to be checked.
Returns : Returns True, if attribute is present else returns False.

判断obj是否有属性key

getattr

Syntax : getattr(obj, key, def)
Parameters :
obj : The object whose attributes need to be processed.
key : The attribute of object
def : The default value that need to be printed in case attribute is not found.
Returns : Object value if value is available, default value in case attribute is not present
and returns AttributeError in case attribute is not present and default value is not
specified.