第四讲函数及模块基础

4.1 Python函数

目录
函数是组织好的，可重复使用的，用来实现单一，或相关联功能的代码段。

函数能提高应用的模块性，和代码的重复利用率。你已经知道Python提供了许多内建函数，比如print()。但你也可以自己创建函数，这被叫做用户自定义函数。

4.1.1定义一个函数

你可以定义一个由自己想要功能的函数，以下是简单的规则：

函数代码块以 def 关键词开头，后接函数标识符名称和圆括号 ()。
任何传入参数和自变量必须放在圆括号中间，圆括号之间可以用于定义参数。
函数的第一行语句可以选择性地使用文档字符串—用于存放函数说明。
函数内容以冒号 : 起始，并且缩进。
return [表达式] 结束函数，选择性地返回一个值给调用方，不带表达式的 return 相当于返回 None。

4.1.2语法

Python 定义函数使用 def 关键字，一般格式如下：
def 函数名（参数列表）:

函数体

def hello() :
    print("Hello World!")

i=hello()
i

def max(i, j):
    if i > j:
        return i
    else:
        return j
 
a = 4
b = 5
i=1
i=max(a, b)
print(i,max(a, b))

# 计算面积函数
def area(width, height):
    return width * height
 
def print_welcome(name):
    print("Welcome", name)
 
print_welcome("zhansan")
w = 4
h = 5
print("width =", w, " height =", h, " area =", area(w, h))

4.1.3函数调用

定义一个函数：给了函数一个名称，指定了函数里包含的参数，和代码块结构。

这个函数的基本结构完成以后，你可以通过另一个函数调用执行，也可以直接从 Python 命令提示符执行。

如下实例调用了 printme() 函数：

# 定义函数
def printme( str ):
   # 打印任何传入的字符串
   print (str)
   return
 
# 调用函数
printme("我要调用用户自定义函数!")
printme("再次调用同一函数")

可更改(mutable)与不可更改(immutable)对象

在 python 中，strings, tuples, 和 numbers 是不可更改的对象，而 list,dict 等则是可以修改的对象。

不可变类型：变量赋值 a=5 后再赋值 a=10，这里实际是新生成一个 int 值对象 10，再让 a 指向它，而 5 被丢弃，不是改变 a 的值，相当于新生成了 a。
可变类型：变量赋值 la=[1,2,3,4] 后再赋值 la[2]=5 则是将 list la 的第三个元素值更改，本身la没有动，只是其内部的一部分值被修改了。

python 函数的参数传递：

不可变类型：类似 C++ 的值传递，如整数、字符串、元组。如 fun(a)，传递的只是 a 的值，没有影响 a 对象本身。如果在 fun(a) 内部修改 a 的值，则是新生成一个 a 的对象。
可变类型：类似 C++ 的引用传递，如列表，字典。如 fun(la)，则是将 la 真正的传过去，修改后 fun 外部的 la 也会受影响

python 中一切都是对象，严格意义我们不能说值传递还是引用传递，我们应该说传不可变对象和传可变对象。

python 传不可变对象实例通过 id() 函数来查看内存地址变化：

#查看参数id
def change(a):#虚实结合，避免用局部变量做参数
    print(id(a))   # 指向的是同一个对象
    a=10
    print(id(a))   # 一个新对象
 
a=1
print(id(a))
change(a)
print(a)

可以看见在调用函数前后，形参和实参指向的是同一个对象（对象 id 相同），在函数内部修改形参后，形参指向的是不同的 id。

传可变对象实例

可变对象在函数里修改了参数，那么在调用这个函数的函数里，原始的参数也被改变了。例如：

# 传入可变对象
def changeme( mylist ):
   "修改传入的列表"
   mylist.append([1,2,3,4])
   print ("函数内取值1: ", mylist)
   return 
 
# 调用changeme函数
mylist = [10,20,30]
changeme( mylist )
print ("函数外取值2: ", mylist)

4.1.4参数

以下是调用函数时可使用的正式参数类型：

必需参数
关键字参数
默认参数
不定长参数

必需参数

必需参数须以正确的顺序传入函数。调用时的数量必须和声明时的一样。

调用 printme() 函数，你必须传入一个参数，不然会出现语法错误：

#无参数则出错
def printme( str ):
   "打印任何传入的字符串"
   print (str)
   return
 
# 调用 printme 函数，不加参数会报错
printme("hello")

关键字参数

关键字参数和函数调用关系紧密，函数调用使用关键字参数来确定传入的参数值。

使用关键字参数允许函数调用时参数的顺序与声明时不一致，因为 Python 解释器能够用参数名匹配参数值。

以下实例在函数 printme() 调用时使用参数名：

#使用关键字传入参数
def printme( str ):
   "打印任何传入的字符串"
   print (str)
   return
 
#调用printme函数
printme( str = "你是菜鸟")

#以下实例中演示了函数参数的使用不需要使用指定顺序：
def printinfo( name, age ):
   "打印任何传入的字符串"
   print ("名字: ", name)
   print ("年龄: ", age)
   return
 
#调用printinfo函数
printinfo( age=30, name="zhansan" )

默认参数

调用函数时，如果没有传递参数，则会使用默认参数。以下实例中如果没有传入 age 参数，则使用默认值：

#函数默认参数实例
def printinfo( name, age = 35 ):
   "打印任何传入的字符串"
   print ("名字: ", name)
   print ("年龄: ", age)
   return
 
#调用printinfo函数
printinfo( age=50, name="大橘子" )
print ("------------------------")
printinfo( name="二狗子" )

不定长参数

你可能需要一个函数能处理比当初声明时更多的参数。这些参数叫做不定长参数，和上述 2 种参数不同，声明时不会命名。

基本语法如下：

'''
def functionname([formal_args,] *var_args_tuple ):
"函数_文档字符串"
function_suite
return [expression] '''

# 不定长函数参数实例
def printinfo( arg1, *vartuple ):
   "打印任何传入的参数"
   print ("输出: ")
   print (arg1)
   print (vartuple)
 
# 调用printinfo 函数
printinfo( 70, 60, 50, 40, 30, 20, 10 )

如果在函数调用时没有指定参数，它就是一个空元组。我们也可以不向函数传递未命名的变量。如下实例：

# 可写函数说明
def printinfo( arg1, *vartuple ):
   "打印任何传入的参数"
   print ("输出1: ")
   print (arg1)
   for var in vartuple:
      print ("输出2：",var)
   return
 
# 调用printinfo 函数
printinfo( 10 )
printinfo( 2000, 60, 50 )

#还有一种就是参数带两个星号 **基本语法如下： '''
def functionname([formal_args,] **var_args_dict ): "函数_文档字符串" function_suite return [expression] '''

#
# 参数带**会以字典的形式导入：
#
def printinfo( arg1, **vardict ):
   #"打印任何传入的参数"
   print ("输出: ")
   print (arg1)
   print (vardict)
 
# 调用printinfo 函数
printinfo(1, key="myword",b=3)

#如果单独出现星号 * 后的参数必须用关键字传入
def f(a,b,*,c):
     return a+b+c
#f(1,2,3) 
f(1,2,c=3) # 正常

4.1.5匿名函数

python 使用 lambda 来创建匿名函数。

所谓匿名，意即不再使用 def 语句这样标准的形式定义一个函数。

lambda 只是一个表达式，函数体比 def 简单很多。
lambda的主体是一个表达式，而不是一个代码块。仅仅能在lambda表达式中封装有限的逻辑进去。
lambda 函数拥有自己的命名空间，且不能访问自己参数列表之外或全局命名空间里的参数。
虽然lambda函数看起来只能写一行，却不等同于C或C++的内联函数，后者的目的是调用小函数时不占用栈内存从而增加运行效率。

语法

lambda 函数的语法只包含一个语句，如下：

'''
lambda [arg1 [,arg2,.....argn]]:expression
'''
sum = lambda arg1, arg2: arg1 + arg2
 
# 调用sum函数
print ("相加后的值为 : ", sum( 10, 20 ))
print ("相加后的值为 : ", sum( 20, 20 ))

4.1.6 Return语句

return [表达式] 语句用于退出函数，选择性地向调用方返回一个表达式。不带参数值的return语句返回None。之前的例子都没有示范如何返回数值，以下实例演示了 return 语句的用法：

#
# 带返回值的函数
#
a=1
def sum( arg1, arg2 ):
   # 返回2个参数的和."
   total = arg1 + arg2
   print ("函数内 : ", total+a)
   return total
 
# 调用sum函数
total = sum( 10, 20 )
print ("函数外 : ", total)

4.1.7强制位置参数

Python3.8 新增了一个函数形参语法 / 用来指明函数形参必须使用指定位置参数，不能使用关键字参数的形式。

在以下的例子中：

形参 a 和 b 必须使用指定位置参数
c 或 d 可以是位置形参或关键字形参
而 e 和 f 要求为关键字形参:

def f(a, b, /, c, d, *, e, f):
    print(a, b, c, d, e, f)

#f(10, 20, 30, d=40, e=50, f=60)
#f(10, b=20, c=30, d=40, e=50, f=60)   # b 不能使用关键字参数的形式
f(10, 20, 30, 40, 50, f=60)           # e 必须使用关键字参数的形式

4.2 Python模块

用 python 解释器编程，如果你从 Python 解释器退出再进入，那么你定义的所有的方法和变量就都消失了。

为此 Python 提供了一个办法，把这些定义存放在文件中，为一些脚本或者交互式的解释器实例使用，这个文件被称为模块。

模块是一个包含所有你定义的函数和变量的文件，其后缀名是.py。模块可以被别的程序引入，以使用该模块中的函数等功能。这也是使用 python 标准库的方法。

下面是一个使用 python 标准库中模块的例子

import sys
 
print('命令行参数如下:')
for i in sys.argv:
   print(i)
 
print('\n\nPython 路径为：', sys.path, '\n')

命令行参数如下:
C:\Users\zeng_\AppData\Roaming\Python\Python310\site-packages\ipykernel_launcher.py
--ip=127.0.0.1
--stdin=9003
--control=9001
--hb=9000
--Session.signature_scheme="hmac-sha256"
--Session.key=b"1619aeb2-8e35-428b-8590-960ecd2b487b"
--shell=9002
--transport="tcp"
--iopub=9004
--f=c:\Users\zeng_\AppData\Roaming\jupyter\runtime\kernel-v2-3704NFbhunP4F8Tt.json


Python 路径为： ['c:\\VSWork\\Pythonwork\\0001', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\python310.zip', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\DLLs', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\lib', 'c:\\ProgramData\\Anaconda3\\envs\\python310', '', 'C:\\Users\\zeng_\\AppData\\Roaming\\Python\\Python310\\site-packages', 'C:\\Users\\zeng_\\AppData\\Roaming\\Python\\Python310\\site-packages\\win32', 'C:\\Users\\zeng_\\AppData\\Roaming\\Python\\Python310\\site-packages\\win32\\lib', 'C:\\Users\\zeng_\\AppData\\Roaming\\Python\\Python310\\site-packages\\Pythonwin', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\lib\\site-packages', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\lib\\site-packages\\win32', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\lib\\site-packages\\win32\\lib', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\lib\\site-packages\\Pythonwin']

1、import sys 引入 python 标准库中的 sys.py 模块；这是引入某一模块的方法。
2、sys.argv 是一个包含命令行参数的列表。
3、sys.path 包含了一个 Python 解释器自动查找所需模块的路径的列表。

4.2.1 import 语句

想使用 Python 源文件，只需在另一个源文件里执行 import 语句，语法如下：

import module1[, module2[,... moduleN]

一个模块只会被导入一次，不管你执行了多少次import。这样可以防止导入模块被一遍又一遍地执行。

当我们使用import语句的时候，Python解释器是怎样找到对应的文件的呢？

这就涉及到Python的搜索路径，搜索路径是由一系列目录名组成的，Python解释器就依次从这些目录中去寻找所引入的模块。

这看起来很像环境变量，事实上，也可以通过定义环境变量的方式来确定搜索路径。

搜索路径是在Python编译或安装的时候确定的，安装新的库应该也会修改。搜索路径被存储在sys模块中的path变量，做一个简单的实验，在交互式解释器中，输入以下代码：

import sys
sys.path

sys.path 输出是一个列表，其中第一项是空串''，代表当前目录（若是从一个脚本中打印出来的话，可以更清楚地看出是哪个目录），亦即我们执行python解释器的目录（对于脚本的话就是运行的脚本所在的目录）。

因此若像我一样在当前目录下存在与要引入模块同名的文件，就会把要引入的模块屏蔽掉。

了解了搜索路径的概念，就可以在脚本中修改sys.path来引入一些不在搜索路径中的模块。

现在，在解释器的当前目录或者 sys.path 中的一个目录里面来创建一个fibo.py的文件，代码如下：

斐波那契(fibonacci)数列模块

'''''' def fib(n): # 定义到 n 的斐波那契数列 a, b = 0, 1 while b < n: print(b, end=' ') a, b = b, a+b print()

def fib2(n): # 返回到 n 的斐波那契数列 result = [] a, b = 0, 1 while b < n: result.append(b) a, b = b, a+b return result ''''''

然后进入Python解释器，使用下面的命令导入这个模块：

import fibo

f.fib(1000)

1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987

fibo.fib2(100)

[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]

fibo.__name__

'fibo'

from … import 语句

Python 的 from 语句让你从模块中导入一个指定的部分到当前命名空间中，语法如下：
from modname import name1[, name2[, ... nameN]]

from … import * 语句

把一个模块的所有内容全都导入到当前的命名空间也是可行的，只需使用如下声明：

from modname import *

from fibo import fib, fib2
fib(500)
from math import sin,cos
y=sin(2)
print(y)

1 1 2 3 5 8 13 21 34 55 89 144 233 377 
0.9092974268256817

4.3 Numpy

NumPy(Numerical Python) 是 Python 语言的一个扩展程序库，支持大量的维度数组与矩阵运算，此外也针对数组运算提供大量的数学函数库。

NumPy 的前身 Numeric 最早是由 Jim Hugunin 与其它协作者共同开发，2005 年，Travis Oliphant 在 Numeric 中结合了另一个同性质的程序库 Numarray 的特色，并加入了其它扩展而开发了 NumPy。NumPy 为开放源代码并且由许多协作者共同维护开发。

NumPy 是一个运行速度非常快的数学库，主要用于数组计算，包含：

一个强大的N维数组对象 ndarray
广播功能函数
整合 C/C++/Fortran 代码的工具
线性代数、傅里叶变换、随机数生成等功能

NumPy 应用

NumPy 通常与 SciPy（Scientific Python）和 Matplotlib（绘图库）一起使用，这种组合广泛用于替代 MatLab，是一个强大的科学计算环境，有助于我们通过 Python 学习数据科学或者机器学习。
SciPy 是一个开源的 Python 算法库和数学工具包。
SciPy 包含的模块有最优化、线性代数、积分、插值、特殊函数、快速傅里叶变换、信号处理和图像处理、常微分方程求解和其他科学与工程中常用的计算。
Matplotlib 是 Python 编程语言及其数值数学扩展包 NumPy 的可视化操作界面。它为利用通用的图形用户界面工具包，如 Tkinter, wxPython, Qt 或 GTK+ 向应用程序嵌入式绘图提供了应用程序接口（API）。

4.4 Pandas

Pandas 是 Python 语言的一个扩展程序库，用于数据分析。

Pandas 是一个开放源码、BSD 许可的库，提供高性能、易于使用的数据结构和数据分析工具。

Pandas 名字衍生自术语 "panel data"（面板数据）和 "Python data analysis"（Python 数据分析）。

Pandas 一个强大的分析结构化数据的工具集，基础是 Numpy（提供高性能的矩阵运算）。

Pandas 可以从各种文件格式比如 CSV、JSON、SQL、Microsoft Excel 导入数据。

Pandas 可以对各种数据进行运算操作，比如归并、再成形、选择，还有数据清洗和数据加工特征。

Pandas 广泛应用在学术、金融、统计学等各个数据分析领域。用 python 。

Pandas 应用

Pandas 的主要数据结构是 Series （一维数据）与 DataFrame（二维数据），这两种数据结构足以处理金融、统计、社会科学、工程等领域里的大多数典型用例。

数据结构

Series 是一种类似于一维数组的对象，它由一组数据（各种Numpy数据类型）以及一组与之相关的数据标签（即索引）组成。
DataFrame 是一个表格型的数据结构，它含有一组有序的列，每列可以是不同的值类型（数值、字符串、布尔型值）。DataFrame 既有行索引也有列索引，它可以被看做由 Series 组成的字典（共同用一个索引）。

Object creation

See the Intro to data structures section.

Creating a Series by passing a list of values, letting pandas create a default integer index:

data=pd.read_csv("./stocktest/stock600848.csv",parse_dates=["trade_date"],index_col=["trade_date"])
data

	Unnamed: 0	ts_code	open	high	low	close	vol
trade_date
2021-10-26	0	600848.SH	15.08	15.20	15.02	15.02	24291.16
2021-10-25	1	600848.SH	15.20	15.20	15.00	15.09	34938.46
2021-10-22	2	600848.SH	15.25	15.50	15.21	15.25	46717.41
2021-10-21	3	600848.SH	15.26	15.33	15.13	15.20	26697.95
2021-10-20	4	600848.SH	15.19	15.40	15.13	15.26	31790.33
...	...	...	...	...	...	...	...
2000-01-10	4823	600848.SH	10.36	10.74	10.00	10.65	13267.00
2000-01-07	4824	600848.SH	10.27	10.44	10.02	10.35	11930.00
2000-01-06	4825	600848.SH	9.97	10.30	9.68	10.17	8910.00
2000-01-05	4826	600848.SH	9.64	10.18	9.48	10.04	8457.00
2000-01-04	4827	600848.SH	9.45	9.66	9.15	9.60	3826.00

4828 rows × 7 columns

s = pd.Series([1, 3, 5, np.nan, 6, 8])

s

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

dates = pd.date_range("20230101", periods=7)
dates

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07'],
              dtype='datetime64[ns]', freq='D')

df = pd.DataFrame(np.random.randn(7, 4), index=dates, columns=list("ABCD"))

df

	A	B	C	D
2023-01-01	-1.321121	0.057394	0.169561	-0.329157
2023-01-02	1.986053	-0.304579	0.799928	0.716096
2023-01-03	1.026101	-0.160384	-0.237493	-0.695067
2023-01-04	1.339618	-1.477273	0.243761	0.333993
2023-01-05	0.868598	-0.464636	-0.705627	1.197284
2023-01-06	1.154911	-1.454133	-1.478631	-0.959069
2023-01-07	-0.507658	2.229784	0.475991	0.196368

df2 = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df2

	A	B	C	D	E	F
0	1.0	2013-01-02	1.0	3	test	foo
1	1.0	2013-01-02	1.0	3	train	foo
2	1.0	2013-01-02	1.0	3	test	foo
3	1.0	2013-01-02	1.0	3	train	foo

df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed: df2. < TAB > # noqa: E225, E999 df2.A df2.bool df2.abs df2.boxplot df2.add df2.C df2.add_prefix df2.clip df2.add_suffix df2.columns df2.align df2.copy df2.all df2.count df2.any df2.combine df2.append df2.D df2.apply df2.describe df2.applymap df2.diff df2.B df2.duplicated

Viewing data

See the Basics section.

Use DataFrame.head() and DataFrame.tail() to view the top and bottom rows of the frame
respectively:

df.head()
df.tail(3)
df.index
df.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

df.to_numpy()

array([[ 0.42617925,  0.21491699, -1.39557492, -0.63709394],
       [-0.15597782, -0.08090035,  1.6964907 , -1.42085817],
       [ 1.57716656,  0.2989674 ,  1.12132911, -1.63155905],
       [-0.81414061, -1.50162613,  0.56557247, -0.77898062],
       [ 0.63096165,  2.46513493,  1.93998832,  0.88508883],
       [ 0.01136164, -0.12850564, -0.96812863, -0.07347257],
       [ 0.30480466, -1.65843555, -0.05266834, -0.08216438]])

df2.to_numpy()

array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
      dtype=object)

df.describe()

	A	B	C	D
count	7.000000	7.000000	7.000000	7.000000
mean	0.282908	-0.055778	0.415287	-0.534149
std	0.740172	1.367420	1.284479	0.865241
min	-0.814141	-1.658436	-1.395575	-1.631559
25%	-0.072308	-0.815066	-0.510398	-1.099919
50%	0.304805	-0.080900	0.565572	-0.637094
75%	0.528570	0.256942	1.408910	-0.077818
max	1.577167	2.465135	1.939988	0.885089

df.T

	2023-01-01	2023-01-02	2023-01-03	2023-01-04	2023-01-05	2023-01-06	2023-01-07
A	0.426179	-0.155978	1.577167	-0.814141	0.630962	0.011362	0.304805
B	0.214917	-0.080900	0.298967	-1.501626	2.465135	-0.128506	-1.658436
C	-1.395575	1.696491	1.121329	0.565572	1.939988	-0.968129	-0.052668
D	-0.637094	-1.420858	-1.631559	-0.778981	0.885089	-0.073473	-0.082164

df.sort_index(axis=1,ascending=False)

	D	C	B	A
2023-01-01	-0.637094	-1.395575	0.214917	0.426179
2023-01-02	-1.420858	1.696491	-0.080900	-0.155978
2023-01-03	-1.631559	1.121329	0.298967	1.577167
2023-01-04	-0.778981	0.565572	-1.501626	-0.814141
2023-01-05	0.885089	1.939988	2.465135	0.630962
2023-01-06	-0.073473	-0.968129	-0.128506	0.011362
2023-01-07	-0.082164	-0.052668	-1.658436	0.304805

df.sort_values(by="B")

	A	B	C	D
2023-01-07	0.304805	-1.658436	-0.052668	-0.082164
2023-01-04	-0.814141	-1.501626	0.565572	-0.778981
2023-01-06	0.011362	-0.128506	-0.968129	-0.073473
2023-01-02	-0.155978	-0.080900	1.696491	-1.420858
2023-01-01	0.426179	0.214917	-1.395575	-0.637094
2023-01-03	1.577167	0.298967	1.121329	-1.631559
2023-01-05	0.630962	2.465135	1.939988	0.885089

Selection

Note

While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, DataFrame.at(), DataFrame.iat(), DataFrame.loc() and DataFrame.iloc().

See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.

Getting

Selecting a single column, which yields a Series, equivalent to df.A:

df["A"]
df.index

DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04',
               '2023-01-05', '2023-01-06', '2023-01-07'],
              dtype='datetime64[ns]', freq='D')

df[0:3]

	A	B	C	D
2023-01-01	0.426179	0.214917	-1.395575	-0.637094
2023-01-02	-0.155978	-0.080900	1.696491	-1.420858
2023-01-03	1.577167	0.298967	1.121329	-1.631559

Selection by label

See more in Selection by Label using DataFrame.loc() or DataFrame.at().

For getting a cross section using a label:

df.loc[dates[0]]

A    0.353124
B   -0.952107
C   -0.810359
D    0.319397
Name: 2013-01-01 00:00:00, dtype: float64

df.loc[:,["A","B"]]

	A	B
2013-01-01	0.353124	-0.952107
2013-01-02	1.011881	-1.467853
2013-01-03	-0.021260	-0.387072
2013-01-04	-0.471901	-0.362371
2013-01-05	-0.785147	2.458449
2013-01-06	0.894152	0.082539

df.loc["20130102":"20130104", ["A", "B"]]

	A	B
2013-01-02	1.011881	-1.467853
2013-01-03	-0.021260	-0.387072
2013-01-04	-0.471901	-0.362371

df.loc["20130102", ["A", "B"]]

A    1.011881
B   -1.467853
Name: 2013-01-02 00:00:00, dtype: float64

df.at[dates[0], "A"]

0.3531243539756235

Selection by position

See more in Selection by Position using DataFrame.iloc() or DataFrame.at().

Select via the position of the passed integers:

df.iloc[3]

A   -0.471901
B   -0.362371
C   -1.663361
D    0.417683
Name: 2013-01-04 00:00:00, dtype: float64

df.iloc[3:5,0:2]

	A	B
2013-01-04	-0.471901	-0.362371
2013-01-05	-0.785147	2.458449

df.iloc[[1, 2, 4], [0, 2]]

	A	C
2013-01-02	1.011881	0.984630
2013-01-03	-0.021260	0.094271
2013-01-05	-0.785147	0.060556

df.iloc[1:3, :]

	A	B	C	D
2013-01-02	1.011881	-1.467853	0.984630	-1.423896
2013-01-03	-0.021260	-0.387072	0.094271	0.830802

For getting a value explicitly:

df.iloc[1,1]

-1.4678527958278718

For getting fast access to a scalar (equivalent to the prior method):

df.iat[1,1]

-1.4678527958278718

Boolean indexing

Using a single column’s values to select data:

df[df["A"]>0]

	A	B	C	D
2013-01-01	0.353124	-0.952107	-0.810359	0.319397
2013-01-02	1.011881	-1.467853	0.984630	-1.423896
2013-01-06	0.894152	0.082539	0.198704	1.096456

df[df > 0]

	A	B	C	D
2013-01-01	0.353124	NaN	NaN	0.319397
2013-01-02	1.011881	NaN	0.984630	NaN
2013-01-03	NaN	NaN	0.094271	0.830802
2013-01-04	NaN	NaN	NaN	0.417683
2013-01-05	NaN	2.458449	0.060556	NaN
2013-01-06	0.894152	0.082539	0.198704	1.096456

df2 = df.copy()

df2["E"] = ["one", "one", "two", "three", "four", "three"]

df2

	A	B	C	D	E
2013-01-01	0.353124	-0.952107	-0.810359	0.319397	one
2013-01-02	1.011881	-1.467853	0.984630	-1.423896	one
2013-01-03	-0.021260	-0.387072	0.094271	0.830802	two
2013-01-04	-0.471901	-0.362371	-1.663361	0.417683	three
2013-01-05	-0.785147	2.458449	0.060556	-0.461612	four
2013-01-06	0.894152	0.082539	0.198704	1.096456	three

Setting

Setting a new column automatically aligns the data by the indexes:

s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))

s1

2013-01-02    1
2013-01-03    2
2013-01-04    3
2013-01-05    4
2013-01-06    5
2013-01-07    6
Freq: D, dtype: int64

df["F"] = s1

df.at[dates[0], "A"] = 0
df.iat[0, 1] = 0

df.loc[:, "D"] = np.array([5] * len(df))

df2 = df.copy()

df2[df2 > 0] = -df2

df2

	A	B	C	D	F
2023-01-01	0.000000	0.000000	-1.395575	-5	NaN
2023-01-02	-0.155978	-0.080900	-1.696491	-5	NaN
2023-01-03	-1.577167	-0.298967	-1.121329	-5	NaN
2023-01-04	-0.814141	-1.501626	-0.565572	-5	NaN
2023-01-05	-0.630962	-2.465135	-1.939988	-5	NaN
2023-01-06	-0.011362	-0.128506	-0.968129	-5	NaN
2023-01-07	-0.304805	-1.658436	-0.052668	-5	NaN

Missing data

pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.

Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:

df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])

df1.loc[dates[0]: dates[1], "E"] = 1

df1

	A	B	C	D	F	E
2023-01-01	0.000000	0.000000	-1.395575	5	NaN	1.0
2023-01-02	-0.155978	-0.080900	1.696491	5	NaN	1.0
2023-01-03	1.577167	0.298967	1.121329	5	NaN	NaN
2023-01-04	-0.814141	-1.501626	0.565572	5	NaN	NaN

df1.dropna(how="any")

	A	B	C	D	F	E

df1.fillna(value=0)

	A	B	C	D	E
2023-01-01	0.000000	0.000000	-1.395575	5	1.0
2023-01-02	-0.155978	-0.080900	1.696491	5	1.0
2023-01-03	1.577167	0.298967	1.121329	5	0.0
2023-01-04	-0.814141	-1.501626	0.565572	5	0.0

pd.isna(df1)

	A	B	C	D	F	E
2023-01-01	False	False	False	False	True	False
2023-01-02	False	False	False	False	True	False
2023-01-03	False	False	False	False	True	True
2023-01-04	False	False	False	False	True	True

Operations

See the Basic section on Binary Ops.

Stats

Operations in general exclude missing data.

Performing a descriptive statistic:

df.mean()

A    0.222025
B   -0.086481
C    0.415287
D    5.000000
F         NaN
dtype: float64

df.mean(1)

2023-01-01    0.901106
2023-01-02    1.614903
2023-01-03    1.999366
2023-01-04    0.812451
2023-01-05    2.509021
2023-01-06    0.978682
2023-01-07    0.898425
Freq: D, dtype: float64

Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension:

s=pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
s

2013-01-01    NaN
2013-01-02    NaN
2013-01-03    1.0
2013-01-04    3.0
2013-01-05    5.0
2013-01-06    NaN
Freq: D, dtype: float64

df.sub(s, axis="index")

	A	B	C	D	F
2013-01-01	NaN	NaN	NaN	NaN	NaN
2013-01-02	NaN	NaN	NaN	NaN	NaN
2013-01-03	-1.021260	-1.387072	-0.905729	4.0	1.0
2013-01-04	-3.471901	-3.362371	-4.663361	2.0	0.0
2013-01-05	-5.785147	-2.541551	-4.939444	0.0	-1.0
2013-01-06	NaN	NaN	NaN	NaN	NaN

Apply

DataFrame.apply() applies a user defined function to the data:

df.apply(np.cumsum)

	A	B	C	D	F
2013-01-01	0.000000	0.000000	-0.810359	5	NaN
2013-01-02	1.011881	-1.467853	0.174271	10	1.0
2013-01-03	0.990621	-1.854924	0.268541	15	3.0
2013-01-04	0.518720	-2.217296	-1.394820	20	6.0
2013-01-05	-0.266427	0.241153	-1.334264	25	10.0
2013-01-06	0.627724	0.323692	-1.135561	30	15.0

df.apply(lambda x: x.max() - x.min())

A    1.797028
B    3.926302
C    2.647991
D    0.000000
F    4.000000
dtype: float64

Histogramming

See more at Histogramming and Discretization.

s = pd.Series(np.random.randint(0, 7, size=10))

s

0    4
1    5
2    0
3    4
4    3
5    6
6    3
7    5
8    0
9    1
dtype: int32

String Methods

Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.

s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])

s.str.lower()

0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

Merge

Concat

pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.

See the Merging section.

Concatenating pandas objects together along an axis with concat():

df = pd.DataFrame(np.random.randn(10, 4))

df

	0	1	2	3
0	1.070919	-0.909621	-0.058982	1.065355
1	0.942190	2.307742	1.313294	0.904494
2	-0.206415	-0.255212	0.415303	-0.665433
3	0.528560	-0.424218	-0.674206	0.179578
4	-1.147629	1.593130	0.083113	-0.375661
5	-0.067266	-0.636576	0.025570	0.808047
6	-0.480931	1.140837	0.408431	2.374829
7	-0.035620	0.351948	1.159103	-1.073382
8	-0.940397	0.636860	-0.273128	1.445847
9	-2.517698	-0.010000	-0.223322	0.710669

pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)

	0	1	2	3
0	1.070919	-0.909621	-0.058982	1.065355
1	0.942190	2.307742	1.313294	0.904494
2	-0.206415	-0.255212	0.415303	-0.665433
3	0.528560	-0.424218	-0.674206	0.179578
4	-1.147629	1.593130	0.083113	-0.375661
5	-0.067266	-0.636576	0.025570	0.808047
6	-0.480931	1.140837	0.408431	2.374829
7	-0.035620	0.351948	1.159103	-1.073382
8	-0.940397	0.636860	-0.273128	1.445847
9	-2.517698	-0.010000	-0.223322	0.710669

Note

Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it.

Join

merge() enables SQL style join types along specific columns. See the Database style joining section.

left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
pd.merge(left, right, on="key")

	key	lval	rval
0	foo	1	4
1	foo	1	5
2	foo	2	4
3	foo	2	5

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})

right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
pd.merge(left, right, on="key")

	key	lval	rval
0	foo	1	4
1	bar	2	5

Grouping

By “group by” we are referring to a process involving one or more of the following steps:

Splitting the data into groups based on some criteria

Applying a function to each group independently

Combining the results into a data structure

See the Grouping section.

df = pd.DataFrame(
    {
        "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
        "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
        "C": np.random.randn(8),
        "D": np.random.randn(8),
    }
)
df

	A	B	C	D
0	foo	one	0.469626	-0.453533
1	bar	one	-0.528487	0.102063
2	foo	two	-0.165808	0.726964
3	bar	three	0.934137	2.426803
4	foo	two	0.775809	-0.393736
5	bar	two	0.792758	1.703328
6	foo	one	1.664355	-1.335949
7	foo	three	0.005753	0.520489

df.groupby("A")[["C", "D"]].sum()

	C	D
A
bar	1.198408	4.232194
foo	2.749736	-0.935764

df.groupby(["A", "B"]).sum()

		C	D
A	B
bar	one	-0.528487	0.102063
	three	0.934137	2.426803
	two	0.792758	1.703328
foo	one	2.133982	-1.789481
	three	0.005753	0.520489
	two	0.610001	0.333228

Reshaping

See the sections on Hierarchical Indexing and Reshaping.

Stack

tuples = list(
    zip(
        ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
        ["one", "two", "one", "two", "one", "two", "one", "two"],
    )
)


index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])

df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])

df2 = df[:4]

df2

		A	B
first	second
bar	one	0.592387	0.720654
bar	two	-0.240999	0.132012
baz	one	1.181864	0.469986
baz	two	-1.677578	0.122411

stacked = df2.stack()

stacked

first  second   
bar    one     A    0.592387
               B    0.720654
       two     A   -0.240999
               B    0.132012
baz    one     A    1.181864
               B    0.469986
       two     A   -1.677578
               B    0.122411
dtype: float64

stacked.unstack()


stacked.unstack(1)


stacked.unstack(0)

	first	bar	baz
second
one	A	0.592387	1.181864
one	B	0.720654	0.469986
two	A	-0.240999	-1.677578
two	B	0.132012	0.122411

Pivot tables

See the section on Pivot Tables.

df = pd.DataFrame(
    {
        "A": ["one", "one", "two", "three"] * 3,
        "B": ["A", "B", "C"] * 4,
        "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
        "D": np.random.randn(12),
        "E": np.random.randn(12),
    }
)


df

	A	B	C	D	E
0	one	A	foo	0.571207	0.443537
1	one	B	foo	0.059570	0.549866
2	two	C	foo	-0.327195	0.222671
3	three	A	bar	-0.142887	-0.280958
4	one	B	bar	1.484399	0.341254
5	one	C	bar	0.657294	0.111515
6	two	A	foo	-1.309137	-1.301511
7	three	B	foo	-0.382706	0.322421
8	one	C	foo	-0.217702	0.702098
9	one	A	bar	-0.479563	-1.872191
10	two	B	bar	1.102865	-0.331685
11	three	C	bar	-0.626540	-0.590578

pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])

	C	bar	foo
A	B
one	A	-0.479563	0.571207
	B	1.484399	0.059570
	C	0.657294	-0.217702
three	A	-0.142887	NaN
	B	NaN	-0.382706
	C	-0.626540	NaN
two	A	NaN	-1.309137
	B	1.102865	NaN
	C	NaN	-0.327195

Time series

pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.

rng = pd.date_range("1/1/2012", periods=100, freq="S")

ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)

ts.resample("5Min").sum()

2012-01-01    23930
Freq: 5T, dtype: int32

Series.tz_localize() localizes a time series to a time zone:

rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")

ts = pd.Series(np.random.randn(len(rng)), rng)

ts

2012-03-06   -1.069089
2012-03-07   -0.231289
2012-03-08   -0.710142
2012-03-09   -0.830763
2012-03-10    1.825935
Freq: D, dtype: float64

ts_utc = ts.tz_localize("UTC")

ts_utc

2012-03-06 00:00:00+00:00   -1.069089
2012-03-07 00:00:00+00:00   -0.231289
2012-03-08 00:00:00+00:00   -0.710142
2012-03-09 00:00:00+00:00   -0.830763
2012-03-10 00:00:00+00:00    1.825935
Freq: D, dtype: float64

Series.tz_convert() converts a timezones aware time series to another time zone:

ts_utc.tz_convert("US/Eastern")

2012-03-05 19:00:00-05:00   -1.069089
2012-03-06 19:00:00-05:00   -0.231289
2012-03-07 19:00:00-05:00   -0.710142
2012-03-08 19:00:00-05:00   -0.830763
2012-03-09 19:00:00-05:00    1.825935
Freq: D, dtype: float64

Converting between time span representations:

rng = pd.date_range("1/1/2012", periods=5, freq="M")

ts = pd.Series(np.random.randn(len(rng)), index=rng)

ts

2012-01-31    0.374555
2012-02-29    0.033252
2012-03-31    0.377800
2012-04-30    0.121815
2012-05-31   -0.552315
Freq: M, dtype: float64

ps = ts.to_period()

ps

2012-01    0.374555
2012-02    0.033252
2012-03    0.377800
2012-04    0.121815
2012-05   -0.552315
Freq: M, dtype: float64

ps.to_timestamp()

2012-01-01    0.374555
2012-02-01    0.033252
2012-03-01    0.377800
2012-04-01    0.121815
2012-05-01   -0.552315
Freq: MS, dtype: float64

Categoricals

pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.

df = pd.DataFrame(
    {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
df

	id	raw_grade
0	1	a
1	2	b
2	3	b
3	4	a
4	5	a
5	6	e

Rename the categories to more meaningful names:

df["grade"] = df["raw_grade"].astype("category")

df["grade"]

0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']

new_categories = ["very good", "good", "very bad"]

df["grade"] = df["grade"].cat.rename_categories(new_categories)

Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default):

df["grade"] = df["grade"].cat.set_categories(
    ["very bad", "bad", "medium", "good", "very good"]
)


df["grade"]

0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

df.sort_values(by="grade")

	id	raw_grade	grade
5	6	e	very bad
1	2	b	good
2	3	b	good
0	1	a	very good
3	4	a	very good
4	5	a	very good

df.groupby("grade").size()

grade
very bad     1
bad          0
medium       0
good         2
very good    3
dtype: int64

Plotting

See the Plotting docs.

We use the standard convention for referencing the matplotlib API:

import matplotlib.pyplot as plt

plt.close("all")

ts = pd.Series(np.random.randn(1000),
               index=pd.date_range("1/1/2000", periods=1000))

ts = ts.cumsum()

ts.plot()
#plt.show()

<AxesSubplot:>

df = pd.DataFrame(
    np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)


df = df.cumsum()

plt.figure()

df.plot()

plt.legend(loc='best')

<matplotlib.legend.Legend at 0x20ce9501870>

<Figure size 432x288 with 0 Axes>

Importing and exporting data

CSV
Writing to a csv file: using DataFrame.to_csv()

df.to_csv("foo.csv")

Reading from a csv file: using read_csv()

pd.read_csv("foo.csv")

Excel

Reading and writing to Excel.

Writing to an excel file using DataFrame.to_excel():

df.to_excel("foo.xlsx", sheet_name="Sheet1")

pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])