目录
函数是组织好的,可重复使用的,用来实现单一,或相关联功能的代码段。
函数能提高应用的模块性,和代码的重复利用率。你已经知道Python提供了许多内建函数,比如print()。但你也可以自己创建函数,这被叫做用户自定义函数。
你可以定义一个由自己想要功能的函数,以下是简单的规则:
def hello() :
print("Hello World!")
i=hello()
i
def max(i, j):
if i > j:
return i
else:
return j
a = 4
b = 5
i=1
i=max(a, b)
print(i,max(a, b))
# 计算面积函数
def area(width, height):
return width * height
def print_welcome(name):
print("Welcome", name)
print_welcome("zhansan")
w = 4
h = 5
print("width =", w, " height =", h, " area =", area(w, h))
定义一个函数:给了函数一个名称,指定了函数里包含的参数,和代码块结构。
这个函数的基本结构完成以后,你可以通过另一个函数调用执行,也可以直接从 Python 命令提示符执行。
如下实例调用了 printme() 函数:
# 定义函数
def printme( str ):
# 打印任何传入的字符串
print (str)
return
# 调用函数
printme("我要调用用户自定义函数!")
printme("再次调用同一函数")
在 python 中,strings, tuples, 和 numbers 是不可更改的对象,而 list,dict 等则是可以修改的对象。
不可变类型:变量赋值 a=5 后再赋值 a=10,这里实际是新生成一个 int 值对象 10,再让 a 指向它,而 5 被丢弃,不是改变 a 的值,相当于新生成了 a。
可变类型:变量赋值 la=[1,2,3,4] 后再赋值 la[2]=5 则是将 list la 的第三个元素值更改,本身la没有动,只是其内部的一部分值被修改了。
不可变类型:类似 C++ 的值传递,如整数、字符串、元组。如 fun(a),传递的只是 a 的值,没有影响 a 对象本身。如果在 fun(a) 内部修改 a 的值,则是新生成一个 a 的对象。
可变类型:类似 C++ 的引用传递,如 列表,字典。如 fun(la),则是将 la 真正的传过去,修改后 fun 外部的 la 也会受影响
python 中一切都是对象,严格意义我们不能说值传递还是引用传递,我们应该说传不可变对象和传可变对象。
python 传不可变对象实例 通过 id() 函数来查看内存地址变化:
#查看参数id
def change(a):#虚实结合,避免用局部变量做参数
print(id(a)) # 指向的是同一个对象
a=10
print(id(a)) # 一个新对象
a=1
print(id(a))
change(a)
print(a)
可以看见在调用函数前后,形参和实参指向的是同一个对象(对象 id 相同),在函数内部修改形参后,形参指向的是不同的 id。
# 传入可变对象
def changeme( mylist ):
"修改传入的列表"
mylist.append([1,2,3,4])
print ("函数内取值1: ", mylist)
return
# 调用changeme函数
mylist = [10,20,30]
changeme( mylist )
print ("函数外取值2: ", mylist)
以下是调用函数时可使用的正式参数类型:
必需参数须以正确的顺序传入函数。调用时的数量必须和声明时的一样。
调用 printme() 函数,你必须传入一个参数,不然会出现语法错误:
#无参数则出错
def printme( str ):
"打印任何传入的字符串"
print (str)
return
# 调用 printme 函数,不加参数会报错
printme("hello")
关键字参数和函数调用关系紧密,函数调用使用关键字参数来确定传入的参数值。
使用关键字参数允许函数调用时参数的顺序与声明时不一致,因为 Python 解释器能够用参数名匹配参数值。
以下实例在函数 printme() 调用时使用参数名:
#使用关键字传入参数
def printme( str ):
"打印任何传入的字符串"
print (str)
return
#调用printme函数
printme( str = "你是菜鸟")
#以下实例中演示了函数参数的使用不需要使用指定顺序:
def printinfo( name, age ):
"打印任何传入的字符串"
print ("名字: ", name)
print ("年龄: ", age)
return
#调用printinfo函数
printinfo( age=30, name="zhansan" )
#函数默认参数实例
def printinfo( name, age = 35 ):
"打印任何传入的字符串"
print ("名字: ", name)
print ("年龄: ", age)
return
#调用printinfo函数
printinfo( age=50, name="大橘子" )
print ("------------------------")
printinfo( name="二狗子" )
'''
def functionname([formal_args,] *var_args_tuple ):
"函数_文档字符串"
function_suite
return [expression]
'''
# 不定长函数参数实例
def printinfo( arg1, *vartuple ):
"打印任何传入的参数"
print ("输出: ")
print (arg1)
print (vartuple)
# 调用printinfo 函数
printinfo( 70, 60, 50, 40, 30, 20, 10 )
如果在函数调用时没有指定参数,它就是一个空元组。我们也可以不向函数传递未命名的变量。如下实例:
# 可写函数说明
def printinfo( arg1, *vartuple ):
"打印任何传入的参数"
print ("输出1: ")
print (arg1)
for var in vartuple:
print ("输出2:",var)
return
# 调用printinfo 函数
printinfo( 10 )
printinfo( 2000, 60, 50 )
#还有一种就是参数带两个星号 **基本语法如下:
'''
def functionname([formal_args,] **var_args_dict ):
"函数_文档字符串"
function_suite
return [expression]
'''
#
# 参数带**会以字典的形式导入:
#
def printinfo( arg1, **vardict ):
#"打印任何传入的参数"
print ("输出: ")
print (arg1)
print (vardict)
# 调用printinfo 函数
printinfo(1, key="myword",b=3)
#如果单独出现星号 * 后的参数必须用关键字传入
def f(a,b,*,c):
return a+b+c
#f(1,2,3)
f(1,2,c=3) # 正常
python 使用 lambda 来创建匿名函数。
所谓匿名,意即不再使用 def 语句这样标准的形式定义一个函数。
lambda 函数的语法只包含一个语句,如下:
'''
lambda [arg1 [,arg2,.....argn]]:expression
'''
sum = lambda arg1, arg2: arg1 + arg2
# 调用sum函数
print ("相加后的值为 : ", sum( 10, 20 ))
print ("相加后的值为 : ", sum( 20, 20 ))
return [表达式] 语句用于退出函数,选择性地向调用方返回一个表达式。不带参数值的return语句返回None。之前的例子都没有示范如何返回数值,以下实例演示了 return 语句的用法:
#
# 带返回值的函数
#
a=1
def sum( arg1, arg2 ):
# 返回2个参数的和."
total = arg1 + arg2
print ("函数内 : ", total+a)
return total
# 调用sum函数
total = sum( 10, 20 )
print ("函数外 : ", total)
Python3.8 新增了一个函数形参语法 / 用来指明函数形参必须使用指定位置参数,不能使用关键字参数的形式。
在以下的例子中:
def f(a, b, /, c, d, *, e, f):
print(a, b, c, d, e, f)
#f(10, 20, 30, d=40, e=50, f=60)
#f(10, b=20, c=30, d=40, e=50, f=60) # b 不能使用关键字参数的形式
f(10, 20, 30, 40, 50, f=60) # e 必须使用关键字参数的形式
用 python 解释器编程,如果你从 Python 解释器退出再进入,那么你定义的所有的方法和变量就都消失了。
为此 Python 提供了一个办法,把这些定义存放在文件中,为一些脚本或者交互式的解释器实例使用,这个文件被称为模块。
模块是一个包含所有你定义的函数和变量的文件,其后缀名是.py。模块可以被别的程序引入,以使用该模块中的函数等功能。这也是使用 python 标准库的方法。
下面是一个使用 python 标准库中模块的例子
import sys
print('命令行参数如下:')
for i in sys.argv:
print(i)
print('\n\nPython 路径为:', sys.path, '\n')
命令行参数如下: C:\Users\zeng_\AppData\Roaming\Python\Python310\site-packages\ipykernel_launcher.py --ip=127.0.0.1 --stdin=9003 --control=9001 --hb=9000 --Session.signature_scheme="hmac-sha256" --Session.key=b"1619aeb2-8e35-428b-8590-960ecd2b487b" --shell=9002 --transport="tcp" --iopub=9004 --f=c:\Users\zeng_\AppData\Roaming\jupyter\runtime\kernel-v2-3704NFbhunP4F8Tt.json Python 路径为: ['c:\\VSWork\\Pythonwork\\0001', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\python310.zip', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\DLLs', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\lib', 'c:\\ProgramData\\Anaconda3\\envs\\python310', '', 'C:\\Users\\zeng_\\AppData\\Roaming\\Python\\Python310\\site-packages', 'C:\\Users\\zeng_\\AppData\\Roaming\\Python\\Python310\\site-packages\\win32', 'C:\\Users\\zeng_\\AppData\\Roaming\\Python\\Python310\\site-packages\\win32\\lib', 'C:\\Users\\zeng_\\AppData\\Roaming\\Python\\Python310\\site-packages\\Pythonwin', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\lib\\site-packages', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\lib\\site-packages\\win32', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\lib\\site-packages\\win32\\lib', 'c:\\ProgramData\\Anaconda3\\envs\\python310\\lib\\site-packages\\Pythonwin']
import module1[, module2[,... moduleN]
一个模块只会被导入一次,不管你执行了多少次import。这样可以防止导入模块被一遍又一遍地执行。
当我们使用import语句的时候,Python解释器是怎样找到对应的文件的呢?
这就涉及到Python的搜索路径,搜索路径是由一系列目录名组成的,Python解释器就依次从这些目录中去寻找所引入的模块。
这看起来很像环境变量,事实上,也可以通过定义环境变量的方式来确定搜索路径。
搜索路径是在Python编译或安装的时候确定的,安装新的库应该也会修改。搜索路径被存储在sys模块中的path变量,做一个简单的实验,在交互式解释器中,输入以下代码:
import sys
sys.path
sys.path 输出是一个列表,其中第一项是空串'',代表当前目录(若是从一个脚本中打印出来的话,可以更清楚地看出是哪个目录),亦即我们执行python解释器的目录(对于脚本的话就是运行的脚本所在的目录)。
因此若像我一样在当前目录下存在与要引入模块同名的文件,就会把要引入的模块屏蔽掉。
了解了搜索路径的概念,就可以在脚本中修改sys.path来引入一些不在搜索路径中的模块。
现在,在解释器的当前目录或者 sys.path 中的一个目录里面来创建一个fibo.py的文件,代码如下:
'''''' def fib(n): # 定义到 n 的斐波那契数列 a, b = 0, 1 while b < n: print(b, end=' ') a, b = b, a+b print()
def fib2(n): # 返回到 n 的斐波那契数列 result = [] a, b = 0, 1 while b < n: result.append(b) a, b = b, a+b return result ''''''
然后进入Python解释器,使用下面的命令导入这个模块:
import fibo
f.fib(1000)
1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987
fibo.fib2(100)
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89]
fibo.__name__
'fibo'
Python 的 from 语句让你从模块中导入一个指定的部分到当前命名空间中,语法如下:
from modname import name1[, name2[, ... nameN]]
from modname import *
from fibo import fib, fib2
fib(500)
from math import sin,cos
y=sin(2)
print(y)
1 1 2 3 5 8 13 21 34 55 89 144 233 377 0.9092974268256817
NumPy(Numerical Python) 是 Python 语言的一个扩展程序库,支持大量的维度数组与矩阵运算,此外也针对数组运算提供大量的数学函数库。
NumPy 的前身 Numeric 最早是由 Jim Hugunin 与其它协作者共同开发,2005 年,Travis Oliphant 在 Numeric 中结合了另一个同性质的程序库 Numarray 的特色,并加入了其它扩展而开发了 NumPy。NumPy 为开放源代码并且由许多协作者共同维护开发。
NumPy 是一个运行速度非常快的数学库,主要用于数组计算,包含:
NumPy 通常与 SciPy(Scientific Python)和 Matplotlib(绘图库)一起使用, 这种组合广泛用于替代 MatLab,是一个强大的科学计算环境,有助于我们通过 Python 学习数据科学或者机器学习。
SciPy 是一个开源的 Python 算法库和数学工具包。
SciPy 包含的模块有最优化、线性代数、积分、插值、特殊函数、快速傅里叶变换、信号处理和图像处理、常微分方程求解和其他科学与工程中常用的计算。
Matplotlib 是 Python 编程语言及其数值数学扩展包 NumPy 的可视化操作界面。它为利用通用的图形用户界面工具包,如 Tkinter, wxPython, Qt 或 GTK+ 向应用程序嵌入式绘图提供了应用程序接口(API)。
import numpy as np
a=[1,2,3,4,5,6]
b=np.array(a)
b.resize(2,3)
b.transpose()
array([[1, 4], [2, 5], [3, 6]])
a=np.arange(0,12,1)
a.resize(3,4)
print(a)
[[ 0 1 2 3] [ 4 5 6 7] [ 8 9 10 11]]
b=np.linspace(1,7,14)
b=np.array(b)
b=b.reshape(2,7)
b.resize(2,7)
print(b.shape)
(2, 7)
b.T
array([[ 1., 2., 3., 4., 5.], [ 6., 7., 8., 9., 10.]])
b.argmax()
b.conjugate()
array([[ 1., 2., 3., 4., 5.], [ 6., 7., 8., 9., 10.]])
np.squeeze(b)
b.resize(1,1,2,5)
b.squeeze()
array([[[[ 1., 2., 3., 4., 5.], [ 6., 7., 8., 9., 10.]]]])
Pandas 是 Python 语言的一个扩展程序库,用于数据分析。
Pandas 是一个开放源码、BSD 许可的库,提供高性能、易于使用的数据结构和数据分析工具。
Pandas 名字衍生自术语 "panel data"(面板数据)和 "Python data analysis"(Python 数据分析)。
Pandas 一个强大的分析结构化数据的工具集,基础是 Numpy(提供高性能的矩阵运算)。
Pandas 可以从各种文件格式比如 CSV、JSON、SQL、Microsoft Excel 导入数据。
Pandas 可以对各种数据进行运算操作,比如归并、再成形、选择,还有数据清洗和数据加工特征。
Pandas 广泛应用在学术、金融、统计学等各个数据分析领域。用 python 。
Pandas 的主要数据结构是 Series (一维数据)与 DataFrame(二维数据),这两种数据结构足以处理金融、统计、社会科学、工程等领域里的大多数典型用例。
Series 是一种类似于一维数组的对象,它由一组数据(各种Numpy数据类型)以及一组与之相关的数据标签(即索引)组成。
DataFrame 是一个表格型的数据结构,它含有一组有序的列,每列可以是不同的值类型(数值、字符串、布尔型值)。DataFrame 既有行索引也有列索引,它可以被看做由 Series 组成的字典(共同用一个索引)。
Pandas 官网 https://pandas.pydata.org/ Pandas 源代码:https://github.com/pandas-dev/pandas
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
See the Intro to data structures section.
Creating a Series by passing a list of values, letting pandas create a default integer index:
data=pd.read_csv("./stocktest/stock600848.csv",parse_dates=["trade_date"],index_col=["trade_date"])
data
Unnamed: 0 | ts_code | open | high | low | close | vol | |
---|---|---|---|---|---|---|---|
trade_date | |||||||
2021-10-26 | 0 | 600848.SH | 15.08 | 15.20 | 15.02 | 15.02 | 24291.16 |
2021-10-25 | 1 | 600848.SH | 15.20 | 15.20 | 15.00 | 15.09 | 34938.46 |
2021-10-22 | 2 | 600848.SH | 15.25 | 15.50 | 15.21 | 15.25 | 46717.41 |
2021-10-21 | 3 | 600848.SH | 15.26 | 15.33 | 15.13 | 15.20 | 26697.95 |
2021-10-20 | 4 | 600848.SH | 15.19 | 15.40 | 15.13 | 15.26 | 31790.33 |
... | ... | ... | ... | ... | ... | ... | ... |
2000-01-10 | 4823 | 600848.SH | 10.36 | 10.74 | 10.00 | 10.65 | 13267.00 |
2000-01-07 | 4824 | 600848.SH | 10.27 | 10.44 | 10.02 | 10.35 | 11930.00 |
2000-01-06 | 4825 | 600848.SH | 9.97 | 10.30 | 9.68 | 10.17 | 8910.00 |
2000-01-05 | 4826 | 600848.SH | 9.64 | 10.18 | 9.48 | 10.04 | 8457.00 |
2000-01-04 | 4827 | 600848.SH | 9.45 | 9.66 | 9.15 | 9.60 | 3826.00 |
4828 rows × 7 columns
s = pd.Series([1, 3, 5, np.nan, 6, 8])
s
0 1.0 1 3.0 2 5.0 3 NaN 4 6.0 5 8.0 dtype: float64
dates = pd.date_range("20230101", periods=7)
dates
DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06', '2023-01-07'], dtype='datetime64[ns]', freq='D')
df = pd.DataFrame(np.random.randn(7, 4), index=dates, columns=list("ABCD"))
df
A | B | C | D | |
---|---|---|---|---|
2023-01-01 | -1.321121 | 0.057394 | 0.169561 | -0.329157 |
2023-01-02 | 1.986053 | -0.304579 | 0.799928 | 0.716096 |
2023-01-03 | 1.026101 | -0.160384 | -0.237493 | -0.695067 |
2023-01-04 | 1.339618 | -1.477273 | 0.243761 | 0.333993 |
2023-01-05 | 0.868598 | -0.464636 | -0.705627 | 1.197284 |
2023-01-06 | 1.154911 | -1.454133 | -1.478631 | -0.959069 |
2023-01-07 | -0.507658 | 2.229784 | 0.475991 | 0.196368 |
df2 = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df2
A | B | C | D | E | F | |
---|---|---|---|---|---|---|
0 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
1 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
2 | 1.0 | 2013-01-02 | 1.0 | 3 | test | foo |
3 | 1.0 | 2013-01-02 | 1.0 | 3 | train | foo |
df2.dtypes
A float64 B datetime64[ns] C float32 D int32 E category F object dtype: object
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled. Here’s a subset of the attributes that will be completed: df2. < TAB > # noqa: E225, E999 df2.A df2.bool df2.abs df2.boxplot df2.add df2.C df2.add_prefix df2.clip df2.add_suffix df2.columns df2.align df2.copy df2.all df2.count df2.any df2.combine df2.append df2.D df2.apply df2.describe df2.applymap df2.diff df2.B df2.duplicated
See the Basics section.
Use DataFrame.head() and DataFrame.tail() to view the top and bottom rows of the frame
respectively:
df.head()
df.tail(3)
df.index
df.columns
Index(['A', 'B', 'C', 'D'], dtype='object')
df.to_numpy()
array([[ 0.42617925, 0.21491699, -1.39557492, -0.63709394], [-0.15597782, -0.08090035, 1.6964907 , -1.42085817], [ 1.57716656, 0.2989674 , 1.12132911, -1.63155905], [-0.81414061, -1.50162613, 0.56557247, -0.77898062], [ 0.63096165, 2.46513493, 1.93998832, 0.88508883], [ 0.01136164, -0.12850564, -0.96812863, -0.07347257], [ 0.30480466, -1.65843555, -0.05266834, -0.08216438]])
df2.to_numpy()
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'], [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
df.describe()
A | B | C | D | |
---|---|---|---|---|
count | 7.000000 | 7.000000 | 7.000000 | 7.000000 |
mean | 0.282908 | -0.055778 | 0.415287 | -0.534149 |
std | 0.740172 | 1.367420 | 1.284479 | 0.865241 |
min | -0.814141 | -1.658436 | -1.395575 | -1.631559 |
25% | -0.072308 | -0.815066 | -0.510398 | -1.099919 |
50% | 0.304805 | -0.080900 | 0.565572 | -0.637094 |
75% | 0.528570 | 0.256942 | 1.408910 | -0.077818 |
max | 1.577167 | 2.465135 | 1.939988 | 0.885089 |
df.T
2023-01-01 | 2023-01-02 | 2023-01-03 | 2023-01-04 | 2023-01-05 | 2023-01-06 | 2023-01-07 | |
---|---|---|---|---|---|---|---|
A | 0.426179 | -0.155978 | 1.577167 | -0.814141 | 0.630962 | 0.011362 | 0.304805 |
B | 0.214917 | -0.080900 | 0.298967 | -1.501626 | 2.465135 | -0.128506 | -1.658436 |
C | -1.395575 | 1.696491 | 1.121329 | 0.565572 | 1.939988 | -0.968129 | -0.052668 |
D | -0.637094 | -1.420858 | -1.631559 | -0.778981 | 0.885089 | -0.073473 | -0.082164 |
df.sort_index(axis=1,ascending=False)
D | C | B | A | |
---|---|---|---|---|
2023-01-01 | -0.637094 | -1.395575 | 0.214917 | 0.426179 |
2023-01-02 | -1.420858 | 1.696491 | -0.080900 | -0.155978 |
2023-01-03 | -1.631559 | 1.121329 | 0.298967 | 1.577167 |
2023-01-04 | -0.778981 | 0.565572 | -1.501626 | -0.814141 |
2023-01-05 | 0.885089 | 1.939988 | 2.465135 | 0.630962 |
2023-01-06 | -0.073473 | -0.968129 | -0.128506 | 0.011362 |
2023-01-07 | -0.082164 | -0.052668 | -1.658436 | 0.304805 |
df.sort_values(by="B")
A | B | C | D | |
---|---|---|---|---|
2023-01-07 | 0.304805 | -1.658436 | -0.052668 | -0.082164 |
2023-01-04 | -0.814141 | -1.501626 | 0.565572 | -0.778981 |
2023-01-06 | 0.011362 | -0.128506 | -0.968129 | -0.073473 |
2023-01-02 | -0.155978 | -0.080900 | 1.696491 | -1.420858 |
2023-01-01 | 0.426179 | 0.214917 | -1.395575 | -0.637094 |
2023-01-03 | 1.577167 | 0.298967 | 1.121329 | -1.631559 |
2023-01-05 | 0.630962 | 2.465135 | 1.939988 | 0.885089 |
Note
While standard Python / NumPy expressions for selecting and setting are intuitive and come in handy for interactive work, for production code, we recommend the optimized pandas data access methods, DataFrame.at(), DataFrame.iat(), DataFrame.loc() and DataFrame.iloc().
See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.
Selecting a single column, which yields a Series, equivalent to df.A:
df["A"]
df.index
DatetimeIndex(['2023-01-01', '2023-01-02', '2023-01-03', '2023-01-04', '2023-01-05', '2023-01-06', '2023-01-07'], dtype='datetime64[ns]', freq='D')
df[0:3]
A | B | C | D | |
---|---|---|---|---|
2023-01-01 | 0.426179 | 0.214917 | -1.395575 | -0.637094 |
2023-01-02 | -0.155978 | -0.080900 | 1.696491 | -1.420858 |
2023-01-03 | 1.577167 | 0.298967 | 1.121329 | -1.631559 |
See more in Selection by Label using DataFrame.loc() or DataFrame.at().
For getting a cross section using a label:
df.loc[dates[0]]
A 0.353124 B -0.952107 C -0.810359 D 0.319397 Name: 2013-01-01 00:00:00, dtype: float64
df.loc[:,["A","B"]]
A | B | |
---|---|---|
2013-01-01 | 0.353124 | -0.952107 |
2013-01-02 | 1.011881 | -1.467853 |
2013-01-03 | -0.021260 | -0.387072 |
2013-01-04 | -0.471901 | -0.362371 |
2013-01-05 | -0.785147 | 2.458449 |
2013-01-06 | 0.894152 | 0.082539 |
df.loc["20130102":"20130104", ["A", "B"]]
A | B | |
---|---|---|
2013-01-02 | 1.011881 | -1.467853 |
2013-01-03 | -0.021260 | -0.387072 |
2013-01-04 | -0.471901 | -0.362371 |
df.loc["20130102", ["A", "B"]]
A 1.011881 B -1.467853 Name: 2013-01-02 00:00:00, dtype: float64
df.at[dates[0], "A"]
0.3531243539756235
See more in Selection by Position using DataFrame.iloc() or DataFrame.at().
Select via the position of the passed integers:
df.iloc[3]
A -0.471901 B -0.362371 C -1.663361 D 0.417683 Name: 2013-01-04 00:00:00, dtype: float64
df.iloc[3:5,0:2]
A | B | |
---|---|---|
2013-01-04 | -0.471901 | -0.362371 |
2013-01-05 | -0.785147 | 2.458449 |
df.iloc[[1, 2, 4], [0, 2]]
A | C | |
---|---|---|
2013-01-02 | 1.011881 | 0.984630 |
2013-01-03 | -0.021260 | 0.094271 |
2013-01-05 | -0.785147 | 0.060556 |
df.iloc[1:3, :]
A | B | C | D | |
---|---|---|---|---|
2013-01-02 | 1.011881 | -1.467853 | 0.984630 | -1.423896 |
2013-01-03 | -0.021260 | -0.387072 | 0.094271 | 0.830802 |
For getting a value explicitly:
df.iloc[1,1]
-1.4678527958278718
For getting fast access to a scalar (equivalent to the prior method):
df.iat[1,1]
-1.4678527958278718
df[df["A"]>0]
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.353124 | -0.952107 | -0.810359 | 0.319397 |
2013-01-02 | 1.011881 | -1.467853 | 0.984630 | -1.423896 |
2013-01-06 | 0.894152 | 0.082539 | 0.198704 | 1.096456 |
df[df > 0]
A | B | C | D | |
---|---|---|---|---|
2013-01-01 | 0.353124 | NaN | NaN | 0.319397 |
2013-01-02 | 1.011881 | NaN | 0.984630 | NaN |
2013-01-03 | NaN | NaN | 0.094271 | 0.830802 |
2013-01-04 | NaN | NaN | NaN | 0.417683 |
2013-01-05 | NaN | 2.458449 | 0.060556 | NaN |
2013-01-06 | 0.894152 | 0.082539 | 0.198704 | 1.096456 |
df2 = df.copy()
df2["E"] = ["one", "one", "two", "three", "four", "three"]
df2
A | B | C | D | E | |
---|---|---|---|---|---|
2013-01-01 | 0.353124 | -0.952107 | -0.810359 | 0.319397 | one |
2013-01-02 | 1.011881 | -1.467853 | 0.984630 | -1.423896 | one |
2013-01-03 | -0.021260 | -0.387072 | 0.094271 | 0.830802 | two |
2013-01-04 | -0.471901 | -0.362371 | -1.663361 | 0.417683 | three |
2013-01-05 | -0.785147 | 2.458449 | 0.060556 | -0.461612 | four |
2013-01-06 | 0.894152 | 0.082539 | 0.198704 | 1.096456 | three |
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20130102", periods=6))
s1
2013-01-02 1 2013-01-03 2 2013-01-04 3 2013-01-05 4 2013-01-06 5 2013-01-07 6 Freq: D, dtype: int64
df["F"] = s1
df.at[dates[0], "A"] = 0
df.iat[0, 1] = 0
df.loc[:, "D"] = np.array([5] * len(df))
df2 = df.copy()
df2[df2 > 0] = -df2
df2
A | B | C | D | F | |
---|---|---|---|---|---|
2023-01-01 | 0.000000 | 0.000000 | -1.395575 | -5 | NaN |
2023-01-02 | -0.155978 | -0.080900 | -1.696491 | -5 | NaN |
2023-01-03 | -1.577167 | -0.298967 | -1.121329 | -5 | NaN |
2023-01-04 | -0.814141 | -1.501626 | -0.565572 | -5 | NaN |
2023-01-05 | -0.630962 | -2.465135 | -1.939988 | -5 | NaN |
2023-01-06 | -0.011362 | -0.128506 | -0.968129 | -5 | NaN |
2023-01-07 | -0.304805 | -1.658436 | -0.052668 | -5 | NaN |
pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See the Missing Data section.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data:
df1 = df.reindex(index=dates[0:4], columns=list(df.columns) + ["E"])
df1.loc[dates[0]: dates[1], "E"] = 1
df1
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2023-01-01 | 0.000000 | 0.000000 | -1.395575 | 5 | NaN | 1.0 |
2023-01-02 | -0.155978 | -0.080900 | 1.696491 | 5 | NaN | 1.0 |
2023-01-03 | 1.577167 | 0.298967 | 1.121329 | 5 | NaN | NaN |
2023-01-04 | -0.814141 | -1.501626 | 0.565572 | 5 | NaN | NaN |
df1.dropna(how="any")
A | B | C | D | F | E |
---|
df1.fillna(value=0)
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2023-01-01 | 0.000000 | 0.000000 | -1.395575 | 5 | 0.0 | 1.0 |
2023-01-02 | -0.155978 | -0.080900 | 1.696491 | 5 | 0.0 | 1.0 |
2023-01-03 | 1.577167 | 0.298967 | 1.121329 | 5 | 0.0 | 0.0 |
2023-01-04 | -0.814141 | -1.501626 | 0.565572 | 5 | 0.0 | 0.0 |
pd.isna(df1)
A | B | C | D | F | E | |
---|---|---|---|---|---|---|
2023-01-01 | False | False | False | False | True | False |
2023-01-02 | False | False | False | False | True | False |
2023-01-03 | False | False | False | False | True | True |
2023-01-04 | False | False | False | False | True | True |
See the Basic section on Binary Ops.
Operations in general exclude missing data.
Performing a descriptive statistic:
df.mean()
A 0.222025 B -0.086481 C 0.415287 D 5.000000 F NaN dtype: float64
df.mean(1)
2023-01-01 0.901106 2023-01-02 1.614903 2023-01-03 1.999366 2023-01-04 0.812451 2023-01-05 2.509021 2023-01-06 0.978682 2023-01-07 0.898425 Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically broadcasts along the specified dimension:
s=pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
s
2013-01-01 NaN 2013-01-02 NaN 2013-01-03 1.0 2013-01-04 3.0 2013-01-05 5.0 2013-01-06 NaN Freq: D, dtype: float64
df.sub(s, axis="index")
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | NaN | NaN | NaN | NaN | NaN |
2013-01-02 | NaN | NaN | NaN | NaN | NaN |
2013-01-03 | -1.021260 | -1.387072 | -0.905729 | 4.0 | 1.0 |
2013-01-04 | -3.471901 | -3.362371 | -4.663361 | 2.0 | 0.0 |
2013-01-05 | -5.785147 | -2.541551 | -4.939444 | 0.0 | -1.0 |
2013-01-06 | NaN | NaN | NaN | NaN | NaN |
df.apply(np.cumsum)
A | B | C | D | F | |
---|---|---|---|---|---|
2013-01-01 | 0.000000 | 0.000000 | -0.810359 | 5 | NaN |
2013-01-02 | 1.011881 | -1.467853 | 0.174271 | 10 | 1.0 |
2013-01-03 | 0.990621 | -1.854924 | 0.268541 | 15 | 3.0 |
2013-01-04 | 0.518720 | -2.217296 | -1.394820 | 20 | 6.0 |
2013-01-05 | -0.266427 | 0.241153 | -1.334264 | 25 | 10.0 |
2013-01-06 | 0.627724 | 0.323692 | -1.135561 | 30 | 15.0 |
df.apply(lambda x: x.max() - x.min())
A 1.797028 B 3.926302 C 2.647991 D 0.000000 F 4.000000 dtype: float64
s = pd.Series(np.random.randint(0, 7, size=10))
s
0 4 1 5 2 0 3 4 4 3 5 6 6 3 7 5 8 0 9 1 dtype: int32
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions by default (and in some cases always uses them). See more at Vectorized String Methods.
s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
s.str.lower()
0 a 1 b 2 c 3 aaba 4 baca 5 NaN 6 caba 7 dog 8 cat dtype: object
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
See the Merging section.
Concatenating pandas objects together along an axis with concat():
df = pd.DataFrame(np.random.randn(10, 4))
df
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.070919 | -0.909621 | -0.058982 | 1.065355 |
1 | 0.942190 | 2.307742 | 1.313294 | 0.904494 |
2 | -0.206415 | -0.255212 | 0.415303 | -0.665433 |
3 | 0.528560 | -0.424218 | -0.674206 | 0.179578 |
4 | -1.147629 | 1.593130 | 0.083113 | -0.375661 |
5 | -0.067266 | -0.636576 | 0.025570 | 0.808047 |
6 | -0.480931 | 1.140837 | 0.408431 | 2.374829 |
7 | -0.035620 | 0.351948 | 1.159103 | -1.073382 |
8 | -0.940397 | 0.636860 | -0.273128 | 1.445847 |
9 | -2.517698 | -0.010000 | -0.223322 | 0.710669 |
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
0 | 1 | 2 | 3 | |
---|---|---|---|---|
0 | 1.070919 | -0.909621 | -0.058982 | 1.065355 |
1 | 0.942190 | 2.307742 | 1.313294 | 0.904494 |
2 | -0.206415 | -0.255212 | 0.415303 | -0.665433 |
3 | 0.528560 | -0.424218 | -0.674206 | 0.179578 |
4 | -1.147629 | 1.593130 | 0.083113 | -0.375661 |
5 | -0.067266 | -0.636576 | 0.025570 | 0.808047 |
6 | -0.480931 | 1.140837 | 0.408431 | 2.374829 |
7 | -0.035620 | 0.351948 | 1.159103 | -1.073382 |
8 | -0.940397 | 0.636860 | -0.273128 | 1.445847 |
9 | -2.517698 | -0.010000 | -0.223322 | 0.710669 |
Note
Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a DataFrame by iteratively appending records to it.
merge() enables SQL style join types along specific columns. See the Database style joining section.
left = pd.DataFrame({"key": ["foo", "foo"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "foo"], "rval": [4, 5]})
pd.merge(left, right, on="key")
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | foo | 1 | 5 |
2 | foo | 2 | 4 |
3 | foo | 2 | 5 |
left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar"], "rval": [4, 5]})
pd.merge(left, right, on="key")
key | lval | rval | |
---|---|---|---|
0 | foo | 1 | 4 |
1 | bar | 2 | 5 |
By “group by” we are referring to a process involving one or more of the following steps:
Splitting the data into groups based on some criteria
Applying a function to each group independently
Combining the results into a data structure
See the Grouping section.
df = pd.DataFrame(
{
"A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
"B": ["one", "one", "two", "three", "two", "two", "one", "three"],
"C": np.random.randn(8),
"D": np.random.randn(8),
}
)
df
A | B | C | D | |
---|---|---|---|---|
0 | foo | one | 0.469626 | -0.453533 |
1 | bar | one | -0.528487 | 0.102063 |
2 | foo | two | -0.165808 | 0.726964 |
3 | bar | three | 0.934137 | 2.426803 |
4 | foo | two | 0.775809 | -0.393736 |
5 | bar | two | 0.792758 | 1.703328 |
6 | foo | one | 1.664355 | -1.335949 |
7 | foo | three | 0.005753 | 0.520489 |
df.groupby("A")[["C", "D"]].sum()
C | D | |
---|---|---|
A | ||
bar | 1.198408 | 4.232194 |
foo | 2.749736 | -0.935764 |
df.groupby(["A", "B"]).sum()
C | D | ||
---|---|---|---|
A | B | ||
bar | one | -0.528487 | 0.102063 |
three | 0.934137 | 2.426803 | |
two | 0.792758 | 1.703328 | |
foo | one | 2.133982 | -1.789481 |
three | 0.005753 | 0.520489 | |
two | 0.610001 | 0.333228 |
tuples = list(
zip(
["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
["one", "two", "one", "two", "one", "two", "one", "two"],
)
)
index = pd.MultiIndex.from_tuples(tuples, names=["first", "second"])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=["A", "B"])
df2 = df[:4]
df2
A | B | ||
---|---|---|---|
first | second | ||
bar | one | 0.592387 | 0.720654 |
two | -0.240999 | 0.132012 | |
baz | one | 1.181864 | 0.469986 |
two | -1.677578 | 0.122411 |
stacked = df2.stack()
stacked
first second bar one A 0.592387 B 0.720654 two A -0.240999 B 0.132012 baz one A 1.181864 B 0.469986 two A -1.677578 B 0.122411 dtype: float64
stacked.unstack()
stacked.unstack(1)
stacked.unstack(0)
first | bar | baz | |
---|---|---|---|
second | |||
one | A | 0.592387 | 1.181864 |
B | 0.720654 | 0.469986 | |
two | A | -0.240999 | -1.677578 |
B | 0.132012 | 0.122411 |
df = pd.DataFrame(
{
"A": ["one", "one", "two", "three"] * 3,
"B": ["A", "B", "C"] * 4,
"C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
"D": np.random.randn(12),
"E": np.random.randn(12),
}
)
df
A | B | C | D | E | |
---|---|---|---|---|---|
0 | one | A | foo | 0.571207 | 0.443537 |
1 | one | B | foo | 0.059570 | 0.549866 |
2 | two | C | foo | -0.327195 | 0.222671 |
3 | three | A | bar | -0.142887 | -0.280958 |
4 | one | B | bar | 1.484399 | 0.341254 |
5 | one | C | bar | 0.657294 | 0.111515 |
6 | two | A | foo | -1.309137 | -1.301511 |
7 | three | B | foo | -0.382706 | 0.322421 |
8 | one | C | foo | -0.217702 | 0.702098 |
9 | one | A | bar | -0.479563 | -1.872191 |
10 | two | B | bar | 1.102865 | -0.331685 |
11 | three | C | bar | -0.626540 | -0.590578 |
pd.pivot_table(df, values="D", index=["A", "B"], columns=["C"])
C | bar | foo | |
---|---|---|---|
A | B | ||
one | A | -0.479563 | 0.571207 |
B | 1.484399 | 0.059570 | |
C | 0.657294 | -0.217702 | |
three | A | -0.142887 | NaN |
B | NaN | -0.382706 | |
C | -0.626540 | NaN | |
two | A | NaN | -1.309137 |
B | 1.102865 | NaN | |
C | NaN | -0.327195 |
pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial applications. See the Time Series section.
rng = pd.date_range("1/1/2012", periods=100, freq="S")
ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
ts.resample("5Min").sum()
2012-01-01 23930 Freq: 5T, dtype: int32
Series.tz_localize() localizes a time series to a time zone:
rng = pd.date_range("3/6/2012 00:00", periods=5, freq="D")
ts = pd.Series(np.random.randn(len(rng)), rng)
ts
2012-03-06 -1.069089 2012-03-07 -0.231289 2012-03-08 -0.710142 2012-03-09 -0.830763 2012-03-10 1.825935 Freq: D, dtype: float64
ts_utc = ts.tz_localize("UTC")
ts_utc
2012-03-06 00:00:00+00:00 -1.069089 2012-03-07 00:00:00+00:00 -0.231289 2012-03-08 00:00:00+00:00 -0.710142 2012-03-09 00:00:00+00:00 -0.830763 2012-03-10 00:00:00+00:00 1.825935 Freq: D, dtype: float64
Series.tz_convert() converts a timezones aware time series to another time zone:
ts_utc.tz_convert("US/Eastern")
2012-03-05 19:00:00-05:00 -1.069089 2012-03-06 19:00:00-05:00 -0.231289 2012-03-07 19:00:00-05:00 -0.710142 2012-03-08 19:00:00-05:00 -0.830763 2012-03-09 19:00:00-05:00 1.825935 Freq: D, dtype: float64
Converting between time span representations:
rng = pd.date_range("1/1/2012", periods=5, freq="M")
ts = pd.Series(np.random.randn(len(rng)), index=rng)
ts
2012-01-31 0.374555 2012-02-29 0.033252 2012-03-31 0.377800 2012-04-30 0.121815 2012-05-31 -0.552315 Freq: M, dtype: float64
ps = ts.to_period()
ps
2012-01 0.374555 2012-02 0.033252 2012-03 0.377800 2012-04 0.121815 2012-05 -0.552315 Freq: M, dtype: float64
ps.to_timestamp()
2012-01-01 0.374555 2012-02-01 0.033252 2012-03-01 0.377800 2012-04-01 0.121815 2012-05-01 -0.552315 Freq: MS, dtype: float64
pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.
pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API documentation.
df = pd.DataFrame(
{"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
)
df
id | raw_grade | |
---|---|---|
0 | 1 | a |
1 | 2 | b |
2 | 3 | b |
3 | 4 | a |
4 | 5 | a |
5 | 6 | e |
Rename the categories to more meaningful names:
df["grade"] = df["raw_grade"].astype("category")
df["grade"]
0 a 1 b 2 b 3 a 4 a 5 e Name: grade, dtype: category Categories (3, object): ['a', 'b', 'e']
new_categories = ["very good", "good", "very bad"]
df["grade"] = df["grade"].cat.rename_categories(new_categories)
Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new Series by default):
df["grade"] = df["grade"].cat.set_categories(
["very bad", "bad", "medium", "good", "very good"]
)
df["grade"]
0 very good 1 good 2 good 3 very good 4 very good 5 very bad Name: grade, dtype: category Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
df.sort_values(by="grade")
id | raw_grade | grade | |
---|---|---|---|
5 | 6 | e | very bad |
1 | 2 | b | good |
2 | 3 | b | good |
0 | 1 | a | very good |
3 | 4 | a | very good |
4 | 5 | a | very good |
df.groupby("grade").size()
grade very bad 1 bad 0 medium 0 good 2 very good 3 dtype: int64
import matplotlib.pyplot as plt
plt.close("all")
ts = pd.Series(np.random.randn(1000),
index=pd.date_range("1/1/2000", periods=1000))
ts = ts.cumsum()
ts.plot()
#plt.show()
<AxesSubplot:>
df = pd.DataFrame(
np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
)
df = df.cumsum()
plt.figure()
df.plot()
plt.legend(loc='best')
<matplotlib.legend.Legend at 0x20ce9501870>
<Figure size 432x288 with 0 Axes>
df.to_csv("foo.csv")
Reading from a csv file: using read_csv()
pd.read_csv("foo.csv")
df.to_excel("foo.xlsx", sheet_name="Sheet1")
pd.read_excel("foo.xlsx", "Sheet1", index_col=None, na_values=["NA"])
此处可能存在不合适展示的内容,页面不予展示。您可通过相关编辑功能自查并修改。
如您确认内容无涉及 不当用语 / 纯广告导流 / 暴力 / 低俗色情 / 侵权 / 盗版 / 虚假 / 无价值内容或违法国家有关法律法规的内容,可点击提交进行申诉,我们将尽快为您处理。