Python数据分析大杀器之Pandas基础2万字详解（学pandas基础，这一篇就够啦）

文档中心

Python数据分析

🌸个人主页：JoJo的数据分析历险记
📝个人介绍：小编大四统计在读，目前保研到统计学top3高校继续攻读统计研究生
💌如果文章对你有帮助，欢迎关注、点赞、收藏、订阅专栏

本专栏主要介绍python数据分析领域的应用
参考资料:
利用python数据分析

文章目录

Python数据分析
💮1.Series 对象
- 🏵️1.1 Series基本操作
🌹2.DataFrame对象
🥀3.pandas基本数据运算
- 🌺3.1 算术运算
- 🌻3.2 基本算术运算符
- 🌼3.3 函数映射
🌷4.统计函数
- 🌱4.1 相关性和协方差
- - 🌲4.1.1 Series对象
  - 🌳4.1.2DataFrame对象
  - 🌴4.1.3DataFrame和Series相关性
- 🌵4.2排序和秩
🌾5.Pandas缺失值处理
- 🌿5.1 创建NaN数据
- ☘️5.2 删除NaN
- 🍀5.3 为NaN元素填充其他值
🍁6. 层级索引和分层统计
- 🍂6.1 unstack()函数和stack()函数
- 🍃6.2调整层级顺序
- 🌍6.3按层级统计数据
🌎7.数据导入
🌏8.数据处理
- 🌐8.1 连接
- - 🎇8.1.1 内连接
  - 🎉8.1.2 外连接
  - 🎊8.1.3 以索引作为键进行连接
- 🎄8.2拼接
- 🎋8.3组合
🏆文章推荐

我们介绍了Numpy在数据处理方面的应用，本文介绍一下pandas在数据处理方面的应用，pandas可以是基于numpy构建的，但是可以让数据处理变得更便捷

导入相关库

import numpy as npimport pandas as pd

💮1.Series 对象

pandas主要有两个数据对象，一个是Series，类似于一个向量的形式，另一个是DataFrame数据框形式。我们先来看一下如何创建一个Series数据对象。

s = pd.Series([12,-4,7,9])s

0    121    -42     73     9dtype: int64

🏵️1.1 Series基本操作

#选择内部元素s[2]

#为元素赋值s[2]=5ss['a'] = 4s

0    121    -42     53     9a     4dtype: int64

#用其它对象定义新的series对象arr = np.array([1,2,3,4])#此时s2只是原来的一个动态视图，会随数组的改变而改变，例如我们改变原来数组中的第二个元素值s2 = pd.Series(arr)s2arr[1] = 9s2

0    11    92    33    4dtype: int32

#筛选元素s[s>8]

0    123     9dtype: int64

#Series对象的组成元素serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])serd

white     1white     0blue      2green     1green     2yellow    3dtype: int64

#unique去重 返回一个数组serd.unique()

array([1, 0, 2, 3], dtype=int64)

#value_counts 去重 返回出现次数serd.value_counts()

2    21    23    10    1dtype: int64

#isin 函数，返回布尔值serd.isin([0,3])

white     Falsewhite      Trueblue      Falsegreen     Falsegreen     Falseyellow     Truedtype: bool

serd[serd.isin([0,3])]

white     0yellow    3dtype: int64

#NaNs2 = pd.Series([-5,3,np.NaN,14])s2

0    -5.01     3.02     NaN3    14.0dtype: float64

# 用isnull 和 notnull 来进行判断s2.isnull()s2.notnull()

0     True1     True2    False3     Truedtype: bool

s2

0    -5.01     3.02     NaN3    14.0dtype: float64

#用作字典mydict = {'red':2000,'blue':1000,'yellow':500,'orange':1000}myseries = pd.Series(mydict)myseries

red2000blue      1000yellow     500orange    1000dtype: int64

当出现缺失值时，会直接用NaN替代

colors = ['red','blue','yellow','orange','green']myseries = pd.Series(mydict, index = colors)myseries

red2000.0blue      1000.0yellow     500.0orange    1000.0green NaNdtype: float64

进行运算时有NaN为NaN

mydict2 ={'red':400,'yellow':1000,"black":700}myseries2 = pd.Series(mydict2)myseries.fillna(0) + myseries2.fillna(0)

black NaNblue  NaNgreen NaNorangeNaNred2400.0yellow    1500.0dtype: float64

🌹2.DataFrame对象

DataFrame对象是我们在进行数据分析时最常见的数据格式，相当于一个矩阵数据，由不同行不同列组成，通常每一列代表一个变量，每一行代表一个观察数据。我们先来看一下DataFrame的一些基础应用。

创建DataFrame对象

#DataFrame对象data = {'color':['blue','green','yellow','red','white'], 'object':['ball','pen','pencil','paper','mug'], 'price':[1.2,1.0,0.6,0.9,1.7]}frame = pd.DataFrame(data)frame

	color	object	price
0	blue	ball	1.2
1	green	pen	1.0
2	yellow	pencil	0.6
3	red	paper	0.9
4	white	mug	1.7

frame2 = pd.DataFrame(data, columns=['object','price'])frame2

	object	price
0	ball	1.2
1	pen	1.0
2	pencil	0.6
3	paper	0.9
4	mug	1.7

frame3 = pd.DataFrame(data,index=['one','two','three','four','five'])frame3

	color	object	price
one	blue	ball	1.2
two	green	pen	1.0
three	yellow	pencil	0.6
four	red	paper	0.9
five	white	mug	1.7

#选取元素#获得所有列的名称frame.columns

Index(['color', 'object', 'price'], dtype='object')

#获得所有行的名称frame.index

RangeIndex(start=0, stop=5, step=1)

#获得所有值frame.values

array([['blue', 'ball', 1.2],['green', 'pen', 1.0],['yellow', 'pencil', 0.6],['red', 'paper', 0.9],['white', 'mug', 1.7]], dtype=object)

#获得某一列的值frame['price']

0    1.21    1.02    0.63    0.94    1.7Name: price, dtype: float64

#获得行的值 用ix属性和行的索引项frame.iloc[2]

color     yellowobject    pencilprice 0.6Name: 2, dtype: object

#指定多个索引值能选取多行frame.iloc[[2,4]]

	color	object	price
2	yellow	pencil	0.6
4	white	mug	1.7

#可以用frame[0:1]或者frame[0:2]选择行  但切记frame[0]没有数frame[0:4]

对DataFrame进行行选择时，使用索引frame[0:1]返回第一行数据，[1:2]返回第二行数据

	color	object	price
0	blue	ball	1.2
1	green	pen	1.0
2	yellow	pencil	0.6
3	red	paper	0.9

#如果要获取其中的一个元素，必须依次指定元素所在的列名称、行的索引值或标签frame['object'][3]

'paper'

#赋值frame['new']=12 #直接添加某一列frame

	color	object	price	new
0	blue	ball	1.2	12
1	green	pen	1.0	12
2	yellow	pencil	0.6	12
3	red	paper	0.9	12
4	white	mug	1.7	12

frame['new']=[1,2,3,4,5]frame

	color	object	price	new
0	blue	ball	1.2	1
1	green	pen	1.0	2
2	yellow	pencil	0.6	3
3	red	paper	0.9	4
4	white	mug	1.7	5

#修改单个元素的方法frame['price'][2]=3.3frame

	color	object	price	new
0	blue	ball	1.2	1
1	green	pen	1.0	2
2	yellow	pencil	3.3	3
3	red	paper	0.9	4
4	white	mug	1.7	5

# 删除一整列的所有数据，用delframe['new'] = 12framedel frame['new']frame

	color	object	price
0	blue	ball	1.2
1	green	pen	1.0
2	yellow	pencil	3.3
3	red	paper	0.9
4	white	mug	1.7

#筛选元素frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['red','white','blue','green'], columns=['ball','pen','pencil','paper'])frame3frame3[frame3>12]

	ball	pen	pencil	paper
red	NaN	NaN	NaN	NaN
white	NaN	NaN	NaN	NaN
blue	NaN	NaN	NaN	NaN
green	NaN	13.0	14.0	15.0

#用嵌套字典生成DataFrame对象 当出现缺失值时用NaN替代nestdict = {'red':{2012:22, 2013:33},'white':{2011: 13,2012:22,2013:16},'blue':{2011:17,2012:27,2013:48}}nestdict

{'red': {2012: 22, 2013: 33}, 'white': {2011: 13, 2012: 22, 2013: 16}, 'blue': {2011: 17, 2012: 27, 2013: 48}}

frame2 = pd.DataFrame(nestdict)frame2

	red	white	blue
2011	NaN	13	17
2012	22.0	22	27
2013	33.0	16	48

进行转置

frame2.T

	2011	2012	2013
red	NaN	22.0	33.0
white	13.0	22.0	16.0
blue	17.0	27.0	48.0

#index对象ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])ser.index

Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')

ser.idxmax()

'white'

ser.idxmin()

'blue'

#含重复标签的Indexserd = pd.Series(range(6), index=['white','white','blue','green','green','yellow'])serd

white     0white     1blue      2green     3green     4yellow    5dtype: int64

#当一个标签对应多个元素时，返回一个Series对象 而不是单个元素serd['white']

white    0white    1dtype: int64

#判断是否由重复值， is_unique

#索引对象的其他功能ser = pd.Series([2,5,7,4],index = ['one','two','three','four'])ser

one      2two      5three    7four     4dtype: int64

#reindex()函数可以更换series对象的索引，生成一个新的series对象ser.reindex(['three','one','five','two'])

three    7.0one      2.0five     NaNtwo      5.0dtype: float64

ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])ser3

0    13    55    66    3dtype: int64

#自动插补#reindex()函数，method：ffill 表示插补的数为前面的值，bfill表示插补的数为后面的值ser3.reindex(range(6),method='ffill')

0    11    12    13    54    55    6dtype: int64

ser3.reindex(range(8),method='bfill')

0    1.01    5.02    5.03    5.04    6.05    6.06    3.07    NaNdtype: float64

frame.reindex(range(5), method='ffill',columns=['colors','price','new','object'])

	colors	price	new	object
0	blue	1.2	blue	ball
1	green	1.0	green	pen
2	yellow	3.3	yellow	pencil
3	red	0.9	red	paper
4	white	1.7	white	mug

ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])ser

red0.0blue      1.0yellow    2.0white     3.0dtype: float64

ser.drop('yellow')

red      0.0blue     1.0white    3.0dtype: float64

ser.drop(['blue','white'])

red0.0yellow    2.0dtype: float64

frame = pd.DataFrame(np.arange(16).reshape((4,4)),      index=['red','blue','yellow','white'],      columns=['ball','pen','pencil','paper'])frame

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

#删除时默认是行 axis指定轴，1为列frame.drop(['pen'],axis=1)

	ball	pencil	paper
red	0	2	3
blue	4	6	7
yellow	8	10	11
white	12	14	15

🥀3.pandas基本数据运算

🌺3.1 算术运算

当有两个series或DataFrame对象时，如果一个标签，两个对象都有，则把他们的值相加
当一个标签只有一个对象有时，则为NaN

s1 = pd.Series([3,2,5,1],index=['white','yellow','green','blue'])s1

white     3yellow    2green     5blue      1dtype: int64

s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])s1 + s2

black     NaNblue      3.0brown     NaNgreen     NaNwhite     4.0yellow    6.0dtype: float64

# DateFrame对象也一样frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),columns=['ball','pen','pencil','paper'], index = ['red','blue','yellow','white'])frame1

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),index = ['blue','yellow','green','white'],columns=['ball','pen','mug'])frame2

	ball	pen	mug
blue	0	1	2
yellow	3	4	5
green	6	7	8
white	9	10	11

frame3 = frame1+frame2frame3

	ball	mug	paper	pen	pencil
blue	4.0	NaN	NaN	6.0	NaN
green	NaN	NaN	NaN	NaN	NaN
red	NaN	NaN	NaN	NaN	NaN
white	21.0	NaN	NaN	23.0	NaN
yellow	11.0	NaN	NaN	13.0	NaN

🌻3.2 基本算术运算符

主要的算术运算符如下

add() frame1.add(frame2) = frame1+frame2
sub()
div()
mul()

下面通过一些案例来说明

frame = pd.DataFrame(np.arange(16).reshape((4,4)),columns=['ball','pen','pencil','paper'], index = ['red','blue','yellow','white'])frame

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

ser = pd.Series(np.arange(4),['ball','pen','pencil','paper'])ser #与frame 的列名称保持一致，行不可以

ball      0pen1pencil    2paper     3dtype: int32

frame-ser

	ball	pen	pencil	paper
red	0	0	0	0
blue	4	4	4	4
yellow	8	8	8	8
white	12	12	12	12

当索引项只存在于其中一个数据结构时，那么运算结果会为其产生一个新的索引项，但其值为NaN

具体案例如下，我们给ser增加一列mug

ser['mug'] = 9ser

ball      0pen1pencil    2paper     3mug9dtype: int64

frame - ser

	ball	mug	paper	pen	pencil
red	0	NaN	0	0	0
blue	4	NaN	4	4	4
yellow	8	NaN	8	8	8
white	12	NaN	12	12	12

🌼3.3 函数映射

在dataframe和series数据对象中，可以使用函数对所有元素进行操作

frame

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

# 求所有元素的平方根np.sqrt(frame)

	ball	pen	pencil	paper
red	0.000000	1.000000	1.414214	1.732051
blue	2.000000	2.236068	2.449490	2.645751
yellow	2.828427	3.000000	3.162278	3.316625
white	3.464102	3.605551	3.741657	3.872983

#定义函数#法一：f = lambda x:x.max()-x.min()#返回数组取值范围#法二：def f(x):    return x.max()-x.min()

# apply函数可以调用定义的函数frame.apply(f)

ball      12pen12pencil    12paper     12dtype: int64

def f(x):    return pd.Series([x.min(),x.max()],index = ['min','max'])

frame.apply(f,axis = 1)# 默认axis=0

	min	max
red	0	3
blue	4	7
yellow	8	11
white	12	15

🌷4.统计函数

数组大多数统计函数对DataFrame对象有用，故可以直接使用

frame.sum()

ball      24pen28pencil    32paper     36dtype: int64

frame.mean()

ball      6.0pen7.0pencil    8.0paper     9.0dtype: float64

# describe()函数可以计算多个统计量frame.describe()

	ball	pen	pencil	paper
count	4.000000	4.000000	4.000000	4.000000
mean	6.000000	7.000000	8.000000	9.000000
std	5.163978	5.163978	5.163978	5.163978
min	0.000000	1.000000	2.000000	3.000000
25%	3.000000	4.000000	5.000000	6.000000
50%	6.000000	7.000000	8.000000	9.000000
75%	9.000000	10.000000	11.000000	12.000000
max	12.000000	13.000000	14.000000	15.000000

```python ser.rank(method='first') ```

red4.0blue      1.0yellow    2.0white     5.0green     3.0dtype: float64

🌱4.1 相关性和协方差

🌲4.1.1 Series对象

通常涉及两个数据对象
函数分别corr()和cov()

seq = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])seq

2006    32007    42008    32009    42010    52011    42012    32013    2dtype: int64

seq2 = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])seq2

2006    12007    22008    32009    42010    42011    32012    22013    1dtype: int64

seq.corr(seq2)

0.7745966692414834

seq.cov(seq2)

0.8571428571428571

🌳4.1.2DataFrame对象

DataFrame对象计算相关性和协方差依然返回一个dataframe对象

frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])frame2

	ball	pen	pencil	paper
red	1	4	3	6
blue	4	5	6	1
yellow	3	3	1	5
white	4	1	6	4

frame2.corr()

	ball	pen	pencil	paper
ball	1.000000	-0.276026	0.577350	-0.763763
pen	-0.276026	1.000000	-0.079682	-0.361403
pencil	0.577350	-0.079682	1.000000	-0.692935
paper	-0.763763	-0.361403	-0.692935	1.000000

frame2.cov()

	ball	pen	pencil	paper
ball	2.000000	-0.666667	2.000000	-2.333333
pen	-0.666667	2.916667	-0.333333	-1.333333
pencil	2.000000	-0.333333	6.000000	-3.666667
paper	-2.333333	-1.333333	-3.666667	4.666667

🌴4.1.3DataFrame和Series相关性

corrwith()可以计算DataFrame对象的列或行与Series对象或者其他DataFrame对象元素两两之间的相关性

ser

red5blue      0yellow    3white     8green     4dtype: int64

frame2.corrwith(ser)

ball     -0.140028pen      -0.869657pencil    0.080845paper     0.595854dtype: float64

frame2.corrwith(frame)

ball      0.730297pen      -0.831522pencil    0.210819paper    -0.119523dtype: float64

🌵4.2排序和秩

Series用sort_values()和rank()，默认是升序，使用ascending=False改变为升序，下同
DataFrame用sort_index(by=‘’)和rank()

对ser排序

ser.sort_values()

blue      0yellow    3green     4red5white     8dtype: int64

对ser求秩

ser.rank()

red4.0blue      1.0yellow    2.0white     5.0green     3.0dtype: float64

安装pen对frame进行排序

frame.sort_values(by='pen')

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])ser

red5blue      0yellow    3white     8green     4dtype: int64

ser.sort_index()#按字母表顺序升序排列

blue      0green     4red5white     8yellow    3dtype: int64

ser.sort_index(ascending=False)# 改为降序

yellow    3white     8red5green     4blue      0dtype: int64

ser.sort_values()#根据值排序

blue      0yellow    3green     4red5white     8dtype: int64

frame

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

frame.sort_index()

	ball	pen	pencil	paper
blue	4	5	6	7
red	0	1	2	3
white	12	13	14	15
yellow	8	9	10	11

axis代表轴，1表示纵轴，0表示横轴

frame.sort_index(axis=1)

	ball	paper	pen	pencil
red	0	3	1	2
blue	4	7	5	6
yellow	8	11	9	10
white	12	15	13	14

🌾5.Pandas缺失值处理

🌿5.1 创建NaN数据

为数据赋NaN值用np.nan

ser = pd.Series([0,1,2,np.NaN,9], index=['red','blue','yellow','white','green'])ser

red0.0blue      1.0yellow    2.0white     NaNgreen     9.0dtype: float64

ser['white']

nan

☘️5.2 删除NaN

dropna()
ser[ser.notnull()]
DataFrame中去除时为避免删除整行或整列，用how='all’来表示只删除所有元素均为NAN的行或列，如果使用how=‘any’，则只要这一列有缺失值就删除整列

frame3.dropna(how='all')

	ball	mug	paper	pen	pencil
blue	4.0	NaN	NaN	6.0	NaN
white	21.0	NaN	NaN	23.0	NaN
yellow	11.0	NaN	NaN	13.0	NaN

🍀5.3 为NaN元素填充其他值

fillna()函数·

frame3.fillna(0)

	ball	pen
blue	4.0	6.0
green	0.0	0.0
red	0.0	0.0
white	21.0	23.0
yellow	11.0	13.0

#若要将不同列的NaN换成不同元素，依次指定列名称及要替换成的元素即可

frame3.fillna({'ball':1,"pen":99})

	ball	mug	paper	pen	pencil
blue	4.0	NaN	NaN	6.0	NaN
green	1.0	NaN	NaN	99.0	NaN
red	1.0	NaN	NaN	99.0	NaN
white	21.0	NaN	NaN	23.0	NaN
yellow	11.0	NaN	NaN	13.0	NaN

🍁6. 层级索引和分层统计

有时候我们需要对数据进行分层级的索引，具体看下面这个例子

mser = pd.Series(np.random.rand(8),index=[['white','white','white','blue','blue','red','red','red'],      ['up','down','right','up','down','up','down','left']])mser

white  up0.323513down     0.080292right    0.503630blue   up0.201143down     0.173879red    up0.866267down     0.601906left     0.140885dtype: float64

mser.index

MultiIndex(levels=[['blue', 'red', 'white'], ['down', 'left', 'right', 'up']],    codes=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])

mser['white']

up0.323513down     0.080292right    0.503630dtype: float64

mser[:,'up']

white    0.323513blue     0.201143red      0.866267dtype: float64

mser[:,'right']

white    0.50363dtype: float64

mser['white','up']#可以得到某一特定元素的

0.32351250980575463

🍂6.1 unstack()函数和stack()函数

unstack把等级索引Series对象转换为一个简单的DataFrame对象，把第二列索引转换为相应的列，stack则相反，具体如下

mser.unstack() #将series转换为dataframemser.unstack().fillna(0)

	down	left	right	up
blue	0.173879	0.000000	0.00000	0.201143
red	0.601906	0.140885	0.00000	0.866267
white	0.080292	0.000000	0.50363	0.323513

frame

	ball	pen	pencil	paper
red	0	1	2	3
blue	4	5	6	7
yellow	8	9	10	11
white	12	13	14	15

frame.stack()#将dataframe转换为series对象

red     ball0 pen 1 pencil     2 paper      3blue    ball4 pen 5 pencil     6 paper      7yellow  ball8 pen 9 pencil    10 paper     11white   ball      12 pen13 pencil    14 paper     15dtype: int32

dataframe对象的行与列也可以定义分层级索引

mframe = pd.DataFrame(np.arange(16).reshape((4,4)),index = [['white','white','red','red'],['up','down','up','down']],columns=[['pen','pen','paper','paper'],[1,2,1,2]])mframe

		pen		paper
		1	2	1	2
white	up	0	1	2	3
white	down	4	5	6	7
red	up	8	9	10	11
red	down	12	13	14	15

🍃6.2调整层级顺序

swaplevel()函数以要互换位置的两个层级的名称为参数，返回交换位置后的一个新对象，其中的个元素的顺序保持不变

mframe.columns.names = ['object','id']mframe.index.names = ['colors','status']mframe

	object	pen		paper
	id	1	2	1	2
colors	status
white	up	0	1	2	3
white	down	4	5	6	7
red	up	8	9	10	11
red	down	12	13	14	15

mframe.swaplevel('colors','status')

	object	pen		paper
	id	1	2	1	2
status	colors
up	white	0	1	2	3
down	white	4	5	6	7
up	red	8	9	10	11
down	red	12	13	14	15

🌍6.3按层级统计数据

直接把层级的名称赋给level选项

mframe.sum(level='colors')

object	pen		paper
id	1	2	1	2
colors
white	4	6	8	10
red	20	22	24	26

若想对某一层级的列进行统计，则需要把axis的值设置为1

mframe.sum(level='id', axis=1)

	id	1	2
colors	status
white	up	2	4
white	down	10	12
red	up	18	20
red	down	26	28

🌎7.数据导入

很多时候，我们要分析的数据来自电脑上保存的数据文件，本文介绍一下如何导入我们最常用的csv文件，后续我还会介绍如何导入json数据、以及连接SQL数据库等其他的方式来导入数据

import pandas as pd # 加载模块df = pd.read_csv('student.csv')df

Student ID name   age  gender11 1111    Dw    3  Female12 1112     Q   23    Male13 1113     W   21  Female

	id	color	brand_x	sid	brand_y
0	ball	white	OMG	ball	ABC
1	pencil	red	ABC	pencil	OMG
2	pencil	red	ABC	pencil	POD
3	pen	red	ABC	pen	POD

🌏8.数据处理

表连接
拼接
组合

🌐8.1 连接

使用merge()函数类似sql中的多表连接

🎇8.1.1 内连接

import numpy as npimport pandas as pd

frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'], 'price':[12.33,11.44,33.21,13.23,33.62]})frame1

	id	price
0	ball	12.33
1	pencil	11.44
2	pen	33.21
3	mug	13.23
4	ashtray	33.62

frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'], 'color':['white','red','red','black']})frame2

	id	color
0	pencil	white
1	pencil	red
2	ball	red
3	pen	black

pd.merge(frame1,frame2)

	id	price	color
0	ball	12.33	red
1	pencil	11.44	white
2	pencil	11.44	red
3	pen	33.21	black

上述返回的DataFrame对象由原来的两个DataFrame对象中ID相同的行组成并且没有指定基于哪一列进行合并，实际应用中通常要指定连接条件，用on来zhid

frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'], 'color':['white','red','red','black','green'], 'brand':['OMG','ABC','ABC','POD','POD']})frame1

	id	color	brand
0	ball	white	OMG
1	pencil	red	ABC
2	pen	red	ABC
3	mug	black	POD
4	ashtray	green	POD

frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'], 'brand':['OMG','POD','ABC','POD']})frame2

	id	brand
0	pencil	OMG
1	pencil	POD
2	ball	ABC
3	pen	POD

pd.merge(frame1,frame2,on='id') # 以id进行合并

	id	color	brand_x	brand_y
0	ball	white	OMG	ABC
1	pencil	red	ABC	OMG
2	pencil	red	ABC	POD
3	pen	red	ABC	POD

pd.merge(frame1,frame2,on='brand') # 以brand进行合并

	id_x	color	brand	id_y
0	ball	white	OMG	pencil
1	pencil	red	ABC	ball
2	pen	red	ABC	ball
3	mug	black	POD	pencil
4	mug	black	POD	pen
5	ashtray	green	POD	pencil
6	ashtray	green	POD	pen

当出现两个列的名称不一致的时候，使用left_on 和 right_on，例如，下面两个表，一个是id,一个是sid,我们相当于是用第一个表的id和第二个表的sid连接

frame2.columns = ['sid','brand']frame2

	sid	brand
0	pencil	OMG
1	pencil	POD
2	ball	ABC
3	pen	POD

pd.merge(frame1,frame2,left_on = 'id',right_on ='sid')

	id	color	brand_x	sid	brand_y
0	ball	white	OMG	ball	ABC
1	pencil	red	ABC	pencil	OMG
2	pencil	red	ABC	pencil	POD
3	pen	red	ABC	pen	POD

merge()函数默认的是内连接，上述结果中的键是由交叉操作出来的

🎉8.1.2 外连接

连接类型用how选项指定
左连接共有的加上左边的
右连接共有的加上右边的
外连接把所有的键整合到一起

frame2.columns=['id','brand']

pd.merge(frame1,frame2,how='outer')

	id	color	brand
0	ball	white	OMG
1	pencil	red	ABC
2	pen	red	ABC
3	mug	black	POD
4	ashtray	green	POD
5	pencil	NaN	OMG
6	pencil	NaN	POD
7	ball	NaN	ABC
8	pen	NaN	POD

pd.merge(frame1,frame2,how='left')

	id	color	brand
0	ball	white	OMG
1	pencil	red	ABC
2	pen	red	ABC
3	mug	black	POD
4	ashtray	green	POD

pd.merge(frame1,frame2,how='right')

	id	color	brand
0	pencil	NaN	OMG
1	pencil	NaN	POD
2	ball	NaN	ABC
3	pen	NaN	POD

要合并多个键，则把多个键给on选项

pd.merge(frame1,frame2,on=['id','brand'],how='outer')

	id	color	brand
0	ball	white	OMG
1	pencil	red	ABC
2	pen	red	ABC
3	mug	black	POD
4	ashtray	green	POD
5	pencil	NaN	OMG
6	pencil	NaN	POD
7	ball	NaN	ABC
8	pen	NaN	POD

🎊8.1.3 以索引作为键进行连接

#方法一pd.merge(frame1,frame2,left_index=True,right_index=True)

	id_x	color	brand_x	id_y	brand_y
0	ball	white	OMG	pencil	OMG
1	pencil	red	ABC	pencil	POD
2	pen	red	ABC	ball	ABC
3	mug	black	POD	pen	POD

#方法二 用join函数 默认左连接frame2.columns = ['id2','brand2']frame1.join(frame2)

	id	color	brand	id2	brand2
0	ball	white	OMG	pencil	OMG
1	pencil	red	ABC	pencil	POD
2	pen	red	ABC	ball	ABC
3	mug	black	POD	pen	POD
4	ashtray	green	POD	NaN	NaN

🎄8.2拼接

numpy中的concatenation()函数可以用来进行拼接操作
pandas的concat()函数实现了按轴拼接的功能()

arr1 = np.arange(9).reshape(3,3)arr1

array([[0, 1, 2],[3, 4, 5],[6, 7, 8]])

arr2 = np.arange(6,15).reshape(3,3)arr2

array([[ 6,  7,  8],[ 9, 10, 11],[12, 13, 14]])

np.concatenate([arr1,arr2],axis=1)#默认是axis=0

array([[ 0,  1,  2,  6,  7,  8],[ 3,  4,  5,  9, 10, 11],[ 6,  7,  8, 12, 13, 14]])

ser1 = pd.Series(np.random.rand(4), index = [1,2,3,4])ser1

1    0.1801912    0.0616493    0.2363784    0.105309dtype: float64

ser2 = pd.Series(np.random.rand(4), index = [5,6,7,8])ser2

5    0.9352776    0.5161467    0.2104618    0.912048dtype: float64

pd.concat([ser1,ser2])

1    0.1801912    0.0616493    0.2363784    0.1053095    0.9352776    0.5161467    0.2104618    0.912048dtype: float64

pd.concat([ser1,ser2],axis = 1)

	0	1
1	0.180191	NaN
2	0.061649	NaN
3	0.236378	NaN
4	0.105309	NaN
5	NaN	0.935277
6	NaN	0.516146
7	NaN	0.210461
8	NaN	0.912048

默认是外连接

pd.concat([ser1,ser2],axis=1,join='inner')

	0	1

如果想要创建等级索引，需要用keys选项来完成

pd.concat([ser1,ser2],keys=[1,2])

1  1    0.180191   2    0.061649   3    0.236378   4    0.1053092  5    0.935277   6    0.516146   7    0.210461   8    0.912048dtype: float64

pd.concat([ser1,ser2],axis=1,keys=[1,2])

	1	2
1	0.180191	NaN
2	0.061649	NaN
3	0.236378	NaN
4	0.105309	NaN
5	NaN	0.935277
6	NaN	0.516146
7	NaN	0.210461
8	NaN	0.912048

🎋8.3组合

当无法通过合并或者拼接方法组合数据用组合函数
combine_first()函数可以用来组合Series对象，同时对齐数据

ser1 = pd.Series(np.random.rand(5), index=[1,2,3,4,5])ser1

1    0.7082792    0.2330483    0.0309914    0.2612915    0.379752dtype: float64

ser2 = pd.Series(np.random.rand(4), index = [2,4,5,6])ser2

2    0.0173974    0.7642955    0.4075526    0.352605dtype: float64

ser1.combine_first(ser2)

1    0.7082792    0.2330483    0.0309914    0.2612915    0.3797526    0.352605dtype: float64

pd.concat([ser1,ser2])

1    0.7082792    0.2330483    0.0309914    0.2612915    0.3797522    0.0173974    0.7642955    0.4075526    0.352605dtype: float64

🏆文章推荐

Python数据可视化大杀器之Seaborn：学完可实现90%数据分析绘图

Python数据分析大杀器之Numpy详解