> 文档中心 > Python数据分析大杀器之Pandas基础2万字详解(学pandas基础,这一篇就够啦)

Python数据分析大杀器之Pandas基础2万字详解(学pandas基础,这一篇就够啦)


Python数据分析


  • 🌸个人主页:JoJo的数据分析历险记
  • 📝个人介绍:小编大四统计在读,目前保研到统计学top3高校继续攻读统计研究生
  • 💌如果文章对你有帮助,欢迎关注、点赞、收藏、订阅专栏

本专栏主要介绍python数据分析领域的应用
参考资料:
利用python数据分析

文章目录

  • Python数据分析
  • 💮1.Series 对象
    • 🏵️1.1 Series基本操作
  • 🌹2.DataFrame对象
  • 🥀3.pandas基本数据运算
    • 🌺3.1 算术运算
    • 🌻3.2 基本算术运算符
    • 🌼3.3 函数映射
  • 🌷4.统计函数
    • 🌱4.1 相关性和协方差
      • 🌲4.1.1 Series对象
      • 🌳4.1.2DataFrame对象
      • 🌴4.1.3DataFrame和Series相关性
    • 🌵4.2排序和秩
  • 🌾5.Pandas缺失值处理
    • 🌿5.1 创建NaN数据
    • ☘️5.2 删除NaN
    • 🍀5.3 为NaN元素填充其他值
  • 🍁6. 层级索引和分层统计
    • 🍂6.1 unstack()函数和stack()函数
    • 🍃6.2调整层级顺序
    • 🌍6.3按层级统计数据
  • 🌎7.数据导入
  • 🌏8.数据处理
    • 🌐8.1 连接
      • 🎇8.1.1 内连接
      • 🎉8.1.2 外连接
      • 🎊8.1.3 以索引作为键进行连接
    • 🎄8.2拼接
    • 🎋8.3组合
  • 🏆文章推荐

我们介绍了Numpy在数据处理方面的应用,本文介绍一下pandas在数据处理方面的应用,pandas可以是基于numpy构建的,但是可以让数据处理变得更便捷

导入相关库

import numpy as npimport pandas as pd

💮1.Series 对象

pandas主要有两个数据对象,一个是Series,类似于一个向量的形式,另一个是DataFrame数据框形式。我们先来看一下如何创建一个Series数据对象。

s = pd.Series([12,-4,7,9])s
0    121    -42     73     9dtype: int64

🏵️1.1 Series基本操作

#选择内部元素s[2]
7
#为元素赋值s[2]=5ss['a'] = 4s
0    121    -42     53     9a     4dtype: int64
#用其它对象定义新的series对象arr = np.array([1,2,3,4])#此时s2只是原来的一个动态视图,会随数组的改变而改变,例如我们改变原来数组中的第二个元素值s2 = pd.Series(arr)s2arr[1] = 9s2
0    11    92    33    4dtype: int32
#筛选元素s[s>8]
0    123     9dtype: int64
#Series对象的组成元素serd = pd.Series([1,0,2,1,2,3], index=['white','white','blue','green','green','yellow'])serd
white     1white     0blue      2green     1green     2yellow    3dtype: int64
#unique去重 返回一个数组serd.unique()
array([1, 0, 2, 3], dtype=int64)
#value_counts 去重 返回出现次数serd.value_counts()
2    21    23    10    1dtype: int64
#isin 函数,返回布尔值serd.isin([0,3])
white     Falsewhite      Trueblue      Falsegreen     Falsegreen     Falseyellow     Truedtype: bool
serd[serd.isin([0,3])]
white     0yellow    3dtype: int64
#NaNs2 = pd.Series([-5,3,np.NaN,14])s2
0    -5.01     3.02     NaN3    14.0dtype: float64
# 用isnull 和 notnull 来进行判断s2.isnull()s2.notnull()
0     True1     True2    False3     Truedtype: bool
s2
0    -5.01     3.02     NaN3    14.0dtype: float64
#用作字典mydict = {'red':2000,'blue':1000,'yellow':500,'orange':1000}myseries = pd.Series(mydict)myseries
red2000blue      1000yellow     500orange    1000dtype: int64

当出现缺失值时,会直接用NaN替代

colors = ['red','blue','yellow','orange','green']myseries = pd.Series(mydict, index = colors)myseries
red2000.0blue      1000.0yellow     500.0orange    1000.0green NaNdtype: float64

进行运算时有NaN为NaN

mydict2 ={'red':400,'yellow':1000,"black":700}myseries2 = pd.Series(mydict2)myseries.fillna(0) + myseries2.fillna(0)
black NaNblue  NaNgreen NaNorangeNaNred2400.0yellow    1500.0dtype: float64

🌹2.DataFrame对象

DataFrame对象是我们在进行数据分析时最常见的数据格式,相当于一个矩阵数据,由不同行不同列组成,通常每一列代表一个变量,每一行代表一个观察数据。我们先来看一下DataFrame的一些基础应用。

创建DataFrame对象

#DataFrame对象data = {'color':['blue','green','yellow','red','white'], 'object':['ball','pen','pencil','paper','mug'], 'price':[1.2,1.0,0.6,0.9,1.7]}frame = pd.DataFrame(data)frame
color object price
0 blue ball 1.2
1 green pen 1.0
2 yellow pencil 0.6
3 red paper 0.9
4 white mug 1.7
frame2 = pd.DataFrame(data, columns=['object','price'])frame2
object price
0 ball 1.2
1 pen 1.0
2 pencil 0.6
3 paper 0.9
4 mug 1.7
frame3 = pd.DataFrame(data,index=['one','two','three','four','five'])frame3
color object price
one blue ball 1.2
two green pen 1.0
three yellow pencil 0.6
four red paper 0.9
five white mug 1.7
#选取元素#获得所有列的名称frame.columns
Index(['color', 'object', 'price'], dtype='object')
#获得所有行的名称frame.index
RangeIndex(start=0, stop=5, step=1)
#获得所有值frame.values
array([['blue', 'ball', 1.2],['green', 'pen', 1.0],['yellow', 'pencil', 0.6],['red', 'paper', 0.9],['white', 'mug', 1.7]], dtype=object)
#获得某一列的值frame['price']
0    1.21    1.02    0.63    0.94    1.7Name: price, dtype: float64
#获得行的值 用ix属性和行的索引项frame.iloc[2]

color     yellowobject    pencilprice 0.6Name: 2, dtype: object
#指定多个索引值能选取多行frame.iloc[[2,4]]
color object price
2 yellow pencil 0.6
4 white mug 1.7
#可以用frame[0:1]或者frame[0:2]选择行  但切记frame[0]没有数frame[0:4]

对DataFrame进行行选择时,使用索引frame[0:1]返回第一行数据,[1:2]返回第二行数据

color object price
0 blue ball 1.2
1 green pen 1.0
2 yellow pencil 0.6
3 red paper 0.9
#如果要获取其中的一个元素,必须依次指定元素所在的列名称、行的索引值或标签frame['object'][3]
'paper'
#赋值frame['new']=12 #直接添加某一列frame
color object price new
0 blue ball 1.2 12
1 green pen 1.0 12
2 yellow pencil 0.6 12
3 red paper 0.9 12
4 white mug 1.7 12
frame['new']=[1,2,3,4,5]frame
color object price new
0 blue ball 1.2 1
1 green pen 1.0 2
2 yellow pencil 0.6 3
3 red paper 0.9 4
4 white mug 1.7 5
#修改单个元素的方法frame['price'][2]=3.3frame
color object price new
0 blue ball 1.2 1
1 green pen 1.0 2
2 yellow pencil 3.3 3
3 red paper 0.9 4
4 white mug 1.7 5
# 删除一整列的所有数据,用delframe['new'] = 12framedel frame['new']frame
color object price
0 blue ball 1.2
1 green pen 1.0
2 yellow pencil 3.3
3 red paper 0.9
4 white mug 1.7
#筛选元素frame3 = pd.DataFrame(np.arange(16).reshape((4,4)),index = ['red','white','blue','green'], columns=['ball','pen','pencil','paper'])frame3frame3[frame3>12]
ball pen pencil paper
red NaN NaN NaN NaN
white NaN NaN NaN NaN
blue NaN NaN NaN NaN
green NaN 13.0 14.0 15.0
#用嵌套字典生成DataFrame对象 当出现缺失值时用NaN替代nestdict = {'red':{2012:22, 2013:33},'white':{2011: 13,2012:22,2013:16},'blue':{2011:17,2012:27,2013:48}}nestdict
{'red': {2012: 22, 2013: 33}, 'white': {2011: 13, 2012: 22, 2013: 16}, 'blue': {2011: 17, 2012: 27, 2013: 48}}
frame2 = pd.DataFrame(nestdict)frame2
red white blue
2011 NaN 13 17
2012 22.0 22 27
2013 33.0 16 48

进行转置

frame2.T
2011 2012 2013
red NaN 22.0 33.0
white 13.0 22.0 16.0
blue 17.0 27.0 48.0
#index对象ser = pd.Series([5,0,3,8,4], index=['red','blue','yellow','white','green'])ser.index
Index(['red', 'blue', 'yellow', 'white', 'green'], dtype='object')
ser.idxmax()
'white'
ser.idxmin()
'blue'
#含重复标签的Indexserd = pd.Series(range(6), index=['white','white','blue','green','green','yellow'])serd
white     0white     1blue      2green     3green     4yellow    5dtype: int64
#当一个标签对应多个元素时,返回一个Series对象 而不是单个元素serd['white']
white    0white    1dtype: int64
#判断是否由重复值, is_unique
#索引对象的其他功能ser = pd.Series([2,5,7,4],index = ['one','two','three','four'])ser
one      2two      5three    7four     4dtype: int64
#reindex()函数可以更换series对象的索引,生成一个新的series对象ser.reindex(['three','one','five','two'])
three    7.0one      2.0five     NaNtwo      5.0dtype: float64
ser3 = pd.Series([1,5,6,3],index=[0,3,5,6])ser3
0    13    55    66    3dtype: int64
#自动插补#reindex()函数,method:ffill 表示插补的数为前面的值,bfill表示插补的数为后面的值ser3.reindex(range(6),method='ffill')
0    11    12    13    54    55    6dtype: int64
ser3.reindex(range(8),method='bfill')
0    1.01    5.02    5.03    5.04    6.05    6.06    3.07    NaNdtype: float64
frame.reindex(range(5), method='ffill',columns=['colors','price','new','object'])
colors price new object
0 blue 1.2 blue ball
1 green 1.0 green pen
2 yellow 3.3 yellow pencil
3 red 0.9 red paper
4 white 1.7 white mug
ser = pd.Series(np.arange(4.),index=['red','blue','yellow','white'])ser
red0.0blue      1.0yellow    2.0white     3.0dtype: float64
ser.drop('yellow')
red      0.0blue     1.0white    3.0dtype: float64
ser.drop(['blue','white'])
red0.0yellow    2.0dtype: float64
frame = pd.DataFrame(np.arange(16).reshape((4,4)),      index=['red','blue','yellow','white'],      columns=['ball','pen','pencil','paper'])frame
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
#删除时默认是行 axis指定轴,1为列frame.drop(['pen'],axis=1)
ball pencil paper
red 0 2 3
blue 4 6 7
yellow 8 10 11
white 12 14 15

🥀3.pandas基本数据运算

🌺3.1 算术运算

  • 当有两个series或DataFrame对象时,如果一个标签,两个对象都有,则把他们的值相加
  • 当一个标签只有一个对象有时,则为NaN
s1 = pd.Series([3,2,5,1],index=['white','yellow','green','blue'])s1
white     3yellow    2green     5blue      1dtype: int64
s2 = pd.Series([1,4,7,2,1],['white','yellow','black','blue','brown'])s1 + s2
black     NaNblue      3.0brown     NaNgreen     NaNwhite     4.0yellow    6.0dtype: float64
# DateFrame对象也一样frame1 = pd.DataFrame(np.arange(16).reshape((4,4)),columns=['ball','pen','pencil','paper'], index = ['red','blue','yellow','white'])frame1
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
frame2 = pd.DataFrame(np.arange(12).reshape((4,3)),index = ['blue','yellow','green','white'],columns=['ball','pen','mug'])frame2
ball pen mug
blue 0 1 2
yellow 3 4 5
green 6 7 8
white 9 10 11
frame3 = frame1+frame2frame3
ball mug paper pen pencil
blue 4.0 NaN NaN 6.0 NaN
green NaN NaN NaN NaN NaN
red NaN NaN NaN NaN NaN
white 21.0 NaN NaN 23.0 NaN
yellow 11.0 NaN NaN 13.0 NaN

🌻3.2 基本算术运算符

主要的算术运算符如下

  • add() frame1.add(frame2) = frame1+frame2
  • sub()
  • div()
  • mul()

下面通过一些案例来说明

frame = pd.DataFrame(np.arange(16).reshape((4,4)),columns=['ball','pen','pencil','paper'], index = ['red','blue','yellow','white'])frame
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
ser = pd.Series(np.arange(4),['ball','pen','pencil','paper'])ser #与frame 的列名称保持一致,行不可以
ball      0pen1pencil    2paper     3dtype: int32
frame-ser
ball pen pencil paper
red 0 0 0 0
blue 4 4 4 4
yellow 8 8 8 8
white 12 12 12 12

当索引项只存在于其中一个数据结构时,那么运算结果会为其产生一个新的索引项,但其值为NaN

具体案例如下,我们给ser增加一列mug

ser['mug'] = 9ser
ball      0pen1pencil    2paper     3mug9dtype: int64
frame - ser
ball mug paper pen pencil
red 0 NaN 0 0 0
blue 4 NaN 4 4 4
yellow 8 NaN 8 8 8
white 12 NaN 12 12 12

🌼3.3 函数映射

在dataframe和series数据对象中,可以使用函数对所有元素进行操作

frame
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
# 求所有元素的平方根np.sqrt(frame)
ball pen pencil paper
red 0.000000 1.000000 1.414214 1.732051
blue 2.000000 2.236068 2.449490 2.645751
yellow 2.828427 3.000000 3.162278 3.316625
white 3.464102 3.605551 3.741657 3.872983
#定义函数#法一:f = lambda x:x.max()-x.min()#返回数组取值范围#法二:def f(x):    return x.max()-x.min()
# apply函数可以调用定义的函数frame.apply(f)
ball      12pen12pencil    12paper     12dtype: int64
def f(x):    return pd.Series([x.min(),x.max()],index = ['min','max'])
frame.apply(f,axis = 1)# 默认axis=0
min max
red 0 3
blue 4 7
yellow 8 11
white 12 15

🌷4.统计函数

  • 数组大多数统计函数对DataFrame对象有用,故可以直接使用
frame.sum()
ball      24pen28pencil    32paper     36dtype: int64
frame.mean()
ball      6.0pen7.0pencil    8.0paper     9.0dtype: float64
# describe()函数可以计算多个统计量frame.describe()
ball pen pencil paper
count 4.000000 4.000000 4.000000 4.000000
mean 6.000000 7.000000 8.000000 9.000000
std 5.163978 5.163978 5.163978 5.163978
min 0.000000 1.000000 2.000000 3.000000
25% 3.000000 4.000000 5.000000 6.000000
50% 6.000000 7.000000 8.000000 9.000000
75% 9.000000 10.000000 11.000000 12.000000
max 12.000000 13.000000 14.000000 15.000000

```python ser.rank(method='first') ```

red4.0blue      1.0yellow    2.0white     5.0green     3.0dtype: float64

🌱4.1 相关性和协方差

🌲4.1.1 Series对象

  • 通常涉及两个数据对象
  • 函数分别corr()和cov()
seq = pd.Series([3,4,3,4,5,4,3,2],['2006','2007','2008','2009','2010','2011','2012','2013'])seq
2006    32007    42008    32009    42010    52011    42012    32013    2dtype: int64
seq2 = pd.Series([1,2,3,4,4,3,2,1],['2006','2007','2008','2009','2010','2011','2012','2013'])seq2
2006    12007    22008    32009    42010    42011    32012    22013    1dtype: int64
seq.corr(seq2)
0.7745966692414834
seq.cov(seq2)
0.8571428571428571

🌳4.1.2DataFrame对象

DataFrame对象计算相关性和协方差依然返回一个dataframe对象

frame2 = pd.DataFrame([[1,4,3,6],[4,5,6,1],[3,3,1,5],[4,1,6,4]],index=['red','blue','yellow','white'],columns=['ball','pen','pencil','paper'])frame2
ball pen pencil paper
red 1 4 3 6
blue 4 5 6 1
yellow 3 3 1 5
white 4 1 6 4
frame2.corr()
ball pen pencil paper
ball 1.000000 -0.276026 0.577350 -0.763763
pen -0.276026 1.000000 -0.079682 -0.361403
pencil 0.577350 -0.079682 1.000000 -0.692935
paper -0.763763 -0.361403 -0.692935 1.000000
frame2.cov()
ball pen pencil paper
ball 2.000000 -0.666667 2.000000 -2.333333
pen -0.666667 2.916667 -0.333333 -1.333333
pencil 2.000000 -0.333333 6.000000 -3.666667
paper -2.333333 -1.333333 -3.666667 4.666667

🌴4.1.3DataFrame和Series相关性

corrwith()可以计算DataFrame对象的列或行与Series对象或者其他DataFrame对象元素两两之间的相关性

ser
red5blue      0yellow    3white     8green     4dtype: int64
frame2.corrwith(ser)
ball     -0.140028pen      -0.869657pencil    0.080845paper     0.595854dtype: float64
frame2.corrwith(frame)
ball      0.730297pen      -0.831522pencil    0.210819paper    -0.119523dtype: float64

🌵4.2排序和秩

  • Series用sort_values()和rank(),默认是升序,使用ascending=False改变为升序,下同
  • DataFrame用sort_index(by=‘’)和rank()

对ser排序

ser.sort_values()
blue      0yellow    3green     4red5white     8dtype: int64

对ser求秩

ser.rank()
red4.0blue      1.0yellow    2.0white     5.0green     3.0dtype: float64

安装pen对frame进行排序

frame.sort_values(by='pen')
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
ser = pd.Series([5,0,3,8,4],index=['red','blue','yellow','white','green'])ser
red5blue      0yellow    3white     8green     4dtype: int64
ser.sort_index()#按字母表顺序升序排列
blue      0green     4red5white     8yellow    3dtype: int64
ser.sort_index(ascending=False)# 改为降序
yellow    3white     8red5green     4blue      0dtype: int64
ser.sort_values()#根据值排序
blue      0yellow    3green     4red5white     8dtype: int64
frame
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
frame.sort_index()
ball pen pencil paper
blue 4 5 6 7
red 0 1 2 3
white 12 13 14 15
yellow 8 9 10 11

axis代表轴,1表示纵轴,0表示横轴

frame.sort_index(axis=1)
ball paper pen pencil
red 0 3 1 2
blue 4 7 5 6
yellow 8 11 9 10
white 12 15 13 14

🌾5.Pandas缺失值处理

🌿5.1 创建NaN数据

  • 为数据赋NaN值 用np.nan
ser = pd.Series([0,1,2,np.NaN,9], index=['red','blue','yellow','white','green'])ser
red0.0blue      1.0yellow    2.0white     NaNgreen     9.0dtype: float64
ser['white']
nan

☘️5.2 删除NaN

  • dropna()
  • ser[ser.notnull()]
  • DataFrame中去除时 为避免删除整行或整列,用how='all’来表示只删除所有元素均为NAN的行或列,如果使用how=‘any’,则只要这一列有缺失值就删除整列
frame3.dropna(how='all')
ball mug paper pen pencil
blue 4.0 NaN NaN 6.0 NaN
white 21.0 NaN NaN 23.0 NaN
yellow 11.0 NaN NaN 13.0 NaN

🍀5.3 为NaN元素填充其他值

  • fillna()函数·
frame3.fillna(0)
ball mug paper pen pencil
blue 4.0 0.0 0.0 6.0 0.0
green 0.0 0.0 0.0 0.0 0.0
red 0.0 0.0 0.0 0.0 0.0
white 21.0 0.0 0.0 23.0 0.0
yellow 11.0 0.0 0.0 13.0 0.0

#若要将不同列的NaN换成不同元素,依次指定列名称及要替换成的元素即可

frame3.fillna({'ball':1,"pen":99})
ball mug paper pen pencil
blue 4.0 NaN NaN 6.0 NaN
green 1.0 NaN NaN 99.0 NaN
red 1.0 NaN NaN 99.0 NaN
white 21.0 NaN NaN 23.0 NaN
yellow 11.0 NaN NaN 13.0 NaN

🍁6. 层级索引和分层统计

有时候我们需要对数据进行分层级的索引,具体看下面这个例子

mser = pd.Series(np.random.rand(8),index=[['white','white','white','blue','blue','red','red','red'],      ['up','down','right','up','down','up','down','left']])mser
white  up0.323513down     0.080292right    0.503630blue   up0.201143down     0.173879red    up0.866267down     0.601906left     0.140885dtype: float64
mser.index
MultiIndex(levels=[['blue', 'red', 'white'], ['down', 'left', 'right', 'up']],    codes=[[2, 2, 2, 0, 0, 1, 1, 1], [3, 0, 2, 3, 0, 3, 0, 1]])
mser['white']
up0.323513down     0.080292right    0.503630dtype: float64
mser[:,'up']
white    0.323513blue     0.201143red      0.866267dtype: float64
mser[:,'right']
white    0.50363dtype: float64
mser['white','up']#可以得到某一特定元素的
0.32351250980575463

🍂6.1 unstack()函数和stack()函数

unstack把等级索引Series对象转换为一个简单的DataFrame对象,把第二列索引转换为相应的列,stack则相反,具体如下

mser.unstack() #将series转换为dataframemser.unstack().fillna(0)
down left right up
blue 0.173879 0.000000 0.00000 0.201143
red 0.601906 0.140885 0.00000 0.866267
white 0.080292 0.000000 0.50363 0.323513
frame
ball pen pencil paper
red 0 1 2 3
blue 4 5 6 7
yellow 8 9 10 11
white 12 13 14 15
frame.stack()#将dataframe转换为series对象
red     ball0 pen 1 pencil     2 paper      3blue    ball4 pen 5 pencil     6 paper      7yellow  ball8 pen 9 pencil    10 paper     11white   ball      12 pen13 pencil    14 paper     15dtype: int32

dataframe对象的行与列也可以定义分层级索引

mframe = pd.DataFrame(np.arange(16).reshape((4,4)),index = [['white','white','red','red'],['up','down','up','down']],columns=[['pen','pen','paper','paper'],[1,2,1,2]])mframe
pen paper
1 2 1 2
white up 0 1 2 3
down 4 5 6 7
red up 8 9 10 11
down 12 13 14 15

🍃6.2调整层级顺序

  • swaplevel()函数以要互换位置的两个层级的名称为参数,返回交换位置后的一个新对象,其中的个元素的顺序保持不变
mframe.columns.names = ['object','id']mframe.index.names = ['colors','status']mframe
object pen paper
id 1 2 1 2
colors status
white up 0 1 2 3
down 4 5 6 7
red up 8 9 10 11
down 12 13 14 15
mframe.swaplevel('colors','status')
object pen paper
id 1 2 1 2
status colors
up white 0 1 2 3
down white 4 5 6 7
up red 8 9 10 11
down red 12 13 14 15

🌍6.3按层级统计数据

  • 直接把层级的名称赋给level选项
mframe.sum(level='colors')
object pen paper
id 1 2 1 2
colors
white 4 6 8 10
red 20 22 24 26

若想对某一层级的列进行统计,则需要把axis的值设置为1

mframe.sum(level='id', axis=1)
id 1 2
colors status
white up 2 4
down 10 12
red up 18 20
down 26 28

🌎7.数据导入

很多时候,我们要分析的数据来自电脑上保存的数据文件,本文介绍一下如何导入我们最常用的csv文件,后续我还会介绍如何导入json数据、以及连接SQL数据库等其他的方式来导入数据

import pandas as pd # 加载模块df = pd.read_csv('student.csv')df
Student ID name   age  gender11 1111    Dw    3  Female12 1112     Q   23    Male13 1113     W   21  Female
id color brand_x sid brand_y
0 ball white OMG ball ABC
1 pencil red ABC pencil OMG
2 pencil red ABC pencil POD
3 pen red ABC pen POD

🌏8.数据处理

  • 表连接
  • 拼接
  • 组合

🌐8.1 连接

使用merge()函数 类似sql中的多表连接

🎇8.1.1 内连接

import numpy as npimport pandas as pd
frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'], 'price':[12.33,11.44,33.21,13.23,33.62]})frame1
id price
0 ball 12.33
1 pencil 11.44
2 pen 33.21
3 mug 13.23
4 ashtray 33.62
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'], 'color':['white','red','red','black']})frame2
id color
0 pencil white
1 pencil red
2 ball red
3 pen black
pd.merge(frame1,frame2)
id price color
0 ball 12.33 red
1 pencil 11.44 white
2 pencil 11.44 red
3 pen 33.21 black

上述返回的DataFrame对象由原来的两个DataFrame对象中ID相同的行组成 并且没有指定基于哪一列进行合并,实际应用中通常要指定连接条件, 用on来zhid

frame1 = pd.DataFrame({'id':['ball','pencil','pen','mug','ashtray'], 'color':['white','red','red','black','green'], 'brand':['OMG','ABC','ABC','POD','POD']})frame1
id color brand
0 ball white OMG
1 pencil red ABC
2 pen red ABC
3 mug black POD
4 ashtray green POD
frame2 = pd.DataFrame({'id':['pencil','pencil','ball','pen'], 'brand':['OMG','POD','ABC','POD']})frame2
id brand
0 pencil OMG
1 pencil POD
2 ball ABC
3 pen POD
pd.merge(frame1,frame2,on='id') # 以id进行合并
id color brand_x brand_y
0 ball white OMG ABC
1 pencil red ABC OMG
2 pencil red ABC POD
3 pen red ABC POD
pd.merge(frame1,frame2,on='brand') # 以brand进行合并
id_x color brand id_y
0 ball white OMG pencil
1 pencil red ABC ball
2 pen red ABC ball
3 mug black POD pencil
4 mug black POD pen
5 ashtray green POD pencil
6 ashtray green POD pen

当出现两个列的名称不一致的时候,使用left_on 和 right_on,例如,下面两个表,一个是id,一个是sid,我们相当于是用第一个表的id和第二个表的sid连接

frame2.columns = ['sid','brand']frame2
sid brand
0 pencil OMG
1 pencil POD
2 ball ABC
3 pen POD
pd.merge(frame1,frame2,left_on = 'id',right_on ='sid')
id color brand_x sid brand_y
0 ball white OMG ball ABC
1 pencil red ABC pencil OMG
2 pencil red ABC pencil POD
3 pen red ABC pen POD

merge()函数默认的是内连接,上述结果中的键是由交叉操作出来的

🎉8.1.2 外连接

  • 连接类型用how选项指定
  • 左连接 共有的加上左边的
  • 右连接 共有的加上右边的
  • 外连接把所有的键整合到一起
frame2.columns=['id','brand']
pd.merge(frame1,frame2,how='outer')
id color brand
0 ball white OMG
1 pencil red ABC
2 pen red ABC
3 mug black POD
4 ashtray green POD
5 pencil NaN OMG
6 pencil NaN POD
7 ball NaN ABC
8 pen NaN POD
pd.merge(frame1,frame2,how='left')
id color brand
0 ball white OMG
1 pencil red ABC
2 pen red ABC
3 mug black POD
4 ashtray green POD
pd.merge(frame1,frame2,how='right')
id color brand
0 pencil NaN OMG
1 pencil NaN POD
2 ball NaN ABC
3 pen NaN POD

要合并多个键,则把多个键给on选项

pd.merge(frame1,frame2,on=['id','brand'],how='outer')
id color brand
0 ball white OMG
1 pencil red ABC
2 pen red ABC
3 mug black POD
4 ashtray green POD
5 pencil NaN OMG
6 pencil NaN POD
7 ball NaN ABC
8 pen NaN POD

🎊8.1.3 以索引作为键进行连接

#方法一pd.merge(frame1,frame2,left_index=True,right_index=True)
id_x color brand_x id_y brand_y
0 ball white OMG pencil OMG
1 pencil red ABC pencil POD
2 pen red ABC ball ABC
3 mug black POD pen POD
#方法二 用join函数 默认左连接frame2.columns = ['id2','brand2']frame1.join(frame2)
id color brand id2 brand2
0 ball white OMG pencil OMG
1 pencil red ABC pencil POD
2 pen red ABC ball ABC
3 mug black POD pen POD
4 ashtray green POD NaN NaN

🎄8.2拼接

  • numpy中的concatenation()函数可以用来进行拼接操作
  • pandas的concat()函数实现了按轴拼接的功能()
arr1 = np.arange(9).reshape(3,3)arr1
array([[0, 1, 2],[3, 4, 5],[6, 7, 8]])
arr2 = np.arange(6,15).reshape(3,3)arr2
array([[ 6,  7,  8],[ 9, 10, 11],[12, 13, 14]])
np.concatenate([arr1,arr2],axis=1)#默认是axis=0
array([[ 0,  1,  2,  6,  7,  8],[ 3,  4,  5,  9, 10, 11],[ 6,  7,  8, 12, 13, 14]])
ser1 = pd.Series(np.random.rand(4), index = [1,2,3,4])ser1
1    0.1801912    0.0616493    0.2363784    0.105309dtype: float64
ser2 = pd.Series(np.random.rand(4), index = [5,6,7,8])ser2
5    0.9352776    0.5161467    0.2104618    0.912048dtype: float64
pd.concat([ser1,ser2])
1    0.1801912    0.0616493    0.2363784    0.1053095    0.9352776    0.5161467    0.2104618    0.912048dtype: float64
pd.concat([ser1,ser2],axis = 1)
0 1
1 0.180191 NaN
2 0.061649 NaN
3 0.236378 NaN
4 0.105309 NaN
5 NaN 0.935277
6 NaN 0.516146
7 NaN 0.210461
8 NaN 0.912048

默认是外连接

pd.concat([ser1,ser2],axis=1,join='inner')
0 1

如果想要创建等级索引,需要用keys选项来完成

pd.concat([ser1,ser2],keys=[1,2])
1  1    0.180191   2    0.061649   3    0.236378   4    0.1053092  5    0.935277   6    0.516146   7    0.210461   8    0.912048dtype: float64
pd.concat([ser1,ser2],axis=1,keys=[1,2])
1 2
1 0.180191 NaN
2 0.061649 NaN
3 0.236378 NaN
4 0.105309 NaN
5 NaN 0.935277
6 NaN 0.516146
7 NaN 0.210461
8 NaN 0.912048

🎋8.3组合

  • 当无法通过合并或者拼接方法组合数据用组合函数
  • combine_first()函数可以用来组合Series对象,同时对齐数据
ser1 = pd.Series(np.random.rand(5), index=[1,2,3,4,5])ser1
1    0.7082792    0.2330483    0.0309914    0.2612915    0.379752dtype: float64
ser2 = pd.Series(np.random.rand(4), index = [2,4,5,6])ser2
2    0.0173974    0.7642955    0.4075526    0.352605dtype: float64
ser1.combine_first(ser2)
1    0.7082792    0.2330483    0.0309914    0.2612915    0.3797526    0.352605dtype: float64
pd.concat([ser1,ser2])
1    0.7082792    0.2330483    0.0309914    0.2612915    0.3797522    0.0173974    0.7642955    0.4075526    0.352605dtype: float64

🏆文章推荐

Python数据可视化大杀器之Seaborn:学完可实现90%数据分析绘图

Python数据分析大杀器之Numpy详解