Python数据分析从入门到高级：导入数据（包含数据库连接）

文档中心

python数据科学系列

🌸个人主页：JoJo的数据分析历险记
📝个人介绍：小编大四统计在读，目前保研到统计学top3高校继续攻读统计研究生
💌如果文章对你有帮助，欢迎关注、点赞、收藏、订阅专栏

文章目录

- python数据科学系列
🥇1.加载sklearn包中的数据集
🥇2.创建模拟数据集
- 🥈2.1 回归数据集
- 🥈2.2 分类模拟数据集
- 🥈2.3 聚类数据集
🥇3. 加载CSV文件
🥇4. 加载excel文件
🥇5. 查询SQL数据库

加载数据是我们进行数据分析的第一步，本文主要介绍以下几个常用的方面导入数据集

加载scikit-learn中的数据集
创建模拟数据集
导入csv数据集
导入excel数据集
连接mysql数据库

🥇1.加载sklearn包中的数据集

sklearn是一个机器学习库，里面包含了许多机器学习数据集。例如：

load_boston 波士顿房价的观测值用于研究回归算法
load_iris 150个花的数据，用于研究分类算法
load_digits 手写数字图片的观测值，用于研究图形分类算法的优质数据集

from sklearn import datasets

# 手写数字数据集digits = datasets.load_digits()

# 创建特征向量features = digits.data# 创建目标向量tatget = digits.target

features[0]

array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8., 0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

🥇2.创建模拟数据集

🥈2.1 回归数据集

下面我们通过make_regression来模拟一个回归数据集

from sklearn.datasets import make_regressionfeatures, target, coefficients = make_regression(n_samples=100,n_features=3,n_informative=3,n_targets=1,noise=0,coef=True,random_state=1)

print('Featrue Matrix\n', features[:3])print('Target Vector\n', target[:3])

Featrue Matrix [[ 1.29322588 -0.61736206 -0.11044703] [-2.793085    0.36633201  1.93752881] [ 0.80186103 -0.18656977  0.0465673 ]]Target Vector [-10.37865986  25.5124503   19.67705609]

🥈2.2 分类模拟数据集

使用make_classification创建分类数据集

from sklearn.datasets import make_classificationfeatures, target= make_classification(n_samples=100,   n_features=3,   n_informative=3,   n_redundant=0,   n_classes=2,   weights=[.25, .75],   random_state=1)

print('Featrue Matrix\n', features[:3])print('Target Vector\n', target[:3])

Featrue Matrix [[ 1.06354768 -1.42632219  1.02163151] [ 0.23156977  1.49535261  0.33251578] [ 0.15972951  0.83533515 -0.40869554]]Target Vector [1 0 0]

import matplotlib.pyplot as plt%matplotlib inline

plt.scatter(features[:,0], features[:,1],c=target)

png

🥈2.3 聚类数据集

使用make_blobs创建聚类数据集

# 用于聚类from sklearn.datasets import make_blobsfeatures, target = make_blobs(n_samples=100,  n_features=2,  centers=3,  cluster_std=0.5,  shuffle=True,  random_state=1)

print('Featrue Matrix\n', features[:3])print('Target Vector\n', target[:3])

Featrue Matrix [[ -1.22685609   3.25572052] [ -9.57463218  -4.38310652] [-10.71976941  -4.20558148]]Target Vector [0 1 1]

plt.scatter(features[:,0], features[:,1],c=target)

png

🥇3. 加载CSV文件

csv文件是我们在进行数据分析时最常用的数据格式。python中pandas库提供了非常简单的方法导入，具体如下

import pandas as pd file = r'C:\Users\DELL\Desktop\Statistic learning\ISLR\data\auto.csv'df = pd.read_csv(file)# 当数据没有表头时，设置header = Nonedf.head()

	mpg	cylinders	displacement	horsepower	weight	acceleration	year	origin	name
0	18.0	8	307.0	130	3504	12.0	70	1	chevrolet chevelle malibu
1	15.0	8	350.0	165	3693	11.5	70	1	buick skylark 320
2	18.0	8	318.0	150	3436	11.0	70	1	plymouth satellite
3	16.0	8	304.0	150	3433	12.0	70	1	amc rebel sst
4	17.0	8	302.0	140	3449	10.5	70	1	ford torino

🥇4. 加载excel文件

url = r'C:\Users\DELL\Desktop\我的文件\学校课程\大三上复习资料\多元统计\例题数据及程序整理\例3-1.xlsx'df = pd.read_excel(url,header=1)#sheetname 表数据表所在的位置，如果加入多张数据表，可以把他们放在一个列表中一起传入f

	序号	批发和零售业	交通运输、仓储和邮政业	住宿和餐饮业	金融业	房地产业	水利、环境和公共设施管理业	所属地区	单位类型
0	1	53918.0	31444.0	47300.0	38959.0	47123.0	35375.0	北京	集体
1	2	61149.0	39936.0	45063.0	116756.0	48572.0	47389.0	上海	集体
2	3	34046.0	47754.0	39653.0	111004.0	46593.0	37562.0	江苏	集体
3	4	50269.0	51772.0	39072.0	125483.0	56055.0	43525.0	浙江	集体
4	5	27341.0	43153.0	40554.0	79899.0	44936.0	42788.0	广东	集体
5	6	129199.0	90183.0	59309.0	224305.0	80317.0	74290.0	北京	国有
6	7	89668.0	100042.0	64674.0	208343.0	88977.0	77464.0	上海	国有
7	8	69904.0	72784.0	45581.0	105894.0	65904.0	59963.0	江苏	国有
8	9	108473.0	86648.0	51239.0	163834.0	69972.0	56899.0	浙江	国有
9	10	63247.0	76359.0	52359.0	138830.0	54179.0	47487.0	广东	国有
10	11	93769.0	80563.0	50984.0	248919.0	87522.0	73048.0	北京	其他
11	12	118433.0	99719.0	52295.0	208705.0	82743.0	73241.0	上海	其他
12	13	63340.0	65300.0	42071.0	126708.0	67070.0	50145.0	江苏	其他
13	14	61801.0	71794.0	41879.0	125875.0	66284.0	52655.0	浙江	其他
14	15	62271.0	80955.0	43174.0	145913.0	68469.0	52324.0	广东	其他

🥇5. 查询SQL数据库

在实际业务分析中，很多时候数据都是存放在数据库中，因此，学会如何连接数据库是非常有必要的，之前介绍了如何使用R语言连接数据库，R语言连接mysql数据库，接下来我们看看如何使用python来连接数据库。首先需要安装pymysql包，pip install pymysql，具体使用代码如下

导入相关库

impcort pandas as pd import pymysql

连接mysql数据库，需要指定相关的参数

dbconn=pymysql.connect(  host="localhost",  database="test",#要连接的数据库  user="root",  password="密码",#密码  port=3306,#端口号  charset='utf8' )

读取数据，通过read_sql可以实现在python中读取sql查询的结果,具体结果如下。

sql = "select * from goods;"df = pd.read_sql(sql=sql, con=dbconn)df

	id	category_id	category	NAME	price	stock	upper_time
0	1	1	女装/女士精品	T恤	39.9	1000	2020-11-10
1	2	1	女装/女士精品	连衣裙	79.9	2500	2020-11-10
2	3	1	女装/女士精品	卫衣	89.9	1500	2020-11-10
3	4	1	女装/女士精品	牛仔裤	89.9	3500	2020-11-10
4	5	1	女装/女士精品	百褶裙	29.9	500	2020-11-10
5	6	1	女装/女士精品	呢绒外套	399.9	1200	2020-11-10
6	7	2	户外运动	自行车	399.9	1000	2020-11-10
7	8	2	户外运动	山地自行车	1399.9	2500	2020-11-10
8	9	2	户外运动	登山杖	59.9	1500	2020-11-10
9	10	2	户外运动	骑行装备	399.9	3500	2020-11-10
10	11	2	户外运动	运动外套	799.9	500	2020-11-10
11	12	2	户外运动	滑板	499.9	1200	2020-11-10

本章的介绍到此介绍，在后续我还会考虑介绍一些如何使用python进行特征工程、数据清洗、模型构建以及一些数据挖掘实战项目。大家多多点赞、收藏、评论、关注支持！！

Python数据分析从入门到高级：导入数据（包含数据库连接）

python数据科学系列

文章目录

🥇1.加载sklearn包中的数据集

🥇2.创建模拟数据集

🥈2.1 回归数据集

🥈2.2 分类模拟数据集

🥈2.3 聚类数据集

🥇3. 加载CSV文件

🥇4. 加载excel文件

🥇5. 查询SQL数据库

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签

Python数据分析从入门到高级：导入数据（包含数据库连接）

python数据科学系列

文章目录

🥇1.加载sklearn包中的数据集

🥇2.创建模拟数据集

🥈2.1 回归数据集

🥈2.2 分类模拟数据集

🥈2.3 聚类数据集

🥇3. 加载CSV文件

🥇4. 加载excel文件

🥇5. 查询SQL数据库

相关问题

公告

DeepSeek全套部署资料免费下载

免费可商用字体批量下载

标签