检验温差是否满足正态分布

import requests
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

第一步获取数据

re = requests.get("http://jse.amstat.org/datasets/normtemp.dat.txt")
re.encoding = "utf-8"
with open("normtemp.dat.txt","w") as f:
    f.write(re.text)
# sep='\s+' ，正则表达式，表示分隔符为一个或多个空； \s表示匹配空白，即空格，tab键
df = pd.read_csv("normtemp.dat.txt", header=None, sep="\s+")
df.columns = ['体温','性别','心率']
df.head()

	体温	性别	心率
0	96.3	1	70
1	96.7	1	71
2	96.9	1	74
3	97.0	1	80
4	97.1	1	73

第二步查看数据的基本情况

df.describe()

	体温	性别	心率
count	130.000000	130.000000	130.000000
mean	98.249231	1.500000	73.761538
std	0.733183	0.501934	7.062077
min	96.300000	1.000000	57.000000
25%	97.800000	1.000000	69.000000
50%	98.300000	1.500000	74.000000
75%	98.700000	2.000000	79.000000
max	100.800000	2.000000	89.000000

散点图

plt.rcParams['font.sans-serif']=['SimHei'] #用来正常显示中文标签
plt.rcParams['axes.unicode_minus']=False #用来正常显示负号

fig=plt.figure(figsize=(16,5))
plt.scatter(df['性别'],df['体温'],c='b',marker='o',alpha=0.7)
plt.title('Scatter')
plt.xlabel('sex')
plt.ylabel('temp')
plt.grid(True)
plt.show()

plt.scatter(df['心率'],df['体温'],c='b',marker='<',alpha=0.7)
plt.title('scatter')
plt.xlabel('heart')
plt.ylabel('temp')
plt.grid(True)
plt.show()

fig=plt.figure(figsize=(16,5))
plt.scatter(df['性别'],df['心率'],c='b',marker='o',alpha=0.7)
plt.title('Scatter')
plt.xlabel('sex')
plt.ylabel('heart')
plt.grid(True)
plt.show()

柱形图

# 函数说明：
# arange([start,] stop[, step,], dtype=None)根据start与stop指定的范围以及step设定的步长，生成一个 ndarray。
# >>> arange(0,1,0.1)
# array([ 0. ,  0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])

# range()
# >>> range(0,5) 			 	#生成一个range object,而不是[0,1,2,3,4] 
# range(0, 5)   
# >>> c = [i for i in range(0,5)] 	 #从0 开始到4，不包括5，默认的间隔为1
# >>> c
# [0, 1, 2, 3, 4]
# >>> c = [i for i in range(0,5,2)] 	 #间隔设为2
# >>> c
# [0, 2, 4]
x=np.arange(0,130,1)
y=df['体温'].values
plt.bar(x,y)

<BarContainer object of 130 artists>

x=np.arange(0,130,1)
y=df['心率'].values
plt.bar(x,y)

<BarContainer object of 130 artists>

直方图

df['体温'].hist(bins=20,alpha=0.5)

<matplotlib.axes._subplots.AxesSubplot at 0x1218097f0>

# 密度图也被称为KDF图，
# 调用plt时加上kind='kde'即可生成一张密度图
df['体温'].plot(kind='kde',secondary_y=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1219a1198>

密度直方图

df['体温'].hist(bins=20,alpha=0.5)
df['体温'].plot(kind='kde',secondary_y=True)

<matplotlib.axes._subplots.AxesSubplot at 0x1a234f0940>

用python为直方图绘制拟合曲线，使用seaborn中的displot绘制

import seaborn as sns 
sns.set_palette("hls") #设置所有图的颜色，使用hls色彩空间
sns.distplot(df['体温'],color="r",bins=30,kde=True)
plt.show()

/Users/lianxiaobao/anaconda/anaconda3/lib/python3.7/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval

# 设置详细的参数，可采用kde_kws(拟合曲线的设置)，hist_kws(直方图柱子的设置)
import seaborn as sns 
import matplotlib as mpl 
sns.set_palette("hls") 
mpl.rc("figure", figsize=(6,4)) 
# lw为曲线的粗细程度
sns.distplot(df['体温'],bins=30,kde_kws={"color":"seagreen", "lw":3 }, hist_kws={ "color": "b" }) 
plt.show()

第三步检验体温数据是否服从正态分布

前三个方法的p值均大于0.05，说明体温服从正态分布

方法一：scipy.stats.normaltest (a, axis=0)

# a - 待检验数据，
# axis - 可设置为整数或置空，如果设置为 none，则待检验数据被当作单独的数据集来进行检验。该值默认为 0，即从 0 轴开始逐行进行检验。

import scipy.stats
scipy.stats.normaltest(df['体温'])

NormaltestResult(statistic=2.703801433319236, pvalue=0.2587479863488212)

方法二：Shapiro-Wilk test, scipy.stats.shapiro(x)

参数：x - 待检验数据

返回：W - 统计数；p-value - p值

scipy.stats.shapiro(df['体温'].values)

(0.9865770936012268, 0.233174666762352)

方法三：scipy.stats.kstest; scipy.stats.kstest (rvs, cdf, args = ( ), N = 20, alternative =‘two-sided’, mode =‘approx’)

rvs - 待检验数据，可以是字符串、数组；

cdf - 需要设置的检验，这里设置为 norm，也就是正态性检验；

alternative - 设置单双尾检验，默认为 two-sided

返回：W - 统计数；p-value - p值

u = df['体温'].mean()
std = df['体温'].std()
scipy.stats.kstest(df['体温'].values,'norm',args=(u,std))

KstestResult(statistic=0.06472685044046644, pvalue=0.6450307317439967)

方法四：Anderson-Darling test; scipy.stats.anderson (x, dist =‘norm’ )

该方法是由 scipy.stats.kstest 改进而来的，可以做正态分布、指数分布、Logistic 分布、Gumbel 分布等多种分布检验。默认参数为 norm，即正态性检验。

参数：x - 待检验数据；dist - 设置需要检验的分布类型

返回：statistic - 统计数；critical_values - 评判值；significance_level - 显著性水平

scipy.stats.anderson(df['体温'].values,dist="norm")

AndersonResult(statistic=0.5201038826714353, critical_values=array([0.56 , 0.637, 0.765, 0.892, 1.061]), significance_level=array([15. , 10. ,  5. ,  2.5,  1. ]))

数据分析之路week04

用python验证数据集中的体温是否服从正态分布

目录

检验温差是否满足正态分布

第一步获取数据

第二步查看数据的基本情况

散点图

柱形图

直方图

密度直方图

用python为直方图绘制拟合曲线，使用seaborn中的displot绘制

第三步检验体温数据是否服从正态分布

方法一：scipy.stats.normaltest (a, axis=0)

方法二：Shapiro-Wilk test, scipy.stats.shapiro(x)

方法三：scipy.stats.kstest; scipy.stats.kstest (rvs, cdf, args = ( ), N = 20, alternative =‘two-sided’, mode =‘approx’)

方法四：Anderson-Darling test; scipy.stats.anderson (x, dist =‘norm’ )

CATALOG

FEATURED TAGS

FRIENDS

目录

检验温差是否满足正态分布

第一步获取数据

第二步 查看数据的基本情况

散点图

柱形图

直方图

密度直方图

用python为直方图绘制拟合曲线，使用seaborn中的displot绘制

第三步 检验体温数据是否服从正态分布

方法一：scipy.stats.normaltest (a, axis=0)

方法二：Shapiro-Wilk test, scipy.stats.shapiro(x)

方法三：scipy.stats.kstest; scipy.stats.kstest (rvs, cdf, args = ( ), N = 20, alternative =‘two-sided’, mode =‘approx’)

方法四：Anderson-Darling test; scipy.stats.anderson (x, dist =‘norm’ )

CATALOG

FEATURED TAGS

FRIENDS

第二步查看数据的基本情况

第三步检验体温数据是否服从正态分布