加盟创富好项目？简单可视化分析告诉你78网上到底都有啥！

2016/12/24 0 人评论 25,509 次阅读

数据来源于本月早些时候从78网上爬取到的768份数据，数据量不大，但是可以深入挖掘的地方还是有不少的，由于时间关系，在此只作简单的数据处理和可视化分析；

一、数据集概览：

	aear	cate	subcate	name	maxmoney	minmoney	activetime
0	广东省	服装鞋包	女装	金蝶茜妮服饰	3	None	22
1	湖北省	服装鞋包	女装	优尚美女装	5	3	2
2	湖北省	服装鞋包	女装	爱依莲服饰	5	3	9
3	北京市	服装鞋包	品牌	梦回唐朝布鞋	5	3	22
4	广东省	服装鞋包	女装	优尚美女装	3	None	4
5	广东省	服装鞋包	女装	优尚美女装	3	None	20
6	广东省	服装鞋包	品牌	凯缇猫童装	3	None	8
7	湖北省	服装鞋包	女装	亲闺密语内衣	5	3	0
8	广东省	服装鞋包	品牌	太阳公公童装	3	None	10
9	山东省	特色餐饮	烧烤	一锅两头牛火锅	50	20	22

二、加载数据分析所需库

涉及到的模块有：pymongo、pandas、matplotlib、numpy

import pymongo
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline

# 解决中文和负号显示
from pylab import mpl
mpl.rcParams['font.sans-serif'] = ['SimHei']
mpl.rcParams['axes.unicode_minus'] = False

三、初览数据

查看所有项目的项目总部分布

pd.unique(data.aear)

结果为：

广东省, 湖北省, 北京市, 山东省, 河北省, 黑龙江省, 上海市, 江苏省,四川省,山西省, 辽宁省, 安徽省, 浙江省, 湖南省, 河南省, 未知, 福建省, 陕西省

78网上一共有17个省市是项目的来源地，还有一个没有标明来源地的“未知”，

首先，我们看看各个来源地区都有多少个项目的存在。

# 按地域分布对数据集进行分组
aear_group = data.groupby([data['aear']])
aear_group_sums = aear_group['name'].count().sort_values()
# 计算项目数量的平均值
aear_group_mean = aear_group_sums.mean()

# 绘制出78网中各个项目的来源地区柱状图
x = np.arange(len(aear_group_sums))
plt.figure(figsize=(10,5))
plt.title("78网加盟项目区域分布直方图")
plt.yticks(x,aear_group_sums.index,)
plt.barh(x,aear_group_sums,align='center',color='dodgerblue')
plt.axvline(aear_group_mean,c='r')
plt.grid()
for a,b,in zip(x,aear_group_sums.values):
    plt.text(s=b,y=a-0.3,x=b+1)
plt.text(s="平均项目数量"+str(int(aear_group_mean)),x=aear_group_mean-12,y=0)

整体的项目数量分布呈现出三个等级，第一等级在100以上，第二等级在70至80之间，第三等级则属于50以下，其中属于山东和北京的项目加起来将近占了所有的项目总量的一半；

接下来，我们再看看所有的项目中都包含了哪些分类

# 按cate对数据集进行分组
data_groupby_cate = data.groupby(data['cate'])
# 统计按分类分组个数量
data_groupby_cate_count = data_groupby_cate['name'].count().sort_values()
# 统计项目的总数
data_groupby_cate_percentage = data_groupby_cate_count.sum()
data_groupby_cate_percentage
data_groupby_cate_count

饰品玩具 7
教育网络 23
美容养生 28
服装鞋包 71
节能环保 72
家居建材 87
生活服务 135
特色餐饮 345

一共有8个项目分类，分别是数据从多到少排列为：特色餐饮、生活服务、家居建材、节能环保、服装鞋包、美容养生、教育网络、饰品玩具

我们用饼图来展示一下项目的分类：

# 绘制项目分类的饼图
plt.figure(figsize=(6,6))
# 使用ggplot的配色方案
# plt.style.use('ggplot')
lables = []
colors = ['r','g','b','c','m','y','k','tan']
for i ,p in zip(data_groupby_cate_count.index,data_groupby_cate_count.values):
    s = i + str(round(p/data_groupby_cate_percentage*100,2)) + '%'
    lables.append(s)
plt.title("78网创业加盟项目分类饼图")
plt.pie(data_groupby_cate_count,labels=lables,colors=colors)

78网一共有738个加盟项目，其中特色餐饮类的项目占比达44.92%，生活服务类的项目占比也有近两成，家居建材、节能环保和服装鞋包类项目的占比比较接近，在一成左右，项目占比最小的是饰品玩具，只有不到百分之一的数量。

接下来，我们看看各个分类中各地区的分布：

# 按分类筛选出各地区的项目数量分布
cate_list = list(data_groupby_cate_count.index)
n = 1
plt.figure(figsize=(15,10))
plt.subplots_adjust(hspace=0.5)
for c in cate_list:
    data_cate = data[data['cate'] == c]
    data_cate_sum_aear = data_cate.groupby('aear')['name'].count()
    x = np.arange(len(data_cate_sum_aear))
    plt.subplot(3,3,n)
    plt.xticks(x,data_cate_sum_aear.index,rotation=50)
    plt.bar(x,data_cate_sum_aear.values,align='center',color='purple')
    for a,b,in zip(x,data_cate_sum_aear.values):
        plt.text(s=b,x=a,y=b)
    plt.title(c,horizontalalignment='right')
    plt.grid()
    n += 1

上图按照分类数量从小到大排列，其中项目数量最少的饰品玩具分类中所包含的地区也是最少的，只有4个地区，其中北京占了一半以上；

教育网络分类中，上海、北京、浙江、山东、湖北，均是教育大省；

美容养生，北京市遥遥领先；

服装鞋包分类，广东省依赖于众多的工厂，第一当之无愧，令人意外的是湖北的服装鞋包项目数量排到了第二；

节能环保分类，山东省一马当先；

家居建材分类，安徽、福建、山东和北京占了大头；

生活服务分类，依旧是北京占了大头；

特色餐饮分类，最突出的分属山东和河北；

我们通过一个堆叠柱状图来看看，各地的分类项目情况：

# 按地区绘制分类饼图
plt.figure(figsize=(16,16))
# 使用ggplot的配色方案
plt.style.use('ggplot')
plt.subplots_adjust(hspace=0.5,wspace=0.5)
n = 1
for name,group in aear_group:
    aear_group_cate = group.groupby('cate')['name']
    plt.subplot(5,5,n)
#     print(name,aear_group_cate.count().index,aear_group_cate.count().values)
    lable = []
    for i ,p in zip(aear_group_cate.count().index,aear_group_cate.count().values):
        s = i + str(p)
        lable.append(s)
    plt.pie(aear_group_cate.count(),labels=lable)
    plt.title(name)
    n += 1

# 按地区绘制分类数目直方图
name_list = []
y = []
x = np.arange(len(aear_group_sums))
for name,group in aear_group:
    aear_group_cate_sum = group.groupby('cate')['name'].count()
#     print(name,len(aear_group_cate_sum.index))
    name_list.append(name)
    y.append(len(aear_group_cate_sum.index))
plt.xticks(x,name_list,rotation=90)
plt.bar(x,y,align='center')
plt.title("78网各地区项目分类分布")
for a,b in zip(x,y):
    plt.text(s=b,x=a-0.25,y=b)

有6个地区只有单一的一个类目的项目，分别为：山西：特色餐饮，河南：特色餐饮，辽宁：特色餐饮，陕西：生活服务，黑龙江：特色餐饮，福建：家居建材

分类项目比较多的地区有： 8个：北京 7个：山东 6个：安徽、广东 5个：江苏、浙江、湖北 4个：上海、湖南

对地区和分类的可视化分析就告一段落，投资项目，除了选择好的项目之外、可靠的服务之外，启动资金也是一个重要因素；

下面，我们看看投资价格方面的信息，毕竟，关于钱的事情，都不是小的事情

# 查看投资价格分布
data_money = data
data_money.fillna(0,inplace=True)

plt.figure(figsize=(10,4))

# 查看项目最高价中的描述性统计信息
data_groupby_maxmoney = data_money.groupby('maxmoney')['maxmoney']
maxs = {}
for n,g in data_groupby_maxmoney:
    print(n,g.count())
    maxs[n] = g
# 绘制项目最高价分布直方图
plt.subplot(1,2,1)
x = np.arange(len(maxs))
xtick = [3,5,10,20,50]
y = [64,396,206,85,17]
plt.xticks(x,xtick)
plt.title("78网项目投资最高价分布直方图")
plt.xlabel("项目投资最高价（万元）")
plt.ylabel("项目数量")
plt.bar(x,y,align='center')
for a,b in zip(x,y):
    plt.text(s=b,x=a,y=b-15)
    
# 查看项目最低价的描述性信息
data_groupby_mixmoney = data_money.groupby('minmoney')['minmoney']
mins = {}
for name,group in data_groupby_mixmoney:
    print(name,group.count())
    mins[name] = group.count()
# 绘制项目最低价数量分布直方图
plt.subplot(1,2,2)
x = np.arange(len(mins))
xtick = [0,3,5,10,20]
y = [64,396,206,85,17]
plt.title("78网项目投资最低价格数量分布")
plt.xlabel("项目投资最低价（万元）")
plt.ylabel("项目数量")
plt.xticks(x,xtick)
plt.bar(x,y,align='center')
for a,b in zip(x,y):
    plt.text(s=b,x=a,y=b-15)

可以发现，投资价格在3-5万元的项目最多，这也是普通的三四线投资者比较中意的投资价格区间；

其次，投资价格在5-10万元的也占了一大部分；

投资价格在20-50万元的项目最少，只有17个，一般而言，投资金额在这个价位的都不会选择在这类加盟网站上找项目了；

综合来看，投资价格在3-10万元的项目，是78网的主要项目，是其吸引加盟投资者的基础，其他区间价格的项目则多是点缀，以显得网站的项目丰富；
（此处可深挖各个分类、各个地区的项目投资价格）

最后，我们从项目的活动时间来看看项目的分布，一个推出时间长的项目直觉上会显得老道和有经验，而一个新推出的项目，则会让人感受到加盟的风险，但是成熟的项目意味着机会少，而新兴的项目，则显示那可能是一块待采的金矿。

# 按活动时间对项目数据进行分组
data_time = data
data_time['activetime'].dropna()
data_groupby_time = data_time.groupby('activetime')['activetime']
time_name = []
time_count = []
for name,group in data_groupby_time:
#     print(name,group.count())
    time_name.append(name)
    time_count.append(group.count())
# 项目的平均活动时间
means_time = sum(time_count)/len(time_count)
max_time = max(time_count)
min_time = min(time_count)
t = [means_time,max_time,min_time]
plt.title("78网加盟项目活动时间比较")
plt.yticks([1,2,3],["平均时间","最长时间","最短时间"])
plt.barh([1,2,3],t,align='center')

所有的项目里面，存在时间最长的为120个月，存在时间最短的为1个月，平均的存在时间为21个月。

（此处可深挖各分类、各地区的投资项目存在时间）

最后，生成一张项目名称的词云作为结束吧

1、获取所有项目名称：

# 获取所有加盟项目的名称
data_name = data['name']
text = ""
for i in data_name:
    text += i
with open("./78name.txt",'w+') as file:
    file.write(text)

2、对提取出来的中文名称进行分词：

# 中文分词
import jieba
with open("./78name.txt","r") as rfile:
#     print(rfile.read())
    fenci = jieba.cut(rfile.read())
    text = " ".join(fenci)
    with open("./fenci_78name.txt","w+") as wfile:
        wfile.write(text)

3、词频统计和分词

# 制作词云
import nltk
import wordcloud
from nltk.corpus import PlaintextCorpusReader

# 加载语料库
corpath = "./"
wordlist = PlaintextCorpusReader(corpath,'.txt')
# 处理语料库索引
nameword = nltk.Text(wordlist.words("fenci_78name.txt"))
# 词频统计
fdist = nltk.FreqDist(nameword)
word_top = []
for item in fdist:
    if len(item ) > 1:
        if len(item) < 20:
            top = (item,fdist[item])
            word_top.append(top)
# print(word_top)
# 实例化一个wordcloud
wc = wordcloud.WordCloud(font_path='c:\\Python34\\Lib\\site-packages\\matplotlib\\mpl-data\\fonts\\ttf\\msyh.ttf')
# 从列表中加载词和词频
wc.generate_from_frequencies(word_top)
plt.figure(figsize=(12,12))
plt.imshow(wc)
plt.grid(False)

最后，词云如下: