机器学习常用数据集大全

UCI Machine Learning Adult Dataset

Business Problem: Classification (a person earns more than 50k or less) Predictor Variable: Label ; Predictors: country, age, education, occupation, marital status etc.

文章:https://towardsdatascience.com/pandas-index-explained-b131beaf6f7b
数据集地址:https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data

Kaggle – Avazu:Click-Through Rate Prediction

Predict whether a mobile ad will be clicked
In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.

Kaggle地址:
https://www.kaggle.com/c/avazu-ctr-prediction/overview

UCI – Adult Data Set $50k

Predict whether income exceeds $50K/yr based on census data. Also known as “Census Income” dataset
https://archive.ics.uci.edu/ml/datasets/Adult

UCI – Iris Data Set

This is perhaps the best known database to be found in the pattern recognition literature. Fisher’s paper is a classic in the field and is referenced frequently to this day
https://archive.ics.uci.edu/ml/datasets/Iris

Kaggle Titanic: Machine Learning from Disaster

use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.

https://www.kaggle.com/c/titanic

CTR预估:(标签-权重)列表类特征怎么输入到模型?

问题:

要做一个CTR预估模型;

通过之前的数据挖掘,我得到了用户对标签的偏好数据:

[(‘标签1’, 0.8), (‘标签2’, 0.65), (‘标签3’, 0.32), (‘标签4’, 0.05)]

列表的每一个元素包含两个分量,分别是标签、偏好权重;

问题是,这样的一个特征,怎样输入到模型?

我只知道,对于标签1、标签2、标签3这样是分类特征,可以用one hot编码;

然而我不知道这里的权重怎么使用呢?,我想到2种方法:1、先给标签1、标签2、标签3做one-hot编码,然后自己找到每个数字为1的位置,把数字1替换成权重;2、对所有标签映射到一个大数组,找到标签在数组的下标,设置为权重,其他的都是0;

这两种方法哪种更好,或者有其他方法吗?

这个问题也有其他场景:用户的历史行为比如播放的ID列表,统计会得到:

[播放itemid,频次]的列表,怎么作为特征输入到模型?

方法:

1、可以直接把权重作为那一维度的特征吧,就是比如favori_entity_延禧攻略,这个特征,特征值就是weight

2、可以看成特殊的one-hot, 把one-hot里面的1用weight代替,作为一个连续值特征

一种做法,打平成单独维度特征

如果是tensorflow,问题的答案在这个文章:
https://zhuanlan.zhihu.com/p/41663141
超级的详细,1、单个特征有有多分类怎么处理,比如[(‘cat_a’,’cat_b), (cat_b, cat_c)];2、如果是加权cat列表怎么办,比如用户的tag偏好列表:[(‘IT’, 0,8), (‘音乐’,0.6)];

对于第一个问题,用cat_vocu_list加上Indicator_column,第二个,用weighted_categorical_column

使用PaddlePaddle搭建卷积网络做文本数据分类

PaddlePaddle是百度开源的深度学习框架,采用和cafee类似的layer搭建的方式构建深度神经网络,当前也在试图发布fluid新版本提供算子级别的网络构建技能,最近有一个文本分类的需求,试着使用paddle进行了实验,对paddle的使用体验为:

  • 文档不全,特别简陋
  • 模型库比较好,即使不懂的用法,可以搜索代码查找用法;
  • github的问题回复比较及时

感觉Paddle是在大力推广和发展的,如果有Paddle同学看到的话,建议把文档补全点;

我的输入数据包含两部分:

  • 用户画像,包括用户的性别、年龄、职业等信息;
  • 用户搜索词列表,保持时间序列,进行分词、停用词过滤等处理;

将类别信息、词语信息,进行StringIndexer之后(使用的Spark进行),输入数据如下所示:

1	0	6	7	3	1	11069|36027|15862|11069|48152|36027|11069|33830|48152|36027|11069|50730|11069|50730|11069|47002	1
1	0	6	7	3	1	62151|21292|21666|53679|21292|21666|34384|26807|53680|381|2992|64045|2992|69922|62902|3346	0

其中单个数字列都是画像的属性分类,以|分割的数字列表代表词语列表,最后一列代表分类目标

整个代码的实现包含三部分,分别是数据读入reader的实现、训练算法的实现、使用算法的实现;

数据读入reader的实现

# coding:utf8


"""
实现paddle需要的读取数据reader的实现
其中包括如下部分:
1、读取整个数据;
2、分割成训练集和测试集;
3、提供paddle可以直接使用的reader()函数,通过yield的方式抛出数据
"""

import random


def read_datas():
    """
    读取所有的数据
    :return:
    """
    fpath = "./inputdatas_numbers.txt"

    results_datas = []
    for line in open(fpath):
        line = line[:-1]
        if not line or len(line) == 0:
            continue
        fields = line.split("\t")
        if len(fields) != 8: continue

        # 按属性提取
        gender, age, lifeStage, trade, educationalLevel, job, words, label = fields

        results_datas.append([
            int(float(gender)),
            int(float(age)),
            int(float(lifeStage)),
            int(float(trade)),
            int(float(educationalLevel)),
            int(float(job)),
            # 注意这里,paddle的序列数据,sequence_data,其实是元素为list的元素
            [int(x) for x in words.split("|")],
            int(float(label))
        ])

    return results_datas


def split_data_train_test(results_datas, rand_seed=0, test_ratio=0.1):
    """
    进行训练数据和测试数据的拆分,这里使用随机的方法进行
    :param results_datas: 
    :param rand_seed: 
    :param test_ratio: 
    :return: 
    """
    rand_seed = 37
    rand = random.Random(x=rand_seed)
    train_data, test_data = [], []
    for line in results_datas:
        if rand.random() > test_ratio:
            train_data.append(line)
        else:
            test_data.append(line)
    return train_data, test_data


def split_data_train_test_avg(results_datas, test_ratio=0.03):
    """
    按词典顺序倒叙排列,然后均匀采样
    :param results_datas:
    :param rand_seed:
    :param test_ratio:
    :return:
    """
    total_len = len(results_datas)
    test_datas_cnt = total_len * test_ratio
    sample_gap = int(total_len * 1.0 / test_datas_cnt)

    sort_datas = sorted(results_datas, cmp=lambda x, y: int(x[7]) < int(y[7]))

    train_data, test_data = [], []

    for i in range(len(sort_datas)):
        if i % sample_gap == 0:
            test_data.append(sort_datas[i])
        else:
            train_data.append(sort_datas[i])

    return train_data, test_data


results_datas = read_datas()
print "数据读取完毕:", len(results_datas)
train_data, test_data = split_data_train_test_avg(results_datas, 0.03)

print "data read over."
print "训练集合数据大小:", len(train_data)
print "测试集合数据大小:", len(test_data)


def train_reader():
    """
    paddle需要,用于训练数据的提取
    :return:
    """
    for line in train_data:
        yield line


def test_reader():
    """
    paddle需要,用于训练数据的提取
    :return:
    """
    for line in test_data:
        yield line


if __name__ == "__main__":
    print len(train_data)
    print len(test_data)

其中需要注意的地方,就是paddle的sequence_data的含义,就是一个list,所以对于query的数据列表,需要处理成[x,y,list(),z]中的子元素list的形式。

查看一下本脚本的运行结果:

数据读取完毕: 20088
data read over.
训练集合数据大小: 19479
测试集合数据大小: 609
19479
609
[1, 0, 6, 7, 3, 1, [62151, 21292, 21666, 53679, 21292, 21666, 34384, 26807, 53680, 381, 2992, 64045, 2992, 69922, 62902, 3346], 0]

 

训练算法的实现

# coding:utf8


"""
使用paddle实现深层网络
"""

import os
import paddle.v2 as paddle
import reader_paddle_sequence

import sys

# 该字典是输入数据的模式,比如通过fileds[feeding["gender"]]就可以得到gender的数值
feeding = {
    'gender': 0,
    'age': 1,
    'lifeStage': 2,
    'trade': 3,
    'educationalLevel': 4,
    'job': 5,
    'words': 6,
    'label': 7
}


def convr_perceptron():
    """
    搭配而成的卷积网络
    :return:
    """
    # 获取卷积层
    conv1, conv2 = get_words_conv()
    # 获取画像特征层
    features_fc = get_usr_combined_features()

    # concat卷积层和画像层
    concat_layer = paddle.layer.concat(
        input=[
            features_fc, conv1, conv2
        ])
    # 加入dropout layer,防止过拟合
    dropout_layer = paddle.layer.dropout(input=concat_layer, dropout_rate=0.6)

    # 加入最终的分类层,使用softmax,分类成6个类别
    predict = paddle.layer.fc(input=dropout_layer, size=6, act=paddle.activation.Softmax())

    return predict


def get_usr_combined_features():
    """
    用户画像的特征,都进入FC
    :return:
    """
    gender = paddle.layer.data(name='gender', type=paddle.data_type.integer_value(2))
    gender_emb = paddle.layer.embedding(input=gender, size=16)
    gender_fc = paddle.layer.fc(input=gender_emb, size=16)

    age = paddle.layer.data(name='age', type=paddle.data_type.integer_value(6))
    age_emb = paddle.layer.embedding(input=age, size=16)
    age_fc = paddle.layer.fc(input=age_emb, size=16)

    lifeStage = paddle.layer.data(name='lifeStage', type=paddle.data_type.integer_value(10))
    lifeStage_emb = paddle.layer.embedding(input=lifeStage, size=16)
    lifeStage_fc = paddle.layer.fc(input=lifeStage_emb, size=16)

    trade = paddle.layer.data(name='trade', type=paddle.data_type.integer_value(23))
    trade_emb = paddle.layer.embedding(input=trade, size=16)
    trade_fc = paddle.layer.fc(input=trade_emb, size=16)

    educationalLevel = paddle.layer.data(name='educationalLevel', type=paddle.data_type.integer_value(4))
    educationalLevel_emb = paddle.layer.embedding(input=educationalLevel, size=16)
    educationalLevel_fc = paddle.layer.fc(input=educationalLevel_emb, size=16)

    job = paddle.layer.data(name='job', type=paddle.data_type.integer_value(6))
    job_emb = paddle.layer.embedding(input=job, size=16)
    job_fc = paddle.layer.fc(input=job_emb, size=16)

    usr_combined_features = paddle.layer.fc(
        input=[gender_fc, age_fc, lifeStage_fc, trade_fc, educationalLevel_fc, job_fc],
        size=200,
        act=paddle.activation.Tanh())

    return usr_combined_features


def get_words_conv():
    """
    words的输入,进入conv
    :return:
    """
    # 词表大小,这个数字来自于分词后进入list的size
    word_dict_len = 73614
    emb_dim = 8
    hid_dim = 256

    # 注意这里的integer_value_sequence,意思是[1,2,3,4]这种形式
    words = paddle.layer.data(name='words', type=paddle.data_type.integer_value_sequence(word_dict_len))
    words_emb = paddle.layer.embedding(input=words, size=emb_dim)

    # 搭建卷积网络,这类可以是多个卷积层
    conv1 = paddle.networks.sequence_conv_pool(input=words_emb, context_len=3, hidden_size=hid_dim)
    conv2 = paddle.networks.sequence_conv_pool(input=words_emb, context_len=4, hidden_size=hid_dim)
    return conv1, conv2


def train():
    """
    执行训练
    """
    # 初始化paddle
    paddle.init(use_gpu=False, trainer_count=1)
    # network config
    y = paddle.layer.data(name='label', type=paddle.data_type.integer_value(6))

    # 获取网络预测结果
    y_predict = convr_perceptron()

    # 设定cost为分类误差
    cost = paddle.layer.classification_cost(input=y_predict, label=y)

    # 随机初始化参数
    parameters = paddle.parameters.create(cost)

    # 创建优化器,主要是设定L2的正则化和学习率
    adam_optimizer = paddle.optimizer.Adam(
        learning_rate=2e-4,
        regularization=paddle.optimizer.L2Regularization(rate=0.9),
        model_average=paddle.optimizer.ModelAverage(average_window=0.5, max_average_window=10000))

    # 使用SGD做训练器
    trainer = paddle.trainer.SGD(cost=cost, parameters=parameters, update_equation=adam_optimizer)

    # 报错拓扑结构,这个拓扑结构将来回用于infer
    inference_topology = paddle.topology.Topology(layers=y_predict)
    with open("inference_topology_conv.pkl", 'wb') as f:
        inference_topology.serialize_for_inference(f)

    # 保存训练误差和测试误差,用于将来的曲线绘制
    fout_pass_err = open("train_pass_error_conv.txt", "w")
    fout_pass_err.write("passid\ttest_data_accurcy\ttrain_data_accurcy\n")

    # 保存中间信息
    def event_handler(event):
        if isinstance(event, paddle.event.EndIteration):
            if event.batch_id % 100 == 0:
                print "\nPass %d, Batch %d, Cost %f, %s" % (
                    event.pass_id, event.batch_id, event.cost, event.metrics)
            else:
                sys.stdout.write('.')
                sys.stdout.flush()
        if isinstance(event, paddle.event.EndPass):
            with open('./params_pass_conv_%d.tar' % event.pass_id, 'w') as f:
                trainer.save_parameter_to_tar(f)

            result_test = trainer.test(
                reader=paddle.batch(
                    paddle.reader.shuffle(reader_paddle_sequence.test_reader, buf_size=50000),
                    batch_size=100),
                feeding=feeding)
            print "\nTest with Pass %d, %s" % (event.pass_id, result_test.metrics["classification_error_evaluator"])

            result_train = trainer.test(
                reader=paddle.batch(
                    paddle.reader.shuffle(reader_paddle_sequence.train_reader, buf_size=50000),
                    batch_size=100),
                feeding=feeding)
            print "\nTrain with Pass %d, %s" % (event.pass_id, result_train.metrics["classification_error_evaluator"])

            fout_pass_err.write("%s\t%s\t%s\n" % (
                str(event.pass_id),
                str(float(result_test.metrics["classification_error_evaluator"])),
                str(float(result_train.metrics["classification_error_evaluator"]))
            )
                                )
            fout_pass_err.flush()

    # 执行训练
    trainer.train(
        reader=paddle.batch(
            paddle.reader.shuffle(reader_paddle_sequence.train_reader, buf_size=50000),
            batch_size=100),
        feeding=feeding,
        event_handler=event_handler,
        num_passes=300)

    fout_pass_err.flush()
    fout_pass_err.close()


if __name__ == '__main__':
    train()

 

训练后,会在当前目录下生成如下文件:

-rw-r--r--  1 baidu  staff   2.4M May 22 16:06 params_pass_conv_0.tar
-rw-r--r--  1 baidu  staff   2.4M May 22 16:06 params_pass_conv_1.tar
-rw-r--r--  1 baidu  staff   2.4M May 22 16:07 params_pass_conv_2.tar
-rw-r--r--  1 baidu  staff   6.0K May 17 16:14 inference_topology.pkl

 

同时训练过程会打印过程信息:

I0523 15:53:02.666565 2921214848 Util.cpp:166] commandline:  --use_gpu=False --trainer_count=1 
I0523 15:53:02.690153 2921214848 GradientMachine.cpp:94] Initing parameters..
I0523 15:53:02.708933 2921214848 GradientMachine.cpp:101] Init parameters done.

Pass 0, Batch 0, Cost 1.769033, {'classification_error_evaluator': 0.7699999809265137}
...................................................................................................
Pass 0, Batch 100, Cost 1.773714, {'classification_error_evaluator': 0.8100000023841858}
..............................................................................................
Test with Pass 0, 0.750410497189

Train with Pass 0, 0.740233063698

Pass 1, Batch 0, Cost 1.779777, {'classification_error_evaluator': 0.75}
...................................................................................................
Pass 1, Batch 100, Cost 1.677371, {'classification_error_evaluator': 0.7200000286102295}
..............................................................................................
Test with Pass 1, 0.610837459564

Train with Pass 1, 0.5671235919

 

试了试GPU和CPU的区别,真的发现GPU那是好多倍的速度呀,深度学习是建立在GPU上的技术果然不差;

同时可以试着打印训练集和错误集的准确率曲线:

 

可以看到,在12轮的时候打到了局部最优,之后出现过拟合;整体效果最好的是88%准确率;

在尝试多次调整dropout和l2正则化参数后依然是这个准确率,因此停止了调整;有待收集更多的数据才可以提升准确率;

利用模型做预测

既然已经训练完毕,那么怎么使用呢,看代码

# coding:utf8

"""
使用模型做预测
"""

import os
import paddle.v2 as paddle
import reader_paddle_sequence

import sys

# 需要和训练的时候的feeding一样
feeding = {
    'gender': 0,
    'age': 1,
    'lifeStage': 2,
    'trade': 3,
    'educationalLevel': 4,
    'job': 5,
    'words': 6,
    'label': 7
}


def test():
    paddle.init(use_gpu=False, trainer_count=1)
    # 读取最优的那个参数集文件
    tarfn = "params_pass_conv_1.tar"
    # 读取模型拓扑文件
    topology_filepath = "inference_topology_conv.pkl"

    # 加载参数和拓扑到一个infer对象
    with open(tarfn) as param_f, open(topology_filepath) as topo_f:
        params = paddle.parameters.Parameters.from_tar(param_f)
        inferer = paddle.inference.Inference(parameters=params, fileobj=topo_f)

    # 使用测试集合的一条数据,进行Infer
    # 这也说明了,对于要预测的输入,需要处理成和训练集、测试集一样的格式才可以
    reader = reader_paddle_sequence.test_reader
    for k in reader():
        print k[:-1]
        res = inferer.infer(input=[k,], feeding=feeding)
        print res
        break


if __name__ == '__main__':
    # 两个选项
    test()

 

打印一下运行结果:

I0523 16:00:56.536753 2921214848 Util.cpp:166] commandline:  --use_gpu=False --trainer_count=1 
[1, 0, 6, 7, 3, 1, [11069, 36027, 15862, 11069, 48152, 36027, 11069, 33830, 48152, 36027, 11069, 50730, 11069, 50730, 11069, 47002]]
[[0.25260347 0.15734845 0.1648475  0.17209302 0.14046918 0.11263847]]

 

可以看到最后一行打印了预测的6个分类的概率;

总结

以上讲述了使用paddle搭建神经网络的整个流程,包括数据读取、网络搭建、训练、模型应用等方面;

对于已经训练好的模型,可以用python flask或者django进行加载和对外提供远程调用;

其实一通百通,当一个网络搭建和实现之后,对paddle都有了很好的理解,自己同时尝试了全连接网络、LSTM网络,都和卷积网络非常类似,只要替换卷积层部分即可;

对于paddle,虽然当前还不完善,但毕竟是国内的深度学习框架,同时也能够实现业务目标,这一点还是要继续支持滴;

附带Paddle链接:

本文链接:http://crazyant.net/2177.html,转载请注明来源。

使用Kmeans对Word2vec的输出做聚类

Word2vec会产出每个词语的权重向量

使用这个向量,可以直接对所有的词语聚类

以下代码,以word2vec的model作为输入,进行kmeans训练,同时进行K的迭代计算,选出WSSSE最小的K值

    /**
      * 将word2vec的结果,作为kmeans的输入进行聚类;进行K的多次迭代,选出WSSSE最小的K
      * @param spark
      * @param model
      */
    def word2vecToKmeans(spark: SparkSession, model: org.apache.spark.mllib.feature.Word2VecModel) = {
        import org.apache.spark.mllib.clustering.{KMeans, KMeansModel}
        import org.apache.spark.mllib.linalg.Vectors

        // val parsedData = data.map(s => Vectors.dense(s.split(' ').map(_.toDouble))).cache()
        val parsedData = model.getVectors.map(row => Vectors.dense(row._2.map(_.toDouble))).toSeq
        val parsedDataRDD = spark.sparkContext.parallelize(parsedData).cache()

        // Cluster the data into two classes using KMeans

        val numKList = 2 to 20

        numKList.foreach(
            k => {
                val numIterations = 50
                val clusters = KMeans.train(parsedDataRDD, k, numIterations)

                // Evaluate clustering by computing Within Set Sum of Squared Errors
                val WSSSE = clusters.computeCost(parsedDataRDD)
                println(s"K==${k}, Within Set Sum of Squared Errors = $WSSSE")
            }
        )
    }

这里使用的是mllib的库

算出来的K值和WSSSE的对应关系为:

2	737409.9793517443
3	680667.1717807942
4	646796.9586209953
5	621979.831387794
6	600079.2948154274
7	583517.901818578
8	568308.9391577758
9	558225.3643934435
10	553948.317112428
11	548844.8163327919
12	534551.2249848123
13	530924.4903488192
14	525710.9272857339
15	523946.17442620965
16	516929.85870202346
17	511611.2490293131
18	510014.93372050225
19	503478.81601442746
20	500293.188117236

 

使用如下代码进行绘图:

#coding:utf8

import matplotlib.pyplot as plt

x = []
wssse = []
for line in open("kmeans_k_wssse.txt"):
    line = line[:-1]
    fields = line.split("\t")
    if len(fields) != 2:
        continue
    x.append(int(fields[0]))
    wssse.append(float(fields[1]))

plt.xlabel('k')
plt.ylabel('SSE')
plt.plot(x,wssse,'o-')
plt.show()

 

图片如下:

 

并不是完美的手肘,不过拐点大概在8、9的位置,以8或者9来聚类比较合适

 

也可以打印距离每个中心的10个数据

val distData = model.getVectors.map(row => {
            val word = row._1
            val probVector = Vectors.dense(row._2.map(_.toDouble))
            val predictK = clusters.predict(probVector)
            val centerVector = clusters.clusterCenters(predictK)
            // 计算当前点,到当前中心的距离
            val dist = Vectors.sqdist(probVector, centerVector)
            (predictK, word, dist)
        }).toSeq
        val distRdd = spark.sparkContext.parallelize(distData)

        val groupData = distRdd.map(row => (row._1, (row._2, row._3))).groupByKey()
        // 打印距离每个中心的10个点
        groupData.map(row => {
            (row._1, row._2.toList.sortWith((a, b) => a._2 < b._2).take(10))
        }).collect().foreach(row => {
            row._2.foreach(
                row2 => println(s"${row._1}\t${row2._1}\t${row2._2}")
            )
        })

 

然而,查看数据,并不能得到为啥这么分类,聚类的结果不好分析~~

 

参考文章:

用手肘法选出最佳的kmeans的K值:https://blog.csdn.net/qq_15738501/article/details/79036255

K-MEANS-SPARK文档:https://spark.apache.org/docs/2.2.0/mllib-clustering.html#k-means