如何从CSV文件创建联合数据集? [英] How to create federated dataset from a CSV file?

查看:285
本文介绍了如何从CSV文件创建联合数据集?的处理方法,对大家解决问题具有一定的参考价值,需要的朋友们下面随着小编来一起学习吧!

问题描述

我选择了此数据集: https://www.kaggle.com/karangadiya/fifa19

I have selected this dataset: https://www.kaggle.com/karangadiya/fifa19

现在,我想将此CSV文件转换为联合数据集以适合模型.

Now, I would like to convert this CSV file into the federated dataset to fit in the model.

Tensorflow提供了有关联合学习的教程,其中他们使用了预定义的数据集.但是,我的问题是如何在联邦学习方案中使用此特定数据集?

Tensorflow provided tutorials on federated learning where they have used a pre-defined dataset. However, my question is How can I use this particular dataset for a federated learning scenario?

推荐答案

我将使用其他CSV数据集,但这仍将解决此问题的核心,即如何从CSV创建联合数据集.我们还假设该数据集中有一列,您想代表数据的client_id.

I'll use a different CSV dataset, but this should still address the core of this question, which is how to create a federated dataset from a CSV. Let's also assume that there is a column in that dataset which you would like to represent the client_ids for your data.

import pandas as pd
import tensorflow as tf
import tensorflow_federated as tff

csv_url = "https://docs.google.com/spreadsheets/d/1eJo2yOTVLPjcIbwe8qSQlFNpyMhYj-xVnNVUTAhwfNU/gviz/tq?tqx=out:csv"

df = pd.read_csv(csv_url, na_values=("?",))

client_id_colname = 'native.country' # the column that represents client ID
SHUFFLE_BUFFER = 1000
NUM_EPOCHS = 1

# split client id into train and test clients
client_ids = df[client_id_colname].unique()
train_client_ids = client_ids.sample(frac=0.5).tolist()
test_client_ids = [x for x in client_ids if x not in train_client_ids]

有几种方法可以执行此操作,但是我将在此处说明的方式使用tff.simulation.ClientData.from_clients_and_fn,这要求我们编写一个接受client_id作为输入并返回tf.data.Dataset的函数.我们可以轻松地从数据框中构造它.

There are a few ways to do this, but the way I'll illustrate here uses tff.simulation.ClientData.from_clients_and_fn, which requires that we write a function that accepts a client_id as input and returns a tf.data.Dataset. We can easily construct this from the dataframe.

def create_tf_dataset_for_client_fn(client_id):
  # a function which takes a client_id and returns a
  # tf.data.Dataset for that client
  client_data = df[df[client_id_colname] == client_id]
  dataset = tf.data.Dataset.from_tensor_slices(client_data.to_dict('list'))
  dataset = dataset.shuffle(SHUFFLE_BUFFER).batch(1).repeat(NUM_EPOCHS)
  return dataset

现在,我们可以使用上面的功能为我们的训练和测试数据创建一个ConcreteClientData对象:

Now, we can use the function above to create a ConcreteClientData object for our training and test data:

train_data = tff.simulation.ClientData.from_clients_and_fn(
        client_ids=train_client_ids,
        create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
    )
test_data = tff.simulation.ClientData.from_clients_and_fn(
        client_ids=test_client_ids,
        create_tf_dataset_for_client_fn=create_tf_dataset_for_client_fn
    )

要查看数据集的一个实例,请尝试:

To see one instance of the dataset, try:

example_dataset = train_data.create_tf_dataset_for_client(
        train_data.client_ids[0]
    )
print(type(example_dataset))
example_element = iter(example_dataset).next()
print(example_element)
# <class 'tensorflow.python.data.ops.dataset_ops.RepeatDataset'>
# {'age': <tf.Tensor: shape=(1,), dtype=int32, numpy=array([37], dtype=int32)>, 'workclass': <tf.Tensor: shape=(1,), dtype=string, numpy=array([b'Local-gov'], dtype=object)>, ...

example_dataset的每个元素都是Python字典,其中的键是表示要素名称的字符串,而值是带有一批这些要素的张量.现在,您已经可以对联邦数据集进行预处理并用于建模.

Each element of example_dataset is a Python dictionary where the keys are strings representing feature names, and the values are tensors with one batch of those features. Now, you have a federated dataset that can be preprocessed and used for modeling.

这篇关于如何从CSV文件创建联合数据集?的文章就介绍到这了,希望我们推荐的答案对大家有所帮助,也希望大家多多支持IT屋!

查看全文
登录 关闭
扫码关注1秒登录
发送“验证码”获取 | 15天全站免登陆