Open
Description
I check the last PR in spark-deep-learning is KerasImageFileEstimator, and when i review the code, i find it will collect all trainning data to driver and then broadcast to executors. This means all tranning data should fit in one server memory and it will definitely not work in real world especially when deep learning is a data-hungry ML algrithom.
Maybe we can write tranning data to a distributed message queue eg. Kafka, then we invoke tf queue to recevie data from kafka and consume data from tf queue when tf session starts.
class KerasImageFileEstimator(Estimator, HasInputCol, HasInputImageNodeName,
HasOutputCol, HasOutputNodeName, HasLabelCol,
HasKerasModel, HasKerasOptimizer, HasKerasLoss,
CanLoadImage, HasOutputMode,DistributedModel="ParamsParallel", KafkaServer="127.0.0.1"):
We also can put data in HDFS as optional , but Message Queue sees to be a perfect choice.
Activity
[-][PR:#45] To Avoid collect tranning data to driver and broadcast them [/-][+][PR:#45] To Avoid collect trainning data to driver and broadcast them [/+][-][PR:#45] To Avoid collect trainning data to driver and broadcast them [/-][+][PR:#45] To Avoid collecting trainning data to driver and broadcasting them [/+]