[PR:#45] To Avoid collecting trainning data to driver and broadcasting  them 

I check the last PR in spark-deep-learning is [KerasImageFileEstimator](https://github.com/databricks/spark-deep-learning/pull/35), and when i review the code, i find it will collect all trainning data to driver and then broadcast to executors. This means all tranning data should fit in one server memory and it will definitely not work in real world especially when deep learning is a  data-hungry ML algrithom. 

Maybe we can write  tranning data to  a distributed message queue eg. Kafka, then we invoke tf queue to recevie data from kafka and consume data from tf queue when tf session starts.

```python
class KerasImageFileEstimator(Estimator, HasInputCol, HasInputImageNodeName,
                              HasOutputCol, HasOutputNodeName, HasLabelCol,
                              HasKerasModel, HasKerasOptimizer, HasKerasLoss,
                              CanLoadImage, HasOutputMode，DistributedModel="ParamsParallel", KafkaServer="127.0.0.1"):
```

We also can put data in HDFS as optional , but Message Queue sees to be a perfect choice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PR:#45] To Avoid collecting trainning data to driver and broadcasting them #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

[PR:#45] To Avoid collecting trainning data to driver and broadcasting them #53

Description

Activity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Participants

Issue actions