 |
Forums |
Admin Discussion Forums: help Start New Thread
By: Adam Pisoni
RE: Streaming exampl? [ reply ] 2008-08-28 23:52
|
The examples directory has an example of distributed_find which does what you want.
Here's how it works. It takes your query and modifies it so that instead of retrieving the rows your query would return, it returns ranges of ids based on your batch size (1000 is the default I believe). So each the rows will look something like
1, 5,1504
2, 1505, 2452
3, 3401, 4444
The examples I gave above are non-contiguous to show that most tables are sparse. The special query figures out the starting and ending ID from your table to make up batch sizes of 1000. It then sends the start and end id, as well as the original query to each worker (as many jobs as there are batches). Each worker then executes the query within the range given and operates on those models.
Does that make sense?
adam
|
By: Jeff Doran
Streaming exampl? [ reply ] 2008-08-28 23:48
|
Are there any examples of streaming in large data sets, say 10 million rows of accounts in a database, into a self.map method?
What I'm trying to understand is how you could partition this data set among, say 3 skynet managers on different servers with 10 workers each. How does an Enumerable keep track of what accounts have been processed across multiple servers? Is a temporary db table or the like used to track what's been done?
Thanks.
-Jeff
|
|
 |