Телеграмм чат группы moscowspark страница 513

Return an RDD created by piping elements to a forked external process. The resulting RDD is computed by executing the given process once per partition. All elements of each input partition are written to a process's stdin as lines of input separated by a newline. The resulting partition consists of the process's stdout output, with each line of stdout resulting in one element of the output partition. A process is invoked even for empty partitions.
The print behavior can be customized by providing two functions.
Params:
command – command to run in forked process.
env – environment variables to set.
printPipeContext – Before piping elements, this function is called as an opportunity to pipe context data. Print line function (like out.println) will be passed as printPipeContext's parameter.
printRDDElement – Use this function to customize how to pipe elements. This function will be called with each RDD element as the 1st parameter, and the print line function (like out.println()) as the 2nd parameter. An example of pipe the RDD data of groupBy() in a streaming way, instead of constructing a huge String to concat all the elements:
                        def printRDDElement(record:(String, Seq[String]), f:String=>Unit) =
                          for (e <- record._2) {f(e)}
                        
separateWorkingDir – Use separate working directories for each task.
bufferSize – Buffer size for the stdin writer for the piped process.
encod

источник

12:21пожаловаться #13

KrivdaTheTriewe in Moscow Spark

rdd.pipe

источник

12:21пожаловаться #14

KrivdaTheTriewe in Moscow Spark

параллелизм управляется - количеством партиуий

источник

12:22пожаловаться #15

ПФ

Паша Финкельштейн... in Moscow Spark

KrivdaTheTriewe

Return an RDD created by piping elements to a forked external process. The resulting RDD is computed by executing the given process once per partition. All elements of each input partition are written to a process's stdin as lines of input separated by a newline. The resulting partition consists of the process's stdout output, with each line of stdout resulting in one element of the output partition. A process is invoked even for empty partitions.
The print behavior can be customized by providing two functions.
Params:
command – command to run in forked process.
env – environment variables to set.
printPipeContext – Before piping elements, this function is called as an opportunity to pipe context data. Print line function (like out.println) will be passed as printPipeContext's parameter.
printRDDElement – Use this function to customize how to pipe elements. This function will be called with each RDD element as the 1st parameter, and the print line function (like out.println()) as the 2nd parameter. An example of pipe the RDD data of groupBy() in a streaming way, instead of constructing a huge String to concat all the elements:
                        def printRDDElement(record:(String, Seq[String]), f:String=>Unit) =
                          for (e <- record._2) {f(e)}
                        
separateWorkingDir – Use separate working directories for each task.
bufferSize – Buffer size for the stdin writer for the piped process.
encod

Очень дорого плодить подпроцессы

источник

12:22пожаловаться #16

ПФ

Паша Финкельштейн... in Moscow Spark

Зато одним wget можно выполнить много запросов ;)

источник

12:22пожаловаться #17

KrivdaTheTriewe in Moscow Spark

ну сделать так, чтобы в рамках команды, создавался один wget

источник

12:23пожаловаться #18