Add Platform test to show Hadoop and Tez partitioning difference by piyushnarang · Pull Request #59 · cwensel/cascading

piyushnarang · 2017-02-11T01:59:58Z

Noticed this on one of our test jobs that we were using to compare the performance of MR and Tez.
I've built a unit test to show a subset of the graph where Cascading on Hadoop is combining more nodes and thus lowering the quantity of data streamed between nodes / steps.

The job starts off with two vertices V0, V1 reading around 3,025,369,753 tuples (10 odd TB). They're then merged + grouped in vertex V2. This is then passed on to Vertex V3 which performs some aggregations (everys) and reduces the data to around 1 TB.

In case of Hadoop, V0, V1 are done on the job's mappers. V2 + V3 are combined and done on the reducers. We then end up writing out this 1TB or so of data and that's picked up by the downstream steps.

Wondering if we should have a rule to collapse these aggregations into the step doing the groupBy?

cwensel · 2023-06-14T14:36:37Z

Leaving this open in the hope I have time to look into it, even though it's likely no longer a concern.

Add Platform test to show Hadoop and Tez partitioning difference

8168099

piyushnarang mentioned this pull request Sep 28, 2017

Add generic TypedPipe optimization rules twitter/scalding#1724

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Platform test to show Hadoop and Tez partitioning difference#59

Add Platform test to show Hadoop and Tez partitioning difference#59
piyushnarang wants to merge 1 commit intocwensel:wip-3.2from
piyushnarang:piyush/hadoop-tez-partition-nodes

piyushnarang commented Feb 11, 2017

Uh oh!

cwensel commented Jun 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

piyushnarang commented Feb 11, 2017

Uh oh!

cwensel commented Jun 14, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants