Add Platform test to show Hadoop and Tez partitioning difference#59
Open
piyushnarang wants to merge 1 commit intocwensel:wip-3.2from
Open
Add Platform test to show Hadoop and Tez partitioning difference#59piyushnarang wants to merge 1 commit intocwensel:wip-3.2from
piyushnarang wants to merge 1 commit intocwensel:wip-3.2from
Conversation
Owner
|
Leaving this open in the hope I have time to look into it, even though it's likely no longer a concern. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Noticed this on one of our test jobs that we were using to compare the performance of MR and Tez.
I've built a unit test to show a subset of the graph where Cascading on Hadoop is combining more nodes and thus lowering the quantity of data streamed between nodes / steps.
The job starts off with two vertices V0, V1 reading around 3,025,369,753 tuples (10 odd TB). They're then merged + grouped in vertex V2. This is then passed on to Vertex V3 which performs some aggregations (everys) and reduces the data to around 1 TB.
In case of Hadoop, V0, V1 are done on the job's mappers. V2 + V3 are combined and done on the reducers. We then end up writing out this 1TB or so of data and that's picked up by the downstream steps.
Wondering if we should have a rule to collapse these aggregations into the step doing the groupBy?