Conversation
wesley-weiming
commented
Dec 6, 2023
- We submitted a performance report
- Supports automatically generating project-impact-graph.yaml for a given number of projects and dependencies, and then you can test it yourself
- In the performance test, we found that using "minimatch" for glob matching would lead to performance degradation (For example, in the case of 2,000 projects and 10,000 dependencies, the time to calculate 100 paths reaches minutes). So we need to change the path matching method of the algorithm and use string matching instead of glob matching. This means that fields such as includeGlobs and excludeGlobs in the file schema must also discard the representation of glob.
| A: | ||
| includedGlobs: | ||
| - projects/folder_A/** | ||
| - projects/folder_A/ |
There was a problem hiding this comment.
Could you elaborate the syntax used here?
There was a problem hiding this comment.
projects:
A:
includedGlobs:
- projects/folder_A/
excludedGlobs:
- projects/folder_A/README.md
dependentProjects:
- G
This semantics describes a project named 'A', and 'includedGlobs' is used to specify the files that should be included in project 'A'. 'excludedGlobs' is a subset of 'includedGlobs', used to specify which paths in project 'A' need to be filtered. 'dependentProjects' indicates which projects directly depend on 'A'
projects/folder_A/, Because we replaced glob match with startsWith, this folder path is used here to represent project A.
There was a problem hiding this comment.
It feels like it is no longer included/excludedGlobs. It's included/excluedPrefix now.
There was a problem hiding this comment.
You are right, it is no longer appropriate to continue using 'glob'
| @@ -0,0 +1,148 @@ | |||
| import path from 'path'; | |||
There was a problem hiding this comment.
Could you add a REAMD to teach us how to run the performance?
My guess now is run this file with node. Is it true?
Even better, you can put the performance result with the running environment info.
There was a problem hiding this comment.
Of course, let me add a new commit
| @@ -2,11 +2,11 @@ globalExcludedGlobs: | |||
| - OWNERS | |||
There was a problem hiding this comment.
Is these file names still working? The implementation has been changed to match with startsWith
There was a problem hiding this comment.
OWNERS
build.sh
bootstrap.sh
These represent public configuration files in the root directory, OWNERS means repoRootDir/OWNERS, which is still available for startsWith
| @@ -4,7 +4,6 @@ | |||
| import fs from 'fs'; | |||
| import yaml from 'yaml'; | |||
| import _ from 'lodash'; | |||
There was a problem hiding this comment.
Is lodash used in any inner loops? Some time ago a perf investigation showed that Lodash's algorithms are often extremely inefficient due to so many layers of abstractions in its code base.
There was a problem hiding this comment.
Not used in loops, lodash is only used twice in a complete calculation process (using its cloneDeep API to clone the graph structure)
| | 3000 | 100000 | 1 | 1 | 9.46s | | ||
| | 3000 | 100000 | 10 | 10 | 10.533s | | ||
| | 3000 | 100000 | 100 | 100 | 11.029s | | ||
| | 3000 | 100000 | 1000 | 1000 | 11.984s | |
There was a problem hiding this comment.
In https://github.com/tiktok/project-impact-graph/pull/3/files I've added some instrumentation to count the number of times each part of the loop is executed.
Here's one of the test cases:
[
{
nodeCount: 2000,
edgeCount: 10000,
pathCountA: 1000,
pathCountB: 1000,
hasImpactIntersection: true,
executeTime: '1.513s'
}
]
{
_integrateExcludedGlobs: 1,
_integrateExcludedGlobs2: 2000,
_validatePaths: 2,
_validatePaths2: 4008000,
lookUpProjectNamesByPathList: 2,
lookUpProjectNamesByPathList2: 4000000,
lookUpProjectNamesByPathList3: 4000000,
lookUpProjectNamesByPathList4: 5780,
getProjectImpactByProjectNames: 2,
getProjectImpactByProjectNames2: 4000,
getProjectImpactByProjectNames3: 24000,
hasImpactIntersection: 1
}The nodeCount is the number of projects, and pathCountA and pathCountB are the "before" and "after" lists of paths from the diff. The bottleneck of this algorithm seems to be 4,000,000 which is O(numberOfPaths * numberOfProjects).
But notice that our project paths have a well-behaved structure, for example:
INCLUDEapps/my-app/**EXCLUDEapps/my-app/README.mdINCLUDEapps/my-app2/**EXCLUDEapps/my-app2/README.mdINCLUDElibraries/my-lib3/**EXCLUDElibraries/my-lib3/README.mdINCLUDElibraries/my-lib4/**EXCLUDElibraries/my-lib4/README.mdINCLUDElibraries/my-lib4/bad-nested-project/**EXCLUDElibraries/my-lib4/bad-nested-project/README.md
Even if we permit projects to be nested under other project folders, the prefixes of these globs still form a tree. For an example input path libraries/my-lib3/src/index.ts, imagine an O(n*log(n)) algorithm that would cheaply walk down libraries -> my-lib3, and then need to test only 2 globs ** and README.md.
This idea is similar to rush-lib/src/logic/LookupByPath.ts.