[stdlib] Add allDistinct APIs for iterables, sequences and arrays#6187
[stdlib] Add allDistinct APIs for iterables, sequences and arrays#6187Dmitry Nekrasov (DmitryNekrasov) wants to merge 4 commits into
Conversation
31228ac to
ebfd3a6
Compare
|
/dry-run |
|
THIS IS A DRY RUN Quality gate is triggered at https://buildserver.labs.intellij.net/build/969309422 — use this link to get full insight. Quality gate was triggered with the following revisions:
Quality gate failed. See https://buildserver.labs.intellij.net/build/969309422 to get full insight. |
|
The failing cases are It's engine-dependent: I checked Reproducer (https://pl.kotl.in/P6_xrfPbh): val a = Float.NaN
val b = Float.fromBits(0xFFFC0000.toInt()) // a different NaN bit pattern
println(a.equals(b)) // true everywhere
println(a.hashCode() == b.hashCode()) // JVM: true; JS: false on Node/Chrome, true on Safari/FirefoxTwo options:
I'd take option 1 now; option 2 is a follow-up that would let us drop the guard. |
ebfd3a6 to
ad57343
Compare
|
/dry-run |
|
THIS IS A DRY RUN Quality gate is triggered at https://buildserver.labs.intellij.net/build/969660301 — use this link to get full insight. Quality gate was triggered with the following revisions:
Quality gate finished successfully. |
| @ExperimentalStdlibApi | ||
| public fun BooleanArray.allDistinct(): Boolean { | ||
| if (size < 2) return true | ||
| val seen = HashSet<Boolean>() |
There was a problem hiding this comment.
For booleans, there are only three cases:
- the size is < 2
- the size is 2 and values are different
- in all other cases, there are duplicates
There was a problem hiding this comment.
Thank you! Totally agree.
There was a problem hiding this comment.
Done.
| @SinceKotlin("2.4") | ||
| @ExperimentalStdlibApi | ||
| public fun ByteArray.allDistinct(): Boolean { | ||
| if (size < 2) return true |
There was a problem hiding this comment.
And if the size is greater than 512, then there are certainly some duplicates :)
There was a problem hiding this comment.
Done. The threshold is 256 — the number of distinct Byte values; ShortArray, UShortArray, and CharArray got the same check at 65536.
| @ExperimentalStdlibApi | ||
| public fun ByteArray.allDistinct(): Boolean { | ||
| if (size < 2) return true | ||
| val seen = HashSet<Byte>() |
There was a problem hiding this comment.
There's not much we can do for other integer types, but all byte values could be captured using 4 longs, so it might be worth implementing teeny-tiny bitset with a single add operation.
There was a problem hiding this comment.
Done.
| seen.add(selector(first)) | ||
| do { | ||
| if (!seen.add(selector(iterator.next()))) return false |
There was a problem hiding this comment.
selector will be inlined twice (kudos to @qwwdfsad for the hint), needlessly emitting additional bytecode at the allDistinctBy's call-sites.
So it makes sense to rewrite it into something like:
val iterator = iterator()
if (!iterator.hasNext()) return true
var element: T? = iterator.next()
if (!iterator.hasNext() return true
val seen = HashSet<K>()
while (true) {
if (!seed.add(selector(element))) return false
if (!iterator.hasNext()) break
element = iterator.next()
}
return true
It would be nice to reduce the size of the loop's preamble, but not sure if there's a reasonable way to achieve it.
There was a problem hiding this comment.
Done.
Review follow-up for #6187: - BooleanArray: only three outcomes exist (size < 2; a distinct pair; otherwise a guaranteed duplicate), so compute the answer directly with no HashSet. - ByteArray/UByteArray: more than 256 elements can't be all-distinct (pigeonhole), return false right away; values that fit are tracked in a 256-bit set of four Longs instead of a HashSet, avoiding boxing and a hash-table allocation. - ShortArray/UShortArray/CharArray: same pigeonhole shortcut at 65536. No bitset here: zeroing 8 KiB per call would penalize the common small-array case. Wider element types (Int and above) can't benefit: their value domain exceeds the maximum array size, so the size check would never fire. Tests cover the domain boundaries (full domain distinct, same size with a duplicate, domain size + 1), all BooleanArray shapes up to size 2 exhaustively, and byte values that collide in the low six bits of the bitset words. KT-30270
e59ec6c to
529aeb5
Compare
Adds experimental `allDistinct()` and `allDistinctBy { }` ("are all
elements different from each other?") for `Iterable`, `Sequence`, and the
object, primitive, and unsigned array families — the dual of `allEqual`.
`@ExperimentalStdlibApi`, `@SinceKotlin("2.4")`.
Distinctness uses `equals`/`hashCode`, so for floating-point elements
`NaN` equals `NaN` and `-0.0` is not equal to `0.0`, consistent with
`Double.equals` and the existing `allEqual`/`isSorted`.
^KT-30270 Fixed
On JS, Double and Float hashCode don't canonicalize NaN, so a HashSet keeps NaN values with different bit patterns apart.
Only the byte arrays get a bitset: zeroing the 8 KiB needed for the 16-bit types on every call would penalize the common small-array case.
The selector is inlined, so each call site used to receive two copies of the lambda's bytecode.
529aeb5 to
5ab237c
Compare
|
/dry-run |
|
THIS IS A DRY RUN Quality gate is triggered at https://buildserver.labs.intellij.net/build/970073740 — use this link to get full insight. Quality gate was triggered with the following revisions:
Quality gate failed. See https://buildserver.labs.intellij.net/build/970073740 to get full insight. |
Adds experimental
allDistinct()andallDistinctBy { }("are all elements different from each other?") forIterable,Sequence, and the object, primitive, and unsigned array families — the dual ofallEqual.@ExperimentalStdlibApi,@SinceKotlin("2.4").Distinctness goes through a hash set, so elements compare by
equals/hashCode:NaNequalsNaN, and-0.0is not equal to0.0— the same asdistinct()/toSet()and the siblingallEqual.^KT-30270 Fixed