Description / Background
Dependabot has failed to upgrade to the latest AWS SDK because of a failure in Hadoop:
It can't create an S3Client because it's trying to directly create the AWS ApacheHttpClient, and that class no longer exists in the same package in the latest version.
Steps to reproduce
- Upgrade to latest AWS SDK
- Run tests
- See error
Expected behaviour
The system should still be able to interact with S3 via Hadoop.
Technical Notes / Implementation Details
One option may be to give Hadoop an alternative factory for the S3 client. We can look into the documentation and the source code for the classes listed in the stack trace.
Looking at the source code it seems like it reads the Hadoop property "fs.s3a.s3.client.factory.impl" to create the client factory. It needs to be a class with a public constructor taking no arguments, implementing the interface org.apache.hadoop.fs.s3a.S3ClientFactory. We should be able to implement that and set the property.
We'd better explain why this is necessary in Javadoc in the implementation.
We'll need to set this in LocalStackHadoopConfigurationProvider, WiremockHadoopConfigurationProvider, and in HadoopConfigurationProvider in the Parquet module and the Trino module.
Screenshots/Logs
Stack trace from a test:
java.lang.NoClassDefFoundError: software/amazon/awssdk/http/apache/ApacheHttpClient
at org.apache.hadoop.fs.s3a.impl.AWSClientConfig.createHttpClientBuilder(AWSClientConfig.java:147)
at org.apache.hadoop.fs.s3a.DefaultS3ClientFactory.createS3Client(DefaultS3ClientFactory.java:129)
at org.apache.hadoop.fs.s3a.impl.ClientManagerImpl.lambda$createS3Client$0(ClientManagerImpl.java:118)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.invokeTrackingDuration(IOStatisticsBinding.java:547)
at org.apache.hadoop.fs.statistics.impl.IOStatisticsBinding.lambda$trackDurationOfOperation$5(IOStatisticsBinding.java:528)
at org.apache.hadoop.util.functional.LazyAtomicReference.eval(LazyAtomicReference.java:94)
at org.apache.hadoop.util.functional.LazyAutoCloseableReference.eval(LazyAutoCloseableReference.java:54)
at org.apache.hadoop.fs.s3a.impl.ClientManagerImpl.getOrCreateS3Client(ClientManagerImpl.java:148)
at org.apache.hadoop.fs.s3a.impl.S3AStoreImpl.getOrCreateS3Client(S3AStoreImpl.java:232)
at org.apache.hadoop.fs.s3a.S3AFileSystem.initialize(S3AFileSystem.java:796)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3615)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:172)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3716)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3667)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:557)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:366)
at org.apache.parquet.hadoop.ParquetReader$Builder.build(ParquetReader.java:415)
at sleeper.compaction.job.execution.JavaCompactionRunner.createInputIterators(JavaCompactionRunner.java:140)
at sleeper.compaction.job.execution.JavaCompactionRunner.compact(JavaCompactionRunner.java:78)
at sleeper.compaction.core.task.CompactionTask.compact(CompactionTask.java:245)
at sleeper.compaction.core.task.CompactionTask.processCompactionMessage(CompactionTask.java:195)
at sleeper.compaction.core.task.CompactionTask.handleMessages(CompactionTask.java:157)
at sleeper.compaction.core.task.CompactionTask.run(CompactionTask.java:133)
at sleeper.compaction.core.task.CompactionTaskTestHelper.runTask(CompactionTaskTestHelper.java:96)
at sleeper.compaction.core.task.CompactionTaskTestHelper.runTask(CompactionTaskTestHelper.java:86)
at sleeper.compaction.job.execution.testutils.CompactionRunnerTestBase.runTask(CompactionRunnerTestBase.java:105)
at sleeper.compaction.job.execution.testutils.CompactionRunnerTestBase.runTask(CompactionRunnerTestBase.java:98)
at sleeper.compaction.job.execution.JavaCompactionRunnerLocalStackIT.shouldRunCompactionJob(JavaCompactionRunnerLocalStackIT.java:96)
at java.base/java.lang.reflect.Method.invoke(Method.java:569)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
Caused by: java.lang.ClassNotFoundException: software.amazon.awssdk.http.apache.ApacheHttpClient
at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:641)
at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:188)
at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:525)
... 31 more
Description / Background
Dependabot has failed to upgrade to the latest AWS SDK because of a failure in Hadoop:
It can't create an S3Client because it's trying to directly create the AWS ApacheHttpClient, and that class no longer exists in the same package in the latest version.
Steps to reproduce
Expected behaviour
The system should still be able to interact with S3 via Hadoop.
Technical Notes / Implementation Details
One option may be to give Hadoop an alternative factory for the S3 client. We can look into the documentation and the source code for the classes listed in the stack trace.
Looking at the source code it seems like it reads the Hadoop property "fs.s3a.s3.client.factory.impl" to create the client factory. It needs to be a class with a public constructor taking no arguments, implementing the interface
org.apache.hadoop.fs.s3a.S3ClientFactory. We should be able to implement that and set the property.We'd better explain why this is necessary in Javadoc in the implementation.
We'll need to set this in LocalStackHadoopConfigurationProvider, WiremockHadoopConfigurationProvider, and in HadoopConfigurationProvider in the Parquet module and the Trino module.
Screenshots/Logs
Stack trace from a test: