This is because AWS Glue must perform internal list functions on Amazon Simple Storage Service (Amazon S3) and YARN (Yet Another Resource Negotiator) must store a large amount of metadata. For example, hundreds of files that are only a few hundred kilobytes each can significantly slow down data processing speeds for your AWS Glue jobs. That is, when a large number of small files slows down data processing due to the aggregate size of the files. Use this pattern to prevent or resolve the small files problem. This pattern shows you how to optimize the ingestion step of the extract, transform, and load (ETL) process for big data and Apache Spark workloads on AWS Glue by optimizing file size before processing your data.
0 Comments
Leave a Reply. |