#databricksdaily search results

How do you handle data skew with repartition()? If a single key is causing skew, I add a random salt (like mod(rand()*N)) to spread that key into multiple partitions. This balances workload, reduces long-tail tasks, and speeds up shuffles. #DatabricksDaily #Databricks


3/3 Too few partitions = slow, chunky tasks. Too many = pointless overhead. Balanced ones = beautiful pipeline runs. #DatabricksDaily #Databricks #DatabricksInterviewPrep #DatabricksPerformance


When is repartition(1) acceptable? Exporting small CSV/JSON to downstream systems Test data generation Creating a single audit/control file #Databricks #DatabricksDaily #DatabricksBasics

What happens when you call repartition(1) before writing a table? Is it recommended? Calling repartition(1) forces Spark to shuffle all data across the cluster and combine it into a single partition. This means the final output will be written as a single file. It is like…



2/3 If the job has heavy joins/shuffles, I bump partitions up. If the dataset is tiny, I scale them down (no point having 800 partitions for 2GB). And honestly, AQE is a lifesaver it fixes the small/oversized partitions at runtime. #DatabricksDaily #Databricks


How do you handle data skew with repartition()? If a single key is causing skew, I add a random salt (like mod(rand()*N)) to spread that key into multiple partitions. This balances workload, reduces long-tail tasks, and speeds up shuffles. #DatabricksDaily #Databricks


When is repartition(1) acceptable? Exporting small CSV/JSON to downstream systems Test data generation Creating a single audit/control file #Databricks #DatabricksDaily #DatabricksBasics

What happens when you call repartition(1) before writing a table? Is it recommended? Calling repartition(1) forces Spark to shuffle all data across the cluster and combine it into a single partition. This means the final output will be written as a single file. It is like…



3/3 Too few partitions = slow, chunky tasks. Too many = pointless overhead. Balanced ones = beautiful pipeline runs. #DatabricksDaily #Databricks #DatabricksInterviewPrep #DatabricksPerformance


2/3 If the job has heavy joins/shuffles, I bump partitions up. If the dataset is tiny, I scale them down (no point having 800 partitions for 2GB). And honestly, AQE is a lifesaver it fixes the small/oversized partitions at runtime. #DatabricksDaily #Databricks


No results for "#databricksdaily"
Loading...

Something went wrong.


Something went wrong.


United States Trends