Spark Archives

Google’s potential HubSpot deal likely to spark fresh antitrust scrutiny

April 8, 2024 by BTR

Google’s potential HubSpot acquisition is expected to face resistance from both U.S. and European antitrust regulators. Although Google hasn’t officially bid to acquire the online marketing software company, valued at $35 billion, experts are discussing whether such a move could curb competition. While some argue it wouldn’t significantly limit competition, many believe that such a … Read more

Spark: Persistence Storage Levels / Blogs / Perficient

February 1, 2024 by BTR

Spark Persistence is an optimization technique, which saves the results of RDD evaluation. Spark provides a convenient method for working with datasets by storing them in memory throughout various operations. When you persist a dataset, Spark stores the data on disk or in memory, or a combination of the two, so that it can be … Read more

Spark: Parser Modes / Blogs / Perficient

February 1, 2024 by BTR

Apache Spark is a powerful open-source distributed computing system widely used for big data processing and analytics. When working with structured data, one common challenge is dealing with parsing errors—malformed or corrupted records that can hinder data processing. Spark provides flexibility in handling these issues through parser modes, allowing users to choose the behavior that … Read more

Spark: Dataframe joins / Blogs / Perficient

January 31, 2024 by BTR

In Apache Spark, DataFrame joins are operations that allow you to combine two DataFrames based on a common column or set of columns. Join operations are fundamental for data analysis and manipulation, particularly when dealing with distributed and large-scale datasets. Spark provides a rich set of APIs for performing various types of DataFrame joins. Import … Read more

Spark Scala: Approaches toward creating Dataframe

January 10, 2024 by BTR

In Spark with Scala, creating DataFrames is fundamental for data manipulation and analysis. There are several approaches for creating DataFrames, each offering its unique advantages. You can create DataFrames from various data sources like CSV, JSON, or even from existing RDDs (Resilient Distributed Datasets). In this blog we will see some approaches towards creating dataframe … Read more

Spark Partition: An Overview / Blogs / Perficient

January 8, 2024 by BTR

In Apache Spark, efficient data management is essential for maximizing performance in distributed computing. Partitioning, repartitioning, and coalescing actively govern how data organizes and distributes across the cluster. Partitioning involves dividing datasets into smaller chunks, enabling parallel processing and optimizing operations. Repartitioning allows for the redistribution of data across partitions, adjusting the balance for more … Read more

Spark RDD Operations

January 4, 2024 by BTR

A comprehensive understanding of Spark’s transformation and action is crucial for efficient Spark code. This blog provides a glimpse on the fundamental aspects of Spark. Before we deep dive into Spark’s transformation and action, let us see a glance of RDD and Dataframe. Resilient Distributed Dataset (RDD): Usually, Spark tasks operate on RDDs, which is … Read more

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

Enjoy Our Website? Please share :) Thank you!