Gartner Cloud DBMS Report Names MarkLogic a Visionary

MarkLogic Connector for Apache Spark Now Available

Today, we are excited to announce the availability of the MarkLogic Connector for Apache Spark. Apache Spark has gained significant user adoption and is an important tool for complex data processing and analytics, especially when it involves machine learning and AI. By combining Spark with MarkLogic’s data persistence and governance capabilities, organizations can build a modern integration hub that is more consistent, powerful, and well-governed than Spark alone can provide.

To get started, users can download the MarkLogic Connector for Apache Spark here

What is Apache Spark?

Apache Spark is an in-memory, distributed data processing engine for analytical applications, including machine learning, SQL, streaming, and graph. As a unified analytical tool, it is widely used by developers to build scalable data pipelines that span diverse data sources, including relational databases and NoSQL systems. Spark supports a variety of programming languages (like Scala, Java, Python) making it a tool of choice for data engineering and data science tasks.

Using Apache Spark with MarkLogic

While Apache Spark is widely used for analytical processing at scale, it does not include its own distributed data persistence layer. This is where MarkLogic Data Hub shines as a unified operational and analytical platform for integrating and managing heterogeneous data from multiple systems.

The combination of Apache Spark and MarkLogic enables organizations to modernize their data analytics infrastructure for faster time-to-insights while reducing cost and risk. Using the MarkLogic Connector for Apache Spark, developers can run Spark jobs for advanced analytics and machine learning directly on data in MarkLogic. This removes the ETL overhead that would otherwise be required when moving and wrangling data between separate operational and analytics systems. Instead, organizations can achieve a simpler architecture and speed up delivery of analytical applications that rely on durable data assets managed in a MarkLogic data hub.

Below are few use cases for Spark with MarkLogic:

  • Scalable Data Ingestion:  The MarkLogic Connector for Apache Spark makes it easy to implement Spark jobs for loading data as is while tracking provenance, lineage, and other metadata. With readily available connectors to diverse data sources, Spark easily facilitates batch and streaming data ingestion. Additionally, it also provides rich data transformation capabilities (like joins, filters, unions, etc.) so developers can easily cleanse and consolidate data from multiple source systems before loading data into MarkLogic. Once data is loaded, MarkLogic has the necessary capabilities to integrate, curate, and enrich source data into durable data assets for multiple use cases.
  • Advanced Analytics:  Spark provides a rich ecosystem for machine learning and predictive analytics libraries like MLlib. Using the MarkLogic Connector for Spark, developers can now run advanced analytics and machine learning directly on the data in MarkLogic. And, they can leverage MarkLogic’s multi-model querying capabilities and securely share fit-for-purpose data with Spark libraries (like streaming, SQL, machine learning) for analytical processing. Another benefit is that MarkLogic’s distributed design can easily scale compute capacity to allow Spark jobs to process vast amounts of data – which is important because with machine learning, processing capacity needs can fluctuate heavily.

The MarkLogic Connector for Spark is compatible with Spark’s DataSource API providing a seamless developer experience. The connector returns the data in MarkLogic as a Spark DataFrame that can quickly be processed using Spark SQL and other Spark APIs. Developers can leverage existing skills as they use Spark native libraries (like SQL, machine learning, and others) in a variety of programming languages (like Java, Scala, and Python) to build sophisticated analytics on top of MarkLogic.

Erste Schritte

Together, the combination of MarkLogic and Spark provides huge benefits for building intelligent analytical applications. The MarkLogic Connector for Spark ensures that organizations are maximizing the benefits of MarkLogic as the trusted source of durable data assets and Spark as the high-performance analytical framework.

To get started, follow along with the hands-on, step-by-step tutorial. To learn more about how you can configure the MarkLogic Connector for Apache Spark, please check out the documentation here. Apache Spark documentation is available here.

Start a discussion

Connect with the community




Most Recent

View All

Datenmanagement: Die Autobranche sucht das Geschäft der Zukunft

Die traditionsreichen Autobauer suchen nach neuen Geschäftsmodellen. Nur wer die richtigen Daten hat, kann die Wünsche der Verbraucher analysieren und das passende Produkt zielgerichet anbieten.
Artikel lesen

Integration von verschiedenen ERP-Systemen – aber wie?

Ingetration von verschiedenen ERP Systemen kann eine Herausforderung sein. Einfacher geht es mit einer NoSQL Datenplattform.
Artikel lesen

Der digitale Wandel erfordert ein neues Enterprise Content Management

Viele Versicherungen haben mit Workflow-Prozessen und der IT-Infrastruktur zu kämpfen, wenn sie ihr wichtigstes Kapital nutzen wollen: Daten und Dokumente. Neue Ansätze sind hier gefragt.
Artikel lesen
Auf dieser Website werden Cookies verwendet.

Mit der Nutzung dieser Webseite stimmen Sie der Verwendung von Cookies gemäß der MarkLogic Datenschutzrichtlinie zu.