Data Engineering Podcast

English, Technology, 1 season, 415 episodes, 1 day, 12 hours, 50 minutes

Data Engineering Podcast

English, Technology, 1 season, 415 episodes, 1 day, 12 hours, 50 minutes

About

This show goes behind the scenes for the tools, techniques, and difficulties associated with the discipline of data engineering. Databases, workflows, automation, and data manipulation are just some of the topics that you will find here.

Find Out About The Technology Behind The Latest PFAD In Analytical Database Development

Summary Building a database engine requires a substantial amount of engineering effort and time investment. Over the decades of research and development into building these software systems there are a number of common components that are shared across implementations. When Paul Dix decided to re-write the InfluxDB engine he found the Apache Arrow ecosystem ready and waiting with useful building blocks to accelerate the process. In this episode he explains how he used the combination of Apache Arrow, Flight, Datafusion, and Parquet to lay the foundation of the newest version of his time-series database. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join us at the top event for the global data community, Data Council Austin. From March 26-28th 2024, we'll play host to hundreds of attendees, 100 top speakers and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data and sharing their insights and learnings through deeply technical talks. As a listener to the Data Engineering Podcast you can get a special discount off regular priced and late bird tickets by using the promo code dataengpod20. Don't miss out on our only event this year! Visit dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) and use code dataengpod20 to register today! Your host is Tobias Macey and today I'm interviewing Paul Dix about his investment in the Apache Arrow ecosystem and how it led him to create the latest PFAD in database design Interview Introduction How did you get involved in the area of data management? Can you start by describing the FDAP stack and how the components combine to provide a foundational architecture for database engines? This was the core of your recent re-write of the InfluxDB engine. What were the design goals and constraints that led you to this architecture? Each of the architectural components are well engineered for their particular scope. What is the engineering work that is involved in building a cohesive platform from those components? One of the major benefits of using open source components is the network effect of ecosystem integrations. That can also be a risk when the community vision for the project doesn't align with your own goals. How have you worked to mitigate that risk in your specific platform? Can you describe the operational/architectural aspects of building a full data engine on top of the FDAP stack? What are the elements of the overall product/user experience that you had to build to create a cohesive platform? What are some of the other tools/technologies that can benefit from some or all of the pieces of the FDAP stack? What are the pieces of the Arrow ecosystem that are still immature or need further investment from the community? What are the most interesting, innovative, or unexpected ways that you have seen parts or all of the FDAP stack used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on/with the FDAP stack? When is the FDAP stack the wrong choice? What do you have planned for the future of the InfluxDB IOx engine and the FDAP stack? Contact Info LinkedIn (https://www.linkedin.com/in/pauldix/) pauldix (https://github.com/pauldix) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links FDAP Stack Blog Post (https://www.influxdata.com/blog/flight-datafusion-arrow-parquet-fdap-architecture-influxdb/) Apache Arrow (https://arrow.apache.org/) DataFusion (https://arrow.apache.org/datafusion/) Arrow Flight (https://arrow.apache.org/docs/format/Flight.html) Apache Parquet (https://parquet.apache.org/) InfluxDB (https://www.influxdata.com/products/influxdb/) Influx Data (https://www.influxdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/influxdb-timeseries-data-platform-episode-199) Rust Language (https://www.rust-lang.org/) DuckDB (https://duckdb.org/) ClickHouse (https://clickhouse.com/) Voltron Data (https://voltrondata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) Velox (https://github.com/facebookincubator/velox) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Trino (https://trino.io/) ODBC == Open DataBase Connectivity (https://en.wikipedia.org/wiki/Open_Database_Connectivity) GeoParquet (https://github.com/opengeospatial/geoparquet) ORC == Optimized Row Columnar (https://orc.apache.org/) Avro (https://avro.apache.org/) Protocol Buffers (https://protobuf.dev/) gRPC (https://grpc.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

2/25/2024 • 56 minutes

Using Trino And Iceberg As The Foundation Of Your Data Lakehouse

Summary A data lakehouse is intended to combine the benefits of data lakes (cost effective, scalable storage and compute) and data warehouses (user friendly SQL interface). Multiple open source projects and vendors have been working together to make this vision a reality. In this episode Dain Sundstrom, CTO of Starburst, explains how the combination of the Trino query engine and the Iceberg table format offer the ease of use and execution speed of data warehouses with the infinite storage and scalability of data lakes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Join in with the event for the global data community, Data Council Austin. From March 26th-28th 2024, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working togethr to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today. Your host is Tobias Macey and today I'm interviewing Dain Sundstrom about building a data lakehouse with Trino and Iceberg Interview Introduction How did you get involved in the area of data management? To start, can you share your definition of what constitutes a "Data Lakehouse"? What are the technical/architectural/UX challenges that have hindered the progression of lakehouses? What are the notable advancements in recent months/years that make them a more viable platform choice? There are multiple tools and vendors that have adopted the "data lakehouse" terminology. What are the benefits offered by the combination of Trino and Iceberg? What are the key points of comparison for that combination in relation to other possible selections? What are the pain points that are still prevalent in lakehouse architectures as compared to warehouse or vertically integrated systems? What progress is being made (within or across the ecosystem) to address those sharp edges? For someone who is interested in building a data lakehouse with Trino and Iceberg, how does that influence their selection of other platform elements? What are the differences in terms of pipeline design/access and usage patterns when using a Trino/Iceberg lakehouse as compared to other popular warehouse/lakehouse structures? What are the most interesting, innovative, or unexpected ways that you have seen Trino lakehouses used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data lakehouse ecosystem? When is a lakehouse the wrong choice? What do you have planned for the future of Trino/Starburst? Contact Info LinkedIn (https://www.linkedin.com/in/dainsundstrom/) dain (https://github.com/dain) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Trino (https://trino.io/) Starburst (https://www.starburst.io/) Presto (https://prestodb.io/) JBoss (https://en.wikipedia.org/wiki/JBoss_Enterprise_Application_Platform) Java EE (https://www.oracle.com/java/technologies/java-ee-glance.html) HDFS (https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html) S3 (https://aws.amazon.com/s3/) GCS == Google Cloud Storage (https://cloud.google.com/storage?hl=en) Hive (https://hive.apache.org/) Hive ACID (https://cwiki.apache.org/confluence/display/hive/hive+transactions) Apache Ranger (https://ranger.apache.org/) OPA == Open Policy Agent (https://www.openpolicyagent.org/) Oso (https://www.osohq.com/) AWS Lakeformation (https://aws.amazon.com/lake-formation/) Tabular (https://tabular.io/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114) Materialized View (https://en.wikipedia.org/wiki/Materialized_view) Clickhouse (https://clickhouse.com/) Druid (https://druid.apache.org/) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

2/18/2024 • 58 minutes, 46 seconds

Data Sharing Across Business And Platform Boundaries

Summary Sharing data is a simple concept, but complicated to implement well. There are numerous business rules and regulatory concerns that need to be applied. There are also numerous technical considerations to be made, particularly if the producer and consumer of the data aren't using the same platforms. In this episode Andrew Jefferson explains the complexities of building a robust system for data sharing, the techno-social considerations, and how the Bobsled platform that he is building aims to simplify the process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Andy Jefferson about how to solve the problem of data sharing Interview Introduction How did you get involved in the area of data management? Can you start by giving some context and scope of what we mean by "data sharing" for the purposes of this conversation? What is the current state of the ecosystem for data sharing protocols/practices/platforms? What are some of the main challenges/shortcomings that teams/organizations experience with these options? What are the technical capabilities that need to be present for an effective data sharing solution? How does that change as a function of the type of data? (e.g. tabular, image, etc.) What are the requirements around governance and auditability of data access that need to be addressed when sharing data? What are the typical boundaries along which data access requires special consideration for how the sharing is managed? Many data platform vendors have their own interfaces for data sharing. What are the shortcomings of those options, and what are the opportunities for abstracting the sharing capability from the underlying platform? What are the most interesting, innovative, or unexpected ways that you have seen data sharing/Bobsled used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data sharing? When is Bobsled the wrong choice? What do you have planned for the future of data sharing? Contact Info LinkedIn (https://www.linkedin.com/in/andyjefferson/?originalSubdomain=de) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Bobsled (https://www.bobsled.co/) OLAP == OnLine Analytical Processing (https://en.wikipedia.org/wiki/Online_analytical_processing) Cassandra (https://cassandra.apache.org/_/index.html) Podcast Episode (https://www.dataengineeringpodcast.com/cassandra-global-scale-database-episode-220) Neo4J (https://neo4j.com/) FTP == File Transfer Protocol (https://en.wikipedia.org/wiki/File_Transfer_Protocol) S3 Access Points (https://aws.amazon.com/s3/features/access-points/) Snowflake Sharing (https://docs.snowflake.com/en/guides-overview-sharing) BigQuery Sharing (https://cloud.google.com/bigquery/docs/authorized-datasets) Databricks Delta Sharing (https://www.databricks.com/product/delta-sharing) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

2/11/2024 • 59 minutes, 55 seconds

Tackling Real Time Streaming Data With SQL Using RisingWave

Summary Stream processing systems have long been built with a code-first design, adding SQL as a layer on top of the existing framework. RisingWave is a database engine that was created specifically for stream processing, with S3 as the storage layer. In this episode Yingjun Wu explains how it is architected to power analytical workflows on continuous data flows, and the challenges of making it responsive and scalable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Dagster offers a new approach to building and running data platforms and data pipelines. It is an open-source, cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability, a declarative programming model, and best-in-class testability. Your team can get up and running in minutes thanks to Dagster Cloud, an enterprise-class hosted solution that offers serverless and hybrid deployments, enhanced security, and on-demand ephemeral test deployments. Go to dataengineeringpodcast.com/dagster (https://www.dataengineeringpodcast.com/dagster) today to get started. Your first 30 days are free! Your host is Tobias Macey and today I'm interviewing Yingjun Wu about the RisingWave database and the intricacies of building a stream processing engine on S3 Interview Introduction How did you get involved in the area of data management? Can you describe what RisingWave is and the story behind it? There are numerous stream processing engines, near-real-time database engines, streaming SQL systems, etc. What is the specific niche that RisingWave addresses? What are some of the platforms/architectures that teams are replacing with RisingWave? What are some of the unique capabilities/use cases that RisingWave provides over other offerings in the current ecosystem? Can you describe how RisingWave is architected and implemented? How have the design and goals/scope changed since you first started working on it? What are the core design philosophies that you rely on to prioritize the ongoing development of the project? What are the most complex engineering challenges that you have had to address in the creation of RisingWave? Can you describe a typical workflow for teams that are building on top of RisingWave? What are the user/developer experience elements that you have prioritized most highly? What are the situations where RisingWave can/should be a system of record vs. a point-in-time view of data in transit, with a data warehouse/lakehouse as the longitudinal storage and query engine? What are the most interesting, innovative, or unexpected ways that you have seen RisingWave used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on RisingWave? When is RisingWave the wrong choice? What do you have planned for the future of RisingWave? Contact Info yingjunwu (https://github.com/yingjunwu) on GitHub Personal Website (https://yingjunwu.github.io/) LinkedIn (https://www.linkedin.com/in/yingjun-wu-4b584536/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links RisingWave (https://risingwave.com/) AWS Redshift (https://aws.amazon.com/redshift/) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57) Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) Druid (https://druid.apache.org/) Materialize (https://materialize.com/) Spark (https://spark.apache.org/) Trino (https://trino.io/) Snowflake (https://www.snowflake.com/en/) Kafka (https://kafka.apache.org/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209) Postgres (https://www.postgresql.org/) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

2/4/2024 • 56 minutes, 55 seconds

Build A Data Lake For Your Security Logs With Scanner

Summary Monitoring and auditing IT systems for security events requires the ability to quickly analyze massive volumes of unstructured log data. The majority of products that are available either require too much effort to structure the logs, or aren't fast enough for interactive use cases. Cliff Crosland co-founded Scanner to provide fast querying of high scale log data for security auditing. In this episode he shares the story of how it got started, how it works, and how you can get started with it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Cliff Crosland about Scanner, a security data lake platform for analyzing security logs and identifying issues quickly and cost-effectively Interview Introduction How did you get involved in the area of data management? Can you describe what Scanner is and the story behind it? What were the shortcomings of other tools that are available in the ecosystem? What is Scanner explicitly not trying to solve for in the security space? (e.g. SIEM) A query engine is useless without data to analyze. What are the data acquisition paths/sources that you are designed to work with?- e.g. cloudtrail logs, app logs, etc. What are some of the other sources of signal for security monitoring that would be valuable to incorporate or integrate with through Scanner? Log data is notoriously messy, with no strictly defined format. How do you handle introspection and querying across loosely structured records that might span multiple sources and inconsistent labelling strategies? Can you describe the architecture of the Scanner platform? What were the motivating constraints that led you to your current implementation? How have the design and goals of the product changed since you first started working on it? Given the security oriented customer base that you are targeting, how do you address trust/network boundaries for compliance with regulatory/organizational policies? What are the personas of the end-users for Scanner? How has that influenced the way that you think about the query formats, APIs, user experience etc. for the prroduct? For teams who are working with Scanner can you describe how it fits into their workflow? What are the most interesting, innovative, or unexpected ways that you have seen Scanner used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Scanner? When is Scanner the wrong choice? What do you have planned for the future of Scanner? Contact Info LinkedIn (https://www.linkedin.com/in/cliftoncrosland/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. Links Scanner (https://scanner.dev/) cURL (https://curl.se/) Rust (https://www.rust-lang.org/) Splunk (https://www.splunk.com/) S3 (https://aws.amazon.com/s3/) AWS Athena (https://aws.amazon.com/athena/) Loki (https://grafana.com/oss/loki/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Presto (https://prestodb.io/) Trino (thttps://trino.io/) AWS CloudTrail (https://aws.amazon.com/cloudtrail/) GitHub Audit Logs (https://docs.github.com/en/organizations/keeping-your-organization-secure/managing-security-settings-for-your-organization/reviewing-the-audit-log-for-your-organization) Okta (https://www.okta.com/) Cribl (https://cribl.io/) Vector.dev (https://vector.dev/) Tines (https://www.tines.com/) Torq (https://torq.io/) Jira (https://www.atlassian.com/software/jira) Linear (https://linear.app/) ECS Fargate (https://aws.amazon.com/fargate/) SQS (https://aws.amazon.com/sqs/) Monoid (https://en.wikipedia.org/wiki/Monoid) Group Theory (https://en.wikipedia.org/wiki/Group_theory) Avro (https://avro.apache.org/) Parquet (https://parquet.apache.org/) OCSF (https://github.com/ocsf/) VPC Flow Logs (https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

1/29/2024 • 1 hour, 2 minutes, 38 seconds

Modern Customer Data Platform Principles

Summary Databases and analytics architectures have gone through several generational shifts. A substantial amount of the data that is being managed in these systems is related to customers and their interactions with an organization. In this episode Tasso Argyros, CEO of ActionIQ, gives a summary of the major epochs in database technologies and how he is applying the capabilities of cloud data warehouses to the challenge of building more comprehensive experiences for end-users through a modern customer data platform (CDP). Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). That’s three free boards at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). Your host is Tobias Macey and today I'm interviewing Tasso Argyros about the role of a customer data platform in the context of the modern data stack Interview Introduction How did you get involved in the area of data management? Can you describe what the role of the CDP is in the context of a businesses data ecosystem? What are the core technical challenges associated with building and maintaining a CDP? What are the organizational/business factors that contribute to the complexity of these systems? The early days of CDPs came with the promise of "Customer 360". Can you unpack that concept and how it has changed over the past ~5 years? Recent years have seen the adoption of reverse ETL, cloud data warehouses, and sophisticated product analytics suites. How has that changed the architectural approach to CDPs? How have the architectural shifts changed the ways that organizations interact with their customer data? How have the responsibilities shifted across different roles? What are the governance policy and enforcement challenges that are added with the expansion of access and responsibility? What are the most interesting, innovative, or unexpected ways that you have seen CDPs built/used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDPs? When is a CDP the wrong choice? What do you have planned for the future of ActionIQ? Contact Info LinkedIn (https://www.linkedin.com/in/tasso/) @Tasso (https://twitter.com/tasso) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Action IQ (https://www.actioniq.com) Aster Data (https://en.wikipedia.org/wiki/Aster_Data_Systems) Teradata (https://www.teradata.com/) Filemaker (https://en.wikipedia.org/wiki/FileMaker) Hadoop (https://hadoop.apache.org/) NoSQL (https://en.wikipedia.org/wiki/NoSQL) Hive (https://hive.apache.org/) Informix (https://en.wikipedia.org/wiki/Informix) Parquet (https://parquet.apache.org/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Spark (https://spark.apache.org/) Redshift (https://aws.amazon.com/redshift/) Unity Catalog (https://www.databricks.com/product/unity-catalog) Customer Data Platform (https://en.wikipedia.org/wiki/Customer_data_platform) CDP Market Guide (https://info.actioniq.com/hubfs/CDP%20Market%20Guide/CDP_Market_Guide_2024.pdf?utm_campaign=FY24Q4_2024%20CDP%20Market%20Guide&utm_source=AIQ&utm_medium=podcast) Kaizen (https://en.wikipedia.org/wiki/Kaizen) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

1/22/2024 • 1 hour, 1 minute, 33 seconds

Pushing The Limits Of Scalability And User Experience For Data Processing WIth Jignesh Patel

Summary Data processing technologies have dramatically improved in their sophistication and raw throughput. Unfortunately, the volumes of data that are being generated continue to double, requiring further advancements in the platform capabilities to keep up. As the sophistication increases, so does the complexity, leading to challenges for user experience. Jignesh Patel has been researching these areas for several years in his work as a professor at Carnegie Mellon University. In this episode he illuminates the landscape of problems that we are faced with and how his research is aimed at helping to solve these problems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Jignesh Patel about the research that he is conducting on technical scalability and user experience improvements around data management Interview Introduction How did you get involved in the area of data management? Can you start by summarizing your current areas of research and the motivations behind them? What are the open questions today in technical scalability of data engines? What are the experimental methods that you are using to gain understanding in the opportunities and practical limits of those systems? As you strive to push the limits of technical capacity in data systems, how does that impact the usability of the resulting systems? When performing research and building prototypes of the projects, what is your process for incorporating user experience into the implementation of the product? What are the main sources of tension between technical scalability and user experience/ease of comprehension? What are some of the positive synergies that you have been able to realize between your teaching, research, and corporate activities? In what ways do they produce conflict, whether personally or technically? What are the most interesting, innovative, or unexpected ways that you have seen your research used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on research of the scalability limits of data systems? What is your heuristic for when a given research project needs to be terminated or productionized? What do you have planned for the future of your academic research? Contact Info Website (https://jigneshpatel.org/) LinkedIn (https://www.linkedin.com/in/jigneshmpatel/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Carnegie Mellon Universe (https://www.cmu.edu/) Parallel Databases (https://en.wikipedia.org/wiki/Parallel_database) Genomics (https://en.wikipedia.org/wiki/Genomics) Proteomics (https://en.wikipedia.org/wiki/Proteomics) Moore's Law (https://en.wikipedia.org/wiki/Moore%27s_law) Dennard Scaling (https://en.wikipedia.org/wiki/Dennard_scaling) Generative AI (https://en.wikipedia.org/wiki/Generative_artificial_intelligence) Quantum Computing (https://en.wikipedia.org/wiki/Quantum_computing) Voltron Data (https://voltrondata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) Von Neumann Architecture (https://en.wikipedia.org/wiki/Von_Neumann_architecture) Two's Complement (https://en.wikipedia.org/wiki/Two%27s_complement) Ottertune (https://ottertune.com/) Podcast Episode (https://www.dataengineeringpodcast.com/ottertune-database-performance-optimization-episode-197/) dbt (https://www.getdbt.com/) Informatica (https://www.informatica.com/) Mozart Data (https://mozartdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/mozart-data-modern-data-stack-episode-242/) DataChat (https://datachat.ai/) Von Neumann Bottleneck (https://www.techopedia.com/definition/14630/von-neumann-bottleneck) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

1/7/2024 • 50 minutes, 26 seconds

Designing Data Platforms For Fintech Companies

Summary Working with financial data requires a high degree of rigor due to the numerous regulations and the risks involved in security breaches. In this episode Andrey Korchack, CTO of fintech startup Monite, discusses the complexities of designing and implementing a data platform in that sector. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Andrey Korchak about how to manage data in a fintech environment Interview Introduction How did you get involved in the area of data management? Can you start by summarizing the data challenges that are particular to the fintech ecosystem? What are the primary sources and types of data that fintech organizations are working with? What are the business-level capabilities that are dependent on this data? How do the regulatory and business requirements influence the technology landscape in fintech organizations? What does a typical build vs. buy decision process look like? Fraud prediction in e.g. banks is one of the most well-established applications of machine learning in industry. What are some of the other ways that ML plays a part in fintech? How does that influence the architectural design/capabilities for data platforms in those organizations? Data governance is a notoriously challenging problem. What are some of the strategies that fintech companies are able to apply to this problem given their regulatory burdens? What are the most interesting, innovative, or unexpected approaches to data management that you have seen in the fintech sector? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data in fintech? What do you have planned for the future of your data capabilities at Monite? Contact Info LinkedIn (https://www.linkedin.com/in/a-korchak/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Monite (https://monite.com/) ISO 270001 (https://www.iso.org/standard/27001) Tesseract (https://github.com/tesseract-ocr/tesseract) GitOps (https://about.gitlab.com/topics/gitops/) SWIFT Protocol (https://en.wikipedia.org/wiki/SWIFT) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

1/1/2024 • 47 minutes, 56 seconds

Troubleshooting Kafka In Production

Summary Kafka has become a ubiquitous technology, offering a simple method for coordinating events and data across different systems. Operating it at scale, however, is notoriously challenging. Elad Eldor has experienced these challenges first-hand, leading to his work writing the book "Kafka: : Troubleshooting in Production". In this episode he highlights the sources of complexity that contribute to Kafka's operational difficulties, and some of the main ways to identify and mitigate potential sources of trouble. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Elad Eldor about operating Kafka in production and how to keep your clusters stable and performant Interview Introduction How did you get involved in the area of data management? Can you describe your experiences with Kafka? What are the operational challenges that you have had to overcome while working with Kafka? What motivated to write a book about how to manage Kafka in production? There are many options now for persistent data queues. What are the factors to consider when determining whether Kafka is the right choice? In the case where Kafka is the appropriate tool, there are many ways to run it now. What are the considerations that teams need to work through when determining whether/where/how to operate a cluster? When provisioning a Kafka cluster, what are the requirements that need to be considered when determining the sizing? What are the axes along which size/scale need to be determined? The core promise of Kafka is that it is a durable store for continuous data. What are the mechanisms that are available for preventing data loss? Under what circumstances can data be lost? What are the different failure conditions that cluster operators need to be aware of? What are the monitoring strategies that are most helpful for identifying (proactively or reactively) those errors? In the event of these different cluster errors, what are the strategies for mitigating and recovering from those failures? When a cluster's usage expands beyond the original designed capacity, what are the options/procedures for expanding that capacity? When a cluster is underutilized, how can it be scaled down to reduce cost? What are the most interesting, innovative, or unexpected ways that you have seen Kafka used? What are the most interesting, unexpected, or challenging lessons that you have learned while working with Kafka? When is Kafka the wrong choice? What are the changes that you would like to see in Kafka to make it easier to operate? Contact Info LinkedIn (https://www.linkedin.com/in/elad-eldor/?originalSubdomain=il) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Kafka: Troubleshooting in Production (https://amzn.to/3NFzPgL) book (affiliate link) IronSource (https://www.is.com/) Druid (https://druid.apache.org/) Trino (https://trino.io/) Kafka (https://kafka.apache.org/) Spark (https://spark.apache.org/) SRE == Site Reliability Engineer (https://en.wikipedia.org/wiki/Site_reliability_engineering) Presto (https://prestodb.io/) System Performance (https://amzn.to/3tkQAag) by Brendan Gregg (affiliate link) HortonWorks (https://en.wikipedia.org/wiki/Hortonworks) RAID == Redundant Array of Inexpensive Disks (https://en.wikipedia.org/wiki/RAID) JBOD == Just a Bunch Of Disks (https://en.wikipedia.org/wiki/Non-RAID_drive_architectures#JBOD) AWS MSK (https://aws.amazon.com/msk/) Confluent (https://www.confluent.io/) Aiven (https://aiven.io/) JStat (https://docs.oracle.com/javase/8/docs/technotes/tools/windows/jstat.html) Kafka Tiered Storage (https://cwiki.apache.org/confluence/display/KAFKA/KIP-405%3A+Kafka+Tiered+Storage) Brendan Gregg iostat utilization explanation (https://www.brendangregg.com/blog/2021-05-09/poor-disk-performance.html) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

12/24/2023 • 1 hour, 14 minutes, 43 seconds

Adding An Easy Mode For The Modern Data Stack With 5X

Summary The "modern data stack" promised a scalable, composable data platform that gave everyone the flexibility to use the best tools for every job. The reality was that it left data teams in the position of spending all of their engineering effort on integrating systems that weren't designed with compatible user experiences. The team at 5X understand the pain involved and the barriers to productivity and set out to solve it by pre-integrating the best tools from each layer of the stack. In this episode founder Tarush Aggarwal explains how the realities of the modern data stack are impacting data teams and the work that they are doing to accelerate time to value. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm welcoming back Tarush Aggarwal to talk about what he and his team at 5x data are building to improve the user experience of the modern data stack. Interview Introduction How did you get involved in the area of data management? Can you describe what 5x is and the story behind it? We last spoke in March of 2022. What are the notable changes in the 5x business and product? What are the notable shifts in the data ecosystem that have influenced your adoption and product direction? What trends are you most focused on tracking as you plan the continued evolution of your offerings? What are the points of friction that teams run into when trying to build their data platform? Can you describe design of the system that you have built? What are the strategies that you rely on to support adaptability and speed of onboarding for new integrations? What are some of the types of edge cases that you have to deal with while integrating and operating the platform implementations that you design for your customers? What is your process for selection of vendors to support? How would you characterize your relationships with the vendors that you rely on? For customers who have pre-existing investment in a portion of the data stack, what is your process for engaging with them to understand how best to support their goals? What are the most interesting, innovative, or unexpected ways that you have seen 5XData used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on 5XData? When is 5X the wrong choice? What do you have planned for the future of 5X? Contact Info LinkedIn (https://www.linkedin.com/in/tarushaggarwal/) @tarush (https://twitter.com/tarush) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links 5X (https://5x.co) Informatica (https://www.informatica.com/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Looker (https://cloud.google.com/looker/) Podcast Episode (https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) Redshift (https://aws.amazon.com/redshift/) Reverse ETL (https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Rudderstack (https://www.rudderstack.com/) Podcast Episode (https://www.dataengineeringpodcast.com/rudderstack-open-source-customer-data-platform-episode-263/) Peak.ai (https://peak.ai/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

12/18/2023 • 56 minutes, 12 seconds

Run Your Own Anomaly Detection For Your Critical Business Metrics With Anomstack

Summary If your business metrics looked weird tomorrow, would you know about it first? Anomaly detection is focused on identifying those outliers for you, so that you are the first to know when a business critical dashboard isn't right. Unfortunately, it can often be complex or expensive to incorporate anomaly detection into your data platform. Andrew Maguire got tired of solving that problem for each of the different roles he has ended up in, so he created the open source Anomstack project. In this episode he shares what it is, how it works, and how you can start using it today to get notified when the critical metrics in your business aren't quite right. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). That’s three free boards at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrew Maguire about his work on the Anomstack project and how you can use it to run your own anomaly detection for your metrics Interview Introduction How did you get involved in the area of data management? Can you describe what Anomstack is and the story behind it? What are your goals for this project? What other tools/products might teams be evaluating while they consider Anomstack? In the context of Anomstack, what constitutes a "metric"? What are some examples of useful metrics that a data team might want to monitor? You put in a lot of work to make Anomstack as easy as possible to get started with. How did this focus on ease of adoption influence the way that you approached the overall design of the project? What are the core capabilities and constraints that you selected to provide the focus and architecture of the project? Can you describe how Anomstack is implemented? How have the design and goals of the project changed since you first started working on it? What are the steps to getting Anomstack running and integrated as part of the operational fabric of a data platform? What are the sharp edges that are still present in the system? What are the interfaces that are available for teams to customize or enhance the capabilities of Anomstack? What are the most interesting, innovative, or unexpected ways that you have seen Anomstack used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomstack? When is Anomstack the wrong choice? What do you have planned for the future of Anomstack? Contact Info LinkedIn (https://www.linkedin.com/in/andrewm4894/) Twitter (https://twitter.com/@andrewm4894) GitHub (http://github.com/andrewm4894) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Anomstack Github repo (http://github.com/andrewm4894/anomstack) Airflow Anomaly Detection Provider Github repo (https://github.com/andrewm4894/airflow-provider-anomaly-detection) Netdata (https://www.netdata.cloud/) Metric Tree (https://www.datacouncil.ai/talks/designing-and-building-metric-trees) Semantic Layer (https://en.wikipedia.org/wiki/Semantic_layer) Prometheus (https://prometheus.io/) Anodot (https://www.anodot.com/) Chaos Genius (https://www.chaosgenius.io/) Metaplane (https://www.metaplane.dev/) Anomalo (https://www.anomalo.com/) PyOD (https://pyod.readthedocs.io/) Airflow (https://airflow.apache.org/) DuckDB (https://duckdb.org/) Anomstack Gallery (https://github.com/andrewm4894/anomstack/tree/main/gallery) Dagster (https://dagster.io/) InfluxDB (https://www.influxdata.com/) TimeGPT (https://docs.nixtla.io/docs/timegpt_quickstart) Prophet (https://facebook.github.io/prophet/) GreyKite (https://linkedin.github.io/greykite/) OpenLineage (https://openlineage.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

12/11/2023 • 49 minutes, 51 seconds

Designing Data Transfer Systems That Scale

Summary The first step of data pipelines is to move the data to a place where you can process and prepare it for its eventual purpose. Data transfer systems are a critical component of data enablement, and building them to support large volumes of information is a complex endeavor. Andrei Tserakhau has dedicated his careeer to this problem, and in this episode he shares the lessons that he has learned and the work he is doing on his most recent data transfer system at DoubleCloud. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues for every part of your data workflow, from migration to deployment. Datafold has recently launched a 3-in-1 product experience to support accelerated data migrations. With Datafold, you can seamlessly plan, translate, and validate data across systems, massively accelerating your migration project. Datafold leverages cross-database diffing to compare tables across environments in seconds, column-level lineage for smarter migration planning, and a SQL translator to make moving your SQL scripts easier. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) today! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Andrei Tserakhau about operationalizing high bandwidth and low-latency change-data capture Interview Introduction How did you get involved in the area of data management? Your most recent project involves operationalizing a generalized data transfer service. What was the original problem that you were trying to solve? What were the shortcomings of other options in the ecosystem that led you to building a new system? What was the design of your initial solution to the problem? What are the sharp edges that you had to deal with to operate and use that initial implementation? What were the limitations of the system as you started to scale it? Can you describe the current architecture of your data transfer platform? What are the capabilities and constraints that you are optimizing for? As you move beyond the initial use case that started you down this path, what are the complexities involved in generalizing to add new functionality or integrate with additional platforms? What are the most interesting, innovative, or unexpected ways that you have seen your data transfer service used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data transfer system? When is DoubleCloud Data Transfer the wrong choice? What do you have planned for the future of DoubleCloud Data Transfer? Contact Info LinkedIn (https://www.linkedin.com/in/andrei-tserakhau/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links DoubleCloud (https://double.cloud/) Kafka (https://kafka.apache.org/) MapReduce (https://en.wikipedia.org/wiki/MapReduce) Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture) Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) dbt (https://www.getdbt.com/) OpenMetadata (https://open-metadata.org/) Podcast Episode (https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/) Speaker - Andrei Tserakhau, DoubleCloud Tech Lead. He has over 10 years of IT engineering experience and for the last 4 years was working on distributed systems with a focus on data delivery systems.

12/4/2023 • 1 hour, 3 minutes, 57 seconds

Addressing The Challenges Of Component Integration In Data Platform Architectures

Summary Building a data platform that is enjoyable and accessible for all of its end users is a substantial challenge. One of the core complexities that needs to be addressed is the fractal set of integrations that need to be managed across the individual components. In this episode Tobias Macey shares his thoughts on the challenges that he is facing as he prepares to build the next set of architectural layers for his data platform to enable a larger audience to start accessing the data being managed by his team. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Developing event-driven pipelines is going to be a lot easier - Meet Functions! Memphis functions enable developers and data engineers to build an organizational toolbox of functions to process, transform, and enrich ingested events “on the fly” in a serverless manner using AWS Lambda syntax, without boilerplate, orchestration, error handling, and infrastructure in almost any language, including Go, Python, JS, .NET, Java, SQL, and more. Go to dataengineeringpodcast.com/memphis (https://www.dataengineeringpodcast.com/memphis) today to get started! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'll be sharing an update on my own journey of building a data platform, with a particular focus on the challenges of tool integration and maintaining a single source of truth Interview Introduction How did you get involved in the area of data management? data sharing weight of history existing integrations with dbt switching cost for e.g. SQLMesh de facto standard of Airflow Single source of truth permissions management across application layers Database engine Storage layer in a lakehouse Presentation/access layer (BI) Data flows dbt -> table level lineage orchestration engine -> pipeline flows task based vs. asset based Metadata platform as the logical place for horizontal view Contact Info LinkedIn (https://linkedin.com/in/tmacey) Website (https://www.dataengineeringpodcast.com) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Monologue Episode On Data Platform Design (https://www.dataengineeringpodcast.com/data-platform-design-episode-268) Monologue Episode On Leaky Abstractions (https://www.dataengineeringpodcast.com/abstractions-and-technical-debt-episode-374) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Trino (https://trino.io/) Dagster (https://dagster.io/) dbt (https://www.getdbt.com/) Snowflake (https://www.snowflake.com/en/) BigQuery (https://cloud.google.com/bigquery) OpenMetadata (https://open-metadata.org/) OpenLineage (https://openlineage.io/) Data Platform Shadow IT Episode (https://www.dataengineeringpodcast.com/shadow-it-data-analytics-episode-121) Preset (https://preset.io/) LightDash (https://www.lightdash.com/) Podcast Episode (https://www.dataengineeringpodcast.com/lightdash-exploratory-business-intelligence-episode-232/) SQLMesh (https://sqlmesh.readthedocs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380) Airflow (https://airflow.apache.org/) Spark (https://spark.apache.org/) Flink (https://flink.apache.org/) Tabular (https://tabular.io/) Iceberg (https://iceberg.apache.org/) Open Policy Agent (https://www.openpolicyagent.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

11/27/2023 • 29 minutes, 42 seconds

Unlocking Your dbt Projects With Practical Advice For Practitioners

Summary The dbt project has become overwhelmingly popular across analytics and data engineering teams. While it is easy to adopt, there are many potential pitfalls. Dustin Dorsey and Cameron Cyr co-authored a practical guide to building your dbt project. In this episode they share their hard-won wisdom about how to build and scale your dbt projects. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). That’s three free boards at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. Your host is Tobias Macey and today I'm interviewing Dustin Dorsey and Cameron Cyr about how to design your dbt projects Interview Introduction How did you get involved in the area of data management? What was your path to adoption of dbt? What did you use prior to its existence? When/why/how did you start using it? What are some of the common challenges that teams experience when getting started with dbt? How does prior experience in analytics and/or software engineering impact those outcomes? You recently wrote a book to give a crash course in best practices for dbt. What motivated you to invest that time and effort? What new lessons did you learn about dbt in the process of writing the book? The introduction of dbt is largely responsible for catalyzing the growth of "analytics engineering". As practitioners in the space, what do you see as the net result of that trend? What are the lessons that we all need to invest in independent of the tool? For someone starting a new dbt project today, can you talk through the decisions that will be most critical for ensuring future success? As dbt projects scale, what are the elements of technical debt that are most likely to slow down engineers? What are the capabilities in the dbt framework that can be used to mitigate the effects of that debt? What tools or processes outside of dbt can help alleviate the incidental complexity of a large dbt project? What are the most interesting, innovative, or unexpected ways that you have seen dbt used? What are the most interesting, unexpected, or challenging lessons that you have learned while working with dbt? (as engineers and/or as autors) What is on your personal wish-list for the future of dbt (or its competition?)? Contact Info Dustin LinkedIn (https://www.linkedin.com/in/dustindorsey/) Cameron LinkedIn (https://www.linkedin.com/in/cameron-cyr/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Biobot Analytic (https://biobot.io/) Breezeway (https://www.breezeway.io/) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) Synapse Analytics (https://azure.microsoft.com/en-us/products/synapse-analytics/) Snowflake (https://azure.microsoft.com/en-us/products/synapse-analytics/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Analytics Power Hour (https://analyticshour.io/) DDL == Data Definition Language (https://en.wikipedia.org/wiki/Data_definition_language) DML == Data Manipulation Language (https://en.wikipedia.org/wiki/Data_manipulation_language) dbt codegen (https://github.com/dbt-labs/dbt-codegen) Unlocking dbt (https://amzn.to/49BhACq) book (affiliate link) dbt Mesh (https://www.getdbt.com/product/dbt-mesh) dbt Semantic Layer (https://www.getdbt.com/product/semantic-layer) GitHub Actions (https://github.com/features/actions) Metaplane (https://www.metaplane.dev/) Podcast Episode (https://www.dataengineeringpodcast.com/metaplane-data-observability-platform-episode-253/) DataTune Conference (https://www.datatuneconf.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

11/20/2023 • 1 hour, 16 minutes, 4 seconds

Enhancing The Abilities Of Software Engineers With Generative AI At Tabnine

Summary Software development involves an interesting balance of creativity and repetition of patterns. Generative AI has accelerated the ability of developer tools to provide useful suggestions that speed up the work of engineers. Tabnine is one of the main platforms offering an AI powered assistant for software engineers. In this episode Eran Yahav shares the journey that he has taken in building this product and the ways that it enhances the ability of humans to get their work done, and when the humans have to adapt to the tool. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Eran Yahav about building an AI powered developer assistant at Tabnine Interview Introduction How did you get involved in machine learning? Can you describe what Tabnine is and the story behind it? What are the individual and organizational motivations for using AI to generate code? What are the real-world limitations of generative AI for creating software? (e.g. size/complexity of the outputs, naming conventions, etc.) What are the elements of skepticism/oversight that developers need to exercise while using a system like Tabnine? What are some of the primary ways that developers interact with Tabnine during their development workflow? Are there any particular styles of software for which an AI is more appropriate/capable? (e.g. webapps vs. data pipelines vs. exploratory analysis, etc.) For natural languages there is a strong bias toward English in the current generation of LLMs. How does that translate into computer languages? (e.g. Python, Java, C++, etc.) Can you describe the structure and implementation of Tabnine? Do you rely primarily on a single core model, or do you have multiple models with subspecialization? How have the design and goals of the product changed since you first started working on it? What are the biggest challenges in building a custom LLM for code? What are the opportunities for specialization of the model architecture given the highly structured nature of the problem domain? For users of Tabnine, how do you assess/monitor the accuracy of recommendations? What are the feedback and reinforcement mechanisms for the model(s)? What are the most interesting, innovative, or unexpected ways that you have seen Tabnine's LLM powered coding assistant used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on AI assisted development at Tabnine? When is an AI developer assistant the wrong choice? What do you have planned for the future of Tabnine? Contact Info LinkedIn (https://www.linkedin.com/in/eranyahav/?originalSubdomain=il) Website (https://csaws.cs.technion.ac.il/~yahave/) Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links TabNine (https://www.tabnine.com/) Technion University (https://www.technion.ac.il/en/home-2/) Program Synthesis (https://en.wikipedia.org/wiki/Program_synthesis) Context Stuffing (http://gptprompts.wikidot.com/context-stuffing) Elixir (https://elixir-lang.org/) Dependency Injection (https://en.wikipedia.org/wiki/Dependency_injection) COBOL (https://en.wikipedia.org/wiki/COBOL) Verilog (https://en.wikipedia.org/wiki/Verilog) MidJourney (https://www.midjourney.com/home) The intro and outro music is from Hitman's Lovesong feat. Paola Graziano (https://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Tales_Of_A_Dead_Fish/Hitmans_Lovesong/) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/)/CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0/)

11/13/2023 • 1 hour, 7 minutes, 52 seconds

Shining Some Light In The Black Box Of PostgreSQL Performance

Summary Databases are the core of most applications, but they are often treated as inscrutable black boxes. When an application is slow, there is a good probability that the database needs some attention. In this episode Lukas Fittl shares some hard-won wisdom about the causes and solution of many performance bottlenecks and the work that he is doing to shine some light on PostgreSQL to make it easier to understand how to keep it running smoothly. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Data lakes are notoriously complex. For data engineers who battle to build and scale high quality data workflows on the data lake, Starburst powers petabyte-scale SQL analytics fast, at a fraction of the cost of traditional methods, so that you can meet all your data needs ranging from AI to data applications to complete analytics. Trusted by teams of all sizes, including Comcast and Doordash, Starburst is a data lake analytics platform that delivers the adaptability and flexibility a lakehouse ecosystem promises. And Starburst does all of this on an open architecture with first-class support for Apache Iceberg, Delta Lake and Hudi, so you always maintain ownership of your data. Want to see Starburst in action? Go to dataengineeringpodcast.com/starburst (https://www.dataengineeringpodcast.com/starburst) and get $500 in credits to try Starburst Galaxy today, the easiest and fastest way to get started using Trino. This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) Your host is Tobias Macey and today I'm interviewing Lukas Fittl about optimizing your database performance and tips for tuning Postgres Interview Introduction How did you get involved in the area of data management? What are the different ways that database performance problems impact the business? What are the most common contributors to performance issues? What are the useful signals that indicate performance challenges in the database? For a given symptom, what are the steps that you recommend for determining the proximate cause? What are the potential negative impacts to be aware of when tuning the configuration of your database? How does the database engine influence the methods used to identify and resolve performance challenges? Most of the database engines that are in common use today have been around for decades. How have the lessons learned from running these systems over the years influenced the ways to think about designing new engines or evolving the ones we have today? What are the most interesting, innovative, or unexpected ways that you have seen to address database performance? What are the most interesting, unexpected, or challenging lessons that you have learned while working on databases? What are your goals for the future of database engines? Contact Info LinkedIn (https://www.linkedin.com/in/lfittl/) @LukasFittl (https://twitter.com/LukasFittl) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links PGAnalyze (https://pganalyze.com/) Citus Data (https://www.citusdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/citus-data-with-ozgun-erdogan-and-craig-kerstiens-episode-13/) ORM == Object Relational Mapper (https://en.wikipedia.org/wiki/Object%E2%80%93relational_mapping) N+1 Query (https://docs.sentry.io/product/issues/issue-details/performance-issues/n-one-queries/) Autovacuum (https://www.postgresql.org/docs/current/routine-vacuuming.html#AUTOVACUUM) Write-ahead Log (https://en.wikipedia.org/wiki/Write-ahead_logging) pgstatio (https://pgpedia.info/p/pg_stat_io.html) randompagecost (https://postgresqlco.nf/doc/en/param/random_page_cost/) pgvector (https://github.com/pgvector/pgvector) Vector Database (https://en.wikipedia.org/wiki/Vector_database) Ottertune (https://ottertune.com/) Podcast Episode (https://www.dataengineeringpodcast.com/ottertune-database-performance-optimization-episode-197/) Citus Extension (https://github.com/citusdata/citus) Hydra (https://github.com/hydradatabase/hydra) Clickhouse (https://clickhouse.tech/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) MyISAM (https://en.wikipedia.org/wiki/MyISAM) MyRocks (http://myrocks.io/) InnoDB (https://en.wikipedia.org/wiki/InnoDB) Great Expectations (https://greatexpectations.io/) Podcast Episode (https://www.dataengineeringpodcast.com/great-expectations-data-contracts-episode-352) OpenTelemetry (https://opentelemetry.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

11/6/2023 • 54 minutes, 51 seconds

Surveying The Market Of Database Products

Summary Databases are the core of most applications, whether transactional or analytical. In recent years the selection of database products has exploded, making the critical decision of which engine(s) to use even more difficult. In this episode Tanya Bragin shares her experiences as a product manager for two major vendors and the lessons that she has learned about how teams should approach the process of tool selection. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) Data projects are notoriously complex. With multiple stakeholders to manage across varying backgrounds and toolchains even simple reports can become unwieldy to maintain. Miro is your single pane of glass where everyone can discover, track, and collaborate on your organization's data. I especially like the ability to combine your technical diagrams with data documentation and dependency mapping, allowing your data engineers and data consumers to communicate seamlessly about your projects. Find simplicity in your most complex projects with Miro. Your first three Miro boards are free when you sign up today at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). That’s three free boards at dataengineeringpodcast.com/miro (https://www.dataengineeringpodcast.com/miro). Your host is Tobias Macey and today I'm interviewing Tanya Bragin about her views on the database products market Interview Introduction How did you get involved in the area of data management? What are the aspects of the database market that keep you interested as a VP of product? How have your experiences at Elastic informed your current work at Clickhouse? What are the main product categories for databases today? What are the industry trends that have the most impact on the development and growth of different product categories? Which categories do you see growing the fastest? When a team is selecting a database technology for a given task, what are the types of questions that they should be asking? Transactional engines like Postgres, SQL Server, Oracle, etc. were long used as analytical databases as well. What is driving the broad adoption of columnar stores as a separate environment from transactional systems? What are the inefficiencies/complexities that this introduces? How can the database engine used for analytical systems work more closely with the transactional systems? When building analytical systems there are numerous moving parts with intricate dependencies. What is the role of the database in simplifying observability of these applications? What are the most interesting, innovative, or unexpected ways that you have seen Clickhouse used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on database products? What are your prodictions for the future of the database market? Contact Info LinkedIn (https://www.linkedin.com/in/tbragin/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Clickhouse (https://clickhouse.com/) Podcast Episode (https://www.dataengineeringpodcast.com/clickhouse-data-warehouse-episode-88/) Elastic (https://www.elastic.co/) OLAP (https://en.wikipedia.org/wiki/Online_analytical_processing) OLTP (https://en.wikipedia.org/wiki/Online_transaction_processing) Graph Database (https://en.wikipedia.org/wiki/Graph_database) Vector Database (https://en.wikipedia.org/wiki/Vector_database) Trino (https://trino.io/) Presto (https://prestodb.io/) Foreign data wrapper (https://wiki.postgresql.org/wiki/Foreign_data_wrappers) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) OpenTelemetry (https://opentelemetry.io/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/tabular-iceberg-lakehouse-tables-episode-363) Parquet (https://parquet.apache.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

10/30/2023 • 46 minutes, 24 seconds

Defining A Strategy For Your Data Products

Summary The primary application of data has moved beyond analytics. With the broader audience comes the need to present data in a more approachable format. This has led to the broad adoption of data products being the delivery mechanism for information. In this episode Ranjith Raghunath shares his thoughts on how to build a strategy for the development, delivery, and evolution of data products. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES (https://Neo4j.com/NODES). This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) Your host is Tobias Macey and today I'm interviewing Ranjith Raghunath about tactical elements of a data product strategy Interview Introduction How did you get involved in the area of data management? Can you describe what is encompassed by the idea of a data product strategy? Which roles in an organization need to be involved in the planning and implementation of that strategy? order of operations: strategy -> platform design -> implementation/adoption platform implementation -> product strategy -> interface development managing grain of data in products team organization to support product development/deployment customer communications - what questions to ask? requirements gathering, helping to understand "the art of the possible" What are the most interesting, innovative, or unexpected ways that you have seen organizations approach data product strategies? What are the most interesting, unexpected, or challenging lessons that you have learned while working on defining and implementing data product strategies? When is a data product strategy overkill? What are some additional resources that you recommend for listeners to direct their thinking and learning about data product strategy? Contact Info LinkedIn (https://www.linkedin.com/in/ranjith-raghunath/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links CXData Labs (https://www.cxdatalabs.com/) Dimensional Modeling (https://en.wikipedia.org/wiki/Dimensional_modeling) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

10/23/2023 • 1 hour, 3 minutes, 50 seconds

Reducing The Barrier To Entry For Building Stream Processing Applications With Decodable

Summary Building streaming applications has gotten substantially easier over the past several years. Despite this, it is still operationally challenging to deploy and maintain your own stream processing infrastructure. Decodable was built with a mission of eliminating all of the painful aspects of developing and deploying stream processing systems for engineering teams. In this episode Eric Sammer discusses why more companies are including real-time capabilities in their products and the ways that Decodable makes it faster and easier. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES (https://Neo4j.com/NODES). Your host is Tobias Macey and today I'm interviewing Eric Sammer about starting your stream processing journey with Decodable Interview Introduction How did you get involved in the area of data management? Can you describe what Decodable is and the story behind it? What are the notable changes to the Decodable platform since we last spoke? (October 2021) What are the industry shifts that have influenced the product direction? What are the problems that customers are trying to solve when they come to Decodable? When you launched your focus was on SQL transformations of streaming data. What was the process for adding full Java support in addition to SQL? What are the developer experience challenges that are particular to working with streaming data? How have you worked to address that in the Decodable platform and interfaces? As you evolve the technical and product direction, what is your heuristic for balancing the unification of interfaces and system integration against the ability to swap different components or interfaces as new technologies are introduced? What are the most interesting, innovative, or unexpected ways that you have seen Decodable used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable? When is Decodable the wrong choice? What do you have planned for the future of Decodable? Contact Info esammer (https://github.com/esammer) on GitHub LinkedIn (https://www.linkedin.com/in/esammer/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Decodable (https://www.decodable.co/) Podcast Episode (https://www.dataengineeringpodcast.com/decodable-streaming-data-pipelines-sql-episode-233/) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57/) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114/) Kafka (https://kafka.apache.org/) Redpanda (https://redpanda.com/) Podcast Episode (https://www.dataengineeringpodcast.com/vectorized-red-panda-streaming-data-episode-152/) Kinesis (https://aws.amazon.com/kinesis/) PostgreSQL (https://www.postgresql.org/) Podcast Episode (https://www.dataengineeringpodcast.com/postgresql-with-jonathan-katz-episode-42/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Databricks (https://www.databricks.com/) Startree (https://startree.ai/) Pinot (https://pinot.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/pinot-embedded-analytics-episode-273/) Rockset (https://rockset.com/) Podcast Episode (https://www.dataengineeringpodcast.com/rockset-serverless-analytics-episode-101/) Druid (https://druid.apache.org/) InfluxDB (https://www.influxdata.com/) Samza (https://samza.apache.org/) Storm (https://storm.apache.org/) Pulsar (https://pulsar.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/pulsar-fast-and-scalable-messaging-with-rajan-dhabalia-and-matteo-merli-episode-17) ksqlDB (https://ksqldb.io/) Podcast Episode (https://www.dataengineeringpodcast.com/ksqldb-kafka-stream-processing-episode-122/) dbt (https://www.getdbt.com/) GitHub Actions (https://github.com/features/actions) Airbyte (https://airbyte.com/) Singer (https://www.singer.io/) Splunk (https://www.splunk.com/) Outbox Pattern (https://debezium.io/blog/2019/02/19/reliable-microservices-data-exchange-with-the-outbox-pattern/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

10/15/2023 • 1 hour, 8 minutes, 28 seconds

Using Data To Illuminate The Intentionally Opaque Insurance Industry

Summary The insurance industry is notoriously opaque and hard to navigate. Max Cho found that fact frustrating enough that he decided to build a business of making policy selection more navigable. In this episode he shares his journey of data collection and analysis and the challenges of automating an intentionally manual industry. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register today at Neo4j.com/NODES (https://Neo4j.com/NODES). You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Max Cho about the wild world of insurance companies and the challenges of collecting quality data for this opaque industry Interview Introduction How did you get involved in the area of data management? Can you describe what CoverageCat is and the story behind it? What are the different sources of data that you work with? What are the most challenging aspects of collecting that data? Can you describe the formats and characteristics (3 Vs) of that data? What are some of the ways that the operational model of insurance companies have contributed to its opacity as an industry from a data perspective? Can you describe how you have architected your data platform? How have the design and goals changed since you first started working on it? What are you optimizing for in your selection and implementation process? What are the sharp edges/weak points that you worry about in your existing data flows? How do you guard against those flaws in your day-to-day operations? What are the most interesting, innovative, or unexpected ways that you have seen your data sets used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on insurance industry data? When is a purely statistical view of insurance the wrong approach? What do you have planned for the future of CoverageCat's data stack? Contact Info LinkedIn (https://www.linkedin.com/in/maxrcho/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links CoverageCat (https://www.coveragecat.com/) Actuarial Model (https://en.wikipedia.org/wiki/Actuarial_science) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

10/9/2023 • 51 minutes, 58 seconds

Building ETL Pipelines With Generative AI

Summary Artificial intelligence applications require substantial high quality data, which is provided through ETL pipelines. Now that AI has reached the level of sophistication seen in the various generative models it is being used to build new ETL workflows. In this episode Jay Mishra shares his experiences and insights building ETL pipelines with the help of generative AI. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! As more people start using AI for projects, two things are clear: It’s a rapidly advancing field, but it’s tough to navigate. How can you get the best results for your use case? Instead of being subjected to a bunch of buzzword bingo, hear directly from pioneers in the developer and data science space on how they use graph tech to build AI-powered apps. . Attend the dev and ML talks at NODES 2023, a free online conference on October 26 featuring some of the brightest minds in tech. Check out the agenda and register at Neo4j.com/NODES (https://neo4j.com/nodes). Your host is Tobias Macey and today I'm interviewing Jay Mishra about the applications for generative AI in the ETL process Interview Introduction How did you get involved in the area of data management? What are the different aspects/types of ETL that you are seeing generative AI applied to? What kind of impact are you seeing in terms of time spent/quality of output/etc.? What kinds of projects are most likely to benefit from the application of generative AI? Can you describe what a typical workflow of using AI to build ETL workflows looks like? What are some of the types of errors that you are likely to experience from the AI? Once the pipeline is defined, what does the ongoing maintenance look like? Is the AI required to operate within the pipeline in perpetuity? For individuals/teams/organizations who are experimenting with AI in their data engineering workflows, what are the concerns/questions that they are trying to address? What are the most interesting, innovative, or unexpected ways that you have seen generative AI used in ETL workflows? What are the most interesting, unexpected, or challenging lessons that you have learned while working on ETL and generative AI? When is AI the wrong choice for ETL applications? What are your predictions for future applications of AI in ETL and other data engineering practices? Contact Info LinkedIn (https://www.linkedin.com/in/jaymishra/) @MishraJay (https://twitter.com/MishraJay) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Astera (https://www.astera.com/) Data Vault (https://en.wikipedia.org/wiki/Data_vault_modeling) Star Schema (https://en.wikipedia.org/wiki/Star_schema) OpenAI (https://openai.com/) GPT == Generative Pre-trained Transformer (https://en.wikipedia.org/wiki/Generative_pre-trained_transformer) Entity Resolution (https://en.wikipedia.org/wiki/Record_linkage) LLAMA (https://en.wikipedia.org/wiki/LLaMA) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

10/1/2023 • 51 minutes, 36 seconds

Powering Vector Search With Real Time And Incremental Vector Indexes

Summary The rapid growth of machine learning, especially large language models, have led to a commensurate growth in the need to store and compare vectors. In this episode Louis Brandy discusses the applications for vector search capabilities both in and outside of AI, as well as the challenges of maintaining real-time indexes of vector data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex (https://www.dataengineeringpodcast.com/hex) to get a 30-day free trial of the Hex Team plan! Your host is Tobias Macey and today I'm interviewing Louis Brandy about building vector indexes in real-time for analytics and AI applications Interview Introduction How did you get involved in the area of data management? Can you describe what vector search is and how it differs from other search technologies? What are the technical challenges related to providing vector search? What are the applications for vector search that merit the added complexity? Vector databases have been gaining a lot of attention recently with the proliferation of LLM applications. Is a dedicated database technology required to support vector indexes/vector search queries? What are the use cases for native vector data types that are separate from AI? With the increasing usage of vectors for data and AI/ML applications, who do you typically see as the owner of that problem space? (e.g. data engineers, ML engineers, data scientists, etc.) For teams who are investing in vector search, what are the architectural considerations that they need to be aware of? How does it impact the data pipeline strategies/topologies used? What are the complexities that need to be addressed when updating vector data in a real-time/streaming fashion? How does that influence the client strategies that are querying that data? What are the most interesting, innovative, or unexpected ways that you have seen vector search used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on vector search applications? When is vector search the wrong choice? What do you see as future potential applications for vector indexes/vector search? Contact Info LinkedIn (https://www.linkedin.com/in/lbrandy/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Rockset (https://rockset.com/) Podcast Episode (https://www.dataengineeringpodcast.com/rockset-serverless-analytics-episode-101/) Vector Index (https://www.datastax.com/guides/what-is-a-vector-index) Vector Search (https://www.datastax.com/guides/what-is-vector-search) Rockset Implementation Explanation (https://rockset.com/videos/vector-search-architecture/) Vector Space (https://en.wikipedia.org/wiki/Vector_space) Euclidean Distance (https://en.wikipedia.org/wiki/Euclidean_distance) OLAP == Online Analytical Processing (https://en.wikipedia.org/wiki/Online_analytical_processing) OLTP == Online Transaction Processing (https://en.wikipedia.org/wiki/Online_transaction_processing) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

9/25/2023 • 59 minutes, 16 seconds

Building Linked Data Products With JSON-LD

Summary A significant amount of time in data engineering is dedicated to building connections and semantic meaning around pieces of information. Linked data technologies provide a means of tightly coupling metadata with raw information. In this episode Brian Platz explains how JSON-LD can be used as a shared representation of linked data for building semantic data products. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! If you’re a data person, you probably have to jump between different tools to run queries, build visualizations, write Python, and send around a lot of spreadsheets and CSV files. Hex brings everything together. Its powerful notebook UI lets you analyze data in SQL, Python, or no-code, in any combination, and work together with live multiplayer and version control. And now, Hex’s magical AI tools can generate queries and code, create visualizations, and even kickstart a whole analysis for you – all from natural language prompts. It’s like having an analytics co-pilot built right into where you’re already doing your work. Then, when you’re ready to share, you can use Hex’s drag-and-drop app builder to configure beautiful reports or dashboards that anyone can use. Join the hundreds of data teams like Notion, AllTrails, Loom, Mixpanel and Algolia using Hex every day to make their work more impactful. Sign up today at dataengineeringpodcast.com/hex (https://www.dataengineeringpodcast.com/hex) to get a 30-day free trial of the Hex Team plan! Your host is Tobias Macey and today I'm interviewing Brian Platz about using JSON-LD for building linked-data products Interview Introduction How did you get involved in the area of data management? Can you describe what the term "linked data product" means and some examples of when you might build one? What is the overlap between knowledge graphs and "linked data products"? What is JSON-LD? What are the domains in which it is typically used? How does it assist in developing linked data products? what are the characteristics that distinguish a knowledge graph from What are the layers/stages of applications and data that can/should incorporate JSON-LD as the representation for records and events? What is the level of native support/compatibiliity that you see for JSON-LD in data systems? What are the modeling exercises that are necessary to ensure useful and appropriate linkages of different records within and between products and organizations? Can you describe the workflow for building autonomous linkages across data assets that are modelled as JSON-LD? What are the most interesting, innovative, or unexpected ways that you have seen JSON-LD used for data workflows? What are the most interesting, unexpected, or challenging lessons that you have learned while working on linked data products? When is JSON-LD the wrong choice? What are the future directions that you would like to see for JSON-LD and linked data in the data ecosystem? Contact Info LinkedIn (https://www.linkedin.com/in/brianplatz/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Fluree (https://flur.ee/) JSON-LD (https://json-ld.org/) Knowledge Graph (https://en.wikipedia.org/wiki/Knowledge_graph) Adjacency List (https://en.wikipedia.org/wiki/Adjacency_list) RDF == Resource Description Framework (https://www.w3.org/RDF/) Semantic Web (https://en.wikipedia.org/wiki/Semantic_Web) Open Graph (https://ogp.me/) Schema.org (https://schema.org/) RDF Triple (https://en.wikipedia.org/wiki/Semantic_triple) IDMP == Identification of Medicinal Products (https://www.fda.gov/industry/fda-data-standards-advisory-board/identification-medicinal-products-idmp) FIBO == Financial Industry Business Ontology (https://spec.edmcouncil.org/fibo/) OWL Standard (https://www.w3.org/OWL/) NP-Hard (https://en.wikipedia.org/wiki/NP-hardness) Forward-Chaining Rules (https://en.wikipedia.org/wiki/Forward_chaining) SHACL == Shapes Constraint Language) (https://www.w3.org/TR/shacl/) Zero Knowledge Cryptography (https://en.wikipedia.org/wiki/Zero-knowledge_proof) Turtle Serialization (https://www.w3.org/TR/turtle/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

9/17/2023 • 1 hour, 1 minute, 30 seconds

An Overview Of The Sate Of Data Orchestration In An Increasingly Complex Data Ecosystem

Summary Data systems are inherently complex and often require integration of multiple technologies. Orchestrators are centralized utilities that control the execution and sequencing of interdependent operations. This offers a single location for managing visibility and error handling so that data platform engineers can manage complexity. In this episode Nick Schrock, creator of Dagster, shares his perspective on the state of data orchestration technology and its application to help inform its implementation in your environment. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Your host is Tobias Macey and today I'm welcoming back Nick Schrock to talk about the state of the ecosystem for data orchestration Interview Introduction How did you get involved in the area of data management? Can you start by defining what data orchestration is and how it differs from other types of orchestration systems? (e.g. container orchestration, generalized workflow orchestration, etc.) What are the misconceptions about the applications of/need for/cost to implement data orchestration? How do those challenges of customer education change across roles/personas? Because of the multi-faceted nature of data in an organization, how does that influence the capabilities and interfaces that are needed in an orchestration engine? You have been working on Dagster for five years now. How have the requirements/adoption/application for orchestrators changed in that time? One of the challenges for any orchestration engine is to balance the need for robust and extensible core capabilities with a rich suite of integrations to the broader data ecosystem. What are the factors that you have seen make the most influence in driving adoption of a given engine? What are the most interesting, innovative, or unexpected ways that you have seen data orchestration implemented and/or used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data orchestration? When is a data orchestrator the wrong choice? What do you have planned for the future of orchestration with Dagster? Contact Info @schrockn (https://twitter.com/schrockn) on Twitter LinkedIn (https://www.linkedin.com/in/schrockn) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Dagster (https://dagster.io/) GraphQL (https://graphql.org/) K8s == Kubernetes (https://kubernetes.io/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Hightouch (https://hightouch.com/) Podcast Episode (https://www.dataengineeringpodcast.com/hightouch-customer-data-warehouse-episode-168/) Airflow (https://airflow.apache.org/) Prefect (https://www.prefect.io) Flyte (https://flyte.org/) Podcast Episode (https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291/) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) DAG == Directed Acyclic Graph (https://en.wikipedia.org/wiki/Directed_acyclic_graph) Temporal (https://temporal.io/) Software Defined Assets (https://docs.dagster.io/concepts/assets/software-defined-assets) DataForm (https://dataform.co/) Gradient Flow State Of Orchestration Report 2022 (https://gradientflow.com/2022-workflow-orchestration-survey/) MLOps Is 98% Data Engineering (https://mlops.community/mlops-is-mostly-data-engineering/) DataHub (https://datahubproject.io/) Podcast Episode (https://www.dataengineeringpodcast.com/datahub-metadata-management-episode-147/) OpenMetadata (https://open-metadata.org/) Podcast Episode (https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237/) Atlan (https://atlan.com/) Podcast Episode (https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

9/10/2023 • 1 hour, 1 minute, 25 seconds

Eliminate The Overhead In Your Data Integration With The Open Source dlt Library

Summary Cloud data warehouses and the introduction of the ELT paradigm has led to the creation of multiple options for flexible data integration, with a roughly equal distribution of commercial and open source options. The challenge is that most of those options are complex to operate and exist in their own silo. The dlt project was created to eliminate overhead and bring data integration into your full control as a library component of your overall data system. In this episode Adrian Brudaru explains how it works, the benefits that it provides over other data integration solutions, and how you can start building pipelines today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) Your host is Tobias Macey and today I'm interviewing Adrian Brudaru about dlt, an open source python library for data loading Interview Introduction How did you get involved in the area of data management? Can you describe what dlt is and the story behind it? What is the problem you want to solve with dlt? Who is the target audience? The obvious comparison is with systems like Singer/Meltano/Airbyte in the open source space, or Fivetran/Matillion/etc. in the commercial space. What are the complexities or limitations of those tools that leave an opening for dlt? Can you describe how dlt is implemented? What are the benefits of building it in Python? How have the design and goals of the project changed since you first started working on it? How does that language choice influence the performance and scaling characteristics? What problems do users solve with dlt? What are the interfaces available for extending/customizing/integrating with dlt? Can you talk through the process of adding a new source/destination? What is the workflow for someone building a pipeline with dlt? How does the experience scale when supporting multiple connections? Given the limited scope of extract and load, and the composable design of dlt it seems like a purpose built companion to dbt (down to the naming). What are the benefits of using those tools in combination? What are the most interesting, innovative, or unexpected ways that you have seen dlt used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on dlt? When is dlt the wrong choice? What do you have planned for the future of dlt? Contact Info LinkedIn (https://www.linkedin.com/in/data-team/?originalSubdomain=de) Join our community to discuss further (https://join.slack.com/t/dlthub-community/shared_invite/zt-1slox199h-HAE7EQoXmstkP_bTqal65g) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links dlt (https://dlthub.com/) Harness Success Story (https://dlthub.com/success-stories/harness/) Our guiding product principles (https://dlthub.com/product/) Ecosystem support (https://dlthub.com/docs/dlt-ecosystem) From basic to complex, dlt has many capabilities (https://dlthub.com/docs/getting-started/build-a-data-pipeline) Singer (https://www.singer.io/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Meltano (https://meltano.com/) Podcast Episode (https://www.dataengineeringpodcast.com/meltano-data-integration-episode-141/) Matillion (https://www.matillion.com/) Podcast Episode (https://www.dataengineeringpodcast.com/matillion-cloud-data-integration-episode-286/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) OpenAPI (https://www.openapis.org/) Data Mesh (https://martinfowler.com/articles/data-monolith-to-mesh.html) Podcast Episode (https://www.dataengineeringpodcast.com/data-mesh-revisited-episode-250/) SQLMesh (https://sqlmesh.com/) Podcast Episode (https://www.dataengineeringpodcast.com/sqlmesh-open-source-dataops-episode-380) Airflow (https://airflow.apache.org/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-data-platform-big-complexity-episode-239/) Prefect (https://www.prefect.io/) Podcast Episode (https://www.dataengineeringpodcast.com/prefect-workflow-engine-episode-86/) Alto (https://github.com/z3z1ma/alto) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

9/4/2023 • 42 minutes, 12 seconds

Building An Internal Database As A Service Platform At Cloudflare

Summary Data persistence is one of the most challenging aspects of computer systems. In the era of the cloud most developers rely on hosted services to manage their databases, but what if you are a cloud service? In this episode Vignesh Ravichandran explains how his team at Cloudflare provides PostgreSQL as a service to their developers for low latency and high uptime services at global scale. This is an interesting and insightful look at pragmatic engineering for reliability and scale. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Vignesh Ravichandran about building an internal database as a service platform at Cloudflare Interview Introduction How did you get involved in the area of data management? Can you start by describing the different database workloads that you have at Cloudflare? What are the different methods that you have used for managing database instances? What are the requirements and constraints that you had to account for in designing your current system? Why Postgres? optimizations for Postgres simplification from not supporting multiple engines limitations in postgres that make multi-tenancy challenging scale of operation (data volume, request rate What are the most interesting, innovative, or unexpected ways that you have seen your DBaaS used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on your internal database platform? When is an internal database as a service the wrong choice? What do you have planned for the future of Postgres hosting at Cloudflare? Contact Info LinkedIn (https://www.linkedin.com/in/vigneshravichandran28/) Website (https://viggy28.dev/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Cloudflare (https://www.cloudflare.com/) PostgreSQL (https://www.postgresql.org/) Podcast Episode (https://www.dataengineeringpodcast.com/postgresql-with-jonathan-katz-episode-42/) IP Address Data Type in Postgres (https://www.postgresql.org/docs/current/datatype-net-types.html) CockroachDB (https://www.cockroachlabs.com/) Podcast Episode (https://www.dataengineeringpodcast.com/cockroachdb-with-peter-mattis-episode-35/) Citus (https://www.citusdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/citus-data-with-ozgun-erdogan-and-craig-kerstiens-episode-13/) Yugabyte (https://www.yugabyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/yugabytedb-planet-scale-sql-episode-115/) Stolon (https://github.com/sorintlab/stolon) pg_rewind (https://www.postgresql.org/docs/current/app-pgrewind.html) PGBouncer (https://www.pgbouncer.org/) HAProxy Presentation (https://www.youtube.com/watch?v=HIOo4j-Tiq4) Etcd (https://etcd.io/) Patroni (https://patroni.readthedocs.io/en/latest/) pg_upgrade (https://www.postgresql.org/docs/current/pgupgrade.html) Edge Computing (https://en.wikipedia.org/wiki/Edge_computing) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

8/28/2023 • 1 hour, 1 minute, 9 seconds

Harnessing Generative AI For Creating Educational Content With Illumidesk

Summary Generative AI has unlocked a massive opportunity for content creation. There is also an unfulfilled need for experts to be able to share their knowledge and build communities. Illumidesk was built to take advantage of this intersection. In this episode Greg Werner explains how they are using generative AI as an assistive tool for creating educational material, as well as building a data driven experience for learners. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) You shouldn't have to throw away the database to build with fast-changing data. You should be able to keep the familiarity of SQL and the proven architecture of cloud warehouses, but swap the decades-old batch computation model for an efficient incremental engine to get complex queries that are always up-to-date. With Materialize, you can! It’s the only true SQL streaming database built from the ground up to meet the needs of modern data products. Whether it’s real-time dashboarding and analytics, personalization and segmentation or automation and alerting, Materialize gives you the ability to work with fresh, correct, and scalable results — all in a familiar SQL interface. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today to get 2 weeks free! Your host is Tobias Macey and today I'm interviewing Greg Werner about building IllumiDesk, a data-driven and AI powered online learning platform Interview Introduction How did you get involved in the area of data management? Can you describe what Illumidesk is and the story behind it? What are the challenges that educators and content creators face in developing and maintaining digital course materials for their target audiences? How are you leaning on data integrations and AI to reduce the initial time investment required to deliver courseware? What are the opportunities for collecting and collating learner interactions with the course materials to provide feedback to the instructors? What are some of the ways that you are incorporating pedagogical strategies into the measurement and evaluation methods that you use for reports? What are the different categories of insights that you need to provide across the different stakeholders/personas who are interacting with the platform and learning content? Can you describe how you have architected the Illumidesk platform? How have the design and goals shifted since you first began working on it? What are the strategies that you have used to allow for evolution and adaptation of the system in order to keep pace with the ecosystem of generative AI capabilities? What are the failure modes of the content generation that you need to account for? What are the most interesting, innovative, or unexpected ways that you have seen Illumidesk used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Illumidesk? When is Illumidesk the wrong choice? What do you have planned for the future of Illumidesk? Contact Info LinkedIn (https://www.linkedin.com/in/wernergreg/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Illumidesk (https://www.illumidesk.com/) Generative AI (https://en.wikipedia.org/wiki/Generative_artificial_intelligence) Vector Database (https://www.pinecone.io/learn/vector-database/) LTI == Learning Tools Interoperability (https://en.wikipedia.org/wiki/Learning_Tools_Interoperability) SCORM (https://scorm.com/scorm-explained/) XAPI (https://xapi.com/overview/) Prompt Engineering (https://en.wikipedia.org/wiki/Prompt_engineering) GPT-4 (https://en.wikipedia.org/wiki/GPT-4) LLama (https://en.wikipedia.org/wiki/LLaMA) Anthropic (https://www.anthropic.com/) FastAPI (https://fastapi.tiangolo.com/) LangChain (https://www.langchain.com/) Celery (https://docs.celeryq.dev/en/stable/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

8/20/2023 • 54 minutes, 52 seconds

Unpacking The Seven Principles Of Modern Data Pipelines

Summary Data pipelines are the core of every data product, ML model, and business intelligence dashboard. If you're not careful you will end up spending all of your time on maintenance and fire-fighting. The folks at Rivery distilled the seven principles of modern data pipelines that will help you stay out of trouble and be productive with your data. In this episode Ariel Pohoryles explains what they are and how they work together to increase your chances of success. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) This episode is brought to you by Datafold – a testing automation platform for data engineers that finds data quality issues before the code and data are deployed to production. Datafold leverages data-diffing to compare production and development environments and column-level lineage to show you the exact impact of every code change on data, metrics, and BI tools, keeping your team productive and stakeholders happy. Datafold integrates with dbt, the modern data stack, and seamlessly plugs in your data CI for team-wide and automated testing. If you are migrating to a modern data stack, Datafold can also help you automate data and code validation to speed up the migration. Learn more about Datafold by visiting dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) Your host is Tobias Macey and today I'm interviewing Ariel Pohoryles about the seven principles of modern data pipelines Interview Introduction How did you get involved in the area of data management? Can you start by defining what you mean by a "modern" data pipeline? At Rivery you published a white paper identifying seven principles of modern data pipelines: Zero infrastructure management ELT-first mindset Speaks SQL and Python Dynamic multi-storage layers Reverse ETL & operational analytics Full transparency Faster time to value What are the applications of data that you focused on while identifying these principles? How do the application of these principles influence the ability of organizations and their data teams to encourage and keep pace with the use of data in the business? What are the technical components of a pipeline infrastructure that are necessary to support a "modern" workflow? How do the technologies involved impact the organizational involvement with how data is applied throughout the business? When using managed services, what are the ways that the pricing model acts to encourage/discourage experimentation/exploration with data? What are the most interesting, innovative, or unexpected ways that you have seen these seven principles implemented/applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to adapt to these principles? What are the cases where some/all of these principles are undesirable/impractical to implement? What are the opportunities for further advancement/sophistication in the ways that teams work with and gain value from data? Contact Info LinkedIn (https://www.linkedin.com/in/ariel-pohoryles-88695622/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Rivery (https://rivery.io/) 7 Principles Of The Modern Data Pipeline (https://rivery.io/downloads/7-principles-modern-data-pipeline-lp/) ELT (https://en.wikipedia.org/wiki/Extract,_load,_transform) Reverse ETL (https://rivery.io/blog/what-is-reverse-etl-guide-for-data-teams/) Martech Landscape (https://chiefmartec.com/2023/05/2023-marketing-technology-landscape-supergraphic-11038-solutions-searchable-on-martechmap-com/) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/?sh=54d5c4916088) Databricks (https://www.databricks.com/) Snowflake (https://www.snowflake.com/en/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

8/14/2023 • 47 minutes, 2 seconds

Quantifying The Return On Investment For Your Data Team

Summary As businesses increasingly invest in technology and talent focused on data engineering and analytics, they want to know whether they are benefiting. So how do you calculate the return on investment for data? In this episode Barr Moses and Anna Filippova explore that question and provide useful exercises to start answering that in your company. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Barr Moses and Anna Filippova about how and whether to measure the ROI of your data team Interview Introduction How did you get involved in the area of data management? What are the typical motivations for measuring and tracking the ROI for a data team? Who is responsible for collecting that information? How is that information used and by whom? What are some of the downsides/risks of tracking this metric? (law of unintended consequences) What are the inputs to the number that constitutes the "investment"? infrastructure, payroll of employees on team, time spent working with other teams? What are the aspects of data work and its impact on the business that complicate a calculation of the "return" that is generated? How should teams think about measuring data team ROI? What are some concrete ROI metrics data teams can use? What level of detail is useful? What dimensions should be used for segmenting the calculations? How can visibility into this ROI metric be best used to inform the priorities and project scopes of the team? With so many tools in the modern data stack today, what is the role of technology in helping drive or measure this impact? How do your respective solutions, Monte Carlo and dbt, help teams measure and scale data value? With generative AI on the upswing of the hype cycle, what are the impacts that you see it having on data teams? What are the unrealistic expectations that it will produce? How can it speed up time to delivery? What are the most interesting, innovative, or unexpected ways that you have seen data team ROI calculated and/or used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on measuring the ROI of data teams? When is measuring ROI the wrong choice? Contact Info Barr LinkedIn (https://www.linkedin.com/in/barrmoses/) Anna LinkedIn (https://www.linkedin.com/in/annafilippova) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Monte Carlo (https://www.montecarlodata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/monte-carlo-observability-data-quality-episode-155) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81) JetBlue Snowflake Con Presentation (https://www.snowflake.com/webinar/thought-leadership/jet-blue-and-monte-carlos/) Generative AI (https://generativeai.net/) Large Language Models (https://en.wikipedia.org/wiki/Large_language_model) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

8/6/2023 • 1 hour, 1 minute, 52 seconds

Strategies For A Successful Data Platform Migration

Summary All software systems are in a constant state of evolution. This makes it impossible to select a truly future-proof technology stack for your data platform, making an eventual migration inevitable. In this episode Gleb Mezhanskiy and Rob Goretsky share their experiences leading various data platform migrations, and the hard-won lessons that they learned so that you don't have to. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex (https://www.dataengineeringpodcast.com/hex) to get a 30-day free trial for your team! Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy and Rob Goretsky about when and how to think about migrating your data stack Interview Introduction How did you get involved in the area of data management? A migration can be anything from a minor task to a major undertaking. Can you start by describing what constitutes a migration for the purposes of this conversation? Is it possible to completely avoid having to invest in a migration? What are the signals that point to the need for a migration? What are some of the sources of cost that need to be accounted for when considering a migration? (both in terms of doing one, and the costs of not doing one) What are some signals that a migration is not the right solution for a perceived problem? Once the decision has been made that a migration is necessary, what are the questions that the team should be asking to determine the technologies to move to and the sequencing of execution? What are the preceding tasks that should be completed before starting the migration to ensure there is no breakage downstream of the changing component(s)? What are some of the ways that a migration effort might fail? What are the major pitfalls that teams need to be aware of as they work through a data platform migration? What are the opportunities for automation during the migration process? What are the most interesting, innovative, or unexpected ways that you have seen teams approach a platform migration? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data platform migrations? What are some ways that the technologies and patterns that we use can be evolved to reduce the cost/impact/need for migraitons? Contact Info Gleb LinkedIn (https://www.linkedin.com/in/glebmezh/) @glebmm (https://twitter.com/glebmm) on Twitter Rob LinkedIn (https://www.linkedin.com/in/robertgoretsky/) RobGoretsky (https://github.com/RobGoretsky) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Datafold (https://www.datafold.com/) Podcast Episode (https://www.dataengineeringpodcast.com/datafold-proactive-data-quality-episode-205/) Informatica (https://www.informatica.com/) Airflow (https://airflow.apache.org/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) Redshift (https://aws.amazon.com/redshift/) Eventbrite (https://www.eventbrite.com/) Teradata (https://www.teradata.com/) BigQuery (https://cloud.google.com/bigquery) Trino (https://trino.io/) EMR == Elastic Map-Reduce (https://aws.amazon.com/emr/) Shadow IT (https://en.wikipedia.org/wiki/Shadow_IT) Podcast Episode (https://www.dataengineeringpodcast.com/shadow-it-data-analytics-episode-121) Mode Analytics (https://mode.com/) Looker (https://cloud.google.com/looker/) Sunk Cost Fallacy (https://en.wikipedia.org/wiki/Sunk_cost) data-diff (https://github.com/datafold/data-diff) Podcast Episode (https://www.dataengineeringpodcast.com/data-diff-open-source-data-integration-validation-episode-303/) SQLGlot (https://github.com/tobymao/sqlglot) Dagster (dhttps://dagster.io/) dbt (https://www.getdbt.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

7/31/2023 • 1 hour, 9 minutes, 52 seconds

Build Real Time Applications With Operational Simplicity Using Dozer

Summary Real-time data processing has steadily been gaining adoption due to advances in the accessibility of the technologies involved. Despite that, it is still a complex set of capabilities. To bring streaming data in reach of application engineers Matteo Pelati helped to create Dozer. In this episode he explains how investing in high performance and operationally simplified streaming with a familiar API can yield significant benefits for software and data teams together. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Modern data teams are using Hex to 10x their data impact. Hex combines a notebook style UI with an interactive report builder. This allows data teams to both dive deep to find insights and then share their work in an easy-to-read format to the whole org. In Hex you can use SQL, Python, R, and no-code visualization together to explore, transform, and model data. Hex also has AI built directly into the workflow to help you generate, edit, explain and document your code. The best data teams in the world such as the ones at Notion, AngelList, and Anthropic use Hex for ad hoc investigations, creating machine learning models, and building operational dashboards for the rest of their company. Hex makes it easy for data analysts and data scientists to collaborate together and produce work that has an impact. Make your data team unstoppable with Hex. Sign up today at dataengineeringpodcast.com/hex (https://www.dataengineeringpodcast.com/hex) to get a 30-day free trial for your team! Your host is Tobias Macey and today I'm interviewing Matteo Pelati about Dozer, an open source engine that includes data ingestion, transformation, and API generation for real-time sources Interview Introduction How did you get involved in the area of data management? Can you describe what Dozer is and the story behind it? What was your decision process for building Dozer as open source? As you note in the documentation, Dozer has overlap with a number of technologies that are aimed at different use cases. What was missing from each of them and the center of their Venn diagram that prompted you to build Dozer? In addition to working in an interesting technological cross-section, you are also targeting a disparate group of personas. Who are you building Dozer for and what were the motivations for that vision? What are the different use cases that you are focused on supporting? What are the features of Dozer that enable engineers to address those uses, and what makes it preferable to existing alternative approaches? Can you describe how Dozer is implemented? How have the design and goals of the platform changed since you first started working on it? What are the architectural "-ilities" that you are trying to optimize for? What is involved in getting Dozer deployed and integrated into an existing application/data infrastructure? How can teams who are using Dozer extend/integrate with Dozer? What does the development/deployment workflow look like for teams who are building on top of Dozer? What is your governance model for Dozer and balancing the open source project against your business goals? What are the most interesting, innovative, or unexpected ways that you have seen Dozer used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dozer? When is Dozer the wrong choice? What do you have planned for the future of Dozer? Contact Info LinkedIn (https://www.linkedin.com/in/matteopelati/?originalSubdomain=sg) @pelatimtt (https://twitter.com/pelatimtt) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Dozer (https://getdozer.io/) Data Robot (https://www.datarobot.com/) Netflix Bulldozer (https://netflixtechblog.com/bulldozer-batch-data-moving-from-data-warehouse-to-online-key-value-stores-41bac13863f8) CubeJS (http://cube.dev/) Podcast Episode (https://www.dataengineeringpodcast.com/cubejs-open-source-headless-data-analytics-episode-248/) JVM == Java Virtual Machine (https://en.wikipedia.org/wiki/Java_virtual_machine) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) LMDB (http://www.lmdb.tech/doc/) Vector Database (https://thenewstack.io/what-is-a-real-vector-database/) LLM == Large Language Model (https://en.wikipedia.org/wiki/Large_language_model) Rockset (https://rockset.com/) Podcast Episode (https://www.dataengineeringpodcast.com/rockset-serverless-analytics-episode-101/) Tinybird (https://www.tinybird.co/) Podcast Episode (https://www.dataengineeringpodcast.com/tinybird-analytical-api-platform-episode-185) Rust Language (https://www.rust-lang.org/) Materialize (https://materialize.com/) Podcast Episode (https://www.dataengineeringpodcast.com/materialize-streaming-analytics-episode-112/) RisingWave (https://www.risingwave.com/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) DataFusion (https://docs.rs/datafusion/latest/datafusion/) Polars (https://www.pola.rs/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

7/24/2023 • 40 minutes, 42 seconds

Datapreneurs - How Todays Business Leaders Are Using Data To Define The Future

Summary Data has been one of the most substantial drivers of business and economic value for the past few decades. Bob Muglia has had a front-row seat to many of the major shifts driven by technology over his career. In his recent book "Datapreneurs" he reflects on the people and businesses that he has known and worked with and how they relied on data to deliver valuable services and drive meaningful change. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Bob Muglia about his recent book about the idea of "Datapreneurs" and the role of data in the modern economy Interview Introduction How did you get involved in the area of data management? Can you describe what your concept of a "Datapreneur" is? How is this distinct from the common idea of an entreprenur? What do you see as the key inflection points in data technologies and their impacts on business capabilities over the past ~30 years? In your role as the CEO of Snowflake you had a first-row seat for the rise of the "modern data stack". What do you see as the main positive and negative impacts of that paradigm? What are the key issues that are yet to be solved in that ecosmnjjystem? For technologists who are thinking about launching new ventures, what are the key pieces of advice that you would like to share? What do you see as the short/medium/long-term impact of AI on the technical, business, and societal arenas? What are the most interesting, innovative, or unexpected ways that you have seen business leaders use data to drive their vision? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Datapreneurs book? What are your key predictions for the future impact of data on the technical/economic/business landscapes? Contact Info LinkedIn (https://www.linkedin.com/in/bob-muglia-714ba592/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Datapreneurs Book (https://www.thedatapreneurs.com/) SQL Server (https://en.wikipedia.org/wiki/Microsoft_SQL_Server) Snowflake (https://www.snowflake.com/en/) Z80 Processor (https://en.wikipedia.org/wiki/Zilog_Z80) Navigational Database (https://en.wikipedia.org/wiki/Navigational_database) System R (https://en.wikipedia.org/wiki/IBM_System_R) Redshift (https://aws.amazon.com/redshift/) Microsoft Fabric (https://www.microsoft.com/en-us/microsoft-fabric) Databricks (https://www.databricks.com/) Looker (https://cloud.google.com/looker/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Databricks Unity Catalog (https://www.databricks.com/product/unity-catalog) RelationalAI (https://relational.ai/) 6th Normal Form (https://en.wikipedia.org/wiki/Sixth_normal_form) Pinecone Vector DB (https://www.pinecone.io/) Podcast Episode (https://www.dataengineeringpodcast.com/pinecone-vector-database-similarity-search-episode-189/) Perplexity AI (https://www.perplexity.ai/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

7/17/2023 • 54 minutes, 45 seconds

Reduce Friction In Your Business Analytics Through Entity Centric Data Modeling

Summary For business analytics the way that you model the data in your warehouse has a lasting impact on what types of questions can be answered quickly and easily. The major strategies in use today were created decades ago when the software and hardware for warehouse databases were far more constrained. In this episode Maxime Beauchemin of Airflow and Superset fame shares his vision for the entity-centric data model and how you can incorporate it into your own warehouse design. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Max Beauchemin about the concept of entity-centric data modeling for analytical use cases Interview Introduction How did you get involved in the area of data management? Can you describe what entity-centric modeling (ECM) is and the story behind it? How does it compare to dimensional modeling strategies? What are some of the other competing methods Comparison to activity schema What impact does this have on ML teams? (e.g. feature engineering) What role does the tooling of a team have in the ways that they end up thinking about modeling? (e.g. dbt vs. informatica vs. ETL scripts, etc.) What is the impact on the underlying compute engine on the modeling strategies used? What are some examples of data sources or problem domains for which this approach is well suited? What are some cases where entity centric modeling techniques might be counterproductive? What are the ways that the benefits of ECM manifest in use cases that are down-stream from the warehouse? What are some concrete tactical steps that teams should be thinking about to implement a workable domain model using entity-centric principles? How does this work across business domains within a given organization (especially at "enterprise" scale)? What are the most interesting, innovative, or unexpected ways that you have seen ECM used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on ECM? When is ECM the wrong choice? What are your predictions for the future direction/adoption of ECM or other modeling techniques? Contact Info mistercrunch (https://github.com/mistercrunch) on GitHub LinkedIn (https://www.linkedin.com/in/maximebeauchemin/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Entity Centric Modeling Blog Post (https://preset.io/blog/introducing-entity-centric-data-modeling-for-analytics/?utm_source=pocket_saves) Max's Previous Apperances Defining Data Engineering with Maxime Beauchemin (https://www.dataengineeringpodcast.com/episode-3-defining-data-engineering-with-maxime-beauchemin) Self Service Data Exploration And Dashboarding With Superset (https://www.dataengineeringpodcast.com/superset-data-exploration-episode-182) Exploring The Evolving Role Of Data Engineers (https://www.dataengineeringpodcast.com/redefining-data-engineering-episode-249) Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations (https://www.dataengineeringpodcast.com/airbnb-alumni-data-driven-organization-episode-319) Apache Airflow (https://airflow.apache.org/) Apache Superset (https://superset.apache.org/) Preset (https://preset.io/) Ubisoft (https://www.ubisoft.com/en-us/) Ralph Kimball (https://en.wikipedia.org/wiki/Ralph_Kimball) The Rise Of The Data Engineer (https://www.freecodecamp.org/news/the-rise-of-the-data-engineer-91be18f1e603/) The Downfall Of The Data Engineer (https://maximebeauchemin.medium.com/the-downfall-of-the-data-engineer-5bfb701e5d6b) The Rise Of The Data Scientist (https://flowingdata.com/2009/06/04/rise-of-the-data-scientist/) Dimensional Data Modeling (https://www.thoughtspot.com/data-trends/data-modeling/dimensional-data-modeling) Star Schema (https://en.wikipedia.org/wiki/Star_schema) Database Normalization (https://en.wikipedia.org/wiki/Database_normalization) Feature Engineering (https://en.wikipedia.org/wiki/Feature_engineering) DRY == Don't Repeat Yourself (https://en.wikipedia.org/wiki/Don%27t_repeat_yourself) Activity Schema (https://www.activityschema.com/) Podcast Episode (https://www.dataengineeringpodcast.com/narrator-exploratory-analytics-episode-234/) Corporate Information Factory (https://amzn.to/3NK4dpB) (affiliate link) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

7/9/2023 • 1 hour, 12 minutes, 54 seconds

How Data Engineering Teams Power Machine Learning With Feature Platforms

Summary Feature engineering is a crucial aspect of the machine learning workflow. To make that possible, there are a number of technical and procedural capabilities that must be in place first. In this episode Razi Raziuddin shares how data engineering teams can support the machine learning workflow through the development and support of systems that empower data scientists and ML engineers to build and maintain their own features. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Introducing RudderStack Profiles. RudderStack Profiles takes the SaaS guesswork and SQL grunt work out of building complete customer profiles so you can quickly ship actionable, enriched data to every downstream team. You specify the customer traits, then Profiles runs the joins and computations for you to create complete customer profiles. Get all of the details and try the new product today at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Razi Raziuddin about how data engineers can empower data scientists to develop and deploy better ML models through feature engineering Interview Introduction How did you get involved in the area of data management? What is feature engineering is and why/to whom it matters? A topic that commonly comes up in relation to feature engineering is the importance of a feature store. What are the tradeoffs for that to be a separate infrastructure/architecture component? What is the overall lifecycle of a feature, from definition to deployment and maintenance? How is this distinct from other forms of data pipeline development and delivery? Who are the participants in that workflow? What are the sharp edges/roadblocks that typically manifest in that lifecycle? What are the interfaces that are needed for data scientists/ML engineers to be able to self-serve their feature management? What is the role of the data engineer in supporting those interfaces? What are the communication/collaboration channels that are necessary to make the overall process a success? From an implementation/architecture perspective, what are the patterns that you have seen teams build around for feature development/serving? What are the most interesting, innovative, or unexpected ways that you have seen feature platforms used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on feature engineering? What are the resources that you find most helpful in understanding and designing feature platforms? Contact Info LinkedIn (https://www.linkedin.com/in/razi-raziuddin-7836301/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links FeatureByte (https://featurebyte.com/) DataRobot (https://www.datarobot.com/) Feature Store (https://www.featurestore.org/) Feast Feature Store (https://feast.dev/) Feathr (https://github.com/feathr-ai/feathr) Kaggle (https://www.kaggle.com/) Yann LeCun (https://en.wikipedia.org/wiki/Yann_LeCun) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

7/3/2023 • 1 hour, 3 minutes, 29 seconds

Seamless SQL And Python Transformations For Data Engineers And Analysts With SQLMesh

Summary Data transformation is a key activity for all of the organizational roles that interact with data. Because of its importance and outsized impact on what is possible for downstream data consumers it is critical that everyone is able to collaborate seamlessly. SQLMesh was designed as a unifying tool that is simple to work with but powerful enough for large-scale transformations and complex projects. In this episode Toby Mao explains how it works, the importance of automatic column-level lineage tracking, and how you can start using it today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack)- Your host is Tobias Macey and today I'm interviewing Toby Mao about SQLMesh, an open source DataOps framework designed to scale data transformations with ease of collaboration and validation built in Interview Introduction How did you get involved in the area of data management? Can you describe what SQLMesh is and the story behind it? DataOps is a term that has been co-opted and overloaded. What are the concepts that you are trying to convey with that term in the context of SQLMesh? What are the rough edges in existing toolchains/workflows that you are trying to address with SQLMesh? How do those rough edges impact the productivity and effectiveness of teams using those Can you describe how SQLMesh is implemented? How have the design and goals evolved since you first started working on it? What are the lessons that you have learned from dbt which have informed the design and functionality of SQLMesh? For teams who have already invested in dbt, what is the migration path from or integration with dbt? You have some built-in integration with/awareness of orchestrators (currently Airflow). What are the benefits of making the transformation tool aware of the orchestrator? What do you see as the potential benefits of integration with e.g. data-diff? What are the second-order benefits of using a tool such as SQLMesh that addresses the more mechanical aspects of managing transformation workfows and the associated dependency chains? What are the most interesting, innovative, or unexpected ways that you have seen SQLMesh used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLMesh? When is SQLMesh the wrong choice? What do you have planned for the future of SQLMesh? Contact Info tobymao (https://github.com/tobymao) on GitHub @captaintobs (https://twitter.com/captaintobs) on Twitter Website (http://tobymao.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links SQLMesh (https://github.com/TobikoData/sqlmesh) Tobiko Data (https://tobikodata.com/) SAS (https://www.sas.com/en_us/home.html) AirBnB Minerva (https://medium.com/airbnb-engineering/how-airbnb-achieved-metric-consistency-at-scale-f23cc53dea70) SQLGlot (https://github.com/tobymao/sqlglot) Cron (https://man.freebsd.org/cgi/man.cgi?query=cron&sektion=8&n=1) AST == Abstract Syntax Tree (https://en.wikipedia.org/wiki/Abstract_syntax_tree) Pandas (https://pandas.pydata.org/) Terraform (https://www.terraform.io/) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) SQLFluff (https://github.com/sqlfluff/sqlfluff) Podcast.__init__ Episode (https://www.pythonpodcast.com/sqlfluff-sql-linter-episode-318/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

6/25/2023 • 50 minutes, 19 seconds

How Column-Aware Development Tooling Yields Better Data Models

Summary Architectural decisions are all based on certain constraints and a desire to optimize for different outcomes. In data systems one of the core architectural exercises is data modeling, which can have significant impacts on what is and is not possible for downstream use cases. By incorporating column-level lineage in the data modeling process it encourages a more robust and well-informed design. In this episode Satish Jayanthi explores the benefits of incorporating column-aware tooling in the data modeling process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack)- Your host is Tobias Macey and today I'm interviewing Satish Jayanthi about the practice and promise of building a column-aware data architecture through intentional modeling Interview Introduction How did you get involved in the area of data management? How has the move to the cloud for data warehousing/data platforms influenced the practice of data modeling? There are ongoing conversations about the continued merits of dimensional modeling techniques in modern warehouses. What are the modeling practices that you have found to be most useful in large and complex data environments? Can you describe what you mean by the term column-aware in the context of data modeling/data architecture? What are the capabilities that need to be built into a tool for it to be effectively column-aware? What are some of the ways that tools like dbt miss the mark in managing large/complex transformation workloads? Column-awareness is obviously critical in the context of the warehouse. What are some of the ways that that information can be fed into other contexts? (e.g. ML, reverse ETL, etc.) What is the importance of embedding column-level lineage awareness into transformation tool vs. layering on top w/ dedicated lineage/metadata tooling? What are the most interesting, innovative, or unexpected ways that you have seen column-aware data modeling used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building column-aware tooling? When is column-aware modeling the wrong choice? What are some additional resources that you recommend for individuals/teams who want to learn more about data modeling/column aware principles? Contact Info LinkedIn (https://www.linkedin.com/in/satish-jayanthi-32703613/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Coalesce (https://coalesce.io/) Podcast Episode (https://www.dataengineeringpodcast.com/coalesce-enterprise-analytics-transformations-episode-278/) Star Schema (https://en.wikipedia.org/wiki/Star_schema) Conformed Dimensions (https://www.linkedin.com/advice/0/how-do-you-use-conformed-dimensions-ensure) Data Vault (https://en.wikipedia.org/wiki/Data_vault_modeling) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

6/18/2023 • 46 minutes, 19 seconds

Build Better Tests For Your dbt Projects With Datafold And data-diff

Summary Data engineering is all about building workflows, pipelines, systems, and interfaces to provide stable and reliable data. Your data can be stable and wrong, but then it isn't reliable. Confidence in your data is achieved through constant validation and testing. Datafold has invested a lot of time into integrating with the workflow of dbt projects to add early verification that the changes you are making are correct. In this episode Gleb Mezhanskiy shares some valuable advice and insights into how you can build reliable and well-tested data assets with dbt and data-diff. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Gleb Mezhanskiy about how to test your dbt projects with Datafold Interview Introduction How did you get involved in the area of data management? Can you describe what Datafold is and what's new since we last spoke? (July 2021 and July 2022 about data-diff) What are the roadblocks to data testing/validation that you see teams run into most often? How does the tooling used contribute to/help address those roadblocks? What are some of the error conditions/failure modes that data-diff can help identify in a dbt project? What are some examples of tests that need to be implemented by the engineer? In your experience working with data teams, what typically constitutes the "staging area" for a dbt project? (e.g. separate warehouse, namespaced tables, snowflake data copies, lakefs, etc.) Given a dbt project that is well tested and has data-diff as part of the validation suite, what are the challenges that teams face in managing the feedback cycle of running those tests? In application development there is the idea of the "testing pyramid", consisting of unit tests, integration tests, system tests, etc. What are the parallels to that in data projects? What are the limitations of the data ecosystem that make testing a bigger challenge than it might otherwise be? Beyond test execution, what are the other aspects of data health that need to be included in the development and deployment workflow of dbt projects? (e.g. freshness, time to delivery, etc.) What are the most interesting, innovative, or unexpected ways that you have seen Datafold and/or data-diff used for testing dbt projects? What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbt testing internally or with your customers? When is Datafold/data-diff the wrong choice for dbt projects? What do you have planned for the future of Datafold? Contact Info LinkedIn (https://www.linkedin.com/in/glebmezh/) Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Datafold (https://www.datafold.com/) Podcast Episode (https://www.dataengineeringpodcast.com/datafold-proactive-data-quality-episode-205/) data-diff (https://github.com/datafold/data-diff) Podcast Episode (https://www.dataengineeringpodcast.com/data-diff-open-source-data-integration-validation-episode-303/) dbt (https://www.getdbt.com/) Dagster (https://dagster.io/) dbt-cloud slim CI (https://docs.getdbt.com/blog/intelligent-slim-ci) GitHub Actions (https://github.com/features/actions) Jenkins (https://www.jenkins.io/) Circle CI (https://circleci.com/) Dolt (https://github.com/dolthub/dolt) Malloy (https://github.com/malloydata/malloy) LakeFS (https://lakefs.io/) Planetscale (https://planetscale.com/) Snowflake Zero Copy Cloning (https://www.youtube.com/watch?v=uGCpwoQOQzQ) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/) Special Guest: Gleb Mezhanskiy.

6/11/2023 • 48 minutes, 21 seconds

Reduce The Overhead In Your Pipelines With Agile Data Engine's DataOps Service

Summary A significant portion of the time spent by data engineering teams is on managing the workflows and operations of their pipelines. DataOps has arisen as a parallel set of practices to that of DevOps teams as a means of reducing wasted effort. Agile Data Engine is a platform designed to handle the infrastructure side of the DataOps equation, as well as providing the insights that you need to manage the human side of the workflow. In this episode Tevje Olin explains how the platform is implemented, the features that it provides to reduce the amount of effort required to keep your pipelines running, and how you can start using it in your own team. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Tevje Olin about Agile Data Engine, a platform that combines data modeling, transformations, continuous delivery and workload orchestration to help you manage your data products and the whole lifecycle of your warehouse Interview Introduction How did you get involved in the area of data management? Can you describe what Agile Data Engine is and the story behind it? What are some of the tools and architectures that an organization might be able to replace with Agile Data Engine? How does the unified experience of Agile Data Engine change the way that teams think about the lifecycle of their data? What are some of the types of experiments that are enabled by reduced operational overhead? What does CI/CD look like for a data warehouse? How is it different from CI/CD for software applications? Can you describe how Agile Data Engine is architected? How have the design and goals of the system changed since you first started working on it? What are the components that you needed to develop in-house to enable your platform goals? What are the changes in the broader data ecosystem that have had the most influence on your product goals and customer adoption? Can you describe the workflow for a team that is using Agile Data Engine to power their business analytics? What are some of the insights that you generate to help your customers understand how to improve their processes or identify new opportunities? In your "about" page it mentions the unique approaches that you take for warehouse automation. How do your practices differ from the rest of the industry? How have changes in the adoption/implementation of ML and AI impacted the ways that your customers exercise your platform? What are the most interesting, innovative, or unexpected ways that you have seen the Agile Data Engine platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Agile Data Engine? When is Agile Data Engine the wrong choice? What do you have planned for the future of Agile Data Engine? Guest Contact Info LinkedIn (https://www.linkedin.com/in/tevjeolin/?originalSubdomain=fi) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? About Agile Data Engine Agile Data Engine unlocks the potential of your data to drive business value - in a rapidly changing world. Agile Data Engine is a DataOps Management platform for designing, deploying, operating and managing data products, and managing the whole lifecycle of a data warehouse. It combines data modeling, transformations, continuous delivery and workload orchestration into the same platform. Links Agile Data Engine (https://www.agiledataengine.com/agile-data-engine-x-data-engineering-podcast) Bill Inmon (https://en.wikipedia.org/wiki/Bill_Inmon) Ralph Kimball (https://en.wikipedia.org/wiki/Ralph_Kimball) Snowflake (https://www.snowflake.com/en/) Redshift (https://aws.amazon.com/redshift/) BigQuery (https://cloud.google.com/bigquery) Azure Synapse (https://azure.microsoft.com/en-us/products/synapse-analytics/) Airflow (https://airflow.apache.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

6/4/2023 • 54 minutes, 5 seconds

A Roadmap To Bootstrapping The Data Team At Your Startup

Summary Building a data team is hard in any circumstance, but at a startup it can be even more challenging. The requirements are fluid, you probably don't have a lot of existing data talent to manage the hiring and onboarding, and there is a need to move fast. Ghalib Suleiman has been on both sides of this equation and joins the show to share his hard-won wisdom about how to start and grow a data team in the early days of company growth. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Ghalib Suleiman about challenges and strategies for building data teams in a startup Interview Introduction How did you get involved in the area of data management? Can you start by sharing your conception of the responsibilities of a data team? What are some of the common fallacies that organizations fall prey to in their first efforts at building data capabilities? Have you found it more practical to hire outside talent to build out the first data systems, or grow that talent internally? What are some of the resources you have found most helpful in training/educating the early creators and consumers of data assets? When there is no internal data talent to assist with hiring, what are some of the problems that manifest in the hiring process? What are the concepts that the new hire needs to know? How much does the hiring manager/interviewer need to know about those concepts to evaluate skill? What are the most critical skills for a first hire to have to start generating valuable output? As a solo data person, what are the uphill battles that they need to be prepared for in the organization? What are the rabbit holes that they should beware of? What are some of the tactical What are the most interesting, innovative, or unexpected ways that you have seen initial data hires tackle startup challenges? What are the most interesting, unexpected, or challenging lessons that you have learned while working on starting and growing data teams? When is it more practical to outsource the data work? Contact Info LinkedIn (https://www.linkedin.com/in/ghalibs/) @ghalib (https://twitter.com/ghalib) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Polytomic (https://www.polytomic.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

5/29/2023 • 42 minutes, 31 seconds

Keep Your Data Lake Fresh With Real Time Streams Using Estuary

Summary Batch vs. streaming is a long running debate in the world of data integration and transformation. Proponents of the streaming paradigm argue that stream processing engines can easily handle batched workloads, but the reverse isn't true. The batch world has been the default for years because of the complexities of running a reliable streaming system at scale. In order to remove that barrier, the team at Estuary have built the Gazette and Flow systems from the ground up to resolve the pain points of other streaming engines, while providing an intuitive interface for data and application engineers to build their streaming workflows. In this episode David Yaffe and Johnny Graettinger share the story behind the business and technology and how you can start using it today to build a real-time data lake without all of the headache. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing David Yaffe and Johnny Graettinger about using streaming data to build a real-time data lake and how Estuary gives you a single path to integrating and transforming your various sources Interview Introduction How did you get involved in the area of data management? Can you describe what Estuary is and the story behind it? Stream processing technologies have been around for around a decade. How would you characterize the current state of the ecosystem? What was missing in the ecosystem of streaming engines that motivated you to create a new one from scratch? With the growth in tools that are focused on batch-oriented data integration and transformation, what are the reasons that an organization should still invest in streaming? What is the comparative level of difficulty and support for these disparate paradigms? What is the impact of continuous data flows on dags/orchestration of transforms? What role do modern table formats have on the viability of real-time data lakes? Can you describe the architecture of your Flow platform? What are the core capabilities that you are optimizing for in its design? What is involved in getting Flow/Estuary deployed and integrated with an organization's data systems? What does the workflow look like for a team using Estuary? How does it impact the overall system architecture for a data platform as compared to other prevalent paradigms? How do you manage the translation of poll vs. push availability and best practices for API and other non-CDC sources? What are the most interesting, innovative, or unexpected ways that you have seen Estuary used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Estuary? When is Estuary the wrong choice? What do you have planned for the future of Estuary? Contact Info Dave Y (mailto:dave@estuary.dev) Johnny G (mailto:johnny@estuary.dev) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Estuary (https://estuary.dev) Try Flow Free (https://dashboard.estuary.dev/register) Gazette (https://gazette.dev) Samza (https://samza.apache.org/) Flink (https://flink.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/apache-flink-with-fabian-hueske-episode-57/) Storm (https://storm.apache.org/) Kafka Topic Partitioning (https://www.openlogic.com/blog/kafka-partitions) Trino (https://trino.io/) Avro (https://avro.apache.org/) Parquet (https://parquet.apache.org/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Airbyte (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Snowflake (https://www.snowflake.com/en/) BigQuery (https://cloud.google.com/bigquery) Vector Database (https://learn.microsoft.com/en-us/semantic-kernel/concepts-ai/vectordb) CDC == Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture) Debezium (https://debezium.io/) Podcast Episode (https://www.dataengineeringpodcast.com/debezium-change-data-capture-episode-114/) MapReduce (https://en.wikipedia.org/wiki/MapReduce) Netflix DBLog (https://netflixtechblog.com/dblog-a-generic-change-data-capture-framework-69351fb9099b) JSON-Schema (http://json-schema.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

5/21/2023 • 55 minutes, 50 seconds

What Happens When The Abstractions Leak On Your Data

Summary All of the advancements in our technology is based around the principles of abstraction. These are valuable until they break down, which is an inevitable occurrence. In this episode the host Tobias Macey shares his reflections on recent experiences where the abstractions leaked and some observances on how to deal with that situation in a data platform architecture. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm sharing some thoughts and observances about abstractions and impedance mismatches from my experience building a data lakehouse with an ELT workflow Interview Introduction impact of community tech debt hive metastore new work being done but not widely adopted tensions between automation and correctness data type mapping integer types complex types naming things (keys/column names from APIs to databases) disaggregated databases - pros and cons flexibility and cost control not as much tooling invested vs. Snowflake/BigQuery/Redshift data modeling dimensional modeling vs. answering today's questions What are the most interesting, unexpected, or challenging lessons that you have learned while working on your data platform? When is ELT the wrong choice? What do you have planned for the future of your data platform? Contact Info LinkedIn (https://www.linkedin.com/in/tmacey/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links dbt (https://www.getdbt.com/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309/) Trino (https://trino.io/) Podcast Episode (https://www.dataengineeringpodcast.com/presto-distributed-sql-episode-149/) ELT (https://en.wikipedia.org/wiki/Extract,_load,_transform) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/?sh=5c0e333f6088) Snowflake (https://www.snowflake.com/en/) BigQuery (https://cloud.google.com/bigquery) Redshift (https://aws.amazon.com/redshift/) Technical Debt (https://en.wikipedia.org/wiki/Technical_debt) Hive Metastore (https://cwiki.apache.org/confluence/display/Hive/AdminManual+Metastore+Administration) AWS Glue (https://aws.amazon.com/glue/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

5/15/2023 • 26 minutes, 41 seconds

Use Consistent And Up To Date Customer Profiles To Power Your Business With Segment Unify

Summary Every business has customers, and a critical element of success is understanding who they are and how they are using the companies products or services. The challenge is that most companies have a multitude of systems that contain fragments of the customer's interactions and stitching that together is complex and time consuming. Segment created the Unify product to reduce the burden of building a comprehensive view of customers and synchronizing it to all of the systems that need it. In this episode Kevin Niparko and Hanhan Wang share the details of how it is implemented and how you can use it to build and maintain rich customer profiles. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Kevin Niparko and Hanhan Wang about Segment's new Unify product for building and syncing comprehensive customer profiles across your data systems Interview Introduction How did you get involved in the area of data management? Can you describe what Segment Unify is and the story behind it? What are the net-new capabilities that it brings to the Segment product suite? What are some of the categories of attributes that need to be managed in a prototypical customer profile? What are the different use cases that are enabled/simplified by the availability of a comprehensive customer profile? What is the potential impact of more detailed customer profiles on LTV? How do you manage permissions/auditability of updating or amending profile data? Can you describe how the Unify product is implemented? What are the technical challenges that you had to address while developing/launching this product? What is the workflow for a team who is adopting the Unify product? What are the other Segment products that need to be in use to take advantage of Unify? What are some of the most complex edge cases to address in identity resolution? How does reverse ETL factor into the enrichment process for profile data? What are some of the issues that you have to account for in synchronizing profiles across platforms/products? How do you mititgate the impact of "regression to the mean" for systems that don't support all of the attributes that you want to maintain in a profile record? What are some of the data modeling considerations that you have had to account for to support e.g. historical changes (e.g. slowly changing dimensions)? What are the most interesting, innovative, or unexpected ways that you have seen Segment Unify used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Segment Unify? When is Segment Unify the wrong choice? What do you have planned for the future of Segment Unify? Contact Info Kevin LinkedIn (https://www.linkedin.com/in/kevin-niparko-5ab86b54/) Blog (https://n2parko.com/) Hanhan LinkedIn (https://www.linkedin.com/in/hansquared/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Segment Unify (https://segment.com/product/unify/) Segment (https://segment.com/) Podcast Episode (https://www.dataengineeringpodcast.com/segment-customer-analytics-episode-72/) Customer Data Platform (CDP) (https://blog.hubspot.com/service/customer-data-platform-guide) Golden Profile (https://www.uniserv.com/en/business-cases/customer-data-management/golden-record-golden-profile/) Reverse ETL (https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb) MarTech Landscape (https://chiefmartec.com/2023/05/2023-marketing-technology-landscape-supergraphic-11038-solutions-searchable-on-martechmap-com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

5/7/2023 • 54 minutes, 34 seconds

Realtime Data Applications Made Easier With Meroxa

Summary Real-time capabilities have quickly become an expectation for consumers. The complexity of providing those capabilities is still high, however, making it more difficult for small teams to compete. Meroxa was created to enable teams of all sizes to deliver real-time data applications. In this episode DeVaris Brown discusses the types of applications that are possible when teams don't have to manage the complex infrastructure necessary to support continuous data flows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing DeVaris Brown about the impact of real-time data on business opportunities and risk profiles Interview Introduction How did you get involved in the area of data management? Can you describe what Meroxa is and the story behind it? How have the focus and goals of the platform and company evolved over the past 2 years? Who are the target customers for Meroxa? What problems are they trying to solve when they come to your platform? Applications powered by real-time data were the exclusive domain of large and/or sophisticated tech companies for several years due to the inherent complexities involved. What are the shifts that have made them more accessible to a wider variety of teams? What are some of the remaining blockers for teams who want to start using real-time data? With the democratization of real-time data, what are the new categories of products and applications that are being unlocked? How are organizations thinking about the potential value that those types of apps/services can provide? With data flowing constantly, there are new challenges around oversight and accuracy. How does real-time data change the risk profile for applications that are consuming it? What are some of the technical controls that are available for organizations that are risk-averse? What skills do developers need to be able to effectively design, develop, and deploy real-time data applications? How does this differ when talking about internal vs. consumer/end-user facing applications? What are the most interesting, innovative, or unexpected ways that you have seen Meroxa used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Meroxa? When is Meroxa the wrong choice? What do you have planned for the future of Meroxa? Contact Info LinkedIn (https://www.linkedin.com/in/devarispbrown/) @devarispbrown (https://twitter.com/devarispbrown) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Meroxa (https://meroxa.com/) Podcast Episode (https://www.dataengineeringpodcast.com/meroxa-data-integration-episode-153/) Kafka (https://kafka.apache.org/) Kafka Connect (https://docs.confluent.io/platform/current/connect/index.html) Conduit (https://github.com/ConduitIO/conduit) - golang Kafka connect replacement Pulsar (https://pulsar.apache.org/) Redpanda (https://redpanda.com/) Flink (https://flink.apache.org/) Beam (https://beam.apache.org/) Clickhouse (https://clickhouse.tech/) Druid (https://druid.apache.org/) Pinot (https://pinot.apache.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

4/24/2023 • 45 minutes, 26 seconds

Building Self Serve Business Intelligence With AI And Semantic Modeling At Zenlytic

Summary Business intellingence has been chasing the promise of self-serve data for decades. As the capabilities of these systems has improved and become more accessible, the target of what self-serve means changes. With the availability of AI powered by large language models combined with the evolution of semantic layers, the team at Zenlytic have taken aim at this problem again. In this episode Paul Blankley and Ryan Janssen explore the power of natural language driven data exploration combined with semantic modeling that enables an intuitive way for everyone in the business to access the data that they need to succeed in their work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Paul Blankley and Ryan Janssen about Zenlytic, a no-code business intelligence tool focused on emerging commerce brands Interview Introduction How did you get involved in the area of data management? Can you describe what Zenlytic is and the story behind it? Business intelligence is a crowded market. What was your process for defining the problem you are focused on solving and the method to achieve that outcome? Self-serve data exploration has been attempted in myriad ways over successive generations of BI and data platforms. What are the barriers that have been the most challenging to overcome in that effort? What are the elements that are coming together now that give you confidence in being able to deliver on that? Can you describe how Zenlytic is implemented? What are the evolutions in the understanding and implementation of semantic layers that provide a sufficient substrate for operating on? How have the recent breakthroughs in large language models (LLMs) improved your ability to build features in Zenlytic? What is your process for adding domain semantics to the operational aspect of your LLM? For someone using Zenlytic, what is the process for getting it set up and integrated with their data? Once it is operational, can you describe some typical workflows for using Zenlytic in a business context? Who are the target users? What are the collaboration options available? What are the most complex engineering/data challenges that you have had to address in building Zenlytic? What are the most interesting, innovative, or unexpected ways that you have seen Zenlytic used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Zenlytic? When is Zenlytic the wrong choice? What do you have planned for the future of Zenlytic? Contact Info Paul Blankley (LinkedIn) (https://www.linkedin.com/in/paulblankley/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Zenlytic (https://zenlytic.com/) OLAP Cube (https://analyticsengineers.club/whats-an-olap-cube/) Large Language Model (https://en.wikipedia.org/wiki/Large_language_model) Starburst (https://www.starburst.io/) Prompt Engineering (https://en.wikipedia.org/wiki/Prompt_engineering) ChatGPT (https://openai.com/blog/chatgpt) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

4/16/2023 • 49 minutes, 19 seconds

An Exploration Of The Composable Customer Data Platform

Summary The customer data platform is a category of services that was developed early in the evolution of the current era of cloud services for data processing. When it was difficult to wire together the event collection, data modeling, reporting, and activation it made sense to buy monolithic products that handled every stage of the customer data lifecycle. Now that the data warehouse has taken center stage a new approach of composable customer data platforms is emerging. In this episode Darren Haken is joined by Tejas Manohar to discuss how Autotrader UK is addressing their customer data needs by building on top of their existing data stack. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) Your host is Tobias Macey and today I'm interviewing Darren Haken and Tejas Manohar about building a composable CDP and how you can start adopting it incrementally Interview Introduction How did you get involved in the area of data management? Can you describe what you mean by a "composable CDP"? What are some of the key ways that it differs from the ways that we think of a CDP today? What are the problems that you were focused on addressing at Autotrader that are solved by a CDP? One of the promises of the first generation CDP was an opinionated way to model your data so that non-technical teams could own this responsibility. What do you see as the risks/tradeoffs of moving CDP functionality into the same data stack as the rest of the organization? What about companies that don't have the capacity to run a full data infrastructure? Beyond the core technology of the data warehouse, what are the other evolutions/innovations that allow for a CDP experience to be built on top of the core data stack? added burden on core data teams to generate event-driven data models When iterating toward a CDP on top of the core investment of the infrastructure to feed and manage a data warehouse, what are the typical first steps? What are some of the components in the ecosystem that help to speed up the time to adoption? (e.g. pre-built dbt packages for common transformations, etc.) What are the most interesting, innovative, or unexpected ways that you have seen CDPs implemented? What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDP related functionality? When is a CDP (composable or monolithic) the wrong choice? What do you have planned for the future of the CDP stack? Contact Info Darren LinkedIn (https://www.linkedin.com/in/darrenhaken/?originalSubdomain=uk) @DarrenHaken (https://twitter.com/darrenhaken) on Twitter Tejas LinkedIn (https://www.linkedin.com/in/tejasmanohar) @tejasmanohar (https://twitter.com/tejasmanohar) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Autotrader (https://www.autotrader.co.uk/) Hightouch (https://hightouch.com/) Customer Studio (https://hightouch.com/platform/customer-studio) CDP == Customer Data Platform (https://blog.hubspot.com/service/customer-data-platform-guide) Segment (https://segment.com/) Podcast Episode (https://www.dataengineeringpodcast.com/segment-customer-analytics-episode-72/) mParticle (https://www.mparticle.com/) Salesforce (https://www.salesforce.com/) Amplitude (https://amplitude.com/) Snowplow (https://snowplow.io/) Podcast Episode (https://www.dataengineeringpodcast.com/snowplow-with-alexander-dean-episode-48/) Reverse ETL (https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) BigQuery (https://cloud.google.com/bigquery) Databricks (https://www.databricks.com/) ELT (https://en.wikipedia.org/wiki/Extract,_load,_transform) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) DataHub (https://datahubproject.io/) Podcast Episode (https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/) Amundsen (https://www.amundsen.io/) Podcast Episode (https://www.dataengineeringpodcast.com/amundsen-data-discovery-episode-92/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

4/10/2023 • 1 hour, 11 minutes, 42 seconds

Mapping The Data Infrastructure Landscape As A Venture Capitalist

Summary The data ecosystem has been building momentum for several years now. As a venture capital investor Matt Turck has been trying to keep track of the main trends and has compiled his findings into the MAD (ML, AI, and Data) landscape reports each year. In this episode he shares his experiences building those reports and the perspective he has gained from the exercise. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) today to learn more Your host is Tobias Macey and today I'm interviewing Matt Turck about his annual report on the Machine Learning, AI, & Data landscape and the insights around data infrastructure that he has gained in the process Interview Introduction How did you get involved in the area of data management? Can you describe what the MAD landscape report is and the story behind it? At a high level, what is your goal in the compilation and maintenance of your landscape document? What are your guidelines for what to include in the landscape? As the data landscape matures, how have you seen that influence the types of projects/companies that are founded? What are the product categories that were only viable when capital was plentiful and easy to obtain? What are the product categories that you think will be swallowed by adjacent concerns, and which are likely to consolidate to remain competitive? The rapid growth and proliferation of data tools helped establish the "Modern Data Stack" as a de-facto architectural paradigm. As we move into this phase of contraction, what are your predictions for how the "Modern Data Stack" will evolve? Is there a different architectural paradigm that you see as growing to take its place? How has your presentation and the types of information that you collate in the MAD landscape evolved since you first started it?~~ What are the most interesting, innovative, or unexpected product and positioning approaches that you have seen while tracking data infrastructure as a VC and maintainer of the MAD landscape? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the MAD landscape over the years? What do you have planned for future iterations of the MAD landscape? Contact Info Website (https://mattturck.com/) @mattturck (https://twitter.com/mattturck) on Twitter MAD Landscape Comments Email (mailto:mad2023@firstmarkcap.com) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links MAD Landscape (https://mad.firstmarkcap.com) First Mark Capital (https://firstmark.com/) Bayesian Learning (https://en.wikipedia.org/wiki/Bayesian_inference) AI Winter (https://en.wikipedia.org/wiki/AI_winter) Databricks (https://www.databricks.com/) Cloud Native Landscape (https://landscape.cncf.io/) LUMA Scape (https://lumapartners.com/lumascapes/) Hadoop Ecosystem (https://www.analyticsvidhya.com/blog/2020/10/introduction-hadoop-ecosystem/) Modern Data Stack (https://www.fivetran.com/blog/what-is-the-modern-data-stack) Reverse ETL (https://medium.com/memory-leak/reverse-etl-a-primer-4e6694dcc7fb) Generative AI (https://generativeai.net/) dbt (https://www.getdbt.com/) Transform (https://transform.co/) Podcast Episode (https://www.dataengineeringpodcast.com/transform-co-metrics-layer-episode-206/) Snowflake IPO (https://www.cnn.com/2020/09/16/investing/snowflake-ipo/index.html) Dataiku (https://www.dataiku.com/) Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/tabular-iceberg-lakehouse-tables-episode-363) Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) Trino (https://trino.io/) Y42 (https://www.y42.com/) Podcast Episode (https://www.dataengineeringpodcast.com/y42-full-stack-data-platform-episode-295) Mozart Data (https://www.mozartdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/mozart-data-modern-data-stack-episode-242/) Keboola (https://www.keboola.com/) MPP Database (https://www.techtarget.com/searchdatamanagement/definition/MPP-database-massively-parallel-processing-database) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

4/3/2023 • 1 hour, 1 minute, 57 seconds

Unlocking The Potential Of Streaming Data Applications Without The Operational Headache At Grainite

Summary The promise of streaming data is that it allows you to react to new information as it happens, rather than introducing latency by batching records together. The peril is that building a robust and scalable streaming architecture is always more complicated and error-prone than you think it's going to be. After experiencing this unfortunate reality for themselves, Abhishek Chauhan and Ashish Kumar founded Grainite so that you don't have to suffer the same pain. In this episode they explain why streaming architectures are so challenging, how they have designed Grainite to be robust and scalable, and how you can start using it today to build your streaming data applications without all of the operational headache. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Businesses that adapt well to change grow 3 times faster than the industry average. As your business adapts, so should your data. RudderStack Transformations lets you customize your event data in real-time with your own JavaScript or Python code. Join The RudderStack Transformation Challenge today for a chance to win a $1,000 cash prize just by submitting a Transformation to the open-source RudderStack Transformation library. Visit dataengineeringpodcast.com/rudderstack (https://www.dataengineeringpodcast.com/rudderstack) today to learn more Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today Your host is Tobias Macey and today I'm interviewing Ashish Kumar and Abhishek Chauhan about Grainite, a platform designed to give you a single place to build streaming data applications Interview Introduction How did you get involved in the area of data management? Can you describe what Grainite is and the story behind it? What are the personas that you are focused on addressing with Grainite? What are some of the most complex aspects of building streaming data applications in the absence of something like Grainite? How does Grainite work to reduce that complexity? What are some of the commonalities that you see in the teams/organizations that find their way to Grainite? What are some of the higher-order projects that teams are able to build when they are using Grainite as a starting point vs. where they would be spending effort on a fully managed streaming architecture? Can you describe how Grainite is architected? How have the design and goals of the platform changed/evolved since you first started working on it? What does your internal build vs. buy process look like for identifying where to spend your engineering resources? What is the process for getting Grainite set up and integrated into an organizations technical environment? What is your process for determining which elements of the platform to expose as end-user features and customization options vs. keeping internal to the operational aspects of the product? Once Grainite is running, can you describe the day 0 workflow of building an application or data flow? What are the day 2 - N capabilities that Grainite offers for ongoing maintenance/operation/evolution of those applications? What are the most interesting, innovative, or unexpected ways that you have seen Grainite used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Grainite? When is Grainite the wrong choice? What do you have planned for the future of Grainite? Contact Info Ashish LinkedIn (https://www.linkedin.com/in/ashishkumarprofile/) Abhishek LinkedIn (https://www.linkedin.com/in/abhishekchauhan/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Grainite (https://www.grainite.com/) Blog about the challenges of streaming architectures (https://www.grainite.com/blog/there-was-an-old-lady-who-swallowed-a-fly) Getting Started Docs (https://gitbook.grainite.com/developers/getting-started) BigTable (https://research.google/pubs/pub27898/) Spanner (https://research.google/pubs/pub39966/) Firestore (https://cloud.google.com/firestore) OpenCensus (https://opencensus.io/) Citrix (https://www.citrix.com/) NetScaler (https://www.citrix.com/blogs/2022/10/03/netscaler-is-back/) J2EE (https://www.oracle.com/java/technologies/appmodel.html) RocksDB (https://rocksdb.org/) Pulsar (https://pulsar.apache.org/) SQL Server (https://en.wikipedia.org/wiki/Microsoft_SQL_Server) MySQL (https://www.mysql.com/) RAFT Protocol (https://raft.github.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

3/25/2023 • 1 hour, 13 minutes, 33 seconds

Aligning Data Security With Business Productivity To Deploy Analytics Safely And At Speed

Summary As with all aspects of technology, security is a critical element of data applications, and the different controls can be at cross purposes with productivity. In this episode Yoav Cohen from Satori shares his experiences as a practitioner in the space of data security and how to align with the needs of engineers and business users. He also explains why data security is distinct from application security and some methods for reducing the challenge of working across different data systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today RudderStack makes it easy for data teams to build a customer data platform on their own warehouse. Use their state of the art pipelines to collect all of your data, build a complete view of your customer and sync it to every downstream tool. Sign up for free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Yoav Cohen about the challenges that data teams face in securing their data platforms and how that impacts the productivity and adoption of data in the organization Interview Introduction How did you get involved in the area of data management? Data security is a very broad term. Can you start by enumerating some of the different concerns that are involved? How has the scope and complexity of implementing security controls on data systems changed in recent years? In your experience, what is a typical number of data locations that an organization is trying to manage access/permissions within? What are some of the main challenges that data/compliance teams face in establishing and maintaining security controls? How much of the problem is technical vs. procedural/organizational? As a vendor in the space, how do you think about the broad categories/boundary lines for the different elements of data security? (e.g. masking vs. RBAC, etc.) What are the different layers that are best suited to managing each of those categories? (e.g. masking and encryption in storage layer, RBAC in warehouse, etc.) What are some of the ways that data security and organizational productivity are at odds with each other? What are some of the shortcuts that you see teams and individuals taking to address the productivity hit from security controls? What are some of the methods that you have found to be most effective at mitigating or even improving productivity impacts through security controls? How does up-front design of the security layers improve the final outcome vs. trying to bolt on security after the platform is already in use? How can education about the motivations for different security practices improve compliance and user experience? What are the most interesting, innovative, or unexpected ways that you have seen data teams align data security and productivity? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data security technology? What are the areas of data security that still need improvements? Contact Info Yoav Cohen (https://www.linkedin.com/in/yoav-cohen-7a4ba23/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Satori (https://satoricyber.com) Podcast Episode (https://www.dataengineeringpodcast.com/satori-cloud-data-governance-episode-165) Data Masking (https://en.wikipedia.org/wiki/Data_masking) RBAC == Role Based Access Control (https://en.wikipedia.org/wiki/Role-based_access_control) ABAC == Attribute Based Access Control (https://en.wikipedia.org/wiki/Attribute-based_access_control) Gartner Data Security Platform Report (https://www.gartner.com/en/documents/4006252) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

3/19/2023 • 51 minutes, 38 seconds

Use Your Data Warehouse To Power Your Product Analytics With NetSpring

Summary With the rise of the web and digital business came the need to understand how customers are interacting with the products and services that are being sold. Product analytics has grown into its own category and brought with it several services with generational differences in how they approach the problem. NetSpring is a warehouse-native product analytics service that allows you to gain powerful insights into your customers and their needs by combining your event streams with the rest of your business data. In this episode Priyendra Deshwal explains how NetSpring is designed to empower your product and data teams to build and explore insights around your products in a streamlined and maintainable workflow. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Join in with the event for the global data community, Data Council Austin. From March 28-30th 2023, they'll play host to hundreds of attendees, 100 top speakers, and dozens of startups that are advancing data science, engineering and AI. Data Council attendees are amazing founders, data scientists, lead engineers, CTOs, heads of data, investors and community organizers who are all working together to build the future of data. As a listener to the Data Engineering Podcast you can get a special discount of 20% off your ticket by using the promo code dataengpod20. Don't miss out on their only event this year! Visit: dataengineeringpodcast.com/data-council (https://www.dataengineeringpodcast.com/data-council) today! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Your host is Tobias Macey and today I'm interviewing Priyendra Deshwal about how NetSpring is using the data warehouse to deliver a more flexible and detailed view of your product analytics Interview Introduction How did you get involved in the area of data management? Can you describe what NetSpring is and the story behind it? What are the activities that constitute "product analytics" and what are the roles/teams involved in those activities? When teams first come to you, what are the common challenges that they are facing and what are the solutions that they have attempted to employ? Can you describe some of the challenges involved in bringing product analytics into enterprise or highly regulated environments/industries? How does a warehouse-native approach simplify that effort? There are many different players (both commercial and open source) in the product analytics space. Can you share your view on the role that NetSpring plays in that ecosystem? How is the NetSpring platform implemented to be able to best take advantage of modern warehouse technologies and the associated data stacks? What are the pre-requisites for an organization's infrastructure/data maturity for being able to benefit from NetSpring? How have the goals and implementation of the NetSpring platform evolved from when you first started working on it? Can you describe the steps involved in integrating NetSpring with an organization's existing warehouse? What are the signals that NetSpring uses to understand the customer journeys of different organizations? How do you manage the variance of the data models in the warehouse while providing a consistent experience for your users? Given that you are a product organization, how are you using NetSpring to power NetSpring? What are the most interesting, innovative, or unexpected ways that you have seen NetSpring used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on NetSpring? When is NetSpring the wrong choice? What do you have planned for the future of NetSpring? Contact Info LinkedIn (https://www.linkedin.com/in/priyendra-deshwal/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links NetSpring (https://www.netspring.io/) ThoughtSpot (https://www.thoughtspot.com/) Product Analytics (https://theproductmanager.com/topics/product-analytics-guide/) Amplitude (https://amplitude.com/) Mixpanel (https://mixpanel.com/) Customer Data Platform (https://blog.hubspot.com/service/customer-data-platform-guide) GDPR (https://en.wikipedia.org/wiki/General_Data_Protection_Regulation) CCPA (https://en.wikipedia.org/wiki/California_Consumer_Privacy_Act) Segment (https://segment.com/) Podcast Episode (https://www.dataengineeringpodcast.com/segment-customer-analytics-episode-72/) Rudderstack (https://www.rudderstack.com/) Podcast Episode (https://www.dataengineeringpodcast.com/rudderstack-open-source-customer-data-platform-episode-263/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

3/10/2023 • 49 minutes, 21 seconds

Exploring The Nuances Of Building An Intentional Data Culture

Summary The ecosystem for data professionals has matured to the point that there are a large and growing number of distinct roles. With the scope and importance of data steadily increasing it is important for organizations to ensure that everyone is aligned and operating in a positive environment. To help facilitate the nascent conversation about what constitutes an effective and productive data culture, the team at Data Council have dedicated an entire conference track to the subject. In this episode Pete Soderling and Maggie Hays join the show to explore this topic and their experience preparing for the upcoming conference. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Pete Soderling and Maggie Hays about the growing importance of establishing and investing in an organization's data culture and their experience forming an entire conference track around this topic Interview Introduction How did you get involved in the area of data management? Can you describe what your working definition of "Data Culture" is? In what ways is a data culture distinct from an organization's corporate culture? How are they interdependent? What are the elements that are most impactful in forming the data culture of an organization? What are some of the motivations that teams/companies might have in fighting against the creation and support of an explicit data culture? Are there any strategies that you have found helpful in counteracting those tendencies? In terms of the conference, what are the factors that you consider when deciding how to group the different presentations into tracks or themes? What are the experiences that you have had personally and in community interactions that led you to elevate data culture to be it's own track? What are the broad challenges that practitioners are facing as they develop their own understanding of what constitutes a healthy and productive data culture? What are some of the risks that you considered when forming this track and evaluating proposals? What are your criteria for determining whether this track is successful? What are the most interesting, innovative, or unexpected aspects of data culture that you have encountered through developing this track? What are the most interesting, unexpected, or challenging lessons that you have learned while working on selecting presentations for this year's event? What do you have planned for the future of this topic at Data Council events? Contact Info Pete @petesoder (https://twitter.com/petesoder) on Twitter LinkedIn (https://www.linkedin.com/in/petesoder) Maggie LinkedIn (https://www.linkedin.com/in/maggie-hays) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Data Council (https://datacouncil.ai/austin) Podcast Episode (https://www.dataengineeringpodcast.com/data-council-data-professional-community-episode-96) Data Community Fund (https://www.datacommunity.fund) DataHub (https://datahubproject.io/) Podcast Episode (https://www.dataengineeringpodcast.com/acryl-data-datahub-metadata-graph-episode-230/) Database Design For Mere Mortals (https://amzn.to/3ZFV6dU) by Michael J. Hernandez (affiliate link) SOAP (https://en.wikipedia.org/wiki/SOAP) REST (https://en.wikipedia.org/wiki/Representational_state_transfer) Econometrics (https://en.wikipedia.org/wiki/Econometrics) DBA == Database Administrator (https://www.careerexplorer.com/careers/database-administrator/) Conway's Law (https://en.wikipedia.org/wiki/Conway%27s_law) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

3/6/2023 • 45 minutes, 44 seconds

Building A Data Mesh Platform At PayPal

Summary There has been a lot of discussion about the practical application of data mesh and how to implement it in an organization. Jean-Georges Perrin was tasked with designing a new data platform implementation at PayPal and wound up building a data mesh. In this episode he shares that journey and the combination of technical and organizational challenges that he encountered in the process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to dataengineeringpodcast.com/timextender (https://www.dataengineeringpodcast.com/timextender) where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Jean-Georges Perrin about his work at PayPal to implement a data mesh and the role of data contracts in making it work Interview Introduction How did you get involved in the area of data management? Can you start by describing the goals and scope of your work at PayPal to implement a data mesh? What are the core problems that you were addressing with this project? Is a data mesh ever "done"? What was your experience engaging at the organizational level to identify the granularity and ownership of the data products that were needed in the initial iteration? What was the impact of leading multiple teams on the design of how to implement communication/contracts throughout the mesh? What are the technical systems that you are relying on to power the different data domains? What is your philosophy on enforcing uniformity in technical systems vs. relying on interface definitions as the unit of consistency? What are the biggest challenges (technical and procedural) that you have encountered during your implementation? How are you managing visibility/auditability across the different data domains? (e.g. observability, data quality, etc.) What are the most interesting, innovative, or unexpected ways that you have seen PayPal's data mesh used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh? When is a data mesh the wrong choice? What do you have planned for the future of your data mesh at PayPal? Contact Info LinkedIn (https://www.linkedin.com/in/jgperrin/) Blog (https://jgp.ai/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Data Mesh (https://www.thoughtworks.com/en-us/what-we-do/data-and-ai/data-mesh) O'Reilly Book (https://amzn.to/3Z5nC8T) (affiliate link) The next generation of Data Platforms is the Data Mesh (https://medium.com/paypal-tech/the-next-generation-of-data-platforms-is-the-data-mesh-b7df4b825522) PayPal (https://about.pypl.com/about-us/default.aspx) Conway's Law (https://en.wikipedia.org/wiki/Conway%27s_law) Data Mesh For All Ages - US (https://amzn.to/3YzVRop), Data Mesh For All Ages - UK (https://amzn.to/3YzVRop) Data Mesh Radio (https://daappod.com/data-mesh-radio/) Data Mesh Community (https://datameshlearning.com/) Data Mesh In Action (http://jgp.ai/dmia) Great Expectations (https://greatexpectations.io/) Podcast Episode (https://www.dataengineeringpodcast.com/great-expectations-technical-debt-data-pipeline-episode-117/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

2/27/2023 • 46 minutes, 54 seconds

The View Below The Waterline Of Apache Iceberg And How It Fits In Your Data Lakehouse

Summary Cloud data warehouses have unlocked a massive amount of innovation and investment in data applications, but they are still inherently limiting. Because of their complete ownership of your data they constrain the possibilities of what data you can store and how it can be used. Projects like Apache Iceberg provide a viable alternative in the form of data lakehouses that provide the scalability and flexibility of data lakes, combined with the ease of use and performance of data warehouses. Ryan Blue helped create the Iceberg project, and in this episode he rejoins the show to discuss how it has evolved and what he is doing in his new business Tabular to make it even easier to implement and maintain. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Hey there podcast listener, are you tired of dealing with the headache that is the 'Modern Data Stack'? We feel your pain. It's supposed to make building smarter, faster, and more flexible data infrastructures a breeze. It ends up being anything but that. Setting it up, integrating it, maintaining it—it’s all kind of a nightmare. And let's not even get started on all the extra tools you have to buy to get it to do its thing. But don't worry, there is a better way. TimeXtender takes a holistic approach to data integration that focuses on agility rather than fragmentation. By bringing all the layers of the data stack together, TimeXtender helps you build data solutions up to 10 times faster and saves you 70-80% on costs. If you're fed up with the 'Modern Data Stack', give TimeXtender a try. Head over to timextender.com/dataengineering where you can do two things: watch us build a data estate in 15 minutes and start for free today. Your host is Tobias Macey and today I'm interviewing Ryan Blue about the evolution and applications of the Iceberg table format and how he is making it more accessible at Tabular Interview Introduction How did you get involved in the area of data management? Can you describe what Iceberg is and its position in the data lake/lakehouse ecosystem? Since it is a fundamentally a specification, how do you manage compatibility and consistency across implementations? What are the notable changes in the Iceberg project and its role in the ecosystem since our last conversation October of 2018? Around the time that Iceberg was first created at Netflix a number of alternative table formats were also being developed. What are the characteristics of Iceberg that lead teams to adopt it for their lakehouse projects? Given the constant evolution of the various table formats it can be difficult to determine an up-to-date comparison of their features, particularly earlier in their development. What are the aspects of this problem space that make it so challenging to establish unbiased and comprehensive comparisons? For someone who wants to manage their data in Iceberg tables, what does the implementation look like? How does that change based on the type of query/processing engine being used? Once a table has been created, what are the capabilities of Iceberg that help to support ongoing use and maintenance? What are the most interesting, innovative, or unexpected ways that you have seen Iceberg used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iceberg/Tabular? When is Iceberg/Tabular the wrong choice? What do you have planned for the future of Iceberg/Tabular? Contact Info LinkedIn (https://www.linkedin.com/in/rdblue/) rdblue (https://github.com/rdblue) on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Hadoop (https://hadoop.apache.org/) Data Lakehouse (https://www.forbes.com/sites/bernardmarr/2022/01/18/what-is-a-data-lakehouse-a-super-simple-explanation-for-anyone/) ACID == Atomic, Consistent, Isolated, Durable (https://en.wikipedia.org/wiki/ACID) Apache Hive (https://hive.apache.org/) Apache Impala (https://impala.apache.org/) Bodo (https://www.bodo.ai/) Podcast Episode (https://www.dataengineeringpodcast.com/bodo-parallel-data-processing-python-episode-223/) StarRocks (https://www.starrocks.io/) Dremio (https://www.dremio.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dremio-open-data-lakehouse-episode-333/) DDL == Data Definition Language (https://en.wikipedia.org/wiki/Data_definition_language) Trino (https://trino.io/) PrestoDB (https://prestodb.io/) Apache Hudi (https://hudi.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/hudi-streaming-data-lake-episode-209/) dbt (https://www.getdbt.com/) Apache Flink (https://flink.apache.org/) TileDB (https://tiledb.com/) Podcast Episode (https://www.dataengineeringpodcast.com/tiledb-universal-data-engine-episode-146/) CDC == Change Data Capture (https://en.wikipedia.org/wiki/Change_data_capture) Substrait (https://substrait.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

2/19/2023 • 55 minutes, 6 seconds

Let The Whole Team Participate In Data With The Quilt Versioned Data Hub

Summary Data is a team sport, but it's often difficult for everyone on the team to participate. For a long time the mantra of data tools has been "by developers, for developers", which automatically excludes a large portion of the business members who play a crucial role in the success of any data project. Quilt Data was created as an answer to make it easier for everyone to contribute to the data being used by an organization and collaborate on its application. In this episode Aneesh Karve shares the journey that Quilt has taken to provide an approachable interface for working with versioned data in S3 that empowers everyone to collaborate. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Your host is Tobias Macey and today I'm interviewing Aneesh Karve about how Quilt Data helps you bring order to your chaotic data in S3 with transactional versioning and data discovery built in Interview Introduction How did you get involved in the area of data management? Can you describe what Quilt is and the story behind it? How have the goals and features of the Quilt platform changed since I spoke with Kevin in June of 2018? What are the main problems that users are trying to solve when they find Quilt? What are some of the alternative approaches/products that they are coming from? How does Quilt compare with options such as LakeFS, Unstruk, Pachyderm, etc.? Can you describe how Quilt is implemented? What are the types of tools and systems that Quilt gets integrated with? How do you manage the tension between supporting the lowest common denominator, while providing options for more advanced capabilities? What is a typical workflow for a team that is using Quilt to manage their data? What are the most interesting, innovative, or unexpected ways that you have seen Quilt used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Quilt? When is Quilt the wrong choice? What do you have planned for the future of Quilt? Contact Info LinkedIn (https://www.linkedin.com/in/aneeshkarve/) @akarve (https://twitter.com/akarve) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Quilt Data (https://quiltdata.com/) Podcast Episode (https://www.dataengineeringpodcast.com/quilt-data-with-kevin-moore-episode-37/) UW Madison (https://www.wisc.edu/) Docker Swarm (https://docs.docker.com/engine/swarm/) Kaggle (https://www.kaggle.com/) open.quiltdata.com (https://open.quiltdata.com/) FinOS Perspective (https://perspective.finos.org/) LakeFS (https://lakefs.io/) Podcast Episode (https://www.dataengineeringpodcast.com/lakefs-data-lake-versioning-episode-157/) Pachyderm (https://www.pachyderm.com/) Podcast Episode (https://www.dataengineeringpodcast.com/pachyderm-data-lineage-episode-82) Unstruk (https://www.unstruk.com/) Podcast Episode (https://www.dataengineeringpodcast.com/unstruk-unstructured-data-warehouse-episode-196/) Parquet (https://parquet.apache.org/) Avro (https://avro.apache.org/) ORC (https://orc.apache.org/) Cloudformation (https://aws.amazon.com/cloudformation/) Troposphere (https://github.com/cloudtools/troposphere) CDK == Cloud Development Kit (https://aws.amazon.com/cdk/) Shadow IT (https://en.wikipedia.org/wiki/Shadow_IT) Podcast Episode (https://www.dataengineeringpodcast.com/shadow-it-data-analytics-episode-121) Delta Lake (https://delta.io/) Podcast Episode (https://www.dataengineeringpodcast.com/delta-lake-data-lake-episode-85/) Apache Iceberg (https://iceberg.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/iceberg-with-ryan-blue-episode-52/) Datasette (https://datasette.io/) Frictionless (https://frictionlessdata.io/) DVC (https://dvc.org/) Podcast.__init__ Episode (https://www.pythonpodcast.com/data-version-control-episode-206/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

2/11/2023 • 52 minutes, 2 seconds

Reflecting On The Past 6 Years Of Data Engineering

Summary This podcast started almost exactly six years ago, and the technology landscape was much different than it is now. In that time there have been a number of generational shifts in how data engineering is done. In this episode I reflect on some of the major themes and take a brief look forward at some of the upcoming changes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Your host is Tobias Macey and today I'm reflecting on the major trends in data engineering over the past 6 years Interview Introduction 6 years of running the Data Engineering Podcast Around the first time that data engineering was discussed as a role Followed on from hype about "data science" Hadoop era Streaming Lambda and Kappa architectures Not really referenced anymore "Big Data" era of capture everything has shifted to focusing on data that presents value Regulatory environment increases risk, better tools introduce more capability to understand what data is useful Data catalogs Amundsen and Alation Orchestration engine Oozie, etc. -> Airflow and Luigi -> Dagster, Prefect, Lyft, etc. Orchestration is now a part of most vertical tools Cloud data warehouses Data lakes DataOps and MLOps Data quality to data observability Metadata for everything Data catalog -> data discovery -> active metadata Business intelligence Read only reports to metric/semantic layers Embedded analytics and data APIs Rise of ELT dbt Corresponding introduction of reverse ETL What are the most interesting, unexpected, or challenging lessons that you have learned while working on running the podcast? What do you have planned for the future of the podcast? Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

2/6/2023 • 32 minutes, 21 seconds

Let Your Business Intelligence Platform Build The Models Automatically With Omni Analytics

Summary Business intelligence has gone through many generational shifts, but each generation has largely maintained the same workflow. Data analysts create reports that are used by the business to understand and direct the business, but the process is very labor and time intensive. The team at Omni have taken a new approach by automatically building models based on the queries that are executed. In this episode Chris Merrick shares how they manage integration and automation around the modeling layer and how it improves the organizational experience of business intelligence. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Your host is Tobias Macey and today I'm interviewing Chris Merrick about the Omni Analytics platform and how they are adding automatic data modeling to your business intelligence Interview Introduction How did you get involved in the area of data management? Can you describe what Omni Analytics is and the story behind it? What are the core goals that you are trying to achieve with building Omni? Business intelligence has gone through many evolutions. What are the unique capabilities that Omni Analytics offers over other players in the market? What are the technical and organizational anti-patterns that typically grow up around BI systems? What are the elements that contribute to BI being such a difficult product to use effectively in an organization? Can you describe how you have implemented the Omni platform? How have the design/scope/goals of the product changed since you first started working on it? What does the workflow for a team using Omni look like? What are some of the developments in the broader ecosystem that have made your work possible? What are some of the positive and negative inspirations that you have drawn from the experience that you and your team-mates have gained in previous businesses? What are the most interesting, innovative, or unexpected ways that you have seen Omni used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Omni? When is Omni the wrong choice? What do you have planned for the future of Omni? Contact Info LinkedIn (https://www.linkedin.com/in/merrickchristopher/) @cmerrick (https://twitter.com/cmerrick) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Omni Analytics (https://www.exploreomni.com/) Stitch (https://www.stitchdata.com/) RJ Metrics (https://en.wikipedia.org/wiki/RJMetrics) Looker (https://www.looker.com/) Podcast Episode (https://www.dataengineeringpodcast.com/looker-with-daniel-mintz-episode-55/) Singer (https://www.singer.io/) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) Teradata (https://www.teradata.com/) Fivetran (https://www.fivetran.com/) Apache Arrow (https://arrow.apache.org/) Podcast Episode (https://www.dataengineeringpodcast.com/voltron-data-apache-arrow-episode-346/) DuckDB (https://duckdb.org/) Podcast Episode (https://www.dataengineeringpodcast.com/duckdb-in-process-olap-database-episode-270/) BigQuery (https://cloud.google.com/bigquery) Snowflake (https://www.snowflake.com/en/) Podcast Episode (https://www.dataengineeringpodcast.com/snowflakedb-cloud-data-warehouse-episode-110/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

1/30/2023 • 50 minutes, 43 seconds

Safely Test Your Applications And Analytics With Production Quality Data Using Tonic AI

Summary The most interesting and challenging bugs always happen in production, but recreating them is a constant challenge due to differences in the data that you are working with. Building your own scripts to replicate data from production is time consuming and error-prone. Tonic is a platform designed to solve the problem of having reliable, production-like data available for developing and testing your software, analytics, and machine learning projects. In this episode Adam Kamor explores the factors that make this such a complex problem to solve, the approach that he and his team have taken to turn it into a reliable product, and how you can start using it to replace your own collection of scripts. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda (https://www.dataengineeringpodcast.com/gartnerda) today to find out more. Your host is Tobias Macey and today I'm interviewing Adam Kamor about Tonic, a service for generating data sets that are safe for development, analytics, and machine learning Interview Introduction How did you get involved in the area of data management? Can you describe what Tonic is and the story behind it? What are the core problems that you are trying to solve? What are some of the ways that fake or obfuscated data is used in development and analytics workflows? challenges of reliably subsetting data impact of ORMs and bad habits developers get into with database modeling Can you describe how Tonic is implemented? What are the units of composition that you are building to allow for evolution and expansion of your product? How have the design and goals of the platform evolved since you started working on it? Can you describe some of the different workflows that customers build on top of your various tools What are the most interesting, innovative, or unexpected ways that you have seen Tonic used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Tonic? When is Tonic the wrong choice? What do you have planned for the future of Tonic? Contact Info LinkedIn (https://www.linkedin.com/in/adam-kamor-85720b48/) @AdamKamor (https://twitter.com/adamkamor) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Tonic (https://hubs.la/Q01yX4qN0) Djinn (https://hubs.la/Q01yX4FL0) Django (https://www.djangoproject.com/) Ruby on Rails (https://rubyonrails.org/) C# (https://learn.microsoft.com/en-us/dotnet/csharp/tour-of-csharp/) Entity Framework (https://learn.microsoft.com/en-us/dotnet/csharp/tour-of-csharp/) PostgreSQL (https://www.postgresql.org/) MySQL (https://www.mysql.com/) Oracle DB (https://www.oracle.com/database/) MongoDB (https://www.mongodb.com/) Parquet (https://parquet.apache.org/) Databricks (https://www.databricks.com/) Mockaroo (https://www.mockaroo.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

1/22/2023 • 45 minutes, 40 seconds

Building Applications With Data As Code On The DataOS

Summary The modern data stack has made it more economical to use enterprise grade technologies to power analytics at organizations of every scale. Unfortunately it has also introduced new overhead to manage the full experience as a single workflow. At the Modern Data Company they created the DataOS platform as a means of driving your full analytics lifecycle through code, while providing automatic knowledge graphs and data discovery. In this episode Srujan Akula explains how the system is implemented and how you can start using it today with your existing data systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda (https://www.dataengineeringpodcast.com/gartnerda) today to find out more. Your host is Tobias Macey and today I'm interviewing Srujan Akula about DataOS, a pre-integrated and managed data platform built by The Modern Data Company Interview Introduction How did you get involved in the area of data management? Can you describe what your mission at The Modern Data Company is and the story behind it? Your flagship (only?) product is a platform that you're calling DataOS. What is the scope and goal of that platform? Who is the target audience? On your site you refer to the idea of "data as software". What are the principles and ways of thinking that are encompassed by that concept? What are the platform capabilities that are required to make it possible? There are 11 "Key Features" listed on your site for the DataOS. What was your process for identifying the "must have" vs "nice to have" features for launching the platform? Can you describe the technical architecture that powers your DataOS product? What are the core principles that you are optimizing for in the design of your platform? How have the design and goals of the system changed or evolved since you started working on DataOS? Can you describe the workflow for the different practitioners and stakeholders working on an installation of DataOS? What are the interfaces and escape hatches that are available for integrating with and extending the operation of the DataOS? What are the features or capabilities that you are expressly choosing not to implement? (e.g. ML pipelines, data sharing, etc.) What are the design elements that you are focused on to make DataOS approachable and understandable by different members of an organization? What are the most interesting, innovative, or unexpected ways that you have seen DataOS used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on DataOS? When is DataOS the wrong choice? What do you have planned for the future of DataOS? Contact Info LinkedIn (https://www.linkedin.com/in/srujanakula/) @srujanakula (https://twitter.com/srujanakula) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Modern Data Company (https://themoderndatacompany.com/) Alation (https://www.alation.com/) Airbyte (https://airbyte.com/) Podcast Episode (https://www.dataengineeringpodcast.com/airbyte-open-source-data-integration-episode-173/) Fivetran (https://www.fivetran.com/) Podcast Episode (https://www.dataengineeringpodcast.com/fivetran-data-replication-episode-93/) Airflow (https://airflow.apache.org/) Dremio (https://www.dremio.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dremio-with-tomer-shiran-episode-58/) PrestoDB (https://prestodb.io/) GraphQL (https://graphql.org/) Cypher (https://neo4j.com/developer/cypher/) graph query language Gremlin (https://en.wikipedia.org/wiki/Gremlin_(query_language)) graph query language The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

1/16/2023 • 48 minutes, 36 seconds

Automate Your Pipeline Creation For Streaming Data Transformations With SQLake

Summary Managing end-to-end data flows becomes complex and unwieldy as the scale of data and its variety of applications in an organization grows. Part of this complexity is due to the transformation and orchestration of data living in disparate systems. The team at Upsolver is taking aim at this problem with the latest iteration of their platform in the form of SQLake. In this episode Ori Rafael explains how they are automating the creation and scheduling of orchestration flows and their related transforations in a unified SQL interface. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management Data and analytics leaders, 2023 is your year to sharpen your leadership skills, refine your strategies and lead with purpose. Join your peers at Gartner Data & Analytics Summit, March 20 – 22 in Orlando, FL for 3 days of expert guidance, peer networking and collaboration. Listeners can save $375 off standard rates with code GARTNERDA. Go to dataengineeringpodcast.com/gartnerda (https://www.dataengineeringpodcast.com/gartnerda) today to find out more. Truly leveraging and benefiting from streaming data is hard - the data stack is costly, difficult to use and still has limitations. Materialize breaks down those barriers with a true cloud-native streaming database - not simply a database that connects to streaming systems. With a PostgreSQL-compatible interface, you can now work with real-time data using ANSI SQL including the ability to perform multi-way complex joins, which support stream-to-stream, stream-to-table, table-to-table, and more, all in standard SQL. Go to dataengineeringpodcast.com/materialize (https://www.dataengineeringpodcast.com/materialize) today and sign up for early access to get started. If you like what you see and want to help make it better, they're hiring (https://materialize.com/careers/) across all functions! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm interviewing Ori Rafael about the SQLake feature for the Upsolver platform that automatically generates pipelines from your queries Interview Introduction How did you get involved in the area of data management? Can you describe what the SQLake product is and the story behind it? What is the core problem that you are trying to solve? What are some of the anti-patterns that you have seen teams adopt when designing and implementing DAGs in a tool such as Airlow? What are the benefits of merging the logic for transformation and orchestration into the same interface and dialect (SQL)? Can you describe the technical implementation of the SQLake feature? What does the workflow look like for designing and deploying pipelines in SQLake? What are the opportunities for using utilities such as dbt for managing logical complexity as the number of pipelines scales? SQL has traditionally been challenging to compose. How did that factor into your design process for how to structure the dialect extensions for job scheduling? What are some of the complexities that you have had to address in your orchestration system to be able to manage timeliness of operations as volume and complexity of the data scales? What are some of the edge cases that you have had to provide escape hatches for? What are the most interesting, innovative, or unexpected ways that you have seen SQLake used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SQLake? When is SQLake the wrong choice? What do you have planned for the future of SQLake? Contact Info LinkedIn (https://www.linkedin.com/in/ori-rafael-91723344/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Upsolver (https://www.upsolver.com/) Podcast Episode (https://www.dataengineeringpodcast.com/upsolver-streaming-data-integration-episode-240/) SQLake (https://docs.upsolver.com/sqlake/) Airflow (https://airflow.apache.org/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309/) Prefect (https://www.prefect.io/) Podcast Episode (https://www.dataengineeringpodcast.com/prefect-workflow-engine-episode-86/) Flyte (https://flyte.org/) Podcast Episode (https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291/) GitHub Actions (https://github.com/features/actions) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) PartiQL (https://partiql.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

1/8/2023 • 44 minutes, 5 seconds

Increase Your Odds Of Success For Analytics And AI Through More Effective Knowledge Management With AlignAI

Summary Making effective use of data requires proper context around the information that is being used. As the size and complexity of your organization increases the difficulty of ensuring that everyone has the necessary knowledge about how to get their work done scales exponentially. Wikis and intranets are a common way to attempt to solve this problem, but they are frequently ineffective. Rehgan Avon co-founded AlignAI to help address this challenge through a more purposeful platform designed to collect and distribute the knowledge of how and why data is used in a business. In this episode she shares the strategic and tactical elements of how to make more effective use of the technical and organizational resources that are available to you for getting work done with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan (https://www.dataengineeringpodcast.com/atlan) today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm interviewing Rehgan Avon about her work at AlignAI to help organizations standardize their technical and procedural approaches to working with data Interview Introduction How did you get involved in the area of data management? Can you describe what AlignAI is and the story behind it? What are the core problems that you are focused on addressing? What are the tactical ways that you are working to solve those problems? What are some of the common and avoidable ways that analytics/AI projects go wrong? What are some of the ways that organizational scale and complexity impacts their ability to execute on data and AI projects? What are the ways that incomplete/unevenly distributed knowledge manifests in project design and execution? Can you describe the design and implementation of the AlignAI platform? How have the goals and implementation of the product changed since you first started working on it? What is the workflow at the individual and organizational level for businesses that are using AlignAI? One of the perennial challenges with knowledge sharing in an organization is managing incentives to engage with the available material. What are some of the ways that you are working to integrate the creation and distribution of institutional knowledge into employees' day-to-day work? What are the most interesting, innovative, or unexpected ways that you have seen AlignAI used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on AlignAI? When is AlignAI the wrong choice? What do you have planned for the future of AlignAI? Contact Info LinkedIn (https://www.linkedin.com/in/rehganavon/) @RehganAvon (https://twitter.com/RehganAvon) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links AlignAI (https://www.getalignai.com/) Sharepoint (https://en.wikipedia.org/wiki/SharePoint) Confluence (https://en.wikipedia.org/wiki/Confluence_(software)) GitHub (https://github.com/) Canva (https://www.canva.com/) Instructional Design (https://en.wikipedia.org/wiki/Instructional_design) Notion (https://www.notion.so/) Coda (https://coda.io/) Waterfall Design (https://en.wikipedia.org/wiki/Waterfall_model) dbt (https://www.getdbt.com/) Podcast Episode (https://www.dataengineeringpodcast.com/dbt-data-analytics-episode-81/) Alteryx (https://www.alteryx.com/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

12/29/2022 • 59 minutes, 21 seconds

Using Product Driven Development To Improve The Productivity And Effectiveness Of Your Data Teams

Summary With all of the messaging about treating data as a product it is becoming difficult to know what that even means. Vishal Singh is the head of products at Starburst which means that he has to spend all of his time thinking and talking about the details of product thinking and its application to data. In this episode he shares his thoughts on the strategic and tactical elements of moving your work as a data professional from being task-oriented to being product-oriented and the long term improvements in your productivity that it provides. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver (https://www.dataengineeringpodcast.com/upsolver) today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I'm interviewing Vishal Singh about his experience building data products at Starburst Interview Introduction How did you get involved in the area of data management? Can you describe what your definition of a "data product" is? What are some of the different contexts in which the idea of a data product is applicable? How do the parameters of a data product change across those different contexts/consumers? What are some of the ways that you see the conversation around the purpose and practice of building data products getting overloaded by conflicting objectives? What do you see as common challenges in data teams around how to approach product thinking in their day-to-day work? What are some of the tactical ways that product-oriented work on data problems differs from what has become common practice in data teams? What are some of the features that you are building at Starburst that contribute to the efforts of data teams to build full-featured product experiences for their data? What are the most interesting, innovative, or unexpected ways that you have seen Starburst used in the context of data products? What are the most interesting, unexpected, or challenging lessons that you have learned while working at Starburst? When is a data product the wrong choice? What do you have planned for the future of support for data product development at Starburst? Contact Info LinkedIn (https://www.linkedin.com/in/singhsvishal/) @vishal_singh (https://twitter.com/vishal_singh) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ (https://www.pythonpodcast.com) covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Starburst (https://www.starburst.io/) Podcast Episode (https://www.dataengineeringpodcast.com/starburst-lakehouse-modern-data-architecture-episode-304/) Geophysics (https://en.wikipedia.org/wiki/Geophysics) Product-Led Growth (https://www.productled.org/foundations/what-is-product-led-growth) Trino (https://trino.io/) DataNova (https://www.starburst.io/datanova/) Starburst Galaxy (https://www.starburst.io/platform/starburst-galaxy/) Tableau (https://www.tableau.com/) PowerBI (https://powerbi.microsoft.com/en-us/) Podcast Episode (https://www.dataengineeringpodcast.com/power-bi-business-intelligence-episode-154/) Metabase (https://www.metabase.com/) Podcast Episode (https://www.dataengineeringpodcast.com/metabase-with-sameer-al-sakran-episode-29/) Great Expectations (https://greatexpectations.io/) Podcast Episode (https://www.dataengineeringpodcast.com/great-expectations-technical-debt-data-pipeline-episode-117/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

12/29/2022 • 58 minutes, 45 seconds

An Exploration Of Tobias' Experience In Building A Data Lakehouse From Scratch

Summary Five years of hosting the Data Engineering Podcast has provided Tobias Macey with a wealth of insight into the work of building and operating data systems at a variety of scales and for myriad purposes. In order to condense that acquired knowledge into a format that is useful to everyone Scott Hirleman turns the tables in this episode and asks Tobias about the tactical and strategic aspects of his experiences applying those lessons to the work of building a data platform from scratch. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan (https://www.dataengineeringpodcast.com/atlan) today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm being interviewed by Scott Hirleman about my work on the podcasts and my experience building a data platform Interview Introduction How did you get involved in the area of data management? Data platform building journey Why are you building, who are the users/use cases How to focus on doing what matters over cool tools How to build a good UX Anything surprising or did you discover anything you didn't expect at the start How to build so it's modular and can be improved in the future General build vs buy and vendor selection process Obviously have a good BS detector - how can others build theirs So many tools, where do you start - capability need, vendor suite offering, etc. Anything surprising in doing much of this at once How do you think about TCO in build versus buy Any advice Guest call out Be brave, believe you are good enough to be on the show Look at past episodes and don't pitch the same as what's been on recently And vendors, be smart, work with your customers to come up with a good pitch for them as guests... Tobias' advice and learnings from building out a data platform: Advice: when considering a tool, start from what are you actually trying to do. Yes, everyone has tools they want to use because they are cool (or some resume-driven development). Once you have a potential tool, is the capabilty you want to use a unloved feature or a main part of the product. If it's a feature, will they give it the care and attention it needs? Advice: lean heavily on open source. You can fix things yourself and better direct the community's work than just filing a ticket and hoping with a vendor. Learning: there is likely going to be some painful pieces missing, especially around metadata, as you build out your platform. Advice: build in a modular way and think of what is my escape hatch? Yes, you have to lock yourself in a bit but build with the possibility of a vendor or a tool going away - whether that is your choice (e.g. too expensive) or it literally disappears (anyone remember FoundationDB?). Learning: be prepared for tools to connect with each other but the connection to not be as robust as you want. Again, be prepared to have metadata challenges especially. Advice: build your foundation to be strong. This will limit pain as things evolve and change. You can't build a large building on a bad foundation - or at least it's a BAD idea... Advice: spend the time to work with your data consumers to figure out what questions they want to answer. Then abstract that to build to general challenges instead of point solutions. Learning: it's easy to put data in S3 but it can be painfully difficult to query it. There's a missing piece as to how to store it for easy querying, not just the metadata issues. Advice: it's okay to pay a vendor to lessen pain. But becoming wholly reliant on them can put you in a bad spot. Advice: look to create paved path / easy path approaches. If someone wants to follow the preset path, it's easy for them. If they want to go their own way, more power to them, but not the data platform team's problem if it isn't working well. Learning: there will be places you didn't expect to bend - again, that metadata layer for Tobias - to get things done sooner. It's okay to not have the end platform built at launch, move forward and get something going. Advice: "one of the perennial problems in technlogy is the bias towards speed and action without necessarily understanding the destination." Really consider the path and if you are creating a scalable and maintainable solution instead of pushing for speed to deliver something. Advice: consider building a buffer layer between upstream sources so if there are changes, it doesn't automatically break things downstream. Tobias' data platform components: data lakehouse paradigm, Airbyte for data integration (chosen over Meltano), Trino/Starburst Galaxy for distributed querying, AWS S3 for the storage layer, AWS Glue for very basic metadata cataloguing, Dagster as the crucial orchestration layer, dbt Contact Info LinkedIn (https://www.linkedin.com/in/scotthirleman/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Data Mesh Community (https://datameshlearning.com/community/) Podcast (https://www.linkedin.com/company/80887002/admin/) OSI Model (https://en.wikipedia.org/wiki/OSI_model) Schemata (https://schemata.app/) Podcast Episode (https://www.dataengineeringpodcast.com/schemata-schema-compatibility-utility-episode-324/) Atlan (https://atlan.com/) Podcast Episode (https://www.dataengineeringpodcast.com/atlan-data-team-collaboration-episode-179/) OpenMetadata (https://open-metadata.org/) Podcast Episode (https://www.dataengineeringpodcast.com/openmetadata-universal-metadata-layer-episode-237/) Chris Riccomini (https://daappod.com/data-mesh-radio/devops-for-data-mesh-chris-riccomini/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

12/26/2022 • 1 hour, 11 minutes, 59 seconds

Simple And Scalable Encryption Of Data In Use For Analytics And Machine Learning With Opaque Systems

Summary Encryption and security are critical elements in data analytics and machine learning applications. We have well developed protocols and practices around data that is at rest and in motion, but security around data in use is still severely lacking. Recognizing this shortcoming and the capabilities that could be unlocked by a robust solution Rishabh Poddar helped to create Opaque Systems as an outgrowth of his PhD studies. In this episode he shares the work that he and his team have done to simplify integration of secure enclaves and trusted computing environments into analytical workflows and how you can start using it without re-engineering your existing systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver (https://www.dataengineeringpodcast.com/upsolver) today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I'm interviewing Rishabh Poddar about his work at Opaque Systems to enable secure analysis and machine learning on encrypted data Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Opaque Systems and the story behind it? What are the core problems related to security/privacy in data analytics and ML that organizations are struggling with? What do you see as the balance of internal vs. cross-organization applications for the solutions you are creating? comparison with homomorphic encryption validation and ongoing testing of security/privacy guarantees performance impact of encryption overhead and how to mitigate it UX aspects of not being able to view the underlying data risks of information leakage from schema/meta information Can you describe how the Opaque Systems platform is implemented? How have the design and scope of the product changed since you started working on it? Can you describe a typical workflow for a team or teams building an analytical process or ML project with your platform? What are some of the constraints in terms of data format/volume/variety that are introduced by working with it in the Opaque platform? How are you approaching the balance of maintaining the MC2 project against the product needs of the Opaque platform? What are the most interesting, innovative, or unexpected ways that you have seen the Opaque platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Opaque Systems/MC2? When is Opaque the wrong choice? What do you have planned for the future of the Opaque platform? Contact Info LinkedIn (https://www.linkedin.com/in/rishabh-poddar/) Website (https://rishabhpoddar.com/) @Podcastinator (https://twitter.com/podcastinator) on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Opaque Systems (https://opaque.co/) UC Berkeley RISE Lab (https://rise.cs.berkeley.edu/) TLS (https://en.wikipedia.org/wiki/Transport_Layer_Security) MC² (https://mc2-project.github.io/) Homomorphic Encryption (https://en.wikipedia.org/wiki/Homomorphic_encryption) Secure Multi-Party Computation (https://en.wikipedia.org/wiki/Secure_multi-party_computation) Secure Enclaves (https://opaque.co/blog/what-are-secure-enclaves/) Differential Privacy (https://en.wikipedia.org/wiki/Differential_privacy) Data Obfuscation (https://en.wikipedia.org/wiki/Data_masking) AES == Advanced Encryption Standard (https://en.wikipedia.org/wiki/Advanced_Encryption_Standard) Intel SGX (Software Guard Extensions) (https://www.intel.com/content/www/us/en/developer/tools/software-guard-extensions/overview.html) Intel TDX (Trust Domain Extensions) (https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html) TPC-H Benchmark (https://www.tpc.org/tpch/) Spark (https://spark.apache.org/) Trino (https://trino.io/) PyTorch (https://pytorch.org/) Tensorflow (https://www.tensorflow.org/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

12/26/2022 • 1 hour, 8 minutes, 25 seconds

Revisit The Fundamental Principles Of Working With Data To Avoid Getting Caught In The Hype Cycle

Summary The data ecosystem has seen a constant flurry of activity for the past several years, and it shows no signs of slowing down. With all of the products, techniques, and buzzwords being discussed it can be easy to be overcome by the hype. In this episode Juan Sequeda and Tim Gasper from data.world share their views on the core principles that you can use to ground your work and avoid getting caught in the hype cycles. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold (https://www.dataengineeringpodcast.com/datafold) today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder (https://www.dataengineeringpodcast.com/rudder) Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver (https://www.dataengineeringpodcast.com/upsolver) today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I'm interviewing Juan Sequeda and Tim Gasper about their views on the role of the data mesh paradigm for driving re-assessment of the foundational principles of data systems Interview Introduction How did you get involved in the area of data management? What are the areas of the data ecosystem that you see the most turmoil and confusion? The past couple of years have brought a lot of attention to the idea of the "modern data stack". How has that influenced the ways that your and your customers' teams think about what skills they need to be effective? The other topic that is introducing a lot of confusion and uncertainty is the "data mesh". How has that changed the ways that teams think about who is involved in the technical and design conversations around data in an organization? Now that we, as an industry, have reached a new generational inflection about how data is generated, processed, and used, what are some of the foundational principles that have proven their worth? What are some of the new lessons that are showing the greatest promise? data modeling data platform/infrastructure data collaboration data governance/security/privacy How does your work at data.world work support these foundational practices? What are some of the ways that you work with your teams and customers to help them stay informed on industry practices? What is your process for understanding the balance between hype and reality as you encounter new ideas/technologies? What are some of the notable changes that have happened in the data.world product and market since I last had Bryon on the show in 2017? What are the most interesting, innovative, or unexpected ways that you have seen data.world used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data.world? When is data.world the wrong choice? What do you have planned for the future of data.world? Contact Info Juan LinkedIn (https://www.linkedin.com/in/juansequeda/) @juansequeda (https://twitter.com/juansequeda) on Twitter Website (https://www.juansequeda.com/) Tim LinkedIn (https://www.linkedin.com/in/timgasper/) @TimGasper (https://twitter.com/TimGasper) on Twitter Website (https://www.timgasper.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links data.world (https://data.world/) Podcast Episode (https://www.dataengineeringpodcast.com/data-dot-world-with-bryon-jacob-episode-9/) Gartner Hype Cycle (https://www.gartner.com/en/information-technology/glossary/hype-cycle) Data Mesh (https://www.thoughtworks.com/en-us/what-we-do/data-and-ai/data-mesh) Modern Data Stack (https://tanay.substack.com/p/understanding-the-modern-data-stack) DataOps (https://en.wikipedia.org/wiki/DataOps) Data Observability (https://www.montecarlodata.com/blog-what-is-data-observability/) Data & AI Landscape (https://mattturck.com/data2021/) DataDog (https://www.datadoghq.com/) RDF == Resource Description Framework (https://en.wikipedia.org/wiki/Resource_Description_Framework) SPARQL (https://en.wikipedia.org/wiki/SPARQL) Moshe Vardi (https://en.wikipedia.org/wiki/Moshe_Vardi) Star Schema (https://en.wikipedia.org/wiki/Star_schema) Data Vault (https://en.wikipedia.org/wiki/Data_vault_modeling) Podcast Episode (https://www.dataengineeringpodcast.com/data-vault-data-modeling-episode-119/) BPMN == Business Process Modeling Notation (https://en.wikipedia.org/wiki/Business_Process_Model_and_Notation) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

12/19/2022 • 1 hour, 5 minutes, 29 seconds

Making Sense Of The Technical And Organizational Considerations Of Data Contracts

Summary One of the reasons that data work is so challenging is because no single person or team owns the entire process. This introduces friction in the process of collecting, processing, and using data. In order to reduce the potential for broken pipelines some teams have started to adopt the idea of data contracts. In this episode Abe Gong brings his experiences with the Great Expectations project and community to discuss the technical and organizational considerations involved in implementing these constraints to your data workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you're ready to build your next pipeline, or want to test out the projects you hear about on the show, you'll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode (https://www.dataengineeringpodcast.com/linode) today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don't forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan's active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan (https://www.dataengineeringpodcast.com/atlan) today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo (http://www.dataengineeringpodcast.com/montecarlo) to learn more. Your host is Tobias Macey and today I'm interviewing Abe Gong about the technical and organizational implementation of data contracts Interview Introduction How did you get involved in the area of data management? Can you describe what your conception of a data contract is? What are some of the ways that you have seen them implemented? How has your work on Great Expectations influenced your thinking on the strategic and tactical aspects of adopting/implementing data contracts in a given team/organization? What does the negotiation process look like for identifying what needs to be included in a contract? What are the interfaces/integration points where data contracts are most useful/necessary? What are the discussions that need to happen when deciding when/whether a contract "violation" is a blocking action vs. issuing a notification? At what level of detail/granularity are contracts most helpful? At the technical level, what does the implementation/integration/deployment of a contract look like? What are the most interesting, innovative, or unexpected ways that you have seen data contracts used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contracts/great expectations? When are data contracts the wrong choice? What do you have planned for the future of data contracts in great expectations? Contact Info LinkedIn (https://www.linkedin.com/in/abe-gong-8a77034/) @AbeGong (https://twitter.com/AbeGong) on Twitter Website (https://www.abegong.com/) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don't forget to check out our other shows. Podcast.__init__ () covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast (https://www.themachinelearningpodcast.com) helps you go from idea to production with machine learning. Visit the site (https://www.dataengineeringpodcast.com) to subscribe to the show, sign up for the mailing list, and read the show notes. If you've learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com (mailto:hosts@dataengineeringpodcast.com)) with your story. To help other people find the show please leave a review on Apple Podcasts (https://podcasts.apple.com/us/podcast/data-engineering-podcast/id1193040557) and tell your friends and co-workers Links Great Expectations (https://www.abegong.com/) Podcast Episode (https://www.dataengineeringpodcast.com/great-expectations-technical-debt-data-pipeline-episode-117/) Progressive Typing (https://en.wikipedia.org/wiki/Gradual_typing) Pioneers, Settlers, Town Planners (https://blog.gardeviance.org/2015/03/on-pioneers-settlers-town-planners-and.html) Pydantic (https://pydantic-docs.helpmanual.io/) Podcast.__init__ Episode (https://www.pythonpodcast.com/pydantic-data-validation-episode-263/) Typescript (https://www.typescriptlang.org/) Duck Typing (https://en.wikipedia.org/wiki/Duck_typing) Flyte (https://flyte.org/) Podcast Episode (https://www.dataengineeringpodcast.com/flyte-data-orchestration-machine-learning-episode-291/) Dagster (https://dagster.io/) Podcast Episode (https://www.dataengineeringpodcast.com/dagster-software-defined-assets-data-orchestration-episode-309) Trino (https://trino.io/) The intro and outro music is from The Hug (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/Love_death_and_a_drunken_monkey/04_-_The_Hug) by The Freak Fandango Orchestra (http://freemusicarchive.org/music/The_Freak_Fandango_Orchestra/) / CC BY-SA (http://creativecommons.org/licenses/by-sa/3.0/)

12/19/2022 • 47 minutes

Convert Your Unstructured Data To Embedding Vectors For More Efficient Machine Learning With Towhee

Preamble This is a cross-over episode from our new show The Machine Learning Podcast, the show about going from idea to production with machine learning. Summary Data is one of the core ingredients for machine learning, but the format in which it is understandable to humans is not a useful representation for models. Embedding vectors are a way to structure data in a way that is native to how models interpret and manipulate information. In this episode Frank Liu shares how the Towhee library simplifies the work of translating your unstructured data assets (e.g. images, audio, video, etc.) into embeddings that you can use efficiently for machine learning, and how it fits into your workflow for model development. Announcements Hello and welcome to the Machine Learning Podcast, the podcast about machine learning and how to bring it from idea to delivery. Building good ML models is hard, but testing them properly is even harder. At Deepchecks, they built an open-source testing framework that follows best practices, ensuring that your models behave as expected. Get started quickly using their built-in library of checks for testing and validating your model’s behavior and performance, and extend it to meet your specific needs as your model evolves. Accelerate your machine learning projects by building trust in your models and automating the testing that you used to do manually. Go to themachinelearningpodcast.com/deepchecks today to get started! Your host is Tobias Macey and today I’m interviewing Frank Liu about how to use vector embeddings in your ML projects and how Towhee can reduce the effort involved Interview Introduction How did you get involved in machine learning? Can you describe what Towhee is and the story behind it? What is the problem that Towhee is aimed at solving? What are the elements of generating vector embeddings that pose the greatest challenge or require the most effort? Once you have an embedding, what are some of the ways that it might be used in a machine learning project? Are there any design considerations that need to be addressed in the form that an embedding takes and how it impacts the resultant model that relies on it? (whether for training or inference) Can you describe how the Towhee framework is implemented? What are some of the interesting engineering challenges that needed to be addressed? How have the design/goals/scope of the project shifted since it began? What is the workflow for someone using Towhee in the context of an ML project? What are some of the types optimizations that you have incorporated into Towhee? What are some of the scaling considerations that users need to be aware of as they increase the volume or complexity of data that they are processing? What are some of the ways that using Towhee impacts the way a data scientist or ML engineer approach the design development of their model code? What are the interfaces available for integrating with and extending Towhee? What are the most interesting, innovative, or unexpected ways that you have seen Towhee used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Towhee? When is Towhee the wrong choice? What do you have planned for the future of Towhee? Contact Info LinkedIn fzliu on GitHub Website @frankzliu on Twitter Parting Question From your perspective, what is the biggest barrier to adoption of machine learning today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Towhee Zilliz Milvus Data Engineering Podcast Episode Computer Vision Tensor Autoencoder Latent Space Diffusion Model HSL == Hue, Saturation, Lightness Weights and Biases The intro and outro music is from Hitman’s Lovesong feat. Paola Graziano by The Freak Fandango Orchestra/CC BY-SA 3.0

12/12/2022 • 53 minutes, 45 seconds

Run Your Applications Worldwide Without Worrying About The Database With Planetscale

Summary One of the most critical aspects of software projects is managing its data. Managing the operational concerns for your database can be complex and expensive, especially if you need to scale to large volumes of data, high traffic, or geographically distributed usage. Planetscale is a serverless option for your MySQL workloads that lets you focus on your applications without having to worry about managing the database or fight with differences between development and production. In this episode Nick van Wiggeren explains how the Planetscale platform is implemented, their strategies for balancing maintenance and improvements of the underlying Vitess project with their business goals, and how you can start using it today to free up the time you spend on database administration. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I’m interviewing Nick van Wiggeren about Planetscale, a serverless and globally distributed MySQL database as a service Interview Introduction How did you get involved in the area of data management? Can you describe what Planetscale is and the story behind it? What are the core problems that you are solving with the Planetscale platform? How might an engineering team address those challenges in the absence of Planetscale/Vitess? Can you describe how Planetscale is implemented? What are some of the addons that you have had to build on top of Vitess to make Planetscale What are the impacts that a serverless database has on the way teams approach their application/platform design and development? metrics exposed to help users optimize their usage What is your policy/philosophy for determining what capabilities to include in Vitess and what belongs in the Planetscale platform? What are the most interesting, innovative, or unexpected ways that you have seen Planetscale/Vitess used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Planetscale? When is Planetscale the wrong choice? What do you have planned for the future of Planetscale? Contact Info @nickvanwig on Twitter LinkedIn nickvanw on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Planetscale Vitess CNCF == Cloud Native Computing Foundation Hadoop OLTP == Online Transactional Processing Galera Yugabyte DB Podcast Episode CitusDB MariaDB SkySQL Podcast Episode CockroachDB Podcast Episode NewSQL AWS PrivateLink Planetscale Connect Segment Podcast Episode BigQuery The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/12/2022 • 49 minutes, 40 seconds

Business Intelligence In The Palm Of Your Hand With Zing Data

Summary Business intelligence is the foremost application of data in organizations of all sizes. The typical conception of how it is accessed is through a web or desktop application running on a powerful laptop. Zing Data is building a mobile native platform for business intelligence. This opens the door for busy employees to access and analyze their company information away from their desk, but it has the more powerful effect of bringing first-class support to companies operating in mobile-first economies. In this episode Sabin Thomas shares his experiences building the platform and the interesting ways that it is being used. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. Your host is Tobias Macey and today I’m interviewing Sabin Thomas about Zing Data, a mobile-friendly business intelligence platform Interview Introduction How did you get involved in the area of data management? Can you describe what Zing Data is and the story behind it? Why is mobile access to a business intelligence system important? What does it mean for a business intelligence system to be mobile friendly? (e.g. just looking at charts vs. creating reports, etc.) What are the interaction patterns that don’t translate well to mobile from web or desktop BI systems? What are the new interaction patterns that are enabled by the mobile experience? What are the capabilities that a native app can provide which would be clunky or impossible as a web app on a mobile device? Who are the personas that benefit from a product like Zing Data? Can you describe how the platform (backend and app) are implemented? How have the design and goals of the system changed/evolved since you started working on it? Can you describe a typical workflow for a team that uses Zing? Is it typically the sole/primary BI system, or is it more of an augmentation? What are the most interesting, innovative, or unexpected ways that you have seen Zing used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Zing? When is Zing the wrong choice? What do you have planned for the future of Zing Data? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Zing Data Rakuten Flutter Cordova React Native T-SQL ANSI SQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/5/2022 • 46 minutes, 46 seconds

Adopting Real-Time Data At Organizations Of Every Size

Summary The term "real-time data" brings with it a combination of excitement, uncertainty, and skepticism. The promise of insights that are always accurate and up to date is appealing to organizations, but the technical realities to make it possible have been complex and expensive. In this episode Arjun Narayan explains how the technical barriers to adopting real-time data in your analytics and applications have become surmountable by organizations of all sizes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I’m interviewing Arjun Narayan about the benefits of real-time data for teams of all sizes Interview Introduction How did you get involved in the area of data management? Can you describe what your conception of real-time data is and the benefits that it can provide? types of organizations/teams who are adopting real-time consumers of real-time data locations in data/application stacks where real-time needs to be integrated challenges (technical/infrastructure/talent) involved in adopting/supporting streaming/real-time lessons learned working with early customers that influenced design/implementation of Materialize to simplify adoption of real-time types of queries that are run on materialize vs. warehouse how real-time changes the way stakeholders think about the data sourcing real-time data What are the most interesting, innovative, or unexpected ways that you have seen real-time data used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Materialize to support real-time data applications? When is real-time the wrong choice? What do you have planned for the future of Materialize and real-time data? Contact Info @narayanarjun on Twitter Email LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Materialize Podcast Episode Cockroach Labs Podcast Episode SQL Kafka Debezium Podcast Episode Change Data Capture Reverse ETL Pulsar Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/5/2022 • 50 minutes, 24 seconds

Supporting And Expanding The Arrow Ecosystem For Fast And Efficient Data Processing At Voltron Data

Summary The data ecosystem has been growing rapidly, with new communities joining and bringing their preferred programming languages to the mix. This has led to inefficiencies in how data is stored, accessed, and shared across process and system boundaries. The Arrow project is designed to eliminate wasted effort in translating between languages, and Voltron Data was created to help grow and support its technology and community. In this episode Wes McKinney shares the ways that Arrow and its related projects are improving the efficiency of data systems and driving their next stage of evolution. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Wes McKinney about his work at Voltron Data and on the Arrow ecosystem Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Voltron Data and the story behind it? What is the vision for the broader data ecosystem that you are trying to realize through your investment in Arrow and related projects? How does your work at Voltron Data contribute to the realization of that vision? What is the impact on engineer productivity and compute efficiency that gets introduced by the impedance mismatches between language and framework representations of data? The scope and capabilities of the Arrow project have grown substantially since it was first introduced. Can you give an overview of the current features and extensions to the project? What are some of the ways that ArrowVe and its related projects can be integrated with or replace the different elements of a data platform? Can you describe how Arrow is implemented? What are the most complex/challenging aspects of the engineering needed to support interoperable data interchange between language runtimes? How are you balancing the desire to move quickly and improve the Arrow protocol and implementations, with the need to wait for other players in the ecosystem (e.g. database engines, compute frameworks, etc.) to add support? With the growing application of data formats such as graphs and vectors, what do you see as the role of Arrow and its ideas in those use cases? For workflows that rely on integrating structured and unstructured data, what are the options for interaction with non-tabular data? (e.g. images, documents, etc.) With your support-focused business model, how are you approaching marketing and customer education to make it viable and scalable? What are the most interesting, innovative, or unexpected ways that you have seen Arrow used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Arrow and its ecosystem? When is Arrow the wrong choice? What do you have planned for the future of Arrow? Contact Info Website wesm on GitHub @wesmckinn on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Voltron Data Pandas Podcast Episode Apache Arrow Partial Differential Equation FPGA == Field-Programmable Gate Array GPU == Graphics Processing Unit Ursa Labs Voltron (cartoon) Feature Engineering PySpark Substrait Arrow Flight Acero Arrow Datafusion Velox Ibis SIMD == Single Instruction, Multiple Data Lance DuckDB Podcast Episode Data Threads Conference Nano-Arrow Arrow ADBC Protocol Apache Iceberg Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/28/2022 • 50 minutes, 25 seconds

Analyze Massive Data At Interactive Speeds With The Power Of Bitmaps Using FeatureBase

Summary The most expensive part of working with massive data sets is the work of retrieving and processing the files that contain the raw information. FeatureBase (formerly Pilosa) avoids that overhead by converting the data into bitmaps. In this episode Matt Jaffee explains how to model your data as bitmaps and the benefits that this representation provides for fast aggregate computation. He also discusses the improvements that have been incorporated into FeatureBase to simplify integration with the rest of your data stack, and the SQL interface that was added to make working with the product easier. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Build Data Pipelines. Not DAGs. That’s the spirit behind Upsolver SQLake, a new self-service data pipeline platform that lets you build batch and streaming pipelines without falling into the black hole of DAG-based orchestration. All you do is write a query in SQL to declare your transformation, and SQLake will turn it into a continuous pipeline that scales to petabytes and delivers up to the minute fresh data. SQLake supports a broad set of transformations, including high-cardinality joins, aggregations, upserts and window operations. Output data can be streamed into a data lake for query engines like Presto, Trino or Spark SQL, a data warehouse like Snowflake or Redshift., or any other destination you choose. Pricing for SQLake is simple. You pay $99 per terabyte ingested into your data lake using SQLake, and run unlimited transformation pipelines for free. That way data engineers and data users can process to their heart’s content without worrying about their cloud bill. For data engineering podcast listeners, we’re offering a 30 day trial with unlimited data, so go to dataengineeringpodcast.com/upsolver today and see for yourself how to avoid DAG hell. Your host is Tobias Macey and today I’m interviewing Matt Jaffee about FeatureBase (formerly known as Pilosa and Molecula), a real-time analytical database engine built on bitmaps Interview Introduction How did you get involved in the area of data management? Can you describe what FeatureBase is? What are the use cases that it is designed and optimized for? What are some applications or analyses that are uniquely suited to FeatureBase’s capabilities? What are the notable changes/evolutions that it has gone through in recent years? What are the forces in the broader data ecosystem that have had the greatest impact on your project/product focus? What are the data modeling concepts that platform and data engineers need to consider when working with FeatureBase? With bitmaps as the core data structure, what is involved in translating existing data into bitmaps? How does schema evolution translate to the data representation used in FeatureBase? How does the data model influence considerations around security policies and governance? What are the most interesting, innovative, or unexpected ways that you have seen FeatureBase used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on FeatureBase? When is FeatureBase the wrong choice? What do you have planned for the future of FeatureBase? Contact Info LinkedIn jaffee on GitHub @mattjaffee on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links FeatureBase Pilosa Episode Molecula Episode Bitmap Roaring Bitmaps Pinecone Podcast Episode Milvus Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/28/2022 • 59 minutes, 24 seconds

Tame The Entropy In Your Data Stack And Prevent Failures With Sifflet

Summary The problems that are easiest to fix are the ones that you prevent from happening in the first place. Sifflet is a platform that brings your entire data stack into focus to improve the reliability of your data assets and empower collaboration across your teams. In this episode CEO and founder Salma Bakouk shares her views on the causes and impacts of "data entropy" and how you can tame it before it leads to failures. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Salma Bakouk about achieving data reliability and reducing entropy within your data stack with sifflet Interview Introduction How did you get involved in the area of data management? Can you describe what Sifflet is and the story behind it? What is the motivating goal for the company and product? What are the categories of errors that you consider to be preventable? How does the visibility provided by Sifflet contribute to those prevention efforts? What are the UI/UX patterns that you rely on to allow for meaningful exploration and analysis of dependency chains/impact assessments in the lineage graph? Can you describe how you’ve implemented Sifflet? How have the scope and focus of the product evolved from when you first launched? What is the workflow for someone getting Sifflet integrated into their data stack? What are some of the data modeling considerations that need to be considered when pushing metadata to Sifflet? What are the most interesting, innovative, or unexpected ways that you have seen Sifflet used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Sifflet? When is Sifflet the wrong choice? What do you have planned for the future of Sifflet? Contact Info LinkedIn @SalmaBakouk on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Sifflet Data Observability DataDog NewRelic Splunk Modern Data Stack GoCardless Airbyte Fivetran ORM == Object Relational Mapping The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/21/2022 • 46 minutes, 46 seconds

A Look At The Data Systems Behind The Gameplay For League Of Legends

Summary The majority of blog posts and presentations about data engineering and analytics assume that the consumers of those efforts are internal business users accessing an environment controlled by the business. In this episode Ian Schweer shares his experiences at Riot Games supporting player-focused features such as machine learning models and recommeder systems that are deployed as part of the game binary. He explains the constraints that he and his team are faced with and the various challenges that they have overcome to build useful data products on top of a legacy platform where they don’t control the end-to-end systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Ian Schweer about building the data systems that power League of Legends Interview Introduction How did you get involved in the area of data management? Can you describe what League of Legends is and the role that data plays in the experience? What are the characteristics of the data that you are working with? (e.g. volume/variety/velocity, structured vs. unstructured, real-time vs. batch, etc.) What are the biggest data-related challenges that you face (technically or organizationally)? Multiplayer games are very sensitive to latency. How does that influence your approach to instrumentation/data collection in the end-user experience? Can you describe the current architecture of your data platform? What are the notable evolutions that it has gone through over the life of the game/product? What are the capabilities that you are optimizing for in your platform architecture? Given the longevity of the League of Legends product, what are the practices and design elements that you rely on to help onboard new team members? What are the seams that you intentionally build in to allow for evolution of components and use cases? What are the most interesting, innovative, or unexpected ways that you have seen data and its derivatives used by Riot Games or your players? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data stack for League of Legends? What are the most interesting or informative mistakes that you have made (personally or as a team)? What do you have planned for the future of the data stack at Riot Games? Contact Info LinkedIn Github Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Riot Games League of Legends Team Fight Tactics Wild Rift DoorDash Podcast Interview Decision Science Kafka Alation Airflow Spark Monte Carlo Podcast Episode libtorch The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/21/2022 • 1 hour, 1 minute, 29 seconds

Taking A Look Under The Hood At CreditKarma's Data Platform

Summary CreditKarma builds data products that help consumers take advantage of their credit and financial capabilities. To make that possible they need a reliable data platform that empowers all of the organization’s stakeholders. In this episode Vishnu Venkataraman shares the journey that he and his team have taken to build and evolve their systems and improve the product offerings that they are able to support. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Vishnu Venkataraman about building the data platform at CreditKarma and the forces that shaped the design Interview Introduction How did you get involved in the area of data management? Can you describe what CreditKarma is and the role of data in the business? What is the current team topology that you are using to support data needs in the organization? How has that evolved from when you first started with the company? What are some of the characteristics of the data that you work with? (e.g. volume/variety/velocity, source of the data, format of the data) What are the aspects of data management and architecture that have posed the greatest challenge? What are the data applications that are providing the greatest ROI and/or seeing the most usage? How have you approached the design and growth of your data platform? CreditKarma was one of the first FinTech companies to migrate to the cloud, specifically GCP. Why migrate? What were some of your early challenges taking the company to the cloud? What are the main components of your data platform? What are the most notable evolutions that it has gone through? Given your strong focus on applications of data science and ML, how has that influenced the architectural foundations of your data capabilities? What is your process for evaluating build vs. buy decisions? What are your triggers for deciding when to re-evaluate components of your platform? Given your work with financial institutions how do you address testing and validation of your derived data? How does your team solve for data reliability and quality more broadly? What are the most interesting, innovative, or unexpected aspects of your growth as a data-led organization? What are the most interesting, unexpected, or challenging lessons that you have learned while building up your data platform and teams? When are the most informative mistakes that you have made? What do you have planned for the future of your data platform? Contact Info LinkedIn @vishnuvram on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links CreditKarma Games 24×7 Vertica BigQuery Google Cloud Dataflow Anodot The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/14/2022 • 52 minutes, 2 seconds

Build Data Products Without A Data Team Using AgileData

Summary Building data products is an undertaking that has historically required substantial investments of time and talent. With the rise in cloud platforms and self-serve data technologies the barrier of entry is dropping. Shane Gibson co-founded AgileData to make analytics accessible to companies of all sizes. In this episode he explains the design of the platform and how it builds on agile development principles to help you focus on delivering value. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Shane Gibson about AgileData, a platform that lets you build data products without all of the overhead of managing a data team Interview Introduction How did you get involved in the area of data management? Can you describe what AgileData is and the story behind it? Who is the target audience for this product? For organizations that have an existing data team, how does the platform augment/simplify their work? Can you describe how the AgileData platform is implemented? What are some of the notable evolutions that it has gone through since you first started working on it? Given your strong focus on Agile methods in your work, how has that influenced your priorities in developing the platform? What are the most interesting, innovative, or unexpected ways that you have seen AgileData used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on AgileData? When is AgileData the wrong choice? What do you have planned for the future of AgileData? Contact Info LinkedIn @shagility on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links AgileData Agile Practices For Data Interview Microsoft Azure Snowflake BigQuery DuckDB Podcast Episode Google BI Engine OLAP The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/14/2022 • 1 hour, 12 minutes, 29 seconds

Clean Up Your Data Using Scalable Entity Resolution And Data Mastering With Zingg

Summary Despite the best efforts of data engineers, data is as messy as the real world. Entity resolution and fuzzy matching are powerful utilities for cleaning up data from disconnected sources, but it has typically required custom development and training machine learning models. Sonal Goyal created and open-sourced Zingg as a generalized tool for data mastering and entity resolution to reduce the effort involved in adopting those practices. In this episode she shares the story behind the project, the details of how it is implemented, and how you can use it for your own data projects. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Sonal Goyal about Zingg, an open source entity resolution framework for data engineers Interview Introduction How did you get involved in the area of data management? Can you describe what Zingg is and the story behind it? Who is the target audience for Zingg? How has that informed your efforts in the development and release of the project? What are the use cases where entity resolution is helpful or necessary in a data engineering context? What are the range of options that are available for teams to implement entity/identity resolution in their data? What was your motivation for creating an open source solution for this use case? Why do you think there has not been a compelling open source and generalized solution previously? Can you describe how Zingg is implemented? How have the design and goals shifted since you started working on the project? What does the installation and integration process look like for Zingg? Once you have Zingg configured, what is the workflow for a data engineer or analyst? What are the extension/customization options for someone using Zingg in their environment? What are the most interesting, innovative, or unexpected ways that you have seen Zingg used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Zingg? When is Zingg the wrong choice? What do you have planned for the future of Zingg? Contact Info LinkedIn @sonalgoyal on Twitter sonalgoyal on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Zingg Entity Resolution MDM == Master Data Management Podcast Episode Snowflake Podcast Episode Snowpark Spark Milvus Podcast Episode Pinecone Podcast Episode DuckDB Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/7/2022 • 46 minutes, 46 seconds

Build Better Data Products By Creating Data, Not Consuming It

Summary A lot of the work that goes into data engineering is trying to make sense of the "data exhaust" from other applications and services. There is an undeniable amount of value and utility in that information, but it also introduces significant cost and time requirements. In this episode Nick King discusses how you can be intentional about data creation in your applications and services to reduce the friction and errors involved in building data products and ML applications. He also describes the considerations involved in bringing behavioral data into your systems, and the ways that he and the rest of the Snowplow team are working to make that an easy addition to your platforms. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Nick King about the utility of behavioral data for your data products and the technical and strategic considerations to collect and integrate it Interview Introduction How did you get involved in the area of data management? Can you share your definition of "behavioral data" and how it is differentiated from other sources/types of data? What are some of the unique characteristics of that information? What technical systems are required to generate and collect those interactions? What are the organizational patterns that are required to support effective workflows for building data generation capabilities? What are some of the strategies that have been most effective for bringing together data and application teams to identify and implement what behaviors to track? What are some of the ethical and privacy considerations that need to be addressed when working with end-user behavioral data? The data sources associated with business operations services and custom applications already represent some measure of user interaction and behaviors. How can teams use the information available from those systems to inform and augment the types of events/information that should be captured/generated in a system like Snowplow? Can you describe the workflow for a team using Snowplow to generate data for a given analytical/ML project? What are some of the tactical aspects of deciding what interfaces to use for generating interaction events? What are some of the event modeling strategies to keep in mind to simplify the analysis and integration of the generated data? What are some of the notable changes in implementation and focus for Snowplow over the past ~4 years? How has the emergence of the "modern data stack" influenced the product direction? What are the most interesting, innovative, or unexpected ways that you have seen Snowplow used for data generation/behavioral data collection? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Snowplow? When is Snowplow the wrong choice? What do you have planned for the future of Snowplow? Contact Info LinkedIn @nking on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Snowplow Podcast Episode Private SaaS Episode AS/400 DB2 BigQuery Azure SQL Data Robot Google Spanner MRE == Meals Ready to Eat The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/7/2022 • 1 hour, 5 minutes, 19 seconds

Expanding The Reach of Business Intelligence Through Ubiquitous Embedded Analytics With Sisense

Summary Business intelligence has grown beyond its initial manifestation as dashboards and reports. In its current incarnation it has become a ubiquitous need for analytics and opportunities to answer questions with data. In this episode Amir Orad discusses the Sisense platform and how it facilitates the embedding of analytics and data insights in every aspect of organizational and end-user experiences. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Amir Orad about Sisense, a platform focused on providing intelligent analytics everywhere Interview Introduction How did you get involved in the area of data management? Can you describe what Sisense is and the story behind it? What are the use cases and customers that you are focused on supporting? What is your view on the role of business intelligence in a data driven organization? How has the market shifted in recent years and what are the motivating factors for those changes? Many conversations around data and analytics are focused on self-service access. what are the capabilities that are required to make that a reality? What are the core challenges that teams face on their path to designing and implementing a solution that is comprehensible by their stakeholders? What is the role of automation vs. low-/no-code? What are the unique capabilities that Sisense offers compared to other BI or embedded analytics services? Can you describe how the Sisense platform is implemented? How have the design and goals changed since you started working on it? What is the workflow for someone working with Sisense? What are the options for integrating Sisense with an organization’s data platform? What are the most interesting, innovative, or unexpected ways that you have seen Sisense used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Sisense? When is Sisense the wrong choice? What do you have planned for the future of Sisense? Contact Info LinkedIn @AmirOrad on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Sisense Looker Podcast Episode PowerBI Podcast Episode Business Intelligence Snowflake The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/31/2022 • 54 minutes

Analytics Engineering Without The Friction Of Complex Pipeline Development With Optimus and dbt

Summary One of the most impactful technologies for data analytics in recent years has been dbt. It’s hard to have a conversation about data engineering or analysis without mentioning it. Despite its widespread adoption there are still rough edges in its workflow that cause friction for data analysts. To help simplify the adoption and management of dbt projects Nandam Karthik helped create Optimus. In this episode he shares his experiences working with organizations to adopt analytics engineering patterns and the ways that Optimus and dbt were combined to let data analysts deliver insights without the roadblocks of complex pipeline management. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their extensive library of integrations enable you to automatically send data to hundreds of downstream tools. Sign up free at dataengineeringpodcast.com/rudder Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Nandam Karthik about his experiences building analytics projects with dbt and Optimus for his clients at Sigmoid. Interview Introduction How did you get involved in the area of data management? Can you describe what Sigmoid is and the types of projects that you are involved in? What are some of the core challenges that your clients are facing when they start working with you? An ELT workflow with dbt as the transformation utility has become a popular pattern for building analytics systems. Can you share some examples of projects that you have built with this approach? What are some of the ways that this pattern becomes bespoke as you start exploring a project more deeply? What are the sharp edges/white spaces that you encountered across those projects? Can you describe what Optimus is? How does Optimus improve the user experience of teams working in dbt? What are some of the tactical/organizational practices that you have found most helpful when building with dbt and Optimus? What are the most interesting, innovative, or unexpected ways that you have seen Optimus/dbt used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbt/Optimus projects? When is Optimus/dbt the wrong choice? What are your predictions for how "best practices" for analytics projects will change/evolve in the near/medium term? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Sigmoid Optimus dbt Podcast Episode Airflow AWS Glue BigQuery The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/30/2022 • 40 minutes, 9 seconds

How To Bring Agile Practices To Your Data Projects

Summary Agile methodologies have been adopted by a majority of teams for building software applications. Applying those same practices to data can prove challenging due to the number of systems that need to be included to implement a complete feature. In this episode Shane Gibson shares practical advice and insights from his years of experience as a consultant and engineer working in data about how to adopt agile principles in your data work so that you can move faster and provide more value to the business, while building systems that are maintainable and adaptable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Shane Gibson about how to bring Agile practices to your data management workflows Interview Introduction How did you get involved in the area of data management? Can you describe what AgileData is and the story behind it? What are the main industries and/or use cases that you are focused on supporting? The data ecosystem has been trying on different paradigms from software development for some time now (e.g. DataOps, version control, etc.). What are the aspects of Agile that do and don’t map well to data engineering/analysis? One of the perennial challenges of data analysis is how to approach data modeling. How do you balance the need to provide value with the long-term impacts of incomplete or underinformed modeling decisions made in haste at the beginning of a project? How do you design in affordances for refactoring of the data models without breaking downstream assets? Another aspect of implementing data products/platforms is how to manage permissions and governance. What are the incremental ways that those principles can be incorporated early and evolved along with the overall analytical products? What are some of the organizational design strategies that you find most helpful when establishing or training a team who is working on data products? In order to have a useful target to work toward it’s necessary to understand what the data consumers are hoping to achieve. What are some of the challenges of doing requirements gathering for data products? (e.g. not knowing what information is available, consumers not understanding what’s hard vs. easy, etc.) How do you work with the "customers" to help them understand what a reasonable scope is and translate that to the actual project stages for the engineers? What are some of the perennial questions or points of confusion that you have had to address with your clients on how to design and implement analytical assets? What are the most interesting, innovative, or unexpected ways that you have seen agile principles used for data? What are the most interesting, unexpected, or challenging lessons that you have learned while working on AgileData? When is agile the wrong choice for a data project? What do you have planned for the future of AgileData? Contact Info LinkedIn @shagility on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links AgileData OptimalBI How To Make Toast Data Mesh Information Product Canvas DataKitchen Podcast Episode Great Expectations Podcast Episode Soda Data Podcast Episode Google DataStore Unfix.work Activity Schema Podcast Episode Data Vault Podcast Episode Star Schema Lean Methodology Scrum Kanban The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/23/2022 • 1 hour, 12 minutes, 17 seconds

Going From Transactional To Analytical And Self-managed To Cloud On One Database With MariaDB

Summary The database market has seen unprecedented activity in recent years, with new options addressing a variety of needs being introduced on a nearly constant basis. Despite that, there are a handful of databases that continue to be adopted due to their proven reliability and robust features. MariaDB is one of those default options that has continued to grow and innovate while offering a familiar and stable experience. In this episode field CTO Manjot Singh shares his experiences as an early user of MySQL and MariaDB and explains how the suite of products being built on top of the open source foundation address the growing needs for advanced storage and analytical capabilities. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Manjot Singh about MariaDB, one of the leading open source database engines Interview Introduction How did you get involved in the area of data management? Can you describe what MariaDB is and the story behind it? MariaDB started as a fork of the MySQL engine, what are the notable differences that have evolved between the two projects? How have the MariaDB team worked to maintain compatibility for users who want to switch from MySQL? What are the unique capabilities that MariaDB offers? Beyond the core open source project you have built a suite of commercial extensions. What are the use cases/capabilities that you are targeting with those products? How do you balance the time and effort invested in the open source engine against the commercial projects to ensure that the overall effort is sustainable? What are your guidelines for what features and capabilities are released in the community edition and which are more suited to the commercial products? For your managed cloud service, what are the differentiating factors for that versus the database services provided by the major cloud platforms? What do you see as the future of the database market and how we interact and integrate with them? What are the most interesting, innovative, or unexpected ways that you have seen MariaDB used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on MariaDB? When is MariaDB the wrong choice? What do you have planned for the future of MariaDB? Contact Info LinkedIn @ManjotSingh on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links MariaDB HTML Goodies MySQL PHP MySQL/MariaDB Pluggable Storage InnoDB MyISAM Aria Storage SQL/PSM MyRocks MariaDB XPand BSL == Business Source License Paxos MariaDB MongoDB Compatibility Vertica MariaDB Spider Storage Engine IHME == Institute for Health Metrics and Evaluation Rundeck MaxScale The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/23/2022 • 52 minutes, 4 seconds

Speeding Up The Time To Insight For Supply Chains And Logistics With The Pathway Database That Thinks

Summary Logistics and supply chains are under increased stress and scrutiny in recent years. In order to stay ahead of customer demands, businesses need to be able to react quickly and intelligently to changes, which requires fast and accurate insights into their operations. Pathway is a streaming database engine that embeds artificial intelligence into the storage, with functionality designed to support the spatiotemporal data that is crucial for shipping and logistics. In this episode Adrian Kosowski explains how the Pathway product got started, how its design simplifies the creation of data products that support supply chain operations, and how developers can help to build an ecosystem of applications that allow businesses to accelerate their time to insight. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Adrian Kosowski about Pathway, an AI powered database and streaming framework. Pathway is used for analyzing and optimizing supply chains and logistics in real-time. Interview Introduction How did you get involved in the area of data management? Can you describe what Pathway is and the story behind it? What are the primary challenges that you are working to solve? Who are the target users of the Pathway product and how does it fit into their work? Your tagline is that Pathway is "the database that thinks". What are some of the ways that existing database and stream-processing architectures introduce friction on the path to analysis? How does Pathway incorporate computational capabilities into its engine to address those challenges? What are the types of data that Pathway is designed to work with? Can you describe how the Pathway engine is implemented? What are some of the ways that the design and goals of the product have shifted since you started working on it? What are some of the ways that Pathway can be integrated into an analytical system? What is involved in adapting its capabilities to different industries? What are the most interesting, innovative, or unexpected ways that you have seen Pathway used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pathway? When is Pathway the wrong choice? What do you have planned for the future of Pathway? Contact Info Adrian Kosowski LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Pathway Pathway for developers SPOJ.com – competitive programming community Spatiotemporal Data Pointers in programming Clustering The Halting Problem Pytorch Podcast.__init__ Episode Tensorflow Markov Chains NetworkX Finite State Machine DTW == Dynamic Time Warping The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/16/2022 • 1 hour, 2 minutes, 36 seconds

An Exploration Of The Open Data Lakehouse And Dremio's Contribution To The Ecosystem

Summary The "data lakehouse" architecture balances the scalability and flexibility of data lakes with the ease of use and transaction support of data warehouses. Dremio is one of the companies leading the development of products and services that support the open lakehouse. In this episode Jason Hughes explains what it means for a lakehouse to be "open" and describes the different components that the Dremio team build and contribute to. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Jason Hughes about the work that Dremio is doing to support the open lakehouse Interview Introduction How did you get involved in the area of data management? Can you describe what Dremio is and the story behind it? What are some of the notable changes in the Dremio product and related ecosystem over the past ~4 years? How has the advent of the lakehouse paradigm influenced the product direction? What are the main benefits that a lakehouse design offers to a data platform? What are some of the architectural patterns that are only possible with a lakehouse? What is the distinction you make between a lakehouse and an open lakehouse? What are some of the unique features that Dremio offers for lakehouse implementations? What are some of the investments that Dremio has made to the broader open source/open lakehouse ecosystem? How are those projects/investments being used in the commercial offering? What is the purchase/usage model that customers expect for lakehouse implementations? How have those expectations shifted since the first iterations of Dremio? Dremio has its ancestry in the Drill project. How has that history influenced the capabilities (e.g. integrations, scalability, deployment models, etc.) and evolution of Dremio compared to systems like Trino/Presto and Spark SQL? What are the most interesting, innovative, or unexpected ways that you have seen Dremio used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dremio? When is Dremio the wrong choice? What do you have planned for the future of Dremio? Contact Info Email LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Dremio Podcast Episode Dremio Sonar Dremio Arctic DML == Data Modification Language Spark Data Lake Trino Presto Dremio Data Reflections Tableau Delta Lake Podcast Episode Apache Impala Apache Arrow DuckDB Podcast Episode Google BigLake Project Nessie Apache Iceberg Podcast Episode Hive Metastore AWS Glue Catalog Dremel Apache Drill Arrow Gandiva dbt Airbyte Podcast Episode Singer The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/16/2022 • 50 minutes, 44 seconds

Making The Open Data Lakehouse Affordable Without The Overhead At Iomete

Summary The core of any data platform is the centralized storage and processing layer. For many that is a data warehouse, but in order to support a diverse and constantly changing set of uses and technologies the data lakehouse is a paradigm that offers a useful balance of scale and cost, with performance and ease of use. In order to make the data lakehouse available to a wider audience the team at Iomete built an all-in-one service that handles management and integration of the various technologies so that you can worry about answering important business questions. In this episode Vusal Dadalov explains how the platform is implemented, the motivation for a truly open architecture, and how they have invested in integrating with the broader ecosystem to make it easy for you to get started. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Vusal Dadalov about Iomete, an open and affordable lakehouse platform Interview Introduction How did you get involved in the area of data management? Can you describe what Iomete is and the story behind it? The selection of the storage/query layer is the most impactful decision in the implementation of a data platform. What do you see as the most significant factors that are leading people to Iomete/lakehouse structures rather than a more traditional db/warehouse? The principle of the Lakehouse architecture has been gaining popularity recently. What are some of the complexities/missing pieces that make its implementation a challenge? What are the hidden difficulties/incompatibilities that come up for teams who are investing in data lake/lakehouse technologies? What are some of the shortcomings of lakehouse architectures? What are the fundamental capabilities that are necessary to run a fully functional lakehouse? Can you describe how the Iomete platform is implemented? What was your process for deciding which elements to adopt off the shelf vs. building from scratch? What do you see as the strengths of Spark as the query/execution engine as compared to e.g. Presto/Trino or Dremio? What are the integrations and ecosystem investments that you have had to prioritize to simplify adoption of Iomete? What have been the most challenging aspects of building a competitive business in such an active product category? What are the most interesting, innovative, or unexpected ways that you have seen Iomete used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Iomete? When is Iomete the wrong choice? What do you have planned for the future of Iomete? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Iomete Fivetran Podcast Episode Airbyte Podcast Episode Snowflake Podcast Episode Databricks Collibra Podcast Episode Talend Parquet Trino Spark Presto Snowpark Iceberg Podcast Episode Iomete dbt adapter Singer Meltano Podcast Episode AWS Interface Gateway Apache Hudi Podcast Episode Delta Lake Podcast Episode Amundsen Podcast Episode AWS EMR AWS Athena The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/10/2022 • 55 minutes, 24 seconds

Investing In Understanding The Customer Journey At American Express

Summary For any business that wants to stay in operation, the most important thing they can do is understand their customers. American Express has invested substantial time and effort in their Customer 360 product to achieve that understanding. In this episode Purvi Shah, the VP of Enterprise Big Data Platforms at American Express, explains how they have invested in the cloud to power this visibility and the complex suite of integrations they have built and maintained across legacy and modern systems to make it possible. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Purvi Shah about building the Customer 360 data product for American Express and migrating their enterprise data platform to the cloud Interview Introduction How did you get involved in the area of data management? Can you describe what the Customer 360 project is and the story behind it? What are the types of questions and insights that the C360 project is designed to answer? Can you describe the types of information and data sources that you are relying on to feed this project? What are the different axes of scale that you have had to address in the design and architecture of the C360 project? (e.g. geographical, volume/variety/velocity of data, scale of end-user access and data manipulation, etc.) What are some of the challenges that you have had to address in order to build and maintain the map between organizational and technical requirements/semantics in the platform? What were some of the early wins that you targeted, and how did the lessons from those successes drive the product design going forward? Can you describe the platform architecture for your data systems that are powering the C360 product? How have the design/goals/requirements of the system changed since you first started working on it? How have you approached the integration and migration of legacy data systems and assets into this new platform? What are some of the ongoing maintenance challenges that the legacy platforms introduce? Can you describe how you have approached the question of data quality/observability and the validation/verification of the generated assets? What are the aspects of governance and access control that you need to deal with being part of a financial institution? Now that the C360 product has been in use for a few years, what are the strategic and tactical aspects of the ongoing evolution and maintenance of the product which you have had to address? What are the most interesting, innovative, or unexpected ways that you have seen the C360 product used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on C360 for American Express? When is a C360 project the wrong choice? What do you have planned for the future of C360 and enterprise data platforms at American Express? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Data Stewards Hadoop SBA Paycheck Protection The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/10/2022 • 40 minutes, 43 seconds

Gain Visibility And Insight Into Your Supply Chains Through Operational Analytics Powered By Roambee

Summary The global economy is dependent on complex and dynamic networks of supply chains powered by sophisticated logistics. This requires a significant amount of data to track shipments and operational characteristics of materials and goods. Roambee is a platform that collects, integrates, and analyzes all of that information to provide companies with the critical insights that businesses need to stay running, especially in a time of such constant change. In this episode Roambee CEO, Sanjay Sharma, shares the types of questions that companies are asking about their logistics, the technical work that they do to provide ways to answer those questions, and how they approach the challenge of data quality in its many forms. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Sanjay Sharma about how Roambee is using data to bring visibility into shipping and supply chains. Interview Introduction How did you get involved in the area of data management? Can you describe what Roambee is and the story behind it? Who are the personas that are looking to Roambee for insights? What are some of the questions that they are asking about the state of their assets? Can you describe the types of information sources and the format of the data that you are working with? What are the types of SLAs that you are focused on delivering to your customers? (e.g. latency from recorded event to analytics, accuracy, etc.) Can you describe how the Roambee platform is implemented? How have the evolving landscape of sensor and data technologies influenced the evolution of your service? Given your support for customer-created integrations and user-generated inputs on shipment updates, how do you manage data quality and consistency? How do you approach customer onboarding, and what is your approach to reducing the time to value? What are the most interesting, innovative, or unexpected ways that you have seen the Roambee platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Roambee? When is Roambee the wrong choice? What do you have planned for the future of Roambee? Contact Info LinkedIn Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Roambee RFID == Radio Frequency Identification EDI == Electronic Data Interchange Digital Twin The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/3/2022 • 1 hour, 3 seconds

Make Data Lineage A Ubiquitous Part Of Your Work By Simplifying Its Implementation With Alvin

Summary Data lineage is something that has grown from a convenient feature to a critical need as data systems have grown in scale, complexity, and centrality to business. Alvin is a platform that aims to provide a low effort solution for data lineage capabilities focused on simplifying the work of data engineers. In this episode co-founder Martin Sahlen explains the impact that easy access to lineage information can have on the work of data engineers and analysts, and how he and his team have designed their platform to offer that information to engineers and stakeholders in the places that they interact with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Martin Sahlen about his work on data lineage at Alvin and how it factors into the day-to-day work of data engineers Interview Introduction How did you get involved in the area of data management? Can you describe what Alvin is and the story behind it? What is the core problem that you are trying to solve at Alvin? Data lineage has quickly become an overloaded term. What are the elements of lineage that you are focused on addressing? What are some of the other sources/pieces of information that you integrate into the lineage graph? How does data lineage show up in the work of data engineers? In what ways does your focus on data engineers inform the way that you model the lineage information? As with every data asset/product, the lineage graph is only as useful as the data that it stores. What are some of the ways that you focus on establishing and ensuring a complete view of lineage? How do you account for assets (e.g. tables, dashboards, exports, etc.) that are created outside of the "officially supported" methods? (e.g. someone manually runs a SQL create statement, etc.) Can you describe how you have implemented the Alvin platform? How have the design and goals shifted from when you first started exploring the problem? What are the types of data systems/assets that you are focused on supporting? (e.g. data warehouses vs. lakes, structured vs. unstructured, which BI tools, etc.) How does Alvin fit into the workflow of data engineers and their downstream customers/collaborators? What are some of the design choices (both visual and functional) that you focused on to avoid friction in the data engineer’s workflow? What are some of the open questions/areas for investigation/improvement in the space of data lineage? What are the factors that contribute to the difficulty of a truly holistic and complete view of lineage across an organization? What are the most interesting, innovative, or unexpected ways that you have seen Alvin used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Alvin? When is Alvin the wrong choice? What do you have planned for the future of Alvin? Contact Info LinkedIn @martinsahlen on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Alvin Unacast sqlparse Python library Cython Podcast.__init__ Episode Antlr Kotlin programming language PostgreSQL Podcast Episode OpenSearch ElasticSearch Redis Kubernetes Airflow BigQuery Spark Looker Mode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/3/2022 • 56 minutes, 16 seconds

Power Your Real-Time Analytics Without The Headache Using Fivetran's Change Data Capture Integrations

Summary Data integration from source systems to their downstream destinations is the foundational step for any data product. With the increasing expecation for information to be instantly accessible, it drives the need for reliable change data capture. The team at Fivetran have recently introduced that functionality to power real-time data products. In this episode Mark Van de Wiel explains how they integrated CDC functionality into their existing product, discusses the nuances of different approaches to change data capture from various sources. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Mark Van de Wiel about Fivetran’s implementation of change data capture and the state of streaming data integration in the modern data stack Interview Introduction How did you get involved in the area of data management? What are some of the notable changes/advancements at Fivetran in the last 3 years? How has the scale and scope of usage for real-time data changed in that time? What are some of the differences in usage for real-time CDC data vs. event streams that have been the driving force for a large amount of real-time data? What are some of the architectural shifts that are necessary in an organizations data platform to take advantage of CDC data streams? What are some of the shifts in e.g. cloud data warehouses that have happened/are happening to allow for ingestion and timely processing of these data feeds? What are some of the different ways that CDC is implemented in different source systems? What are some of the ways that CDC principles might start to bleed into e.g. APIs/SaaS systems to allow for more unified processing patterns across data sources? What are some of the architectural/design changes that you have had to make to provide CDC for your customers at Fivetran? What are the most interesting, innovative, or unexpected ways that you have seen CDC used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on CDC at Fivetran? When is CDC the wrong choice? What do you have planned for the future of CDC at Fivetran? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Fivetran Podcast Episode HVR Software Change Data Capture Debezium Podcast Episode LogMiner Materialize Podcast Episode Kafka Kinesis dbt Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/26/2022 • 49 minutes, 36 seconds

Build A Common Understanding Of Your Data Reliability Rules With Soda Core and Soda Checks Language

Summary Regardless of how data is being used, it is critical that the information is trusted. The practice of data reliability engineering has gained momentum recently to address that question. To help support the efforts of data teams the folks at Soda Data created the Soda Checks Language and the corresponding Soda Core utility that acts on this new DSL. In this episode Tom Baeyens explains their reasons for creating a new syntax for expressing and validating checks for data assets and processes, as well as how to incorporate it into your own projects. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Tom Baeyens about Soda Data’s new DSL for data reliability Interview Introduction How did you get involved in the area of data management? Can you describe what SodaCL is and the story behind it? What is the scope of functionality that SodaCL is intended to address? What are the ways that reliability is measured for data assets? (what is the equivalent to site uptime?) What are the core abstractions that you identified for simplifying the declaration of data validations? How did you approach the design of the SodaCL syntax to balance flexibility for various use cases, with structure and opinionated application? Why YAML? Can you describe how the Soda Core utility is implemented? How have the design and scope of the SodaCL dialect and the Soda Core framework evolved since you started working on them? What are the available integration/extension points for teams who are using Soda Core? Can you describe how SodaCL integrates into the workflow of data and analytics engineers? What is your process for evolving the SodaCL dialect in a maintainable and sustainable manner? What are the most interesting, innovative, or unexpected ways that you have seen SodaCL used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SodaCL? When is SodaCL the wrong choice? What do you have planned for the future of SodaCL? Contact Info LinkedIn @tombaeyens on Twitter tombaeyens on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Soda Data Podcast Episode Soda Checks Language Great Expectations Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/26/2022 • 41 minutes, 1 second

Building A Shared Understanding Of Data Assets In A Business Through A Single Pane Of Glass With Workstream

Summary There is a constant tension in business data between growing siloes, and breaking them down. Even when a tool is designed to integrate information as a guard against data isolation, it can easily become a silo of its own, where you have to make a point of using it to seek out information. In order to help distribute critical context about data assets and their status into the locations where work is being done Nicholas Freund co-founded Workstream. In this episode he discusses the challenge of maintaining shared visibility and understanding of data work across the various stakeholders and his efforts to make it a seamless experience. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Data engineers don’t enjoy writing, maintaining, and modifying ETL pipelines all day, every day. Especially once they realize 90% of all major data sources like Google Analytics, Salesforce, Adwords, Facebook, Spreadsheets, etc., are already available as plug-and-play connectors with reliable, intuitive SaaS solutions. Hevo Data is a highly reliable and intuitive data pipeline platform used by data engineers from 40+ countries to set up and run low-latency ELT pipelines with zero maintenance. Boasting more than 150 out-of-the-box connectors that can be set up in minutes, Hevo also allows you to monitor and control your pipelines. You get: real-time data flow visibility, fail-safe mechanisms, and alerts if anything breaks; preload transformations and auto-schema mapping precisely control how data lands in your destination; models and workflows to transform data for analytics; and reverse-ETL capability to move the transformed data back to your business software to inspire timely action. All of this, plus its transparent pricing and 24*7 live support, makes it consistently voted by users as the Leader in the Data Pipeline category on review platforms like G2. Go to dataengineeringpodcast.com/hevodata and sign up for a free 14-day trial that also comes with 24×7 support. Your host is Tobias Macey and today I’m interviewing Nicholas Freund about Workstream, a platform aimed at providing a single pane of glass for analytics in your organization Interview Introduction How did you get involved in the area of data management? Can you describe what Workstream is and the story behind it? What is the core problem that you are trying to solve at Workstream? How does that problem manifest for the different stakeholders in an organization? What are the contributing factors that lead to fragmentation of visibility for data workflows at different stages? What are the sources of information that you use to build a cohesive view of an organization’s data assets? What are the lifecycle stages of a data asset that are most often overlooked or un-maintained? What are the risks and challenges associated with retirement of a data asset? Can you describe how Workstream is implemented? How have the design and goals of the system changed since you first started it? What does the day-to-day interaction with workstream look like for different roles in a company? What are the long-range impacts on team behaviors/productivity/capacity that you hope to catalyze? What are the most interesting, innovative, or unexpected ways that you have seen Workstream used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Workstream? When is Workstream the wrong choice? What do you have planned for the future of Workstream? Contact Info LinkedIn @nickfreund on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Workstream Data Catalog Entropy CDP == Customer Data Platform The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/19/2022 • 54 minutes, 51 seconds

Operational Analytics To Increase Efficiency For Multi-Location Businesses With OpsAnalitica

Summary In order to improve efficiency in any business you must first know what is contributing to wasted effort or missed opportunities. When your business operates across multiple locations it becomes even more challenging and important to gain insights into how work is being done. In this episode Tommy Yionoulis shares his experiences working in the service and hospitality industries and how that led him to found OpsAnalitica, a platform for collecting and analyzing metrics on multi location businesses and their operational practices. He discusses the challenges of making data collection purposeful and efficient without distracting employees from their primary duties and how business owners can use the provided analytics to support their staff in their duties. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. You wake up to a Slack message from your CEO, who’s upset because the company’s revenue dashboard is broken. You’re told to fix it before this morning’s board meeting, which is just minutes away. Enter Metaplane, the industry’s only self-serve data observability tool. In just a few clicks, you identify the issue’s root cause, conduct an impact analysis⁠—and save the day. Data leaders at Imperfect Foods, Drift, and Vendr love Metaplane because it helps them catch, investigate, and fix data quality issues before their stakeholders ever notice they exist. Setup takes 30 minutes. You can literally get up and running with Metaplane by the end of this podcast. Sign up for a free-forever plan at dataengineeringpodcast.com/metaplane, or try out their most advanced features with a 14-day free trial. Mention the podcast to get a free "In Data We Trust World Tour" t-shirt. Your host is Tobias Macey and today I’m interviewing Tommy Yionoulis about using data to improve efficiencies in multi-location service businesses with OpsAnalitica Interview Introduction How did you get involved in the area of data management? Can you describe what OpsAnalitica is and the story behind it? What are some examples of the types of questions that business owners and site managers need to answer in order to run their operations? What are the sources of information that are needed to be able to answer these questions? In the absence of a platform like OpsAnalitica, how are business operations getting the answers to these questions? What are some of the sources of inefficiency that they are contending with? How do those inefficiencies compound as you scale the number of locations? Can you describe how the OpsAnalitica system is implemented? How have the design and goals of the platform evolved since you started working on it? Can you describe the workflow for a business using OpsAnalitica? What are some of the biggest integration challenges that you have to address? What are some of the design elements that you have invested in to reduce errors and complexity for employees tracking relevant metrics? What are the most interesting, innovative, or unexpected ways that you have seen OpsAnalitica used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpsAnalitica? When is OpsAnalitica the wrong choice? What do you have planned for the future of OpsAnalitica? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links OpsAnalitica Quiznos FormRouter Cooper Atkins(?) SensorThings API The Founder movie Toast Looker Podcast Episode Power BI Podcast Episode Pareto Principle Decisions workflow platform The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/19/2022 • 1 hour, 32 minutes, 3 seconds

Building Data Pipelines That Run From Source To Analysis And Activation With Hevo Data

Summary Any business that wants to understand their operations and customers through data requires some form of pipeline. Building reliable data pipelines is a complex and costly undertaking with many layered requirements. In order to reduce the amount of time and effort required to build pipelines that power critical insights Manish Jethani co-founded Hevo Data. In this episode he shares his journey from building a consumer product to launching a data pipeline service and how his frustrations as a product owner have informed his work at Hevo Data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Manish Jethani about Hevo Data’s experiences navigating the modern data stack and the role of ELT in data workflows Interview Introduction How did you get involved in the area of data management? Can you describe what Hevo Data is and the story behind it? What is the core problem that you are trying to solve with the Hevo platform? What are the target personas of who will bring Hevo into a company and who will be using/interacting with it for their day-to-day? What are some of the lessons that you learned building a product that relied on data to function which you have carried into your work at Hevo, providing the utilities that enable other businesses and products? There are numerous commercial and open source options for collecting, transforming, and integrating data. What are the differentiating features of Hevo? What are your views on the benefits of a vertically integrated platform for data flows in the world of the disaggregated "modern data stack"? Can you describe how the Hevo platform is implemented? What are some of the optimizations that you have invested in to support the aggregate load from your customers? The predominant pattern in recent years for collecting and processing data is ELT. In your work at Hevo, what are some of the nuance and exceptions to that "best practice" that you have encountered? How have you factored those learnings back into the product? mechanics of schema mapping edge cases that require human intervention how to surface those in a timely fashion What is the process for onboarding onto the Hevo platform? Once an organization has adopted Hevo, can you describe the workflow of building/maintaining/evolving data pipelines? What are the most interesting, innovative, or unexpected ways that you have seen Hevo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hevo? When is Hevo the wrong choice? What do you have planned for the future of Hevo? Contact Info LinkedIn @ManishJethani on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Hevo Data Kafka MongoDB The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/12/2022 • 57 minutes, 15 seconds

Build Confidence In Your Data Platform With Schema Compatibility Reports That Span Systems And Domains Using Schemata

Summary Data engineering systems are complex and interconnected with myriad and often opaque chains of dependencies. As they scale, the problems of visibility and dependency management can increase at an exponential rate. In order to turn this into a tractable problem one approach is to define and enforce contracts between producers and consumers of data. Ananth Packildurai created Schemata as a way to make the creation of schema contracts a lightweight process, allowing the dependency chains to be constructed and evolved iteratively and integrating validation of changes into standard delivery systems. In this episode he shares the design of the project and how it fits into your development practices. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Your host is Tobias Macey and today I’m interviewing Ananth Packkildurai about Schemata, a modelling framework for decentralised domain-driven ownership of data. Interview Introduction How did you get involved in the area of data management? Can you describe what Schemata is and the story behind it? How does the garbage in/garbage out problem manifest in data warehouse/data lake environments? What are the different places in a data system that schema definitions need to be established? What are the different ways that schema management gets complicated across those various points of interaction? Can you walk me through the end-to-end flow of how Schemata integrates with engineering practices across an organization’s data lifecycle? How does the use of Schemata help with capturing and propagating context that would otherwise be lost or siloed? How is the Schemata utility implemented? What are some of the design and scope questions that you had to work through while developing Schemata? What is the broad vision that you have for Schemata and its impact on data practices? How are you balancing the need for flexibility/adaptability with the desire for ease of adoption and quick wins? The core of the utility is the generation of structured messages How are those messages propagated, stored, and analyzed? What are the pieces of Schemata and its usage that are still undefined? What are the most interesting, innovative, or unexpected ways that you have seen Schemata used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Schemata? When is Schemata the wrong choice? What do you have planned for the future of Schemata? Contact Info ananthdurai on GitHub @ananthdurai on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Schemata Data Engineering Weekly Zendesk Ralph Kimball Data Warehouse Toolkit Iteratively Podcast Episode Protocol Buffers (protobuf) Application Tracing OpenTelemetry Django Spring Framework Dependency Injection JSON Schema dbt Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/12/2022 • 59 minutes, 39 seconds

Introduce Climate Analytics Into Your Data Platform Without The Heavy Lifting Using Sust Global

Summary The global climate impacts everyone, and the rate of change introduces many questions that businesses need to consider. Getting answers to those questions is challenging, because the climate is a multidimensional and constantly evolving system. Sust Global was created to provide curated data sets for organizations to be able to analyze climate information in the context of their business needs. In this episode Gopal Erinjippurath discusses the data engineering challenges of building and serving those data sets, and how they are distilling complex climate information into consumable facts so you don’t have to be an expert to understand it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Gopal Erinjippurath about his work at Sust Global building data sets from geospatial and satellite information to power climate analytics Interview Introduction How did you get involved in the area of data management? Can you describe what Sust Global is and the story behind it? What audience(s) are you focused on? Climate change is obviously a huge topic in the zeitgeist and has been growing in importance. What are the data sources that you are working with to derive climate information? What role do you view Sust Global having in addressing climage change? How are organizations using your climate information assets to inform their analytics and business operations? What are the types of questions that they are asking about the role of climate (present and future) for their business activities? How can they use the climate information that you provide to understand their impact on the planet? What are some of the educational efforts that you need to undertake to ensure that your end-users understand the context and appropriate semantics of the data that you are providing? (e.g. concepts around climate science, statistically meaningful interpretations of aggregations, etc.) Can you describe how you have architected the Sust Global platform? What are some examples of the types of data workflows and transformations that are necessary to maintain your customer-facing services? How have you approached the question of modeling for the data that you provide to end-users to make it straightforward to integrate and analyze the information? What is your process for determining relevant granularities of data and normalizing scales? (e.g. time and distance) What is involved in integrating with the Sust Global platform and how does it fit into the workflow of data engineers/analysts/data scientists at your customer organizations? Any analytical task is an exercise in story-telling. What are some of the techniques that you and your customers have found useful to make climate data relatable and understandable? What are some of the challenges involved in mapping between micro and macro level insights and translating them effectively for the consumer? How does the increasing sensor capabilities and scale of coverage manifest in your data? How do you account for increasing coverage when analyzing across longer historical time scales? How do you balance the need to build a sustainable business with the importance of access to the information that you are working with? What are the most interesting, innovative, or unexpected ways that you have seen Sust Global used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Sust Global? When is Sust the wrong choice? What do you have planned for the future of Sust Global? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Sust Global Planet Labs Carbon Capture IPCC Data Lodge(?) 6th Assessment Report The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/5/2022 • 54 minutes, 18 seconds

A Reflection On Data Observability As It Reaches Broader Adoption

Summary Data observability is a product category that has seen massive growth and adoption in recent years. Monte Carlo is in the vanguard of companies who have been enabling data teams to observe and understand their complex data systems. In this episode founders Barr Moses and Lior Gavish rejoin the show to reflect on the evolution and adoption of data observability technologies and the capabilities that are being introduced as the broader ecosystem adopts the practices. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Barr Moses and Lior Gavish about the state of the market for data observability and their own work at Monte Carlo Interview Introduction How did you get involved in the area of data management? Can you give the elevator pitch for Monte Carlo? What are the notable changes in the Monte Carlo product and business since our last conversation in October 2020? You were one of the early entrants in the market of data quality/data observability products. In your work to gain visibility and traction you invested substantially in content creation (blog posts, presentations, round table conversations, etc.). How would you summarize the focus of your initial efforts? Why do you think data observability has really taken off? A few years ago, the category barely existed – what’s changed? There’s a larger debate within the data engineering community regarding whether it makes sense to go deep or go broad when it comes to monitoring your data. In other words, do you start with a few important data sets, or do you attempt to cover the entire ecosystem. What is your take? For engineers and teams who are just now investigating and investing in observability/quality automation for their data, what are their motivations? How has the conversation around the value/motivating factors matured or changed over the past couple of years? In what way have the requirements and capabilities of data observability platforms shifted? What are the forces in the ecosystem that have driven those changes? How has the scope and vision for your work at Monte Carlo evolved as the understanding and impact of data quality have become more widespread? When teams invest in data quality/observability what are some of the ways that the insights gained influence their other priorities and design choices? (e.g. platform design, pipeline design, data usage, etc.) When it comes to selecting what parts of the data stack to invest in, how do data leaders prioritize? For instance, when does it make sense to build or buy a data catalog? A data observability platform? The adoption of any tool that adds constraints is a delicate balance. What have you found to be the predominant patterns for teams who are incorporating Monte Carlo? (e.g. maintaining delivery velocity and adding safety/trust) A corollary to the goal of data engineers for higher reliability and visibility is the need by the business/team leadership to identify "return on investment". How do you and your customers think about the useful metrics and measurement goals to justify the time spent on "non-functional" requirements? What are the most interesting, innovative, or unexpected ways that you have seen Monte Carlo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Monte Carlo? When is Monte Carlo the wrong choice? What do you have planned for the future of Monte Carlo? Contact Info Barr LinkedIn @BM_DataDowntime on Twitter Lior LinkedIn @lgavish on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Monte Carlo Podcast Episode App Dynamics Datadog New Relic Data Quality Fundamentals book State Of Data Quality Survey dbt Podcast Episode Airflow Dagster Podcast Episode Episode: Incident Management For Data Teams Databricks Delta Patch.tech Snowflake APIs Hightouch Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/5/2022 • 58 minutes, 39 seconds

An Exploration Of What Data Automation Can Provide To Data Engineers And Ascend's Journey To Make It A Reality

Summary The dream of every engineer is to automate all of their tasks. For data engineers, this is a monumental undertaking. Orchestration engines are one step in that direction, but they are not a complete solution. In this episode Sean Knapp shares his views on what constitutes proper automation and the work that he and his team at Ascend are doing to help make it a reality. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Sean Knapp about the role of data automation in building maintainable systems Interview Introduction How did you get involved in the area of data management? Can you describe what you mean by the term "data automation" and the assumptions that it includes? One of the perennial challenges of automation is that there are always steps that are resistant to being performed without human involvement. What are some of the tasks that you have found to be common problems in that sense? What are the different concerns that need to be included in a stack that supports fully automated data workflows? There was recently an interesting article suggesting that the "left-to-right" approach to data workflows is backwards. In your experience, what would be required to allow for triggering data processes based on the needs of the data consumers? (e.g. "make sure that this BI dashboard is up to date every 6 hours") What are the tasks that are most complex to build automation for? What are some companies or tools/platforms that you consider to be exemplars of "data automation done right"? What are the common themes/patterns that they build from? How have you approached the need for data automation in the implementation of the Ascend product? How have the requirements for data automation changed as data plays a more prominent role in a growing number of businesses? What are the foundational elements that are unchanging? What are the most interesting, innovative, or unexpected ways that you have seen data automation implemented? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data automation at Ascend? What are some of the ways that data automation can go wrong? What are you keeping an eye on across the data ecosystem? Contact Info @seanknapp on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Ascend Podcast Episode Google Sawzall CI/CD Airflow Kubernetes Ascend FlexCode MongoDB SHA == Secure Hash Algorithm dbt Podcast Episode Materialized View Great Expectations Podcast Episode Monte Carlo Podcast Episode OpenLineage Podcast Episode Open Metadata Podcast Episode Egeria OOM == Out Of Memory Manager Five Whys Data Mesh Data Fabric The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/29/2022 • 1 hour, 3 minutes, 32 seconds

Alumni Of AirBnB's Early Years Reflect On What They Learned About Building Data Driven Organizations

Summary AirBnB pioneered a number of the organizational practices that have become the goal of modern data teams. Out of that culture a number of successful businesses were created to provide the tools and methods to a broader audience. In this episode several almuni of AirBnB’s formative years who have gone on to found their own companies join the show to reflect on their shared successes, missed opportunities, and lessons learned. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Lindsay Pettingill Chetan Sharma, Swaroop Jagadish, Maxime Beauchemin, and Nick Handel about the lessons that they learned in their time at AirBnB and how they are carrying that forward to their respective companies Interview Introduction How did you get involved in the area of data management? You all worked at AirBnB in similar time frames and then went on to found data-focused companies that are finding success in their respective categories. Do you consider it an outgrowth of the specific company culture/work involved or a curiosity of the moment in time for the data industry that led you each in that direction? What are the elements of AirBnB’s data culture that you feel were done right? What do you see as the critical decisions/inflection points in the company’s growth that led you down that path? Every journey has its detours and dead-ends. What are the mistakes that were made (individual and collective) that were most instructive for you? What about that experience and other experiences led you each to go our respective directions with data startups? Was your motivation to start a company addressing the work that you did at AirBnB due to the desire to build on existing success, or the need to fix a nagging frustration? What are the critical lessons for data teams that you are focused on teaching to engineers inside and outside your company? What are your predictions for the next 5 years of data? What are the most interesting, unexpected, or challenging lessons that you have learned while working on translating your experiences at AirBnB into successful products? Contact Info Lindsay LinkedIn @lpettingill on Twitter Chetan LinkedIn @chesharma87 on Twitter Maxime mistercrunch on GitHub LinkedIn @mistercrunch on Twitter Swaroop swaroopjagadish on GitHub LinkedIn @arudis on Twitter Nick LinkedIn @NicholasHandel on Twitter nhandel on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Iggy Eppo Podcast Episode Acryl Podcast Episode DataHub Preset Superset Podcast Episode Airflow Transform Podcast Episode Deutsche Bank Ubisoft BlackRock Kafka Pinot Stata R Knowledge-Repo Podcast.__init__ Episode AirBnB Almond Flour Cookie Recipe The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/28/2022 • 1 hour, 10 minutes, 14 seconds

Understanding The Role Of The Chief Data Officer

Summary The position of Chief Data Officer (CDO) is relatively new in the business world and has not been universally adopted. As a result, not everyone understands what the responsibilities of the role are, when you need one, and how to hire for it. In this episode Tracy Daniels, CDO of Truist, shares her journey into the position, her responsibilities, and her relationship to the data professionals in her organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Tracy Daniels about the role and responsibilities of the Chief Data Officer and how it is evolving along with the ecosystem Interview Introduction How did you get involved in the area of data management? Can you describe what your path to CDO of Truist has been? As a CDO, what are your responsibilities and scope of influence? Not every organization has an explicit position for the CDO. What are the factors that determine when that should be a distinct role? What is the relationship and potential overlap with a CTO? As the CDO of Truist, what are some of the projects/activities that are vying for your time and attention? Can you share the composition of your teams and how you think about organizational structure and integration for data professionals in your company? What are the industry and business trends that are having the greatest impact on your work as a CDO? How has your role evolved over the past few years? What are some of the organizational politics/pressures that you have had to navigate to achieve your objectives? What are some of the ways that priorities at the C-level can be at cross purposes to that of the CDO? What are some of the skills and experiences that you have found most useful in your work as CDO? What are the most interesting, innovative, or unexpected ways that you have seen the CDO position/responsibilities addressed in other organizations? What are the most interesting, unexpected, or challenging lessons that you have learned while working as a CDO? When is a distinct CDO position the wrong choice for an organization? What advice do you have for anyone who is interested in charting a career path to the CDO seat? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Truist Chief Data Officer Chief Analytics Officer The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/22/2022 • 47 minutes, 10 seconds

An Exploration Of The Expectations, Ecosystem, and Realities Of Real-Time Data Applications

Summary Data has permeated every aspect of our lives and the products that we interact with. As a result, end users and customers have come to expect interactions and updates with services and analytics to be fast and up to date. In this episode Shruti Bhat gives her view on the state of the ecosystem for real-time data and the work that she and her team at Rockset is doing to make it easier for engineers to build those experiences. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Shruti Bhat about the growth of real-time data applications and the systems required to support them Interview Introduction How did you get involved in the area of data management? Can you describe what is driving the adoption of real-time analytics? architectural patterns for real-time analytics sources of latency in the path from data creation to end-user end-user/customer expectations for time to insight differing expectations between internal and external consumers scales of data that are reasonable for real-time vs. batch What are the most interesting, innovative, or unexpected ways that you have seen real-time architectures implemented? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rockset? When is Rockset the wrong choice? What do you have planned for the future of Rockset? Contact Info LinkedIn @shrutibhat on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Rockset Podcast Episode Embedded Analytics Confluent Kafka AWS Kinesis Lambda Architecture Data Observability Data Mesh DynamoDB Streams MongoDB Change Streams Bigeye Monte Carlo Data The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/22/2022 • 1 hour, 6 minutes, 19 seconds

Bringing Automation To Data Labeling For Machine Learning With Watchful

Summary Data engineers have typically left the process of data labeling to data scientists or other roles because of its nature as a manual and process heavy undertaking, focusing instead on building automation and repeatable systems. Watchful is a platform to make labeling a repeatable and scalable process that relies on codifying domain expertise. In this episode founder Shayan Mohanty explains how he and his team are bringing software best practices and automation to the world of machine learning data preparation and how it allows data engineers to be involved in the process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Shayan Mohanty about Watchful, a data-centric platform for labeling your machine learning inputs Interview Introduction How did you get involved in the area of data management? Can you describe what Watchful is and the story behind it? What are your core goals at Watchful? What problem are you solving and who are the people most impacted by that problem? What is the role of the data engineer in the process of getting data labeled for machine learning projects? Data labeling is a large and competitive market. How do you characterize the different approaches offered by the various platforms and services? What are the main points of friction involved in getting data labeled? How do the types of data and its applications factor into how those challenges manifest? What does Watchful provide that allows it to address those obstacles? Can you describe how Watchful is implemented? What are some of the initial ideas/assumptions that you have had to re-evaluate? What are some of the ways that you have had to adjust the design of your user experience flows since you first started? What is the workflow for teams who are adopting Watchful? What are the types of collaboration that need to happen in the data labeling process? What are some of the elements of shared vocabulary that different stakeholders in the process need to establish to be successful? What are the most interesting, innovative, or unexpected ways that you have seen Watchful used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Watchful? When is Watchful the wrong choice? What do you have planned for the future of Watchful? Contact Info LinkedIn @shayanjm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Watchful Entity Resolution Supervised Machine Learning BERT CLIP LabelBox Label Studio Snorkel AI Machine Learning Podcast Episode RegEx == Regular Expression REPL == Read Evaluate Print Loop IDE == Integrated Development Environment Turing Completeness Clojure Rust Named Entity Recognition The Halting Problem NP Hard Lidar Shayan: Arguments Against Hand Labeling The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/14/2022 • 1 hour, 20 minutes, 29 seconds

Collecting And Retaining Contextual Metadata For Powerful And Effective Data Discovery

Summary Data is useless if it isn’t being used, and you can’t use it if you don’t know where it is. Data catalogs were the first solution to this problem, but they are only helpful if you know what you are looking for. In this episode Shinji Kim discusses the challenges of data discovery and how to collect and preserve additional context about each piece of information so that you can find what you need when you don’t even know what you’re looking for yet. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! The biggest challenge with modern data systems is understanding what data you have, where it is located, and who is using it. Select Star’s data discovery platform solves that out of the box, with an automated catalog that includes lineage from where the data originated, all the way to which dashboards rely on it and who is viewing them every day. Just connect it to your database/data warehouse/data lakehouse/whatever you’re using and let them do the rest. Go to dataengineeringpodcast.com/selectstar today to double the length of your free trial and get a swag package when you convert to a paid plan. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Shinji Kim about data discovery and what is required to build and maintain useful context for your information assets Interview Introduction How did you get involved in the area of data management? Can you share your definition of "data discovery" and the technical/social/process components that are required to make it viable? What are the differences between "data discovery" and the capabilities of a "data catalog" and how do they overlap? discovery of assets outside the bounds of the warehouse capturing and codifying tribal knowledge creating a useful structure/framework for capturing data context and operationalizing it What are the most interesting, innovative, or unexpected ways that you have seen data discovery implemented? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data discovery at SelectStar? When might a data discovery effort be more work than is required? What do you have planned for the future of SelectStar? Contact Info LinkedIn @shinjikim on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Select Star Podcast Episode Fivetran Podcast Episode Airbyte Podcast Episode Tableau PowerBI Podcast Episode Looker Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/14/2022 • 53 minutes, 24 seconds

Useful Lessons And Repeatable Patterns Learned From Data Mesh Implementations At AgileLab

Summary Data mesh is a frequent topic of conversation in the data community, with many debates about how and when to employ this architectural pattern. The team at AgileLab have first-hand experience helping large enterprise organizations evaluate and implement their own data mesh strategies. In this episode Paolo Platter shares the lessons they have learned in that process, the Data Mesh Boost platform that they have built to reduce some of the boilerplate required to make it successful, and some of the considerations to make when deciding if a data mesh is the right choice for you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Paolo Platter about Agile Lab’s lessons learned through helping large enterprises establish their own data mesh Interview Introduction How did you get involved in the area of data management? Can you share your experiences working with data mesh implementations? What were the stated goals of project engagements that led to data mesh implementations? What are some examples of projects where you explored data mesh as an option and decided that it was a poor fit? What are some of the technical and process investments that are necessary to support a mesh strategy? When implementing a data mesh what are some of the common concerns/requirements for building and supporting data products? What are the general shape that a product will take in a mesh environment? What are the features that are necessary for a product to be an effective component in the mesh? What are some of the aspects of a data product that are unique to a given implementation? You built a platform for implementing data meshes. Can you describe the technical elements of that system? What were the primary goals that you were addressing when you decided to invest in building Data Mesh Boost? How does Data Mesh Boost help in the implementation of a data mesh? Code review is a common practice in construction and maintenance of software systems. How does that activity map to data systems/products? What are some of the challenges that you have encountered around CI/CD for data products? What are the persistent pain points involved in supporting pre-production validation of changes to data products? Beyond the initial work of building and deploying a data product there is the ongoing lifecycle management. How do you approach refactoring old data products to match updated practices/templates? What are some of the indicators that tell you when an organization is at a level of sophistication that can support a data mesh approach? What are the most interesting, innovative, or unexpected ways that you have seen Data Mesh Boost used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Data Mesh Boost? When is Data Mesh (Boost) the wrong choice? What do you have planned for the future of Data Mesh Boost? Contact Info LinkedIn @axlpado on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links AgileLab Spark Cloudera Zhamak Dehghani Data Mesh Data Fabric Data Virtualization q-lang Data Mesh Boost Data Mesh Marketplace SourceGraph OpenMetadata Egeria The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/6/2022 • 48 minutes, 30 seconds

Optimize Your Machine Learning Development And Serving With The Open Source Vector Database Milvus

Summary The optimal format for storage and retrieval of data is dependent on how it is going to be used. For analytical systems there are decades of investment in data warehouses and various modeling techniques. For machine learning applications relational models require additional processing to be directly useful, which is why there has been a growth in the use of vector databases. These platforms store direct representations of the vector embeddings that machine learning models rely on for computing relevant predictions so that there is no additional processing required to go from input data to inference output. In this episode Frank Liu explains how the open source Milvus vector database is implemented to speed up machine learning development cycles, how to think about proper storage and scaling of these vectors, and how data engineering and machine learning teams can collaborate on the creation and maintenance of these data sets. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Frank Liu about the open source vector database Milvus and how it simplifies the work of supporting ML teams Interview Introduction How did you get involved in the area of data management? Can you describe what Milvus is and the story behind it? What are the goals of the project? Who is the target audience for this database? What are the use cases for a vector database and similarity search of vector embeddings? What are some of the unique capabilities that this category of database engine introduces? Can you describe how Milvus is architected? What are the primary system requirements that have influenced the design choices? How have the goals and implementation evolved since you started working on it? What are some of the interesting details that you have had to address in the storage layer to allow for fast and efficient retrieval of vector embeddings? What are the limitations that you have had to impose on size or dimensionality of vectors to allow for a consistent user experience in a running system? The reference material states that similarity between two vectors implies similarity in the source data. What are some of the characteristics of vector embeddings that might make them immune or susceptible to confusion of similarity across different source data types that share some implicit relationship due to specifics of their vectorized representation? (e.g. an image vs. an audio file, etc.) What are the available deployment models/targets and how does that influence potential use cases? What is the workflow for someone who is building an application on top of Milvus? What are some of the data management considerations that are introduced by vector databases? (e.g. manage versions of vectors, metadata management, etc.) What are the most interesting, innovative, or unexpected ways that you have seen Milvus used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Milvus? When is Milvus the wrong choice? What do you have planned for the future of Milvus? Contact Info LinkedIn fzliu on GitHub Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Milvus Zilliz Linux Foundation/AI & Data MySQL PostgreSQL CockroachDB Pilosa Podcast Episode Pinecone Vector DB Podcast Episode Vector Embedding Reverse Image Search Vector Arithmetic Vector Distance SIGMOD Tensor Rotation Matrix L2 Distance Cosine Distance OpenAI CLIP Knowhere Kafka Pulsar Podcast Episode CAP Theorem Milvus Helm Chart Zilliz Cloud MinIO Towhee Attu Feder FPGA == Field Programmable Gate Array TPU == Tensor Processing Unit The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/6/2022 • 58 minutes, 51 seconds

Interactive Exploratory Data Analysis On Petabyte Scale Data Sets With Arkouda

Summary Exploratory data analysis works best when the feedback loop is fast and iterative. This is easy to achieve when you are working on small datasets, but as they scale up beyond what can fit on a single machine those short iterations quickly become long and tedious. The Arkouda project is a Python interface built on top of the Chapel compiler to bring back those interactive speeds for exploratory analysis on horizontally scalable compute that parallelizes operations on large volumes of data. In this episode David Bader explains how the framework operates, the algorithms that are built into it to support complex analyses, and how you can start using it today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Data stacks are becoming more and more complex. This brings infinite possibilities for data pipelines to break and a host of other issues, severely deteriorating the quality of the data and causing teams to lose trust. Sifflet solves this problem by acting as an overseeing layer to the data stack – observing data and ensuring it’s reliable from ingestion all the way to consumption. Whether the data is in transit or at rest, Sifflet can detect data quality anomalies, assess business impact, identify the root cause, and alert data teams’ on their preferred channels. All thanks to 50+ quality checks, extensive column-level lineage, and 20+ connectors across the Data Stack. In addition, data discovery is made easy through Sifflet’s information-rich data catalog with a powerful search engine and real-time health statuses. Listeners of the podcast will get $2000 to use as platform credits when signing up to use Sifflet. Sifflet also offers a 2-week free trial. Find out more at dataengineeringpodcast.com/sifflet today! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing David Bader about Arkouda, a horizontally scalable parallel compute library for exploratory data analysis in Python Interview Introduction How did you get involved in the area of data management? Can you describe what Arkouda is and the story behind it? What are the main goals of the project? How does it address those goals? Who is the primary audience for Arkouda? What are some of the main points of friction that engineers and scientists encounter while conducting exploratory data analysis (EDA)? What kinds of behaviors are they engaging in during these exploration cycles? When data scientists run up against the limitations of their tools and environments how does that impact the work of data engineers/data platform owners? There have been a number of libraries/frameworks/utilities/etc. built to improve the experience and outcomes for EDA. What was missing that made Arkouda necessary/useful? Can you describe how Arkouda is implemented? What are some of the novel algorithms that you have had to design to support Arkouda’s objectives? How have the design/goals/scope of the project changed since you started working on it? How has the evolution of hardware capabilities impacted the set of processing algorithms that are viable for addressing considerations of scale? What are the relative factors of scale along space/time axes that you are optimizing for? What are some opportunities that are still unrealized for algorithmic optimizations to expand horizons for large-scale data manipulation? For teams/individuals who are working with Arkouda can you describe the implementation process and what the end-user workflow looks like? What are the most interesting, innovative, or unexpected ways that you have seen Arkouda used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Arkouda? When is Arkouda the wrong choice? What do you have planned for the future of Arkouda? Contact Info Website LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Arkouda NJIT == New Jersey Institute of Technology NumPy Pandas Podcast.__init__ Episode NetworkX Chapel Massive Graph Analytics Book Ray Podcast.__init__ Episode Dask Podcast Episode Bodo Podcast Episode Stinger Graph Analytics Bears-R-Us 0MQ Triangle Centrality Degree Centrality The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/31/2022 • 40 minutes, 37 seconds

What "Data Lineage Done Right" Looks Like And How They're Doing It At Manta

Summary Data lineage is the roadmap for your data platform, providing visibility into all of the dependencies for any report, machine learning model, or data warehouse table that you are working with. Because of its centrality to your data systems it is valuable for debugging, governance, understanding context, and myriad other purposes. This means that it is important to have an accurate and complete lineage graph so that you don’t have to perform your own detective work when time is in short supply. In this episode Ernie Ostic shares the approach that he and his team at Manta are taking to build a complete view of data lineage across the various data systems in your organization and the useful applications of that information in the work of every data stakeholder. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect. Your host is Tobias Macey and today I’m interviewing Ernie Ostic about Manta, an automated data lineage service for managing visibility and quality of your data workflows Interview Introduction How did you get involved in the area of data management? Can you describe what Manta is and the story behind it? What are the core problems that Manta aims to solve? Data lineage and metadata systems are a hot topic right now. What is your summary of the state of the market? What are the capabilities that would lead a team or organization to choose Manta in place of the other options? What are some examples of "data lineage done wrong"? (what does that look like?) What are the risks associated with investing in an incomplete solution for data lineage? What are the core attributes that need to be tracked consistently to enable a comprehensive view of lineage? How do the practices for collecting lineage and metadata differ between structured, semi-structured, and unstructured data assets and their movement? Can you describe how Manta is implemented? How have the design and goals of the product changed or evolved? What is involved in integrating Manta with an organization’s data systems? What are the biggest sources of friction/errors in collecting and cleaning lineage information? One of the interesting capabilities that you advertise is versioning and time travel for lineage information. Why is that a necessary and useful feature? Once an organization’s lineage information is available in Manta, how does it factor into the daily workflow of different roles/stakeholders? There are a variety of use cases for metadata in a data platform beyond lineage. What are the benefits that you see from focusing on that as a core competency? Beyond validating quality, identifying errors, etc. it seems that automated discovery of lineage could produce insights into when the presence of data assets that shouldn’t exist. What are some examples of similar discoveries that you are aware of? What are the most interesting, innovative, or unexpected ways that you have seen Manta used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Manta? When is Manta the wrong choice? What do you have planned for the future of Manta? Contact Info LinkedIn @dsrealtime01 on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Manta Egeria OpenLineage Podcast Episode Apache Atlas Neo4J Easytrieve The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/31/2022 • 1 hour, 5 minutes, 18 seconds

Re-Bundling The Data Stack With Data Orchestration And Software Defined Assets Using Dagster

Summary The current stage of evolution in the data management ecosystem has resulted in domain and use case specific orchestration capabilities being incorporated into various tools. This complicates the work involved in making end-to-end workflows visible and integrated. Dagster has invested in bringing insights about external tools’ dependency graphs into one place through its "software defined assets" functionality. In this episode Nick Schrock discusses the importance of orchestration and a central location for managing data systems, the road to Dagster’s 1.0 release, and the new features coming with Dagster Cloud’s general availability. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Nick Schrock about software defined assets and improving the developer experience for data orchestration with Dagster Interview Introduction How did you get involved in the area of data management? What are the notable updates in Dagster since the last time we spoke? (November, 2021) One of the core concepts that you introduced and then stabilized in recent releases is the "software defined asset" (SDA). How have your users reacted to this capability? What are the notable outcomes in development and product practices that you have seen as a result? What are the changes to the interfaces and internals of Dagster that were necessary to support SDA? How did the API design shift from the initial implementation once the community started providing feedback? You’re releasing the stable 1.0 version of Dagster as part of something called "Dagster Day" on August 9th. What do you have planned for that event and what does the release mean for users who have been refraining from using the framework until now? Along with your 1.0 commitment to a stable interface in the framework you are also opening your cloud platform for general availability. What are the major lessons that you and your team learned in the beta period? What new capabilities are coming with the GA release? A core thesis in your work on Dagster is that developer tooling for data professionals has been lacking. What are your thoughts on the overall progress that has been made as an industry? What are the sharp edges that still need to be addressed? A core facet of product-focused software development over the past decade+ is CI/CD and the use of pre-production environments for testing changes, which is still a challenging aspect of data-focused engineering. How are you thinking about those capabilities for orchestration workflows in the Dagster context? What are the missing pieces in the broader ecosystem that make this a challenge even with support from tools and frameworks? How has the situation improved in the recent past and looking toward the near future? What role does the SDA approach have in pushing on these capabilities? What are the most interesting, innovative, or unexpected ways that you have seen Dagster used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on bringing Dagster to 1.0 and cloud to GA? When is Dagster/Dagster Cloud the wrong choice? What do you have planned for the future of Dagster and Elementl? Contact Info @schrockn on Twitter schrockn on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Dagster Day Dagster 1st Podcast Episode 2nd Podcast Episode Elementl GraphQL Unbundling Airflow Feast Spark SQL Dagster Cloud Branch Deployments Dagster custom I/O manager LakeFS Iceberg Project Nessie Prefect Prefect Orion Astronomer Temporal The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/24/2022 • 58 minutes, 14 seconds

Writing The Book That Offers A Single Reference For The Fundamentals Of Data Engineering

Summary Data engineering is a difficult job, requiring a large number of skills that often don’t overlap. Any effort to understand how to start a career in the role has required stitching together information from a multitude of resources that might not all agree with each other. In order to provide a single reference for anyone tasked with data engineering responsibilities Joe Reis and Matt Housley took it upon themselves to write the book "Fundamentals of Data Engineering". In this episode they share their experiences researching and distilling the lessons that will be useful to data engineers now and into the future, without being tied to any specific technologies that may fade from fashion. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Prefect is the modern Dataflow Automation platform for the modern data stack, empowering data practitioners to build, run and monitor robust pipelines at scale. Guided by the principle that the orchestrator shouldn’t get in your way, Prefect is the only tool of its kind to offer the flexibility to write code as workflows. Prefect specializes in glueing together the disparate pieces of a pipeline, and integrating with modern distributed compute libraries to bring power where you need it, when you need it. Trusted by thousands of organizations and supported by over 20,000 community members, Prefect powers over 100MM business critical tasks a month. For more information on Prefect, visit dataengineeringpodcast.com/prefect today. Your host is Tobias Macey and today I’m interviewing Joe Reis and Matt Housley about their new book on the Fundamentals of Data Engineering Interview Introduction How did you get involved in the area of data management? Can you explain what possessed you to write such an ambitious book? What are your goals with this book? What was your process for determining what subject areas to include in the book? How did you determine what level of granularity/detail to use for each subject area? Closely linked to what subjects are necessary to be effective as a data engineer is the concept of what that title encompasses. How have the definitions shifted over the past few decades? In your experiences working in industry and researching for the book, what is the prevailing view on what data engineers do? In the book you focus on what you term the "data lifecycle engineer". What are the skills and background that are needed to be successful in that role? Any discussion of technological concepts and how to build systems tends to drift toward specific tools. How did you balance the need to be agnostic to specific technologies while providing relevant and relatable examples? What are the aspects of the book that you anticipate needing to revisit over the next 2 – 5 years? Which elements do you think will remain evergreen? What are the most interesting, unexpected, or challenging lessons that you have learned while working on writing "Fundamentals of Data Engineering"? What are your predictions for the future of data engineering? Contact Info Joe LinkedIn Website Matt LinkedIn @doctorhousley on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Fundamentals of Data Engineering (affiliate link) Ternary Data Designing Data Intensive Applications James Webb Space Telescope Google Colossus Storage System DMBoK == Data Management Body of Knowledge DAMA Bill Inmon Apache Druid RTFM == Read The Fine Manual DuckDB Podcast Episode VisiCalc Ternary Data Newsletter Meroxa Podcast Episode Ruby on Rails Lambda Architecture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/24/2022 • 1 hour, 1 minute, 2 seconds

Making The Total Cost Of Ownership For External Data Manageable With Crux

Summary There are extensive and valuable data sets that are available outside the bounds of your organization. Whether that data is public, paid, or scraped it requires investment and upkeep to acquire and integrate it with your systems. Crux was built to reduce the total cost of acquisition and ownership for integrating external data, offering a fully managed service for delivering those data assets in the manner that best suits your infrastructure. In this episode Crux CTO Mark Etherington discusses the different costs involved in managing external data, how to think about the total return on investment for your data, and how the Crux platform is architected to reduce the toil involved in managing third party data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Mark Etherington about Crux, a platform that helps organizations scale their most critical data delivery, operations, and transformation needs Interview Introduction How did you get involved in the area of data management? Can you describe what Crux is and the story behind it? What are the categories of information that organizations use external data sources for? What are the challenges and long-term costs related to integrating external data sources that are most often overlooked or underestimated? What are some of the primary risks involved in working with external data sources? How do you work with customers to help them understand the long-term costs associated with integrating various sources? How does that play into the broader conversation about assessing the value of a given data-set? Can you describe how you have architected the Crux platform? How have the design and goals of the platform changed or evolved since you started working on it? What are the design choices that have had the most significant impact on your ability to reduce operational complexity and maintenance overhead for the data you are working with? For teams who are relying on Crux to manage external data, what is involved in setting up the initial integration with your system? What are the steps to on-board new data sources? How do you manage data quality/data observability across your different data providers? What kinds of signals do you propagate to your customers to feed into their operational platforms? What are the most interesting, innovative, or unexpected ways that you have seen Crux used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Crux? When is Crux the wrong choice? What do you have planned for the future of Crux? Contact Info Email LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Crux Thomson Reuters Goldman Sachs JP Morgan Avro ESG == Environmental, Social, Government Data Selenium Google Cloud Platform Cadence Airflow The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/17/2022 • 1 hour, 7 minutes, 12 seconds

Joe Reis Flips The Script And Interviews Tobias Macey About The Data Engineering Podcast

Summary Data engineering is a large and growing subject, with new technologies, specializations, and "best practices" emerging at an accelerating pace. This podcast does its best to explore this fractal ecosystem, and has been at it for the past 5+ years. In this episode Joe Reis, founder of Ternary Data and co-author of "Fundamentals of Data Engineering", turns the tables and interviews the host, Tobias Macey, about his journey into podcasting, how he runs the show behind the scenes, and the other things that occupy his time. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today we’re flipping the script. Joe Reis of Ternary Data will be interviewing me about my time as the host of this show and my perspectives on the data ecosystem Interview Introduction How did you get involved in the area of data management? Now I’ll hand it off to Joe… Joe’s Notes You do a lot of podcasts. Why? Podcast.init started in 2015, and your first episode of Data Engineering was published January 14, 2017. Walk us through the start of these podcasts. why not a data science podcast? why DE? You’ve published 306 of shows of the Data Engineering Podcast, plus 370 for the init podcast, then you’ve got a new ML podcast. How have you kept the motivation over the years? What’s the process for the show (finding guests, topics, etc….recording, publishing)? It’s a lot of work. Walk us through this process. You’ve done a ton of shows and have a lot of context with what’s going on in the field of both data engineering and Python. What have been some of the major evolutions of topics you’ve covered? What’s been the most counterintuitive show or interesting thing you’ve learned while producing the show? How do you keep current with the data engineering landscape? You’ve got a very unique perspective of data engineering, having interviewed countless top people in the field. What are the the big trends you see in data engineering over the next 3 years? What do you do besides podcasting? Is this your only gig, or do you do other work? whats next? Contact Info LinkedIn Website Closing Announcements Thank you for listening! Don’t forget to check out our other shows. The Data Engineering Podcast covers the latest on modern data management. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@themachinelearningpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Podcast.__init__ The Machine Learning Podcast Ternary Data Fundamentals of Data Engineering book (affiliate link) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/17/2022 • 56 minutes, 39 seconds

Charting the Path of Riskified's Data Platform Journey

Summary Building a data platform is a journey, not a destination. Beyond the work of assembling a set of technologies and building integrations across them, there is also the work of growing and organizing a team that can support and benefit from that platform. In this episode Inbar Yogev and Lior Winner share the journey that they and their teams at Riskified have been on for their data platform. They also discuss how they have established a guild system for training and supporting data professionals in the organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Inbar Yogev and Lior Winner about the data platform that the team at Riskified are building to power their fraud management service Interview Introduction How did you get involved in the area of data management? What does Riskified do? Can you describe the role of data at Riskified? What are some of the core types and sources of information that you are dealing with? Who/what are the primary consumers of the data that you are responsible for? What are the team structures that you have tested for your data professionals? What is the composition of your data roles? (e.g. ML engineers, data engineers, data scientists, data product managers, etc.) What are the organizational constraints that have the biggest impact on the design and usage of your data systems? Can you describe the current architecture of your data platform? What are some of the most notable evolutions/redesigns that you have gone through? What is your process for establishing and evaluating selection criteria for any new technologies that you adopt? How do you facilitate knowledge sharing between data professionals? What have you found to be the most challenging technological and organizational complexities that you have had to address on the path to your current state? What are the methods that you use for staying up to date with the data ecosystem? (opportunity to discuss Haya Data conference) In your role as organizers of the Haya Data conference, what are some of the insights that you have gained into the present state and future trajectory of the data community? What are the most interesting, innovative, or unexpected ways that you have seen the Riskified data platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the data platform for Riskified? What do you have planned for the future of your data platform? Contact Info Inbar LinkedIn Lior LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Riskified ADABAS Aerospike Podcast Episode Neo4J Kafka Delta Lake Podcast Episode Databricks Snowflake Podcast Episode Tableau Looker Podcast Episode Redshift Event Sourcing Avro hayaData Conference Data Mesh Data Catalog Data Governance MLOps Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/10/2022 • 39 minutes, 57 seconds

Maintain Your Data Engineers' Sanity By Embracing Automation

Summary Building and maintaining reliable data assets is the prime directive for data engineers. While it is easy to say, it is endlessly complex to implement, requiring data professionals to be experts in a wide range of disparate topics while designing and implementing complex topologies of information workflows. In order to make this a tractable problem it is essential that engineers embrace automation at every opportunity. In this episode Chris Riccomini shares his experiences building and scaling data operations at WePay and LinkedIn, as well as the lessons he has learned working with other teams as they automated their own systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Chris Riccomini about building awareness of data usage into CI/CD pipelines for application development Interview Introduction How did you get involved in the area of data management? What are the pieces of data platforms and processing that have been most difficult to scale in an organizational sense? What are the opportunities for automation to alleviate some of the toil that data and analytics engineers get caught up in? The application delivery ecosystem has been going through ongoing transformation in the form of CI/CD, infrastructure as code, etc. What are the parallels in the data ecosystem that are still nascent? What are the principles that still need to be translated for data practitioners? Which are subject to impedance mismatch and may never make sense to translate? As someone with a software engineering background and extensive experience working in data, what are the missing links to make those teams/objectives work together more seamlessly? How can tooling and automation help in that endeavor? A key factor in the adoption of automation for application delivery is automated tests. What are some of the strategies you find useful for identifying scope and targets for testing/monitoring of data products? As data usage and capabilities grow and evolve in an organization, what are the junction points that are in greatest need of well-defined data contracts? How can automation aid in enforcing and alerting on those contracts in a continuous fashion? What are the most interesting, innovative, or unexpected ways that you have seen automation of data operations used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on automation for data systems? When is automation the wrong choice? What does the future of data engineering look like? Contact Info Website @criccomini on Twitter criccomini on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links WePay Enterprise Service Bus The Missing README Hadoop Confluent Schema Registry Podcast Episode Avro CDC == Change Data Capture Debezium Podcast Episode Data Mesh What the heck is a data mesh? blog post SRE == Site Reliability Engineer Terraform Chef configuration management tool Puppet configuration management tool Ansible configuration management tool BigQuery Airflow Pulumi Podcast.__init__ Episode Monte Carlo Podcast Episode Bigeye Podcast Episode Anomalo Podcast Episode Great Expectations Podcast Episode Schemata Data Engineering Weekly newsletter The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/10/2022 • 1 hour, 5 minutes, 8 seconds

Be Confident In Your Data Integration By Quickly Validating Matching Records With data-diff

Summary The perennial challenge of data engineers is ensuring that information is integrated reliably. While it is straightforward to know whether a synchronization process succeeded, it is not always clear whether every record was copied correctly. In order to quickly identify if and how two data systems are out of sync Gleb Mezhanskiy and Simon Eskildsen partnered to create the open source data-diff utility. In this episode they explain how the utility is implemented to run quickly and how you can start using it in your own data workflows to ensure that your data warehouse isn’t missing any records from your source systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try! Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy and Simon Eskildsen about their work to open source the data diff utility that they have been building at Datafold Interview Introduction How did you get involved in the area of data management? Can you describe what the data diff tool is and the story behind it? What was your motivation for going through the process of releasing your data diff functionality as an open source utility? What are some of the ways that data-diff composes with other data quality tools? (e.g. Great Expectations, Soda SQL, etc.) Can you describe how data-diff is implemented? Given the target of having a performant and scalable utility how did you approach the question of language selection? What are some of the ways that you have seen data-diff incorporated in the workflow of data teams? What were the steps that you needed to do to get the project cleaned up and separated from your internal implementation for release as open source? What are the most interesting, innovative, or unexpected ways that you have seen data-diff used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data-diff? When is data-diff the wrong choice? What do you have planned for the future of data-diff? Contact Info Gleb LinkedIn @glebmm on Twitter Simon Website @Sirupsen on Twitter sirupsen on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Datafold Podcast Episode data-diff Autodesk Airbyte Podcast Episode Debezium Podcast Episode Napkin Math newsletter Airflow Dagster Podcast Episode Great Expectations Podcast Episode dbt Podcast Episode Trino Preql Podcast.__init__ Episode Erez Shinan Fivetran Podcast Episode md5 CRC32 Merkle Tree Locally Optimistic Presto The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Special Guest: Gleb Mezhanskiy.

7/3/2022 • 1 hour, 10 minutes, 57 seconds

The View From The Lakehouse Of Architectural Patterns For Your Data Platform

Summary The ecosystem for data tools has been going through rapid and constant evolution over the past several years. These technological shifts have brought about corresponding changes in data and platform architectures for managing data and analytical workflows. In this episode Colleen Tartow shares her insights into the motivating factors and benefits of the most prominent patterns that are in the popular narrative; data mesh and the modern data stack. She also discusses her views on the role of the data lakehouse as a building block for these architectures and the ongoing influence that it will have as the technology matures. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Tired of deploying bad data? Need to automate data pipelines with less red tape? Shipyard is the premier data orchestration platform built to help your data team quickly launch, monitor, and share workflows in a matter of minutes. Build powerful workflows that connect your entire data stack end-to-end with a mix of your code and their open-source, low-code templates. Once launched, Shipyard makes data observability easy with logging, alerting, and retries that will catch errors before your business team does. So whether you’re ingesting data from an API, transforming it with dbt, updating BI tools, or sending data alerts, Shipyard centralizes these operations and handles the heavy lifting so your data team can finally focus on what they’re good at — solving problems with data. Go to dataengineeringpodcast.com/shipyard to get started automating with their free developer plan today! Your host is Tobias Macey and today I’m interviewing Colleen Tartow about her views on the forces shaping the current generation of data architectures Interview Introduction How did you get involved in the area of data management? In your opinion as an astrophysicist, how well does the metaphor of a starburst map to your current work at the company of the same name? Can you describe what you see as the dominant factors that influence a team’s approach to data architecture and design? Two of the most repeated (often mis-attributed) terms in the data ecosystem for the past couple of years are the "modern data stack" and the "data mesh". As someone who is working at a company that can be construed to provide solutions for either/both of those patterns, what are your thoughts on their lasting strength and long-term viability? What do you see as the strengths of the emerging lakehouse architecture in the context of the "modern data stack"? What are the factors that have prevented it from being a default choice compared to cloud data warehouses? (e.g. BigQuery, Redshift, Snowflake, Firebolt, etc.) What are the recent developments that are contributing to its current growth? What are the weak points/sharp edges that still need to be addressed? (both internal to the platforms and in the external ecosystem/integrations) What are some of the implementation challenges that teams often experience when trying to adopt a lakehouse strategy as the core building block of their data systems? What are some of the exercises that they should be performing to help determine their technical and organizational capacity to support that strategy over the long term? One of the core requirements for a data mesh implementation is to have a common system that allows for product teams to easily build their solutions on top of. How do lakehouse/data virtualization systems allow for that? What are some of the lessons that need to be shared with engineers to help them make effective use of these technologies when building their own data products? What are some of the supporting services that are helpful in these undertakings? What do you see as the forces that will have the most influence on the trajectory of data architectures over the next 2 – 5 years? What are the most interesting, innovative, or unexpected ways that you have seen lakehouse architectures used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the Starburst product? When is a lakehouse the wrong choice? What do you have planned for the future of Starburst’s technology platform? Contact Info LinkedIn @ctartow on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Starburst Trino Teradata Cognos Data Lakehouse Data Virtualization Iceberg Podcast Episode Hudi Podcast Episode Delta Podcast Episode Snowflake Podcast Episode AWS Lake Formation Clickhouse Podcast Episode Druid Pinot Podcast Episode Starburst Galaxy Varada The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/3/2022 • 58 minutes, 43 seconds

Strategies And Tactics For A Successful Master Data Management Implementation

Summary The most complicated part of data engineering is the effort involved in making the raw data fit into the narrative of the business. Master Data Management (MDM) is the process of building consensus around what the information actually means in the context of the business and then shaping the data to match those semantics. In this episode Malcolm Hawker shares his years of experience working in this domain to explore the combination of technical and social skills that are necessary to make an MDM project successful both at the outset and over the long term. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Random data doesn’t do it — and production data is not safe (or legal) for developers to use. What if you could mimic your entire production database to create a realistic dataset with zero sensitive data? Tonic.ai does exactly that. With Tonic, you can generate fake data that looks, acts, and behaves like production because it’s made from production. Using universal data connectors and a flexible API, Tonic integrates seamlessly into your existing pipelines and allows you to shape and size your data to the scale, realism, and degree of privacy that you need. The platform offers advanced subsetting, secure de-identification, and ML-driven data synthesis to create targeted test data for all of your pre-production environments. Your newly mimicked datasets are safe to share with developers, QA, data scientists—heck, even distributed teams around the world. Shorten development cycles, eliminate the need for cumbersome data pipeline work, and mathematically guarantee the privacy of your data, with Tonic.ai. Data Engineering Podcast listeners can sign up for a free 2-week sandbox account, go to dataengineeringpodcast.com/tonic today to give it a try! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Malcolm Hawker about master data management strategies for the enterprise Interview Introduction How did you get involved in the area of data management? Can you start by giving your definition of what MDM is and the scope of activities/functions that it includes? How have evolutions in the data landscape shifted the conversation around MDM? Can you describe what Profisee is and the story behind it? What was your path to joining Profisee and what is your role in the business? Who are the target customers for Profisee? What are the challenges that they typically experience that leads them to MDM as a solution for their problems? How does the narrative around data observability/data quality from tools such as Great Expectations, Monte Carlo, etc. differ from the data quality benefits of a MDM strategy? How do recent conversations around semantic/metrics layers compare to the way that MDM approaches the problem of domain modeling? What are the steps to defining an MDM strategy for an organization or business unit? Once there is a strategy, what are the tactical elements of the implementation? What is the role of the toolchain in that implementation? (e.g. Spark, dbt, Airflow, etc.) Can you describe how Profisee is implemented? How does the customer base inform the architectural approach that Profisee has taken? Can you describe the adoption process for an organization that is using Profisee for their MDM? Once an organization has defined and adopted an MDM strategy, what are the ongoing maintenance tasks related to the domain models? What are the most interesting, innovative, or unexpected ways that you have seen MDM used? What are the most interesting, unexpected, or challenging lessons that you have learned while working in MDM? When is Profisee the wrong choice? What do you have planned for the future of Profisee? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Profisee MDM == Master Data Management CRM == Customer Relationship Management ERP == Enterprise Resource Planning Levenshtein Distance Algorithm Soundex CDP == Customer Data Platform The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/27/2022 • 1 hour, 9 minutes, 8 seconds

Bring Geospatial Analytics Across Disparate Datasets Into Your Toolkit With The Unfolded Platform

Summary The proliferation of sensors and GPS devices has dramatically increased the number of applications for spatial data, and the need for scalable geospatial analytics. In order to reduce the friction involved in aggregating disparate data sets that share geographic similarities the Unfolded team built a platform that supports working across raster, vector, and tabular data in a single system. In this episode Isaac Brodsky explains how the Unfolded platform is architected, their experience joining the team at Foursquare, and how you can start using it for analyzing your spatial data today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business. Your host is Tobias Macey and today I’m interviewing Isaac Brodsky about Foursquare’s Unfolded platform for working with spatial data Interview Introduction How did you get involved in the area of data management? Can you describe what the Unfolded platform is and the story behind it? What are some of the core challenges of working with spatial data? What are some of the sources that organizations rely on for collecting or generating those data sets? What are the capabilities that the Unfolded platform offers for spatial analytics? What use cases are you primarily focused on supporting? What (if any) are the datasets or analyses that you are consciously not investing in supporting? Can you describe how the Unfolded platform is implemented? How have the design and goals shifted or evolved since you started working on Unfolded? What are the new constraints or opportunities that are available after the merger with Foursquare? Can you describe a typical workflow for someone using Unfolded to manage their spatial information and build an analysis on top of it? What are some of the data modeling considerations that are necessary when populating a custom data set with Unfolded? What are some of the techniques that you needed to build to allow for loading large data sets into a users’s browser while maintaining sufficient performance? What are the most interesting, innovative, or unexpected ways that you have seen Unfolded used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Unfolded? When is Unfolded the wrong choice? What do you have planned for the future of Unfolded? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Unfolded Platform H3 Hexagonal Map Tiles Library Carto Mapbox Open Street Map Raster Files Hex Tiles The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/27/2022 • 1 hour, 7 minutes, 1 second

Combining The Simplicity Of Spreadsheets With The Power Of Modern Data Infrastructure At Canvas

Summary Data analysis is a valuable exercise that is often out of reach of non-technical users as a result of the complexity of data systems. In order to lower the barrier to entry Ryan Buick created the Canvas application with a spreadsheet oriented workflow that is understandable to a wide audience. In this episode Ryan explains how he and his team have designed their platform to bring everyone onto a level playing field and the benefits that it provides to the organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business. Your host is Tobias Macey and today I’m interviewing Ryan Buick about Canvas, a spreadsheet interface for your data that lets everyone on your team explore data without having to learn SQL Interview Introduction How did you get involved in the area of data management? Can you describe what Canvas is and the story behind it? The "modern data stack" has enabled organizations to analyze unparalleled volumes of data. What are the shortcomings in the operating model that keeps business users dependent on engineers to answer their questions? Why is the spreadsheet such a popular and persistent metaphor for working with data? What are the biggest issues that existing spreadsheet software run up against as they scale both technically and organizationally? What are the new metaphors/design elements that you needed to develop to extend the existing capabilities and use cases of spreadsheets while keeping them familiar? Can you describe how the Canvas platform is implemented? How have the design and goals of the product changed/evolved since you started working on it? What is the workflow for a business user that is using Canvas to iterate on a series of questions? What are the collaborative features that you have built into Canvas and who are they for? (e.g. other business users, data engineers <-> business users, etc.) What are the situations where the spreadsheet abstraction starts to break down? What are the extension points/escape hatches that you have built into the product for when that happens? What are the most interesting, innovative, or unexpected ways that you have seen Canvas used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Canvas? When is Canvas the wrong choice? What do you have planned for the future of Canvas? Contact Info LinkedIn @ryanjbuick on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Canvas Flexport Podcast Episode about their data mesh implementation Excel Lightdash Podcast Episode dbt Podcast Episode Figma The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/19/2022 • 42 minutes, 58 seconds

Level Up Your Data Platform With Active Metadata

Summary Metadata is the lifeblood of your data platform, providing information about what is happening in your systems. A variety of platforms have been developed to capture and analyze that information to great effect, but they are inherently limited in their utility due to their nature as storage systems. In order to level up their value a new trend of active metadata is being implemented, allowing use cases like keeping BI reports up to date, auto-scaling your warehouses, and automated data governance. In this episode Prukalpa Sankar joins the show to talk about the work she and her team at Atlan are doing to push this capability into the mainstream. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. Your host is Tobias Macey and today I’m interviewing Prukalpa Sankar about how data platforms can benefit from the idea of "active metadata" and the work that she and her team at Atlan are doing to make it a reality Interview Introduction How did you get involved in the area of data management? Can you describe what "active metadata" is and how it differs from the current approaches to metadata systems? What are some of the use cases that "active metadata" can enable for data producers and consumers? What are the points of friction that those users encounter in the current formulation of metadata systems? Central metadata systems/data catalogs came about as a solution to the challenge of integrating every data tool with every other data tool, giving a single place to integrate. What are the lessons that are being learned from the "modern data stack" that can be applied to centralized metadata? Can you describe the approach that you are taking at Atlan to enable the adoption of "active metadata"? What are the architectural capabilities that you had to build to power the outbound traffic flows? How are you addressing the N x M integration problem for pushing metadata into the necessary contexts at Atlan? What are the interfaces that are necessary for receiving systems to be able to make use of the metadata that is being delivered? How does the type/category of metadata impact the type of integration that is necessary? What are some of the automation possibilities that metadata activation offers for data teams? What are the cases where you still need a human in the loop? What are the most interesting, innovative, or unexpected ways that you have seen active metadata capabilities used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on activating metadata for your users? When is an active approach to metadata the wrong choice? What do you have planned for the future of Atlan and active metadata? Contact Info LinkedIn @prukalpa on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on Apple Podcasts and tell your friends and co-workers Links Atlan What is Active Metadata? Segment Podcast Episode Zapier ArgoCD Kubernetes Wix AWS Lambda Modern Data Culture Blog Post The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/19/2022 • 52 minutes, 35 seconds

Discover And De-Clutter Your Unstructured Data With Aparavi

Summary Unstructured data takes many forms in an organization. From a data engineering perspective that often means things like JSON files, audio or video recordings, images, etc. Another category of unstructured data that every business deals with is PDFs, Word documents, workstation backups, and countless other types of information. Aparavi was created to tame the sprawl of information across machines, datacenters, and clouds so that you can reduce the amount of duplicate data and save time and money on managing your data assets. In this episode Rod Christensen shares the story behind Aparavi and how you can use it to cut costs and gain value for the long tail of your unstructured data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Rod Christensen about Aparavi, a platform designed to find and unlock the value of data, no matter where it lives Interview Introduction How did you get involved in the area of data management? Can you describe what Aparavi is and the story behind it? Who are the target customers for Aparavi and how does that inform your product roadmap and messaging? What are some of the insights that you are able to provide about an organization’s data? Once you have generated those insights, what are some of the actions that they typically catalyze? What are the types of storage and data systems that you integrate with? Can you describe how the Aparavi platform is implemented? How do the trends in cloud storage and data systems influence the ways that you evolve the system? Can you describe a typical workflow for an organization using Aparavi? What are the mechanisms that you use for categorizing data assets? What are the interfaces that you provide for data owners and operators to provide heuristics to customize classification/cataloging of data? How can teams integrate with Aparavi to expose its insights to other tools for uses such as automation or data catalogs? What are the most interesting, innovative, or unexpected ways that you have seen Aparavi used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aparavi? When is Aparavi the wrong choice? What do you have planned for the future of Aparavi? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Aparavi SHA-512 The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/13/2022 • 49 minutes, 12 seconds

Hire And Scale Your Data Team With Intention

Summary Building a well rounded and effective data team is an iterative process, and the first hire can set the stage for future success or failure. Trupti Natu has been the first data hire multiple times and gone through the process of building teams across the different stages of growth. In this episode she shares her thoughts and insights on how to be intentional about establishing your own data team. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking all of that information into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how you can take advantage of active metadata and escape the chaos. Atlan is the metadata hub for your data ecosystem. Instead of locking your metadata into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how Atlan’s active metadata platform is helping pioneering data teams like Postman, Plaid, WeWork & Unilever achieve extraordinary things with metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Unstruk is the DataOps platform for your unstructured data. The options for ingesting, organizing, and curating unstructured files are complex, expensive, and bespoke. Unstruk Data is changing that equation with their platform approach to manage your unstructured assets. Built to handle all of your real-world data, from videos and images, to 3d point clouds and geospatial records, to industry specific file formats, Unstruk streamlines your workflow by converting human hours into machine minutes, and automatically alerting you to insights found in your dark data. Unstruk handles data versioning, lineage tracking, duplicate detection, consistency validation, as well as enrichment through sources including machine learning models, 3rd party data, and web APIs. Go to dataengineeringpodcast.com/unstruk today to transform your messy collection of unstructured data files into actionable assets that power your business. Your host is Tobias Macey and today I’m interviewing Trupti Natu about strategies for building your team, from the first data hire to post-acquisition Interview Introduction How did you get involved in the area of FinTech & Data Science (management)? How would you describe your overall career trajectory in data? Can you describe what your experience has been as a data professional at different stages of company growth? What are the traits that you look for in a first or second data hire at an organization? What are useful metrics for success to help gauge the effectiveness of hires at this early stage of data capabilities? What are the broad goals and projects that early data hires should be focused on? What are the indicators that you look for to determine when to scale the team? As you are building a team of data professionals, what are the organizational topologies that you have found most effective? (e.g. centralized vs. embedded data pros, etc.) What are the recruiting and screening/interviewing techniques that you have found most helpful given the relative scarcity of experienced data practitioners? What are the organizational and technical structures that are helpful to establish early in the organization’s data journey to reduce the onboarding time for new hires? Your background has primarily been in FinTech. How does the business domain influence the types of background and domain expertise that you look for? You recently went through an acquisition at the startup you were with. Can you describe the data-related projects that were required during the merger? What are the impedance mismatches that you have had to resolve in your data systems, moving from a fast-moving startup into a larger, more established organization? Being a FinTech company, what are some of the categories of regulatory considerations that you had to deal with during the integration process? What are the most interesting, unexpected, or challenging lessons that you have learned along your career journey? What are some of the pieces of advice that you wished you knew at the beginning of your career, and that you would like to share with others in that situation? Contact Info LinkedIn @truptinatu on Twitter Trupti is hiring for multiple product data science roles. Feel free to DM her on Twitter or LinkedIn to find out more Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other shows. Podcast.__init__ covers the Python language, its community, and the innovative ways it is being used. The Machine Learning Podcast helps you go from idea to production with machine learning. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links SumoLogic FinTech PII == Personally Identifiable Information The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/13/2022 • 1 hour, 53 seconds

Simplify Data Security For Sensitive Information With The Skyflow Data Privacy Vault

Summary The best way to make sure that you don’t leak sensitive data is to never have it in the first place. The team at Skyflow decided that the second best way is to build a storage system dedicated to securely managing your sensitive information and making it easy to integrate with your applications and data systems. In this episode Sean Falconer explains the idea of a data privacy vault and how this new architectural element can drastically reduce the potential for making a mistake with how you manage regulated or personally identifiable information. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! Atlan is the metadata hub for your data ecosystem. Instead of locking all of that information into a new silo, unleash its transformative potential with Atlan’s active metadata capabilities. Push information about data freshness and quality to your business intelligence, automatically scale up and down your warehouse based on usage patterns, and let the bots answer those questions in Slack so that the humans can focus on delivering real value. Go to dataengineeringpodcast.com/atlan today to learn more about how you can take advantage of active metadata and escape the chaos. Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Sean Falconer about the idea of a data privacy vault and how the Skyflow team are working to make it turn-key Interview Introduction How did you get involved in the area of data management? Can you describe what Skyflow is and the story behind it? What is a "data privacy vault" and how does it differ from strategies such as privacy engineering or existing data governance patterns? What are the primary use cases and capabilities that you are focused on solving for with Skyflow? Who is the target customer for Skyflow (e.g. how does it enter an organization)? How is the Skyflow platform architected? How have the design and goals of the system changed or evolved over time? Can you describe the process of integrating with Skyflow at the application level? For organizations that are building analytical capabilities on top of the data managed in their applications, what are the interactions with Skyflow at each of the stages in the data lifecycle? One of the perennial problems with distributed systems is the challenge of joining data across machine boundaries. How do you mitigate that problem? On your website there are different "vaults" advertised in the form of healthcare, fintech, and PII. What are the different requirements across each of those problem domains? What are the commonalities? As a relatively new company in an emerging product category, what are some of the customer education challenges that you are facing? What are the most interesting, innovative, or unexpected ways that you have seen Skyflow used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Skyflow? When is Skyflow the wrong choice? What do you have planned for the future of Skyflow? Contact Info LinkedIn @seanfalconer on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Skyflow Privacy Engineering Data Governance Homomorphic Encryption Polymorphic Encryption The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/6/2022 • 54 minutes, 4 seconds

Bringing The Modern Data Stack To Everyone With Y42

Summary Cloud services have made highly scalable and performant data platforms economical and manageable for data teams. However, they are still challenging to work with and manage for anyone who isn’t in a technical role. Hung Dang understood the need to make data more accessible to the entire organization and created Y42 as a better user experience on top of the "modern data stack". In this episode he shares how he designed the platform to support the full spectrum of technical expertise in an organization and the interesting engineering challenges involved. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their new managed database service you can launch a production ready MySQL, Postgres, or MongoDB cluster in minutes, with automated backups, 40 Gbps connections from your application hosts, and high throughput SSDs. Go to dataengineeringpodcast.com/linode today and get a $100 credit to launch a database, create a Kubernetes cluster, or take advantage of all of their other services. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Hung Dang about Y42, the full-stack data platform that anyone can run Interview Introduction How did you get involved in the area of data management? Can you describe what Y42 is and the story behind it? How would you characterize your positioning in the data ecosystem? What are the problems that you are trying to solve? Who are the personas that you optimize for and how does that manifest in your product design and feature priorities? How is the Y42 platform implemented? What are the core engineering problems that you have had to address in order to tie together the various underlying services that you integrate? How have the design and goals of the product changed or evolved since you started working on it? What are the sharp edges and failure conditions that you have had to automate around in order to support non-technical users? What is the process for integrating Y42 with an organization’s data systems? What is the story for onboarding from existing systems and importing workflows (e.g. Airflow dags and dbt models)? With your recent shift to using Git as the store of platform state, how do you approach the problem of reconciling branched changes with side effects from changes (e.g. creating tables or mutating table structures in the warehouse)? Can you describe a typical workflow for building or modifying a business dashboard or activating data in the warehouse? What are the interfaces and abstractions that you have built into the platform to support collaboration across roles and levels of experience? (technical or organizational) With your focus on end-to-end support for data analysis, what are the extension points or escape hatches for use cases that you can’t support out of the box? What are the most interesting, innovative, or unexpected ways that you have seen Y42 used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Y42? When is Y42 the wrong choice? What do you have planned for the future of Y42? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Y42 CDTM (Center for Digital Technology and Management) Meltano Podcast Episode Airflow Singer dbt Podcast Episode Great Expectations Podcast Episode Airbyte Podcast Episode Grouparoo Podcast Episode Terraform OpenTelemetry Podcast.__init__ Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/6/2022 • 59 minutes, 1 second

A Multipurpose Database For Transactions And Analytics To Simplify Your Data Architecture With Singlestore

Summary A large fraction of data engineering work involves moving data from one storage location to another in order to support different access and query patterns. Singlestore aims to cut down on the number of database engines that you need to run so that you can reduce the amount of copying that is required. By supporting fast, in-memory row-based queries and columnar on-disk representation, it lets your transactional and analytical workloads run in the same database. In this episode SVP of engineering Shireesh Thota describes the impact on your overall system architecture that Singlestore can have and the benefits of using a cloud-native database engine for your next application. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Data teams are increasingly under pressure to deliver. According to a recent survey by Ascend.io, 95% in fact reported being at or over capacity. With 72% of data experts reporting demands on their team going up faster than they can hire, it’s no surprise they are increasingly turning to automation. In fact, while only 3.5% report having current investments in automation, 85% of data teams plan on investing in automation in the next 12 months. 85%!!! That’s where our friends at Ascend.io come in. The Ascend Data Automation Cloud provides a unified platform for data ingestion, transformation, orchestration, and observability. Ascend users love its declarative pipelines, powerful SDK, elegant UI, and extensible plug-in architecture, as well as its support for Python, SQL, Scala, and Java. Ascend automates workloads on Snowflake, Databricks, BigQuery, and open source Spark, and can be deployed in AWS, Azure, or GCP. Go to dataengineeringpodcast.com/ascend and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $5,000 when you become a customer. Your host is Tobias Macey and today I’m interviewing Shireesh Thota about Singlestore (formerly MemSQL), the industry’s first modern relational database for multi-cloud, hybrid and on-premises workloads Interview Introduction How did you get involved in the area of data management? Can you describe what SingleStore is and the story behind it? The database market has gotten very crouded, with different areas of specialization and nuance being the differentiating factors. What are the core sets of workloads that SingleStore is aimed at addressing? What are some of the capabilities that it offers to reduce the need to incorporate multiple data stores for application and analytical architectures? What are some of the most valuable lessons that you learned in your time at MicroSoft that are applicable to SingleStore’s product focus and direction? Nikita Shamgunov joined the show in October of 2018 to talk about what was then MemSQL. What are the notable changes in the engine and business that have occurred in the intervening time? What are the macroscopic trends in data management and application development that are having the most impact on product direction? For engineering teams that are already invested in, or considering adoption of, the "modern data stack" paradigm, where does SingleStore fit in that architecture? What are the services or tools that might be replaced by an installation of SingleStore? What are the efficiencies or new capabilities that an engineering team might expect by adopting SingleStore? What are some of the features that are underappreciated/overlooked which you would like to call attention to? What are the most interesting, innovative, or unexpected ways that you have seen SingleStore used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SingleStore? When is SingleStore the wrong choice? What do you have planned for the future of SingleStore? Contact Info LinkedIn @ShireeshThota on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links MemSQL Interview With Nikita Shamgunov Singlestore MS SQL Server Azure Cosmos DB CitusDB Podcast Episode Debezium Podcast Episode PostgreSQL Podcast Episode MySQL HTAP == Hybrid Transactional-Analytical Processing The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/30/2022 • 41 minutes, 21 seconds

Data Cloud Cost Optimization With Bluesky Data

Summary The latest generation of data warehouse platforms have brought unprecedented operational simplicity and effectively infinite scale. Along with those benefits, they have also introduced a new consumption model that can lead to incredibly expensive bills at the end of the month. In order to ensure that you can explore and analyze your data without spending money on inefficient queries Mingsheng Hong and Zheng Shao created Bluesky Data. In this episode they explain how their platform optimizes your Snowflake warehouses to reduce cost, as well as identifying improvements that you can make in your queries to reduce their contribution to your bill. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Mingsheng Hong and Zheng Shao about Bluesky Data where they are combining domain expertise and machine learning to optimize your cloud warehouse usage and reduce operational costs Interview Introduction How did you get involved in the area of data management? Can you describe what Bluesky is and the story behind it? What are the platforms/technologies that you are focused on in your current early stage? What are some of the other targets that you are considering once you validate your initial hypothesis? Cloud cost optimization is an active area for application infrastructures as well. What are the corollaries and differences between compute and storage optimization strategies and what you are doing at Bluesky? How have your experiences at hyperscale companies using various combinations of cloud and on-premise data platforms informed your approach to the cost management problem faced by adopters of cloud data systems? What are the most significant drivers of cost in cloud data systems? What are the factors (e.g. pricing models, organizational usage, inefficiencies) that lead to such inflated costs? What are the signals that you collect for identifying targets for optimization and tuning? Can you describe how the Bluesky mission control platform is architected? What are the current areas of uncertainty or active research that you are focused on? What is the workflow for a team or organization that is adding Bluesky to their system? How does the usage of Bluesky change as teams move from the initial optimization and dramatic cost reduction into a steady state? What are the most interesting, innovative, or unexpected ways that you have seen teams approaching cost management in the absence of Bluesky? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bluesky? When is Bluesky the wrong choice? What do you have planned for the future of Bluesky? Contact Info Mingsheng LinkedIn @mingshenghong on Twitter Zheng LinkedIn @zshao9 on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Bluesky Data Get A Free Health Check For Your Snowflake From Bluesky RocksDB Snowflake Podcast Episode Trino Podcast Episode Firebolt Podcast Episode Bigquery Hive Vertica Michael Stonebraker Teradata C-Store Paper Ottertune Podcast Episode dbt Podcast Episode infracost Subtract: The Untapped Science of Less by Leidy Klotz The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/30/2022 • 1 hour, 3 minutes, 24 seconds

Unlocking The Value Of Data Across The Organization Through User Friendly Data Tools With Prophecy

Summary The interfaces and design cues that a tool offers can have a massive impact on who is able to use it and the tasks that they are able to perform. With an eye to making data workflows more accessible to everyone in an organization Raj Bains and his team at Prophecy designed a powerful and extensible low-code platform that lets technical and non-technical users scale data flows without forcing everyone into the same layers of abstraction. In this episode he explores the tension between code-first and no-code utilities and how he is working to balance the strengths without falling prey to their shortcomings. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Raj Bains about how improving the user experience for data tools can make your work as a data engineer better and easier Interview Introduction How did you get involved in the area of data management? What are the broad categories of data tool designs that are available currently and how does that impact what is possible with them? What are the points of friction that are introduced by the tools? Can you share some of the types of workarounds or wasted effort that are made necessary by those design elements? What are the core design principles that you have built into Prophecy to address these shortcomings? How do those user experience changes improve the quality and speed of work for data engineers? How has the Prophecy platform changed since we last spoke almost a year ago? What are the tradeoffs of low code systems for productivity vs. flexibility and creativity? What are the most interesting, innovative, or unexpected approaches to developer experience that you have seen for data tools? What are the most interesting, unexpected, or challenging lessons that you have learned while working on user experience optimization for data tooling at Prophecy? When is it more important to optimize for computational efficiency over developer productivity? What do you have planned for the future of Prophecy? Contact Info LinkedIn @_raj_bains on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Prophecy Podcast Episode CUDA Clustrix Hortonworks Apache Hive Compilerworks Podcast Episode Airflow Databricks Fivetran Podcast Episode Airbyte Podcast Episode Streamsets Change Data Capture Apache Pig Spark Scala Ab Initio Type 2 Slowly Changing Dimensions AWS Deequ Matillion Podcast Episode Prophecy SaaS The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/23/2022 • 1 hour, 10 minutes, 56 seconds

Cloud Native Data Orchestration For Machine Learning And Data Engineering With Flyte

Summary Machine learning has become a meaningful target for data applications, bringing with it an increase in the complexity of orchestrating the entire data flow. Flyte is a project that was started at Lyft to address their internal needs for machine learning and integrated closely with Kubernetes as the execution manager. In this episode Ketan Umare and Haytham Abuelfutuh share the story of the Flyte project and how their work at Union is focused on supporting and scaling the code and community that has made Flyte successful. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Data lake architectures provide the best combination of massive scalability and cost reduction, but they aren’t always the most performant option. That’s why Kyligence has built on top of the leading open source OLAP engine for data lakes, Apache Kylin. With their AI augmented engine they detect patterns from your critical queries, automatically build data marts with optimized table structures, and provide a unified SQL interface across your lake, cubes, and indexes. Their cost-based query router will give you interactive speeds across petabyte scale data sets for BI dashboards and ad-hoc data exploration. Stop struggling to speed up your data lake. Get started with Kyligence today at dataengineeringpodcast.com/kyligence Your host is Tobias Macey and today I’m interviewing Ketan Umare and Haytham Abuelfutuh about Flyte, the open source and kubernetes-native orchestration engine for your data systems Interview Introduction How did you get involved in the area of data management? Can you describe what Flyte is and the story behind it? What was missing in the ecosystem of available tools that made it necessary/worthwhile to create Flyte? Workflow orchestrators have been around for several years and have gone through a number of generational shifts. How would you characterize Flyte’s position in the ecosystem? What do you see as the closest alternatives? What are the core differentiators that might lead someone to choose Flyte over e.g. Airflow/Prefect/Dagster? What are the core primitives that Flyte exposes for building up complex workflows? Machine learning use cases have been a core focus since the project’s inception. What are some of the ways that that manifests in the design and feature set? Can you describe the architecture of Flyte? How have the design and goals of the platform changed/evolved since you first started working on it? What are the changes in the data ecosystem that have had the most substantial impact on the Flyte project? (e.g. roadmap, integrations, pushing people toward adoption, etc.) What is the process for setting up a Flyte deployment? What are the user personas that you prioritize in the design and feature development for Flyte? What is the workflow for someone building a new pipeline in Flyte? What are the patterns that you and the community have established to encourage discovery and reuse of granular task definitions? Beyond code reuse, how can teams scale usage of Flyte at the company/organization level? What are the affordances that you have created to facilitate local development and testing of workflows while ensuring a smooth transition to production? What are the patterns that are available for CI/CD of workflows using Flyte? How have you approached the design of data contracts/type definitions to provide a consistent/portable API for defining inter-task dependencies across languages? What are the available interfaces for extending Flyte and building integrations with other components across the data ecosystem? Data orchestration engines are a natural point for generating and taking advantage of rich metadata. How do you manage creation and propagation of metadata within and across the framework boundaries? Last year you founded Union to offer a managed version of Flyte. What are the features that you are offering beyond what is available in the open source? What are the opportunities that you see for the Flyte ecosystem with a corporate entity to invest in expanding adoption? What are the most interesting, innovative, or unexpected ways that you have seen Flyte used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Flyte? When is Flyte the wrong choice? What do you have planned for the future of Flyte? Contact Info Ketan Umare Haytham Abuelfutuh Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Flyte Slack Channel Union.ai Kubeflow Airflow AWS Step Functions Protocol Buffers XGBoost MLFlow Dagster Podcast Episode Prefect Podcast Episode Arrow Parquet Metaflow Pytorch Podcast.__init__ Episode dbt FastAPI Podcast.__init__ Interview Python Type Annotations Modin Podcast.__init__ Interview Monad Datahub Podcast Episode OpenMetadata Podcast Episode Hudi Podcast Episode Iceberg Podcast Episode Great Expectations Podcast Episode Pandera Union ML Weights and Biases Whylogs Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/23/2022 • 1 hour, 7 minutes, 7 seconds

Insights And Advice On Building A Data Lake Platform From Someone Who Learned The Hard Way

Summary Designing a data platform is a complex and iterative undertaking which requires accounting for many conflicting needs. Designing a platform that relies on a data lake as its central architectural tenet adds additional layers of difficulty. Srivatsan Sridharan has had the opportunity to design, build, and run data lake platforms for both Yelp and Robinhood, with many valuable lessons learned from each experience. In this episode he shares his insights and advice on how to approach such an undertaking in your own organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. Your host is Tobias Macey and today I’m interviewing Srivatsan Sridharan about the technological, staffing, and design considerations for building a data platform Interview Introduction How did you get involved in the area of data management? Can you describe what your experience has been with designing and implementing data platforms? What are the elements that you have found to be common requirements across organizations and data characteristics? What are the architectural elements that require the most detailed consideration based on organizational needs and data requirements? How has the ecosystem for building maintainable and usable data lakes matured over the past few years? What are the elements that are still cumbersome or intractable? The streaming ecosystem has also gone through substantial changes over the past few years. What is your synopsis of the meaningful differences between todays options and where we were ~6 years ago? How did your experiences at Yelp inform your current architectural approach at Robinhood? Can you describe your current platform architecture? What are the primary capabilities that you are optimizing for? What is your evaluation process for determining what components to use in your platform? How do you approach the build vs. buy problem and quantify the tradeoffs? What are the most interesting, innovative, or unexpected ways that you have seen your data systems used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing data platforms across your career? When is a data lake architecture the wrong choice? What do you have planned for the future of the data platform at Robinhood? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Robinhood Yelp Kafka Spark Flink Podcast Episode Pulsar Podcast Episode Parquet Change Data Capture Delta Lake Podcast Episode Hudi Podcast Episode Redshift BigQuery Informatica Data Mesh Podcast Episode PrestoDB Trino Airbyte Podcast Episode Meltano Podcast Episode Fivetran Podcast Episode Stitch Pinot Podcast Episode Clickhouse Podcast Episode Druid Iceberg Podcast Episode Looker Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/16/2022 • 58 minutes, 10 seconds

Designing And Deploying IoT Analytics For Industrial Applications At Vopak

Summary Industrial applications are one of the primary adopters of Internet of Things (IoT) technologies, with business critical operations being informed by data collected across a fleet of sensors. Vopak is a business that manages storage and distribution of a variety of liquids that are critical to the modern world, and they have recently launched a new platform to gain more utility from their industrial sensors. In this episode Mário Pereira shares the system design that he and his team have developed for collecting and managing the collection and analysis of sensor data, and how they have split the data processing and business logic responsibilities between physical terminals and edge locations, and centralized storage and compute. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Mário Pereira about building a data management system for globally distributed IoT sensors at Vopak Interview Introduction How did you get involved in the area of data management? Can you describe what Vopak is and what kinds of information you rely on to power the business? What kinds of sensors and edge devices are you using? What kinds of consistency or variance do you have between sensors across your locations? How much computing power and storage space do you place at the edge? What level of pre-processing/filtering is being done at the edge and how do you decide what information needs to be centralized? What are some examples of decision-making that happens at the edge? Can you describe the platform architecture that you have built for collecting and processing sensor data? What was your process for selecting and evaluating the various components? How much tolerance do you have for missed messages/dropped data? How long are your data retention periods and what are the factors that influence that policy? What are some statistics related to the volume, variety, and velocity of your data? What are the end-to-end latency requirements for different segments of your data? What kinds of analysis are you performing on the collected data? What are some of the potential ramifications of failures in your system? (e.g. spills, explosions, spoilage, contamination, revenue loss, etc.) What are some of the scaling issues that you have experienced as you brought your system online? How have you been managing the decision making prior to implementing these technology solutions? What are the new capabilities and business processes that are enabled by this new platform? What are the most interesting, innovative, or unexpected ways that you have seen your data capabilities applied? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building an IoT collection and aggregation platform at global scale? What do you have planned for the future of your IoT system? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Vopak Swinging Door Compression Algorithm IoT Greengrass OPCUA IoT protocol MongoDB AWS Kinesis AWS Batch AWS IoT Sitewise Edge Boston Dynamics The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/16/2022 • 47 minutes, 54 seconds

Exploring The Insights And Impact Of Dan Delorey's Distinguished Career In Data

Summary Dan Delorey helped to build the core technologies of Google’s cloud data services for many years before embarking on his latest adventure as the VP of Data at SoFi. From being an early engineer on the Dremel project, to helping launch and manage BigQuery, on to helping enterprises adopt Google’s data products he learned all of the critical details of how to run services used by data platform teams. Now he is the consumer of many of the tools that his work inspired. In this episode he takes a trip down memory lane to weave an interesting and informative narrative about the broader themes throughout his work and their echoes in the modern data ecosystem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Dan Delorey about his journey through the data ecosystem as the current head of data at SoFi, prior engineering leader with the BigQuery team, and early engineer on Dremel Interview Introduction How did you get involved in the area of data management? Can you start by sharing what your current relationship to the data ecosystem is and the cliffs-notes version of how you ended up there? Dremel was a ground-breaking technology at the time. What do you see as its lasting impression on the landscape of data both in and outside of Google? You were instrumental in crafting the vision behind "querying data in place," (what they called, federated data) at Dremel and BigQuery. What do you mean by this? How has this approach evolved? What are some challenges with this approach? How well did the Drill project capture the core principles of Dremel as outlined in the eponymous white paper? Following your work on Drill you were involved with the development and growth of BigQuery and the broader suite of Google Cloud’s data platform. What do you see as the influence that those tools had on the evolution of the broader data ecosystem? How have your experiences at Google influenced your approach to platform and organizational design at SoFi? What’s in SoFi’s data stack? How do you decide what technologies to buy vs. build in-house? How does your team solve for data quality and governance? What are the dominating factors that you consider when deciding on project/product priorities for your team? When you’re not building industry-defining data tooling or leading data strategy, you spend time thinking about the ethics of data. Can you elaborate a bit about your research and interest there? You also have some ideas about data marketplaces, which is a hot topic these days with companies like Snowflake and Databricks breaking into this economy. What’s your take on the evolution of this space? What are the most interesting, innovative, or unexpected data systems that you have encountered? What are the most interesting, unexpected, or challenging lessons that you have learned while working on building and supporting data systems? What are the areas that you are paying the most attention to? What interesting predictions do you have for the future of data systems and their applications? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links SoFi Bigquery Dremel Brigham Young University Empirical Software Engineering Map/Reduce Hadoop Sawzall VLDB Test Of Time Award Paper GFS Colossus Partitioned Hash Join Google BigTable HBase AWS Athena Snowflake Podcast Episode Data Vault Star Schema Privacy Vault Homomorphic Encryption The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/9/2022 • 1 hour, 51 seconds

Scaling Analysis of Connected Data And Modeling Complex Relationships With The TigerGraph Graph Database

Summary Many of the events, ideas, and objects that we try to represent through data have a high degree of connectivity in the real world. These connections are best represented and analyzed as graphs to provide efficient and accurate analysis of their relationships. TigerGraph is a leading database that offers a highly scalable and performant native graph engine for powering graph analytics and machine learning. In this episode Jon Herke shares how TigerGraph customers are taking advantage of those capabilities to achieve meaningful discoveries in their fields, the utilities that it provides for modeling and managing your connected data, and some of his own experiences working with the platform before joining the company. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more. Your host is Tobias Macey and today I’m interviewing Jon Herke about TigerGraph, a distributed native graph database Interview Introduction How did you get involved in the area of data management? Can you describe what TigerGraph is and the story behind it? What are some of the core use cases that you are focused on supporting? How has TigerGraph changed over the past 4 years since I spoke with Todd Blaschka at the Open Data Science Conference? How has the ecosystem of graph databases changed in usage and design in recent years? What are some of the persistent areas of confusion or misinformation that you encounter when explaining graph databases and TigerGraph to potential users? The tagline on your website says that TigerGraph is "The Only Scalable Graph Database for the Enterprise". Can you unpack that claim and explain what is necessary for a graph database to be suitable for enterprise use? What are some of the typical application and system architectures that you typically see for end-users of TigerGraph? (e.g. polyglot persistence, etc.) What are the cases where TigerGraph should be the system of record as opposed to an optimization option for addressing highly connected data? What are the data modeling considerations that end-users should be thinking of when planning their storage structures in TigerGraph? What are the most interesting, innovative, or unexpected ways that you have seen TigerGraph used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on TigerGraph? When is TigerGraph the wrong choice? What do you have planned for the future of TigerGraph? Contact Info LinkedIn @jonherke on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links TigerGraph GraphQL Kafka GQL (Graph Query Language) LDBC (Linked Data Benchmark Council) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/9/2022 • 39 minutes, 55 seconds

Leading The Charge For The ELT Data Integration Pattern For Cloud Data Warehouses At Matillion

Summary The predominant pattern for data integration in the cloud has become extract, load, and then transform or ELT. Matillion was an early innovator of that approach and in this episode CTO Ed Thompson explains how they have evolved the platform to keep pace with the rapidly changing ecosystem. He describes how the platform is architected, the challenges related to selling cloud technologies into enterprise organizations, and how you can adopt Matillion for your own workflows to reduce the maintenance burden of data integration workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more. Your host is Tobias Macey and today I’m interviewing Ed Thompson about Matillion, a cloud-native data integration platform for accelerating your time to analytics Interview Introduction How did you get involved in the area of data management? Can you describe what Matillion is and the story behind it? What are the use cases and user personas that you are focused on supporting? How does that influence the focus and pace of your feature development and priorities? How is Matillion architected? How have the design and goals of the system changed since you started working on it? The ecosystems of both cloud technologies and data processing have been rapidly growing and evolving, with new patterns and paradigms being introduced. What are the elements of your product focus and messaging that you have had to update and what are the core principles that have stayed the same? What have been the most challenging integrations to build and support? What is a typical workflow for integrating Matillion into an organization and building a set of pipelines? What are some of the patterns that have been useful for managing incidental complexity as usage scales? What are the most interesting, innovative, or unexpected ways that you have seen Matillion used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Matillion? When is Matillion the wrong choice? What do you have planned for the future of Matillion? Contact Info LinkedIn Matillion Contact Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Matillion Twitter IBM DB2 Cognos Talend Redshift AWS Marketplace AWS Re:Invent Azure GCP == Google Cloud Platform Informatica SSIS == SQL Server Integration Services PCRE == Perl Compatible Regular Expressions Teradata Tomcat Collibra Alation The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/2/2022 • 53 minutes, 19 seconds

Evolving And Scaling The Data Platform at Yotpo

Summary Building a data platform is an iterative and evolutionary process that requires collaboration with internal stakeholders to ensure that their needs are being met. Yotpo has been on a journey to evolve and scale their data platform to continue serving the needs of their organization as it increases the scale and sophistication of data usage. In this episode Doron Porat and Liran Yogev explain how they arrived at their current architecture, the capabilities that they are optimizing for, and the complex process of identifying and evaluating new components to integrate into their systems. This is an excellent exploration of the decisions and tradeoffs that need to be made while building such a complex system. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Doron Porat and Liran Yogev about their experiences designing and implementing a self-serve data platform at Yotpo Interview Introduction How did you get involved in the area of data management? Can you describe what Yotpo is and the role that data plays in the organization? What are the core data types and sources that you are working with? What kinds of data assets are being produced and how do those get consumed and re-integrated into the business? What are the user personas that you are supporting and what are the interfaces that they are comfortable interacting with? What is the size of your team and how is it structured? You recently posted about the current architecture of your data platform. What was the starting point on your platform journey? What did the early stages of feature and platform evolution look like? What was the catalyst for making a concerted effort to integrate your systems into a cohesive platform? What was the scope and directive of the project for building a platform? What are the metrics and capabilities that you are optimizing for in the structure of your data platform? What are the organizational or regulatory constraints that you needed to account for? What are some of the early decisions that affected your available choices in later stages of the project? What does the current state of your architecture look like? How long did it take to get to where you are today? What were the factors that you considered in the various build vs. buy decisions? How did you manage cost modeling to understand the true savings on either side of that decision? If you were to start from scratch on a new data platform today what might you do differently? What are the decisions that proved helpful in the later stages of your platform development? What are the most interesting, innovative, or unexpected ways that you have seen your platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on designing and implementing your platform? What do you have planned for the future of your platform infrastructure? Contact Info Doron LinkedIn Liran LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Yotpo Data Platform Architecture Blog Post Greenplum Databricks Metorikku Apache Hive CDC == Change Data Capture Debezium Podcast Episode Apache Hudi Podcast Episode Upsolver Podcast Episode Spark PrestoDB Snowflake Podcast Episode Druid Rockset Podcast Episode dbt Podcast Episode Acryl Podcast Episode Atlan Podcast Episode OpenLineage Podcast Episode Okera Shopify Data Warehouse Episode Redshift Delta Lake Podcast Episode Iceberg Podcast Episode Outbox Pattern Backstage Roadie Nomad Kubernetes Deequ Great Expectations Podcast Episode LakeFS Podcast Episode 2021 Recap Episode Monte Carlo The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/2/2022 • 1 hour, 4 minutes, 10 seconds

Operational Analytics At Speed With Minimal Busy Work Using Incorta

Summary A huge amount of effort goes into modeling and shaping data to make it available for analytical purposes. This is often due to the need to simplify the final queries so that they are performant for visualization or limited exploration. In order to cut down the level of effort involved in making data usable, Matthew Halliday and his co-founders created Incorta as an end-to-end, in-memory analytical engine that removes barriers to insights on your data. In this episode he explains how the system works, the use cases that it empowers, and how you can start using it for your own analytics today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the leading end-to-end Data Observability Platform! Trusted by the data teams at Fox, JetBlue, and PagerDuty, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, dbt models, Airflow jobs, and business intelligence tools, reducing time to detection and resolution from weeks to just minutes. Monte Carlo also gives you a holistic picture of data health with automatic, end-to-end lineage from ingestion to the BI layer directly out of the box. Start trusting your data with Monte Carlo today! Visit http://www.dataengineeringpodcast.com/montecarlo?utm_source=rss&utm_medium=rss to learn more. Your host is Tobias Macey and today I’m interviewing Matthew Halliday about Incorta, an in-memory, unified data and analytics platform as a service Interview Introduction How did you get involved in the area of data management? Can you describe what Incorta is and the story behind it? What are the use cases and customers that you are focused on? How does that focus inform the design and priorities of functionality in the product? What are the technologies and workflows that Incorta might replace? What are the systems and services that it is intended to integrate with and extend? Can you describe how Incorta is implemented? What are the core technological decisions that were necessary to make the product successful? How have the design and goals of the system changed and evolved since you started working on it? Can you describe the workflow for building an end-to-end analysis using Incorta? What are some of the new capabilities or use cases that Incorta enables which are impractical or intractable with other combinations of tools in the ecosystem? How do the features of Incorta influence the approach that teams take for data modeling? What are the points of collaboration and overlap between organizational roles while using Incorta? What are the most interesting, innovative, or unexpected ways that you have seen Incorta used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Incorta? When is Incorta the wrong choice? What do you have planned for the future of Incorta? Contact Info LinkedIn @layereddelay on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Incorta 3rd Normal Form Parquet Podcast Episode Delta Lake Podcast Episode Iceberg Podcast Episode PrestoDB PySpark Dataiku Angular React Apache ECharts The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/24/2022 • 1 hour, 11 minutes, 16 seconds

Gain Visibility Into Your Entire Machine Learning System Using Data Logging With WhyLogs

Summary There are very few tools which are equally useful for data engineers, data scientists, and machine learning engineers. WhyLogs is a powerful library for flexibly instrumenting all of your data systems to understand the entire lifecycle of your data from source to productionized model. In this episode Andy Dang explains why the project was created, how you can apply it to your existing data systems, and how it functions to provide detailed context for being able to gain insight into all of your data processes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Andy Dang about powering observability of AI systems with the whylogs data logging library Interview Introduction How did you get involved in the area of data management? Can you describe what Whylabs is and the story behind it? How is "data logging" differentiated from logging for the purpose of debugging and observability of software logic? What are the use cases that you are aiming to support with Whylogs? How does it compare to libraries and services like Great Expectations/Monte Carlo/Soda Data/Datafold etc. Can you describe how Whylogs is implemented? How have the design and goals of the project changed or evolved since you started working on it? How do you maintain feature parity between the Python and Java integrations? How do you structure the log events and metadata to provide detail and context for data applications? How does that structure support aggregation and interpretation/analysis of the log information? What is the process for integrating Whylogs into an existing project? Once you have the code instrumented with log events, what is the workflow for using Whylogs to debug and maintain a data application? What have you found to be useful heuristics for identifying what to log? What are some of the strategies that teams can use to maintain a balance of signal vs. noise in the events that they are logging? How is the Whylogs governance set up and how are you approaching sustainability of the open source project? What are the additional utilities and services that you anticipate layering on top of/integrating with Whylogs? What are the most interesting, innovative, or unexpected ways that you have seen Whylogs used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Whylabs? When is Whylogs/Whylabs the wrong choice? What do you have planned for the future of Whylabs? Contact Info LinkedIn @andy_dng on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Whylogs Whylabs Spark Airflow Pandas Podcast Episode Data Sketches Grafana Great Expectations Podcast Episode Monte Carlo Podcast Episode Soda Data Podcast Episode Datafold Podcast Episode Delta Lake Podcast Episode HyperLogLog MLFlow Flyte The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/24/2022 • 59 minutes, 3 seconds

Connecting To The Next Frontier Of Computing With Quantum Networks

Summary The next paradigm shift in computing is coming in the form of quantum technologies. Quantum procesors have gained significant attention for their speed and computational power. The next frontier is in quantum networking for highly secure communications and the ability to distribute across quantum processing units without costly translation between quantum and classical systems. In this episode Prineha Narang, co-founder and CTO of Aliro, explains how these systems work, the capabilities that they can offer, and how you can start preparing for a post-quantum future for your data systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Your host is Tobias Macey and today I’m interviewing Dr. Prineha Narang about her work at Aliro building quantum networking technologies and how it impacts the capabilities of data systems Interview Introduction How did you get involved in the area of data management? Can you describe what Aliro is and the story behind it? What are the use cases that you are focused on? What is the impact of quantum networks on distributed systems design? (what limitations does it remove?) What are the failure modes of quantum networks? How do they differ from classical networks? How can network technologies bridge between classical and quantum connections and where do those transitions happen? What are the latency/bandwidth capacities of quantum networks? How does it influence the network protocols used during those communications? How much error correction is necessary during the quantum communication stages of network transfers? How does quantum computing technology change the landscape for AI technologies? How does that impact the work of data engineers who are building the systems that power the data feeds for those models? What are the most interesting, innovative, or unexpected ways that you have seen quantum technologies used for data systems? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aliro and your academic research? When are quantum technologies the wrong choice? What do you have planned for the future of Aliro and your research efforts? Contact Info LinkedIn Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Aliro Quantum Harvard University CalTech Quantum Computing Quantum Repeater ARPANet Trapped Ion Quantum Computer Photonic Computing SDN == Software Defined Networking QPU == Quantum Processing Unit IEEE The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/18/2022 • 40 minutes, 23 seconds

What Does It Really Mean To Do MLOps And What Is The Data Engineer's Role?

Summary Putting machine learning models into production and keeping them there requires investing in well-managed systems to manage the full lifecycle of data cleaning, training, deployment and monitoring. This requires a repeatable and evolvable set of processes to keep it functional. The term MLOps has been coined to encapsulate all of these principles and the broader data community is working to establish a set of best practices and useful guidelines for streamlining adoption. In this episode Demetrios Brinkmann and David Aponte share their perspectives on this rapidly changing space and what they have learned from their work building the MLOps community through blog posts, podcasts, and discussion forums. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Your host is Tobias Macey and today I’m interviewing Demetrios Brinkmann and David Aponte about what you need to know about MLOps as a data engineer Interview Introduction How did you get involved in the area of data management? Can you describe what MLOps is? How does it relate to DataOps? DevOps? (is it just another buzzword?) What is your interest and involvement in the space of MLOps? What are the open and active questions in the MLOps community? Who is responsible for MLOps in an organization? What is the role of the data engineer in that process? What are the core capabilities that are necessary to support an "MLOps" workflow? How do the current platform technologies support the adoption of MLOps workflows? What are the areas that are currently underdeveloped/underserved? Can you describe the technical and organizational design/architecture decisions that need to be made when endeavoring to adopt MLOps practices? What are some of the common requirements for supporting ML workflows? What are some of the ways that requirements become bespoke to a given organization or project? What are the opportunities for standardization or consolidation in the tooling for MLOps? What are the pieces that are always going to require custom engineering? What are the most interesting, innovative, or unexpected approaches to MLOps workflows/platforms that you have seen? What are the most interesting, unexpected, or challenging lessons that you have learned while working on supporting the MLOps community? What are your predictions for the future of MLOps? What are you keeping a close eye on? Contact Info Demetrios LinkedIn @Dpbrinkm on Twitter Medium David LinkedIn @aponteanalytics on Twitter aponte411 on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links MLOps Community Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are by Seth Stephens-Davidowitz (affiliate link) MLOps DataOps DevOps The Sequence Newsletter Neptune.ai Algorithmia Kubeflow The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/16/2022 • 1 hour, 15 minutes, 53 seconds

DataOps As A Service For Your Data Integration Workflows With Rivery

Summary Data engineering is a practice that is multi-faceted and requires integration with a large number of systems. This often means working across multiple tools to get the job done which can introduce significant cost to productivity due to the number of context switches. Rivery is a platform designed to reduce this incidental complexity and provide a single system for working across the different stages of the data lifecycle. In this episode CEO and founder Itamar Ben hemo explains how his experiences in the industry led to his vision for the Rivery platform as a single place to build end-to-end analytical workflows, including how it is architected and how you can start using it today for your own work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Itamar Ben Hemo about Rivery, a SaaS platform designed to provide an end-to-end solution for Ingestion, Transformation, Orchestration, and Data Operations Interview Introduction How did you get involved in the area of data management? Can you describe what Rivery is and the story behind it? What are the primary goals of Rivery as a platform and company? What are the target personas for the Rivery platform? What are the points of interaction/workflows for each of those personas? What are some of the positive and negative sources of inspiration that you looked to while deciding on the scope of the platform? The majority of recently formed companies are focused on narrow and composable concerns of data management. What do you see as the shortcomings of that approach? What are some of the tradeoffs between integrating independent tools vs buying into an ecosystem? How is the Rivery platform designed and implemented? How have the design and goals of the platform changed or evolved since you began working on it? What were your criteria for the MVP that would allow you to test your hypothesis? How has the evolution of the ecosystem influenced your product strategy? One of the interesting features that you offer is the catalog of "kits" to quickly set up common workflows. How do you manage regression/integration testing for those kits as the Rivery platform evolves? What are the most interesting, innovative, or unexpected ways that you have seen Rivery used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rivery? When is Rivery the wrong choice? What do you have planned for the future of Rivery? Contact Info LinkedIn @ItamarBenHemo on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Rivery Matillion BigQuery Snowflake Podcast Episode dbt Podcast Episode Fivetran Podcast Episode Snowpark Postman Debezium Podcast Episode Snowflake Partner Connect The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/11/2022 • 58 minutes, 4 seconds

Synthetic Data As A Service For Simplifying Privacy Engineering With Gretel

Summary Any time that you are storing data about people there are a number of privacy and security considerations that come with it. Privacy engineering is a growing field in data management that focuses on how to protect attributes of personal data so that the containing datasets can be shared safely. In this episode Gretel co-founder and CTO John Myers explains how they are building tools for data engineers and analysts to incorporate privacy engineering techniques into their workflows and validate the safety of their data against re-identification attacks. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Your host is Tobias Macey and today I’m interviewing John Myers about privacy engineering and use cases for synthetic data Interview Introduction How did you get involved in the area of data management? Can you describe what Gretel is and the story behind it? How do you define "privacy engineering"? In an organization or data team, who is typically responsible for privacy engineering? How would you characterize the current state of the art and adoption for privacy engineering? Who are the target users of Gretel and how does that inform the features and design of the product? What are the stages of the data lifecycle where Gretel is used? Can you describe a typical workflow for integrating Gretel into data pipelines for business analytics or ML model training? How is the Gretel platform implemented? How have the design and goals of the system changed or evolved since you started working on it? What are some of the nuances of synthetic data generation or masking that data engineers/data analysts need to be aware of as they start using Gretel? What are the most interesting, innovative, or unexpected ways that you have seen Gretel used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Gretel? When is Gretel the wrong choice? What do you have planned for the future of Gretel? Contact Info LinkedIn @jtm_tech on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Gretel Privacy Engineering Weights and Biases Red Team/Blue Team Generative Adversarial Network Capture The Flag in application security CVE == Common Vulnerabilities and Exposures Machine Learning Cold Start Problem Faker Mockaroo Kaggle Sentry The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/10/2022 • 48 minutes, 32 seconds

Accelerate Development Of Enterprise Analytics With The Coalesce Visual Workflow Builder

Summary The flexibility of software oriented data workflows is useful for fulfilling complex requirements, but for simple and repetitious use cases it adds significant complexity. Coalesce is a platform designed to reduce repetitive work for common workflows by adopting a visual pipeline builder to support your data warehouse transformations. In this episode Satish Jayanthi explains how he is building a framework to allow enterprises to move quickly while maintaining guardrails for data workflows. This allows everyone in the business to participate in data analysis in a sustainable manner. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Satish Jayanthi about how organizations can use data architectural patterns to stay competitive in today’s data-rich environment Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Coalesce and the story behind it? What are the core problems that you are focused on solving with Coalesce? The platform appears to be fairly opinionated in the workflow. What are the design principles and philosophies that you have embedded into the user experience? Can you describe how Coalesce is implemented? What are the pitfalls in data architecture patterns that you commonly see organizations fall prey to? How do the pre-built transformation templates in Coalesce help to guide users in a more maintainable direction? The platform is currently tied to Snowflake as the underlying engine. How much effort will it be to expand your integrations and the scope of Coalesece? What are the most interesting, innovative, or unexpected ways that you have seen Coalesce used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Coalesce? When is Coalesce the wrong choice? What do you have planned for the future of Coalesce? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Coalesce Data Warehouse Toolkit Wherescape dbt Podcast Episode Type 2 Dimensions Firebase Kubernetes Star Schema Data Vault Podcast Episode Data Mesh Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/3/2022 • 42 minutes, 45 seconds

Repeatable Patterns For Designing Data Platforms And When To Customize Them

Summary Building a data platform for your organization is a challenging undertaking. Building multiple data platforms for other organizations as a service without burning out is another thing entirely. In this episode Brandon Beidel from Red Ventures shares his experiences as a data product manager in charge of helping his customers build scalable analytics systems that fit their needs. He explains the common patterns that have been useful across multiple use cases, as well as when and how to build customized solutions. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl Hey Data Engineering Podcast listeners, want to learn how the Joybird data team reduced their time spent building new integrations and managing data pipelines by 93%? Join our live webinar on April 20th. Joybird director of analytics, Brett Trani, will walk through how retooling their data stack with RudderStack, Snowflake, and Iterable made this possible. Visit www.rudderstack.com/joybird?utm_source=rss&utm_medium=rss to register today. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Brandon Beidel about his data platform journey at Red Ventures Interview Introduction How did you get involved in the area of data management? Can you describe what Red Ventures is and your role there? Given the relative newness of data product management, where do you draw inspiration and direction for how to approach your work? What are the primary categories of data product that your data consumers are building/relying on? What are the types of data sources that you are working with to power those downstream use cases? Can you describe the size and composition/organization of your data team(s)? How do you approach the build vs. buy decision while designing and evolving your data platform? What are the tools/platforms/architectural and usage patterns that you and your team have developed for your platform? What are the primary goals and constraints that have contributed to your decisions? How have the goals and design of the platform changed or evolved since you started working with the team? You recently went through the process of establishing and reporting on SLAs for your data products. Can you describe the approach you took and the useful lessons that were learned? What are the technical and organizational components of the data work at Red Ventures that have proven most difficult? What excites you most about the future of data engineering? What are the most interesting, innovative, or unexpected ways that you have seen teams building more reliable data systems? What aspects of data tooling or processes are still missing for most data teams? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data products at Red Ventures? What do you have planned for the future of your data platform? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Red Ventures Monte Carlo Opportunity Cost dbt Podcast Episode Apache Ranger Privacera Podcast Episode Segment Fivetran Podcast Episode Databricks Bigquery Redshift Hightouch Podcast Episode Airflow Astronomer Podcast Episode Airbyte Podcast Episode Clickhouse Podcast Episode Presto Podcast Episode Trino The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/3/2022 • 47 minutes, 2 seconds

Eliminate The Bottlenecks In Your Key/Value Storage With SpeeDB

Summary At the foundational layer many databases and data processing engines rely on key/value storage for managing the layout of information on the disk. RocksDB is one of the most popular choices for this component and has been incorporated into popular systems such as ksqlDB. As these systems are scaled to larger volumes of data and higher throughputs the RocksDB engine can become a bottleneck for performance. In this episode Adi Gelvan shares the work that he and his team at SpeeDB have put into building a drop-in replacement for RocksDB that eliminates that bottleneck. He explains how they redesigned the core algorithms and storage management features to deliver ten times faster throughput, how the lower latencies work to reduce the burden on platform engineers, and how they are working toward an open source offering so that you can try it yourself with no friction. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold built automated regression testing to help data and analytics engineers deal with data quality in their pull requests. Datafold shows how a change in SQL code affects your data, both on a statistical level and down to individual rows and values before it gets merged to production. No more shipping and praying, you can now know exactly what will change in your database! Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale Your host is Tobias Macey and today I’m interviewing Adi Gelvan about his work on SpeeDB, the "next generation data engine" Interview Introduction How did you get involved in the area of data management? Can you describe what SpeeDB is and the story behind it? What is your target market and customer? What are some of the shortcomings of RocksDB that these organizations are running into and how do they manifest? What are the characteristics of RocksDB that have led so many database engines to embed it or build on top of it? Which of the systems that rely on RocksDB do you most commonly see running into its limitations? How does the work you have done at SpeeDB compare to the efforts of the Terark project? Can you describe how you approached the work of identifying areas for improvement in RocksDB? What are some of the optimizations that you introduced? What are some tradeoffs that you deemed acceptable in the process of optimizing for speed and scale? What is the integration process for adopting SpeeDB? In the event that an organization has a system with data resident in RocksDB, what is the migration process? What are the most interesting, innovative, or unexpected ways that you have seen SpeeDB used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SpeeDB? When is SpeeDB the wrong choice? What do you have planned for the future of SpeeDB? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links SpeeDB RocksDB TerarkDB EMC Infinidat LSM == Log-Structured Merge Tree B+ Tree LevelDB LMDB Bloom Filter Badger The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/27/2022 • 46 minutes, 52 seconds

Building A Data Governance Bridge Between Cloud And Datacenters For The Enterprise At Privacera

Summary Data governance is a practice that requires a high degree of flexibility and collaboration at the organizational and technical levels. The growing prominence of cloud and hybrid environments in data management adds additional stress to an already complex endeavor. Privacera is an enterprise grade solution for cloud and hybrid data governance built on top of the robust and battle tested Apache Ranger project. In this episode Balaji Ganesan shares how his experiences building and maintaining Ranger in previous roles helped him understand the needs of organizations and engineers as they define and evolve their data governance policies and practices. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product at dataengineeringpodcast.com/acryl RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. The most important piece of any data project is the data itself, which is why it is critical that your data source is high quality. PostHog is your all-in-one product analytics suite including product analysis, user funnels, feature flags, experimentation, and it’s open source so you can host it yourself or let them do it for you! You have full control over your data and their plugin system lets you integrate with all of your other data tools, including data warehouses and SaaS platforms. Give it a try today with their generous free tier at dataengineeringpodcast.com/posthog Your host is Tobias Macey and today I’m interviewing Balaji Ganesan about his work at Privacera and his view on the state of data governance, access control, and security in the cloud Interview Introduction How did you get involved in the area of data management? Can you describe what Privacera is and the story behind it? What is your working definition of "data governance" and how does that influence your product focus and priorities? What are some of the lessons that you learned from your work on Apache Ranger that helped with your efforts at Privacera? How would you characterize your position in the market for data governance/data security tools? What are the unique constraints and challenges that come into play when managing data in cloud platforms? Can you explain how the Privacera platform is architected? How have the design and goals of the system changed or evolved since you started working on it? What is the workflow for an operator integrating Privacera into a data platform? How do you provide feedback to users about the level of coverage for discovered data assets? How does Privacera fit into the workflow of the different personas working with data? What are some of the security and privacy controls that Privacera introduces? How do you mitigate the potential for anyone to bypass Privacera’s controls by interacting directly with the underlying systems? What are the most interesting, innovative, or unexpected ways that you have seen Privacera used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacera? When is Privacera the wrong choice? What do you have planned for the future of Privacera? Contact Info LinkedIn @Balaji_Blog on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Privacera Hadoop Hortonworks Apache Ranger Oracle Teradata Presto/Trino Starburst Podcast Episode Ahana Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/27/2022 • 1 hour, 2 minutes, 35 seconds

Exploring Incident Management Strategies For Data Teams

Summary Data assets and the pipelines that create them have become critical production infrastructure for companies. This adds a requirement for reliability and management of up-time similar to application infrastructure. In this episode Francisco Alberini and Mei Tao share their insights on what incident management looks like for data platforms and the teams that support them. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Francisco Alberini and Mei Tao about patterns and practices for incident management in data teams Interview Introduction How did you get involved in the area of data management? Can you start by describing some of the ways that an "incident" can manifest in a data system? At a high level, what are the steps and participants required to bring an incident to resolution? The principle of incident management is familiar to application/site reliability teams. What is the current state of the art/adoption for these practices among data teams? What are the signals that teams should be monitoring to identify and alert on potential incidents? Alerting is a subjective and nuanced practice, regardless of the context. What are some useful practices that you have seen and enacted to reduce alert fatigue and provide useful context in the alerts that do get sent? Another aspect of this problem is the proper routing of alerts to ensure that the right person sees and acts on it. How have you seen teams deal with the challenge of delivering alerts to the right people? When there is an active incident, what are the steps that you commonly see data teams take to understand the cause and scope of the issue? How can teams augment their systems to make incidents faster to resolve? What are the most interesting, innovative, or unexpected ways that you have seen teams approch incident response? What are the most interesting, unexpected, or challenging lessons that you have learned while working on incident management strategies? What are the aspects of incident management for data teams that are still missing? Contact Info Mei @tao_mei on Twitter Email Francisco @falberini on Twitter Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Monte Carlo Learn more about RCA best practices Segment Podcast Episode Segment Protocols Redshift Airflow dbt Podcast Episode The Goal by Eliahu Golratt Data Mesh Podcast Episode Follow-Up Podcast Episode PagerDuty OpsGenie Grafana Prometheus Sentry Podcast.__init__ Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/20/2022 • 57 minutes, 25 seconds

Accelerate Your Embedded Analytics With Apache Pinot

Summary Data and analytics are permeating every system, including customer-facing applications. The introduction of embedded analytics to an end-user product creates a significant shift in requirements for your data layer. The Pinot OLAP datastore was created for this purpose, optimizing for low latency queries on rapidly updating datasets with highly concurrent queries. In this episode Kishore Gopalakrishna and Xiang Fu explain how it is able to achieve those characteristics, their work at StarTree to make it more easily available, and how you can start using it for your own high throughput data workloads today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. This episode is brought to you by Acryl Data, the company behind DataHub, the leading developer-friendly data catalog for the modern data stack. Open Source DataHub is running in production at several companies like Peloton, Optum, Udemy, Zynga and others. Acryl Data provides DataHub as an easy to consume SaaS product which has been adopted by several companies. Signup for the SaaS product today at dataengineeringpodcast.com/acryl Your host is Tobias Macey and today I’m interviewing Kishore Gopalakrishna and Xiang Fu about Apache Pinot and its applications for powering user-facing analytics Interview Introduction How did you get involved in the area of data management? Can you describe what Pinot is and the story behind it? What are the primary use cases that Pinot is designed to support? There are numerous OLAP engines available with varying tradeoffs and optimal use cases. What are the cases where Pinot is the preferred choice? How does it compare to systems such as Clickhouse (for OLAP) or CubeJS/GoodData (for embedded analytics)? How do the operational needs of a database engine change as you move from serving internal stakeholders to external end-users? Can you describe how Pinot is architected? What were the key design elements that were necessary to support low-latency queries with high concurrency? Can you describe a typical end-to-end architecture where Pinot will be used for embedded analytics? What are some of the tools/technologies/platforms/design patterns that Pinot might replace or obviate? What are some of the useful lessons related to data modeling that users of Pinot should consider? What are some edge cases that they might encounter due to details of how the storage layer is architected? (e.g. data tiering, tail latencies, etc.) What are some heuristics that you have developed for understanding how to manage data lifecycles in a user-facing analytics application? What are some of the ways that users might need to customize Pinot for their specific use cases and what options do they have for extending it? What are the most interesting, innovative, or unexpected ways that you have seen Pinot used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Pinot? When is Pinot the wrong choice? What do you have planned for the future of Pinot? Contact Info Kishore LinkedIn @KishoreBytes on Twitter Xiang LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Apache Pinot StarTree Espresso Apache Helix Apache Gobblin Apache S4 Kafka Lucene StarTree Index Presto Trino Pulsar Podcast Episode Spark The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/20/2022 • 1 hour, 12 minutes, 56 seconds

Taking A Multidimensional Approach To Data Observability At Acceldata

Summary Data observability is a term that has been co-opted by numerous vendors with varying ideas of what it should mean. At Acceldata, they view it as a holistic approach to understanding the computational and logical elements that power your analytical capabilities. In this episode Tristan Spaulding, head of product at Acceldata, explains the multi-dimensional nature of gaining visibility into your running data platform and how they have architected their platform to assist in that endeavor. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale Your host is Tobias Macey and today I’m interviewing Tristan Spaulding about Acceldata, a platform offering multidimensional data observability for modern data infrastructure Interview Introduction How did you get involved in the area of data? Can you describe what Acceldata is and the story behind it? What does it mean for a data observability platform to be "multidimensional"? How do the architectural characteristics of the "modern data stack" influence the requirements and implementation of data observability strategies? The data observability ecosystem has seen a lot of activity over the past ~2-3 years. What are the unique capabilities/use cases that Acceldata supports? Who are your target users and how does that focus influence the way that you have approached feature and design priorities? What are some of the ways that you are using the Acceldata platform to run Acceldata? Can you describe how the Acceldata platform is implemented? How have the design and goals of the system changed or evolved since you started working on it? How are you managing the definition, collection, and correlation of events across stages of the data lifecycle? What are some of the ways that performance data can feed back into the debugging and maintenance of an organization’s data ecosystem? What are the challenges that data platform owners face when trying to interpret the metrics and events that are available in a system like Acceldata? What are the most interesting, innovative, or unexpected ways that you have seen Acceldata used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Acceldata? When is Acceldata the wrong choice? What do you have planned for the future of Acceldata? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Acceldata Semantic Web Hortonworks dbt Podcast Episode Firebolt Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/14/2022 • 1 hour, 3 minutes, 17 seconds

Accelerating Adoption Of The Modern Data Stack At 5X Data

Summary The modern data stack is a constantly moving target which makes it difficult to adopt without prior experience. In order to accelerate the time to deliver useful insights at organizations of all sizes that are looking to take advantage of these new and evolving architectures Tarush Aggarwal founded 5X Data. In this episode he explains how he works with these companies to deploy the technology stack and pairs them with an experienced engineer who assists with the implementation and training to let them realize the benefits of this architecture. He also shares his thoughts on the current state of the ecosystem for modern data vendors and trends to watch as we move into the future. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Tarush Agarwal about how he and his team are helping organizations streamline adoption of the modern data stack Interview Introduction How did you get involved in the area of data management? Can you describe what you are doing at 5x and the story behind it? How has your focus and operating model shifted since we spoke a year ago? What are the biggest shifts in the market for data management that you have seen in that time? What are the main challenges that your customers are facing when they start working with you? What are the components that you are relying on to build repeatable data platforms for your customers? What are the sharp edges that you have had to smooth out to scale your implementation of those systems? What do you see as the white spaces that still exist in the offerings available for the "modern data stack"? With the rapid introduction of so many new products in the data ecosystem, what are the categories that you see as being a long-term necessity? What are the areas that you predict will merge and consolidate over the next 3 – 5 years? What are the most interesting, innovative, or unexpected types of problems that you and your collaborators have had the opportunity to work on? What are the most interesting, unexpected, or challenging lessons that you have learned while building the 5x organization? When is 5x the wrong choice? What do you have planned for the future of 5x? Contact Info LinkedIn @tarush on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links 5X Data Podcast Interview Snowflake Podcast Interview dbt Podcast Interview Fivetran Podcast Interview Looker Podcast Interview Matt Turck State of Data Mixpanel Amplitude Heap Podcast Episode Bigquery Narrator Podcast Episode Marquez Podcast Episode Atlan Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/14/2022 • 53 minutes, 51 seconds

Move Your Database To The Data And Speed Up Your Analytics With DuckDB

Summary When you think about selecting a database engine for your project you typically consider options focused on serving multiple concurrent users. Sometimes what you really need is an embedded database that is blazing fast for single user workloads. DuckDB is an in-process database engine optimized for OLAP applications to speed up your analytical queries that meets you where you are, whether that’s Python, R, Java, even the web. In this episode, Hannes Mühleisen, co-creator and CEO of DuckDB Labs, shares the motivations for creating the project, the myriad ways that it can be used to speed up your data projects, and the detailed engineering efforts that go into making it adaptable to any environment. This is a fascinating and humorous exploration of a truly useful piece of technology. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. Your host is Tobias Macey and today I’m interviewing Hannes Mühleisen about DuckDB, an in-process embedded database engine for columnar analytics Interview Introduction How did you get involved in the area of data management? Can you describe what DuckDB is and the story behind it? Where did the name come from? What are some of the use cases that DuckDB is designed to support? The interface for DuckDB is similar (at least in spirit) to SQLite. What are the deciding factors for when to use one vs. the other? How might they be used in concert to take advantage of their relative strengths? What are some of the ways that DuckDB can be used to better effect than options provided by different language ecosystems? Can you describe how DuckDB is implemented? How has the design and goals of the project changed or evolved since you began working on it? What are some of the optimizations that you have had to make in order to support performant access to data that exceeds available memory? Can you describe a typical workflow of incorporating DuckDB into an analytical project? What are some of the libraries/tools/systems that DuckDB might replace in the scope of a project or team? What are some of the overlooked/misunderstood/under-utilized features of DuckDB that you would like to highlight? What is the governance model and plan long-term sustainability of the project? What are the most interesting, innovative, or unexpected ways that you have seen DuckDB used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on DuckDB? When is DuckDB the wrong choice? What do you have planned for the future of DuckDB? Contact Info Hannes Mühleisen @hfmuehleisen on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links DuckDB CWI SQLite OLAP == Online Analytical Processing Duck Typing ZODB Teradata HTAP == Hybrid Transactional/Analytical Processing Pandas Podcast.__init__ Episode Apache Arrow Julia Language Voltron Data Parquet Thrift Protobuf Vectorized Query Processor LLVM DuckDB Labs DuckDB Foundation MIT Open Courseware (OCW) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/5/2022 • 1 hour, 17 minutes, 2 seconds

Developer Friendly Application Persistence That Is Fast And Scalable With HarperDB

Summary Databases are an important component of application architectures, but they are often difficult to work with. HarperDB was created with the core goal of being a developer friendly database engine. In the process they ended up creating a scalable distributed engine that works across edge and datacenter environments to support a variety of novel use cases. In this episode co-founder and CEO Stephen Goldberg shares the history of the project, how it is architected to achieve their goals, and how you can start using it today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Are you looking for a structured and battle-tested approach for learning data engineering? Would you like to know how you can build proper data infrastructures that are built to last? Would you like to have a seasoned industry expert guide you and answer all your questions? Join Pipeline Academy, the worlds first data engineering bootcamp. Learn in small groups with likeminded professionals for 9 weeks part-time to level up in your career. The course covers the most relevant and essential data and software engineering topics that enable you to start your journey as a professional data engineer or analytics engineer. Plus we have AMAs with world-class guest speakers every week! The next cohort starts in April 2022. Visit dataengineeringpodcast.com/academy and apply now! Your host is Tobias Macey and today I’m interviewing Stephen Goldberg about HarperDB, a developer-friendly distributed database engine designed to scale across edge and cloud environments Interview Introduction How did you get involved in the area of data management? Can you describe what HarperDB is and the story behind it? There has been an explosion of database engines over the past 5 – 10 years, with each entrant offering specific capabilities. What are the use cases that HarperDB is focused on addressing? What are the issues that you experienced with existing database engines that led to the creation of HarperDB? In what ways does HarperDB address those issues? What are some of the ways that the focus on developers has influenced the interfaces and features of HarperDB? What is your view on the role of the database in the near to medium future? Can you describe how HarperDB is implemented? How have the design and goals changed from when you first started working on it? One of the common difficulties in document oriented databases is being able to conduct performant joins. What are the considerations that users need to be aware of as they are designing their data models? What are some examples of deployment topologies that HarperDB can support given the pub/sub replication model? What are some of the data modeling/database design strategies that users of HarperDB should know in order to take full advantage of its capabilities? With the dynamic schema capabilities allowing developers to add attributes and mutate the table structure at any point, what are the options for schema enforcment? (e.g. add an integer attribute and another record tries to write a string to that attribute location) What are the most interesting, innovative, or unexpected ways that you have seen HarperDB used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on HarperDB? When is HarperDB the wrong choice? What do you have planned for the future of HarperDB? Contact Info LinkedIn @sgoldberg on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links HarperDB @harperdbio on Twitter Mulesoft Zapier LMDB SocketIO SocketCluster MongoDB CouchDB PostgreSQL VoltDB Heroku SAP/Hana NodeJS DynamoDB CockroachDB Podcast Episode Fastify HTAP == Hybrid Transactional Analytical Processing Splunk The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/5/2022 • 49 minutes, 33 seconds

Reflections On Designing A Data Platform From Scratch

Summary Building a data platform is a complex journey that requires a significant amount of planning to do well. It requires knowledge of the available technologies, the requirements of the operating environment, and the expectations of the stakeholders. In this episode Tobias Macey, the host of the show, reflects on his plans for building a data platform and what he has learned from running the podcast that is influencing his choices. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription TimescaleDB, from your friends at Timescale, is the leading open-source relational database with support for time-series data. Time-series data is time stamped so you can measure how a system is changing. Time-series data is relentless and requires a database like TimescaleDB with speed and petabyte-scale. Understand the past, monitor the present, and predict the future. That’s Timescale. Visit them today at dataengineeringpodcast.com/timescale RudderStack helps you build a customer data platform on your warehouse or data lake. Instead of trapping data in a black box, they enable you to easily collect customer data from the entire stack and build an identity graph on your warehouse, giving you full visibility and control. Their SDKs make event streaming from any app or website easy, and their state-of-the-art reverse ETL pipelines enable you to send enriched data to any cloud tool. Sign up free… or just get the free t-shirt for being a listener of the Data Engineering Podcast at dataengineeringpodcast.com/rudder. I’m your host, Tobias Macey, and today I’m sharing the approach that I’m taking while designing a data platform Interview Introduction How did you get involved in the area of data management? What are the components that need to be considered when designing a solution? Data integration (extract and load) What are your data sources? Batch or streaming (acceptable latencies) Data storage (lake or warehouse) How is the data going to be used? What other tools/systems will need to integrate with it? The warehouse (Bigquery, Snowflake, Redshift) has become the focal point of the "modern data stack" Data orchestration Who will be managing the workflow logic? Metadata repository Types of metadata (catalog, lineage, access, queries, etc.) Semantic layer/reporting Data applications Implementation phases Build a single end-to-end workflow of a data application using a single category of data across sources Validate the ability for an analyst/data scientist to self-serve a notebook powered analysis Iterate Risks/unknowns Data modeling requirements Specific implementation details as integrations across components are built When to use a vendor and risk lock-in vs. spend engineering time Contact Info Email LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Presto Podcast Episode Trino Podcast Episode Dagster Podcast Episode Prefect Podcast Episode Dremio Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/28/2022 • 40 minutes, 21 seconds

Manage Your Unstructured Data Assets Across Cloud And Hybrid Environments With Komprise

Summary There are a wealth of options for managing structured and textual data, but unstructured binary data assets are not as well supported across the ecosystem. As organizations start to adopt cloud technologies they need a way to manage the distribution, discovery, and collaboration of data across their operating environments. To help solve this complicated challenge Krishna Subramanian and her co-founders at Komprise built a system that allows you to treat use and secure your data wherever it lives, and track copies across environments without requiring manual intervention. In this episode she explains the difficulties that everyone faces as they scale beyond a single operating environment, and how the Komprise platform reduces the burden of managing large and heterogeneous collections of unstructured files. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. So now your modern data stack is set up. How is everyone going to find the data they need, and understand it? Select Star is a data discovery platform that automatically analyzes & documents your data. For every table in Select Star, you can find out where the data originated, which dashboards are built on top of it, who’s using it in the company, and how they’re using it, all the way down to the SQL queries. Best of all, it’s simple to set up, and easy for both engineering and operations teams to use. With Select Star’s data catalog, a single source of truth for your data is built in minutes, even across thousands of datasets. Try it out for free and double the length of your free trial today at dataengineeringpodcast.com/selectstar. You’ll also get a swag package when you continue on a paid plan. Your host is Tobias Macey and today I’m interviewing Krishna Subramanian about her work at Komprise to generate value from unstructured file and object data across storage formats and locations Interview Introduction How did you get involved in the area of data management? Can you describe what Komprise is and the story behind it? Who are the target customers of the Komprise platform? What are the core use cases that you are focused on supporting? How would you characterize the common approaches to managing file storage solutions for hybrid cloud environments? What are some of the shortcomings of the enterprise storage providers’ methods for managing storage tiers when trying to use that data for analytical workloads? Given the growth in popularity and capabilities of cloud solutions, how have you approached the strategic positioning of your product to capitalize on the market? Can you describe how the Komprise platform is architected? What are some of the most complex considerations that you have had to engineer for when dealing with enterprise data distribution in hybrid cloud environments? What are the data replication and consistency guarantees that you are able to offer while spanning across on-premise and cloud systems/block and object storage? (e.g. eventual consistency vs. read-after-write, low latency replication on data changes vs. scheduled syncing, etc.) How do you determine and validate the heuristics that you use for understanding how/when to distribute files across storage systems? How does the specific workload that you are powering influence the specific operations/capabilities that your customers take advantage of? What are the most interesting, innovative, or unexpected ways that you have seen Komprise used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Komprise? When is Komprise the wrong choice? What do you have planned for the future of Komprise? Contact Info LinkedIn @cloudKrishna on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Komprise Unstruk Podcast Episode SMB NFS S3 The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/28/2022 • 54 minutes, 46 seconds

Build Your Python Data Processing Your Way And Run It Anywhere With Fugue

Summary Python has grown to be one of the top languages used for all aspects of data, from collection and cleaning, to analysis and machine learning. Along with that growth has come an explosion of tools and engines that help power these workflows, which introduces a great deal of complexity when scaling from single machines and exploratory development to massively parallel distributed computation. In answer to that challenge the Fugue project offers an interface to automatically translate across Pandas, Spark, and Dask execution environments without having to modify your logic. In this episode core contributor Kevin Kho explains how the slight differences in the underlying engines can lead to big problems, how Fugue works to hide those differences from the developer, and how you can start using it in your own work today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Every data project starts with collecting the information that will provide answers to your questions or inputs to your models. The web is the largest trove of information on the planet and Oxylabs helps you unlock its potential. With the Oxylabs scraper APIs you can extract data from even javascript heavy websites. Combined with their residential proxies you can be sure that you’ll have reliable and high quality data whenever you need it. Go to dataengineeringpodcast.com/oxylabs today and use code DEP25 to get your special discount on residential proxies. Your host is Tobias Macey and today I’m interviewing Kevin Kho about Fugue, a library that offers a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites Interview Introduction How did you get involved in the area of data management? Can you describe what Fugue is and the story behind it? What are the core goals of the Fugue project? Who are the target users for Fugue and how does that influence the feature priorities and API design? How does Fugue compare to projects such as Modin, etc. for abstracting over the execution engine? What are some of the sharp edges that contribute to the engineering effort required to migrate from a single machine to Spark or Dask? What are some of the determining factors that will influence the decision of whether to use Pandas, Spark, or Dask? Can you describe how Fugue is implemented? How have the design and goals of the project changed or evolved since you started working on it? How do you ensure the consistency of logic across execution engines? Can you describe the workflow of integrating Fugue into an existing or greenfield project? How have you approached the work of automating logic optimization across execution contexts? What are some of the risks or error conditions that you have to guard against? How do you manage validation of those optimizations, particularly as the different engines release new versions or capabilities? What are the most interesting, innovative, or unexpected ways that you have seen Fugue used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Fugue? When is Fugue the wrong choice? What do you have planned for the future of Fugue? Contact Info LinkedIn Email Fugue Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Fugue Fugue Tutorials Prefect Podcast Episode Bodo Podcast Episode Pandas DuckDB Koalas Dask Podcast Episode Spark Modin Podcast.__init__ Episode Fugue SQL Flink PyCaret ANTLR OmniSci Ibis The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/21/2022 • 1 hour, 1 minute, 7 seconds

Understanding The Immune System With Data At ImmunAI

Summary The life sciences as an industry has seen incredible growth in scale and sophistication, along with the advances in data technology that make it possible to analyze massive amounts of genomic information. In this episode Guy Yachdav, director of software engineering for ImmunAI, shares the complexities that are inherent to managing data workflows for bioinformatics. He also explains how he has architected the systems that ingest, process, and distribute the data that he is responsible for and the requirements that are introduced when collaborating with researchers, domain experts, and machine learning developers. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. Your host is Tobias Macey and today I’m interviewing Guy Yachdav, Director of Software Engineering at Immunai, about his work at Immunai to wrangle biological data for advancing research into the human immune system. Interview Introduction (see Guy’s bio below) How did you get involved in the area of data management? Can you describe what Immunai is and the story behind it? What are some of the categories of information that you are working with? What kinds of insights are you trying to power/questions that you are trying to answer with that data? Who are the stakeholders that you are working with and how does that influence your approach to the integration/transformation/presentation of the data? What are some of the challenges unique to the biological data domain that you have had to address? What are some of the limitations in the off-the-shelf tools when applied to biological data? How have you approached the selection of tools/techniques/technologies to make your work maintainable for your engineers and accessible for your end users? Can you describe the platform architecture that you are using to support your stakeholders? What are some of the constraints or requirements (e.g. regulatory, security, etc.) that you need to account for in the design? What are some of the ways that you make your data accessible to AI/ML engineers? What are the most interesting, innovative, or unexpected ways that you have seen Immunai used? What are the most interesting, unexpected, or challenging lessons that you have learned while working at Immunai? What do you have planned for the future of the Immunai data platform? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links ImmunAI Apache Arrow Columbia Genome Center Dagster Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/21/2022 • 43 minutes, 7 seconds

Bring Your Code To Your Streaming And Static Data Without Effort With The Deephaven Real Time Query Engine

Summary Streaming data sources are becoming more widely available as tools to handle their storage and distribution mature. However it is still a challenge to analyze this data as it arrives, while supporting integration with static data in a unified syntax. Deephaven is a project that was designed from the ground up to offer an intuitive way for you to bring your code to your data, whether it is streaming or static without having to know which is which. In this episode Pete Goddard, founder and CEO of Deephaven shares his journey with the technology that powers the platform, how he and his team are pouring their energy into the community edition of the technology so that you can use it freely in your own work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month. Your host is Tobias Macey and today I’m interviewing Pete Goddard about his work at Deephaven, a query engine optimized for manipulating and merging streaming and static data Interview Introduction How did you get involved in the area of data management? Can you describe what Deephaven is and the story behind it? What is the role of Deephaven in the context of an organization’s data platform? What are the upstream and downstream systems and teams that it is likely to be integrated with? Who are the target users of Deephaven and how does that influence the feature priorities and design of the platform? comparison of use cases/experience with Materialize What are the different components that comprise the suite of functionality in Deephaven? How have you architected the system? What are some of the ways that the goals/design of the platform have changed or evolved since you started working on it? What are some of the impedance mismatches that you have had to address between supporting different language environments and data access patterns? (e.g. batch/streaming/ML and Python/Java/R) Can you describe some common workflows that a data engineer might build with Deephaven? What are the avenues for collaboration across data roles and stakeholders? licensing choice/governance model What are the most interesting, innovative, or unexpected ways that you have seen Deephaven used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Deephaven? When is Deephaven the wrong choice? What do you have planned for the future of Deephaven? Contact Info @pete_paco on Twitter @deephaven on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Deephaven GitHub Materialize Podcast Episode Arrow Flight kSQLDB Podcast Episode Redpanda Podcast Episode Pandas Podcast Episode NumPy Numba Barrage Debezium Podcast Episode JPy Sabermetrics The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/14/2022 • 1 hour, 2 minutes, 5 seconds

Build Your Own End To End Customer Data Platform With Rudderstack

Summary Collecting, integrating, and activating data are all challenging activities. When that data pertains to your customers it can become even more complex. To simplify the work of managing the full flow of your customer data and keep you in full control the team at Rudderstack created their eponymous open source platform that allows you to work with first and third party data, as well as build and manage reverse ETL workflows. In this episode CEO and founder Soumyadeb Mitra explains how Rudderstack compares to the various other tools and platforms that share some overlap, how to set it up for your own data needs, and how it is architected to scale to meet demand. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Soumyadeb Mitra about his experience as the founder of Rudderstack and its role in your data platform Interview Introduction How did you get involved in the area of data management? Can you describe what Rudderstack is and the story behind it? What are the main use cases that Rudderstack is designed to support? Who are the target users of Rudderstack? How does the availability of the managed cloud service change the user profiles that you can target? How do these user profiles influence your focus and prioritization of features and user experience? How would you characterize the position of Rudderstack in the current data ecosystem? What other tools/systems might you replace with Rudderstack? How do you think about the application of Rudderstack compared to tools for data integration (e.g. Singer, Stitch, Fivetran) and reverse ETL (e.g. Grouparoo, Hightouch, Census)? Can you describe how the Rudderstack platform is designed and implemented? How have the goals/design/use cases of Rudderstack changed or evolved since you first started working on it? What are the different extension points available for engineers to extend and customize Rudderstack? Working with customer data is a core capability in Rudderstack. How do you manage the identity resolution of users as they transition back and forth between anonymous and identified? What are some of the data privacy primitives that you include to assist with data security/regulatory concerns? What is the process of getting started with Rudderstack as a software or data platform engineer? What are some of the operational challenges related to running your own deployment of Rudderstack? What are some of the overlooked/underemphasized capabilities of Rudderstack? How have you approached the governance model/boundaries between OSS and commercial for Rudderstack? What are the most interesting, innovative, or unexpected ways that you have seen Rudderstack used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Rudderstack? When is Rudderstack the wrong choice? What do you have planned for the future of Rudderstack? Contact Info LinkedIn @soumyadeb_mitra on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Rudderstack Hadoop Spark Segment Podcast Episode Grouparoo Podcast Episode Fivetran Podcast Episode Stitch Singer Podcast Episode Census Podcast Episode Hightouch Podcast Episode LiveRamp Airbyte Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/14/2022 • 47 minutes, 34 seconds

Scale Your Spatial Analysis By Building It In SQL With Syntax Extensions

Summary Along with globalization of our societies comes the need to analyze the geospatial and geotemporal data that is needed to manage the growth in commerce, communications, and other activities. In order to make geospatial analytics more maintainable and scalable there has been an increase in the number of database engines that provide extensions to their SQL syntax that supports manipulation of spatial data. In this episode Matthew Forrest shares his experiences of working in the domain of geospatial analytics and the application of SQL dialects to his analysis. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month. Your host is Tobias Macey and today I’m interviewing Matthew Forrest about doing spatial analysis in SQL Interview Introduction How did you get involved in the area of data management? Can you describe what spatial SQL is and some of the use cases that it is relevant for? compatibility with/comparison to syntax from PostGIS What is involved in implementation of spatial logic in database engines mapping geospatial concepts into declarative syntax foundational data types data modeling workflow for analyzing spatial data sets outside of database engines translating from e.g. geopandas to SQL level of support in database engines for spatial data types What are the most interesting, innovative, or unexpected ways that you have seen spatial SQL used? What are the most interesting, unexpected, or challenging lessons that you have learned while working with spatial SQL? When is SQL the wrong choice for spatial analysis? What do you have planned for the future of spatial analytics support in SQL for the Carto platform? Contact Info LinkedIn Website @mbforr on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Carto Spatial SQL Blog Post Spatial Analysis PostGIS QGIS KML Shapefile GeoJSON Paul Ramsey’s Blog Norwegian SOSI GDAL Google Cloud Dataflow GeoBEAM Carto Data Observatory WGS84 Projection EPSG Code PySAL GeoMesa Uber H3 Spatial Indexing PGRouting Spatialite The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/7/2022 • 59 minutes, 54 seconds

Scalable Strategies For Protecting Data Privacy In Your Shared Data Sets

Summary There are many dimensions to the work of protecting the privacy of users in our data. When you need to share a data set with other teams, departments, or businesses then it is of utmost importance that you eliminate or obfuscate personal information. In this episode Will Thompson explores the many ways that sensitive data can be leaked, re-identified, or otherwise be at risk, as well as the different strategies that can be employed to mitigate those attack vectors. He also explains how he and his team at Privacy Dynamics are working to make those strategies more accessible to organizations so that you can focus on all of the other tasks required of you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Will Thompson about managing data privacy concerns for data sets used in analytics and machine learning Interview Introduction How did you get involved in the area of data management? Data privacy is a multi-faceted problem domain. Can you start by enumerating the different categories of privacy concern that are involved in analytical use cases? Can you describe what Privacy Dynamics is and the story behind it? Which categor(y|ies) are you focused on addressing? What are some of the best practices in the definition, protection, and enforcement of data privacy policies? Is there a data security/privacy equivalent to the OWASP top 10? What are some of the techniques that are available for anonymizing data while maintaining statistical utility/significance? What are some of the engineering/systems capabilities that are required for data (platform) engineers to incorporate these practices in their platforms? What are the tradeoffs of encryption vs. obfuscation when anonymizing data? What are some of the types of PII that are non-obvious? What are the risks associated with data re-identification, and what are some of the vectors that might be exploited to achieve that? How can privacy risks mitigation be maintained as new data sources are introduced that might contribute to these re-identification vectors? Can you describe how Privacy Dynamics is implemented? What are the most challenging engineering problems that you are dealing with? How do you approach validation of a data set’s privacy? What have you found to be useful heuristics for identifying private data? What are the risks of false positives vs. false negatives? Can you describe what is involved in integrating the Privacy Dynamics system into an existing data platform/warehouse? What would be required to integrate with systems such as Presto, Clickhouse, Druid, etc.? What are the most interesting, innovative, or unexpected ways that you have seen Privacy Dynamics used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Privacy Dynamics? When is Privacy Dynamics the wrong choice? What do you have planned for the future of Privacy Dynamics? Contact Info LinkedIn @willseth on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Privacy Dynamics Pandas Podcast Episode – Pandas For Data Engineering Homomorphic Encryption Differential Privacy Immuta Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/6/2022 • 1 hour, 6 seconds

A Reflection On Learning A Lot More Than 97 Things Every Data Engineer Should Know

Summary The Data Engineering Podcast has been going for five years now and has included conversations and interviews with a huge number of guests, covering a broad range of topics. In addition to that, the host curated the essays contained in the book "97 Things Every Data Engineer Should Know", using the knowledge and context gained from running the show to inform the selection process. In this episode he shares some reflections on producing the podcast, compiling the book, and relevant trends in the ecosystem of data engineering. He also provides some advice for those who are early in their career of data engineering and looking to advance in their roles. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month. Your host is Tobias Macey and today I’m doing something a bit different. I’m going to talk about some of the lessons that I have learned while running the podcast, compiling the book "97 Things Every Data Engineer Should Know", and some of the themes that I’ve observed throughout. Interview Introduction How did you get involved in the area of data management? Overview of the 97 things book How the project came about Goals of the book What are the paths into data engineering? What are some of the macroscopic themes in the industry? What are some of the microscopic details that are useful/necessary to succeed as a data engineer? What are some of the career/team/organizational details that are helpful for data engineers? What are the most interesting, innovative, or unexpected outcomes/feedback that I have seen from running the podcast and working on the book? What are the most interesting, unexpected, or challenging lessons that I have learned while working on the Data Engineering Podcast and 97 things book? What do I have planned for the future of the podcast? Contact Info LinkedIn Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links 97 Things Every Data Engineer Should Know Buy on Amazon (affiliate link) Read on O’Reilly Learning O’Reilly Learning 30 Day Free Trial Podcast.__init__ Pipeline Academy data engineering bootcamp Podcast Episode Hadoop Object Relational Mapper (ORM) Singer Podcast Episode Airbyte Podcast Episode Data Mesh Podcast Episode Data Contracts Episode Designing Data Intensive Applications Data Council 2022 Conference Data Engineering Weekly Newsletter Data Mesh Learning MLOps Community Analytics Engineering Newsletter The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/31/2022 • 41 minutes, 35 seconds

Effective Pandas Patterns For Data Engineering

Summary Pandas is a powerful tool for cleaning, transforming, manipulating, or enriching data, among many other potential uses. As a result it has become a standard tool for data engineers for a wide range of applications. Matt Harrison is a Python expert with a long history of working with data who now spends his time on consulting and training. He recently wrote a book on effective patterns for Pandas code, and in this episode he shares advice on how to write efficient data processing routines that will scale with your data volumes, while being understandable and maintainable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Matt Harrison about useful tips for using Pandas for data engineering projects Interview Introduction How did you get involved in the area of data management? What are the main tasks that you have seen Pandas used for in a data engineering context? What are some of the common mistakes that can lead to poor performance when scaling to large data sets? What are some of the utility features that you have found most helpful for data processing? One of the interesting add-ons to Pandas is its integration with Arrow. What are some of the considerations for how and when to use the Arrow capabilities vs. out-of-the-box Pandas? Pandas is a tool that spans data processing and data science. What are some of the ways that data engineers should think about writing their code to make it accessible to data scientists for supporting collaboration across data workflows? Pandas is often used for transformation logic. What are some of the ways that engineers should approach the design of their code to make it understandable and maintainable? How can data engineers support testing their transformations? There are a number of projects that aim to scale Pandas logic across cores and clusters. What are some of the considerations for when to use one of these tools, and how to select the proper framework? (e.g. Dask, Modin, Ray, etc.) What are some anti-patterns that engineers should guard against when using Pandas for data processing? What are the most interesting, innovative, or unexpected ways that you have seen Pandas used for data processing? When is Pandas the wrong choice for data processing? What are some of the projects related to Pandas that you are keeping an eye on? Contact Info @__mharrison__ on Twitter metasnake Effective Pandas Bundle (affiliate link with 20% discount code applied) Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Metasnake Snowflake Schema OLAP Panel Data NumPy Dask Podcast Episode Parquet Arrow Feather Zen of Python Joel Grus’ I Don’t Like Notebooks presentation Pandas Method Chaining Effective Pandas Book (affiliate link with 20% discount code applied) Podcast.__init__ Episode pytest Podcast.__init__ Episode Great Expectations Podcast Episode Hypothesis Podcast.__init__ Episode Papermill Podcast Episode Jupytext Koalas Modin Podcast.__init__ Episode Spark Ray Podcast.__init__ Episode Spark Pandas API Vaex Rapids Terality H2O H2O DataTable Fugue Ibis Multi-process Pandas PandaPy Polars Google Colab The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/31/2022 • 1 hour, 21 seconds

The Importance Of Data Contracts As The Interface For Data Integration With Abhi Sivasailam

Summary Data platforms are exemplified by a complex set of connections that are subject to a set of constantly evolving requirements. In order to make this a tractable problem it is necessary to define boundaries for communication between concerns, which brings with it the need to establish interface contracts for communicating across those boundaries. The recent move toward the data mesh as a formalized architecture that builds on this design provides the language that data teams need to make this a more organized effort. In this episode Abhi Sivasailam shares his experience designing and implementing a data mesh solution with his team at Flexport, and the importance of defining and enforcing data contracts that are implemented at those domain boundaries. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month. Your host is Tobias Macey and today I’m interviewing Abhi Sivasailam about the different social and technical interfaces available for defining and enforcing data contracts Interview Introduction How did you get involved in the area of data management? Can you start by explaining what your working definition of a "data contract" is? What are the goals and purpose of these contracts? What are the locations and methods of defining a data contract? What kind of information needs to be encoded in a contract definition? How do you manage enforcement of contracts? manifestations of contracts in data mesh implementation ergonomics (technical and social) of data contracts and how to prevent them from prohibiting productivity What are the most interesting, innovative, or unexpected approaches to data contracts that you have seen? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data contract implementation? When are data contracts the wrong choice? Contact Info LinkedIn @_abhisivasailam on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Flexport Debezium Podcast Episode Data Mesh At Flexport Presentation Data Mesh Podcast Episode Column Names As Contracts podcast episode with Emily Riederer dbtplyr The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/23/2022 • 56 minutes

Building And Managing Data Teams And Data Platforms In Large Organizations With Ashish Mrig

Summary Data engineering is a relatively young and rapidly expanding field, with practitioners having a wide array of experiences as they navigate their careers. Ashish Mrig currently leads the data analytics platform for Wayfair, as well as running a local data engineering meetup. In this episode he shares his career journey, the challenges related to management of data professionals, and the platform design that he and his team have built to power analytics at a large company. He also provides some excellent insights into the factors that play into the build vs. buy decision at different organizational sizes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Ashish Mrig about his path as a data engineer Interview Introduction How did you get involved in the area of data management? You currently lead a data engineering team at a relatively large company. What are the topics that account for the majority of your time and energy? What are some of the most valuable lessons that you’ve learned about managing and motivating teams of data professionals? What has been your most consistent challenge across the different generations of the data ecosystem? How is your current data platform architected? Given the current state of the technology and services landscape, how would you approach the design and implementation of a greenfield rebuild of your platform? What are some of the pitfalls that you have seen data teams encounter most frequently? You are running a data engineering meetup for your local community in the Boston area. What have been some of the recurring themes that are discussed in those events? Contact Info Medium Blog LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Wayfair Tivo InfluxDB Podcast Interview BigQuery AtScale Podcast Episode Data Engineering Boston The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/23/2022 • 52 minutes, 44 seconds

Automated Data Quality Management Through Machine Learning With Anomalo

Summary Data quality control is a requirement for being able to trust the various reports and machine learning models that are relying on the information that you curate. Rules based systems are useful for validating known requirements, but with the scale and complexity of data in modern organizations it is impractical, and often impossible, to manually create rules for all potential errors. The team at Anomalo are building a machine learning powered platform for identifying and alerting on anomalous and invalid changes in your data so that you aren’t flying blind. In this episode founders Elliot Shmukler and Jeremy Stanley explain how they have architected the system to work with your data warehouse and let you know about the critical issues hiding in your data without overwhelming you with alerts. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription The only thing worse than having bad data is not knowing that you have it. With Bigeye’s data observability platform, if there is an issue with your data or data pipelines you’ll know right away and can get it fixed before the business is impacted. Bigeye let’s data teams measure, improve, and communicate the quality of your data to company stakeholders. With complete API access, a user-friendly interface, and automated yet flexible alerting, you’ve got everything you need to establish and maintain trust in your data. Go to dataengineeringpodcast.com/bigeye today to sign up and start trusting your analyses. Your host is Tobias Macey and today I’m interviewing Elliot Shmukler and Jeremy Stanley about Anomalo, a data quality platform aiming to automate issue detection with zero setup Interview Introduction How did you get involved in the area of data management? Can you describe what Anomalo is and the story behind it? Managing data quality is ostensibly about building trust in your data. What are the promises that data teams are able to make about the information in their control when they are using Anomalo? What are some of the claims that cannot be made unequivocally when relying on data quality monitoring systems? types of data quality issues identified utility of automated vs programmatic tests Can you describe how the Anomalo system is designed and implemented? How have the design and goals of the platform changed or evolved since you started working on it? What is your approach for validating changes to the business logic in your platform given the unpredictable nature of the system under test? model training/customization process statistical model seasonality/windowing CI/CD With any monitoring system the most challenging thing to do is avoid generating alerts that aren’t actionable or helpful. What is your strategy for helping your customers avoid alert fatigue? What are the most interesting, innovative, or unexpected ways that you have seen Anomalo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Anomalo? When is Anomalo the wrong choice? What do you have planned for the future of Anomalo? Contact Info Elliot LinkedIn @eshmu on Twitter Jeremy LinkedIn @jeremystan on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Anomalo Great Expectations Podcast Episode Shapley Values Gradient Boosted Decision Tree The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/15/2022 • 1 hour, 2 minutes, 30 seconds

An Introduction To Data And Analytics Engineering For Non-Programmers

Summary Applications of data have grown well beyond the venerable business intelligence dashboards that organizations have relied on for decades. Now it is being used to power consumer facing services, influence organizational behaviors, and build sophisticated machine learning systems. Given this increased level of importance it has become necessary for everyone in the business to treat data as a product in the same way that software applications have driven the early 2000s. In this episode Brian McMillan shares his work on the book "Building Data Products" and how he is working to educate business users and data professionals about the combination of technical, economical, and business considerations that need to be blended for these projects to succeed. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month. Your host is Tobias Macey and today I’m interviewing Brian McMillan about building data products and his book to introduce the work of data analysts and engineers to non-programmers Interview Introduction How did you get involved in the area of data management? Can you describe what motivated you to write a book about the work of building data products? Who is your target audience? What are the main goals that you are trying to achieve through the book? What was your approach for determining the structure and contents of the book? What are the core principles of data engineering that have remained from the original wave of ETL tools and rigid data warehouses? What are some of the new foundational elements of data products that need to be codified for the next generation of organizations and data professionals? There is a lot of activity and conversation happening in and around data which can make it difficult to understand which parts are signal and which are noise. What, if anything, do you see as being truly new and/or innovative? Are there any core lessons or principles that you consider to be at risk of getting drowned out in the current frenzy of activity? How do the practices for building products with small teams differ from those employed by larger groups? What do you see as the threshold beyond which a team can no longer be considered "small"? What are the roles/skills/titles that you view as necessary for building data products in the current phase of maturity for the ecosystem? What do you see as the biggest risks to engineering and data teams? What are the most interesting, innovative, or unexpected ways that you have seen the principles in the book used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on the book? Contact Info Email twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Building Data Products: Introduction to Data and Analytics Engineering for non-programmers Theory of Constraints Throughput Economics "Swaptronics" – The act of swapping out electronic components until you find a combination that works. Informatica SSIS – Microsoft SQL Server Integration Services 3X – Kent Beck Wardley Maps Vega Lite Datasette Why Use Make – Mike Bostock Building Production Applications Using Go & SQLite The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/15/2022 • 50 minutes, 13 seconds

Open Source Reverse ETL For Everyone With Grouparoo

Summary Reverse ETL is a product category that evolved from the landscape of customer data platforms with a number of companies offering their own implementation of it. While struggling with the work of automating data integration workflows with marketing, sales, and support tools Brian Leonard accidentally discovered this need himself and turned it into the open source framework Grouparoo. In this episode he explains why he decided to turn these efforts into an open core business, how the platform is implemented, and the benefits of having an open source contender in the landscape of operational analytics products. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! StreamSets DataOps Platform is the world’s first single platform for building smart data pipelines across hybrid and multi-cloud architectures. Build, run, monitor and manage data pipelines confidently with an end-to-end data integration platform that’s built for constant change. Amp up your productivity with an easy-to-navigate interface and 100s of pre-built connectors. And, get pipelines and new hires up and running quickly with powerful, reusable components that work across batch and streaming. Once you’re up and running, your smart data pipelines are resilient to data drift. Those ongoing and unexpected changes in schema, semantics, and infrastructure. Finally, one single pane of glass for operating and monitoring all your data pipelines. The full transparency and control you desire for your data operations. Get started building pipelines in minutes for free at dataengineeringpodcast.com/streamsets. The first 10 listeners of the podcast that subscribe to StreamSets’ Professional Tier, receive 2 months free after their first month. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Brian Leonard about Grouparoo, an open source framework for managing your reverse ETL pipelines Interview Introduction How did you get involved in the area of data management? Can you describe what Grouparoo is and the story behind it? What are the core requirements for building a reverse ETL system? What are the additional capabilities that users of the system ask for as they get more advanced in their usage? Who is your target user for Grouparoo and how does that influence your priorities on feature development and UX design? What are the benefits of building an open source core for a reverse ETL platform as compared to the other commercial options? Can you describe the architecture and implementation of the Grouparoo project? What are the additional systems that you have built to support the hosted offering? How have the design and goals of the project changed since you first started working on it? What is the workflow for getting Grouparoo deployed and set up with an initial pipeline? How does Grouparoo handle model and schema evolution and potential mismatch in the data warehouse and destination systems? What is the process for building a new integration and getting it included in the official list of plugins? What is your strategy/philosophy around which features are included in the open source vs. hosted/enterprise offerings? What are the most interesting, innovative, or unexpected ways that you have seen Grouparoo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Grouparoo? When is Grouparoo the wrong choice? What do you have planned for the future of Grouparoo? Contact Info LinkedIn @bleonard on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Grouparoo GitHub Task Rabbit Snowflake Podcast Episode Looker Podcast Episode Customer Data Platform Podcast Episode dbt Open Source Data Stack Conference The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/8/2022 • 44 minutes, 56 seconds

Data Observability Out Of The Box With Metaplane

Summary Data observability is a set of technical and organizational capabilities related to understanding how your data is being processed and used so that you can proactively identify and fix errors in your workflows. In this episode Metaplane founder Kevin Hu shares his working definition of the term and explains the work that he and his team are doing to cut down on the time to adoption for this new set of practices. He discusses the factors that influenced his decision to start with the data warehouse, the potential shortcomings of that approach, and where he plans to go from there. This is a great exploration of what it means to treat your data platform as a living system and apply state of the art engineering to it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode is Sponsored by Prophecy.io – the low-code data engineering platform for the cloud. Prophecy provides an easy-to-use visual interface to design & deploy data pipelines on Apache Spark & Apache Airflow. Now all the data users can use software engineering best practices – git, tests and continuous deployment with a simple to use visual designer. How does it work? – You visually design the pipelines, and Prophecy generates clean Spark code with tests on git; then you visually schedule these pipelines on Airflow. You can observe your pipelines with built in metadata search and column level lineage. Finally, if you have existing workflows in AbInitio, Informatica or other ETL formats that you want to move to the cloud, you can import them automatically into Prophecy making them run productively on Spark. Create your free account today at dataengineeringpodcast.com/prophecy. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Kevin Hu about Metaplane, a platform aiming to provide observability for modern data stacks, from warehouses to BI dashboards and everything in between. Interview Introduction How did you get involved in the area of data management? Can you describe what Metaplane is and the story behind it? Data observability is an area that has seen a huge amount of activity over the past couple of years. What is your working definition of that term? What are the areas of differentiation that you see across vendors in the space? Can you describe how the Metaplane platform is architected? How have the design and goals of Metaplane changed or evolved since you started working on it? establishing seasonality in data metrics blind spots from operating at the level of the data warehouse What are the most interesting, innovative, or unexpected ways that you have seen Metaplane used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaplane? When is Metaplane the wrong choice? What do you have planned for the future of Metaplane? Contact Info LinkedIn @kevinzhenghu on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Metaplane Datadog Control Theory James Clerk Maxwell Centrifugal Governor Huygens Amazon ECS Stop Hiring Devops Experts (And Start Growing Them) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/8/2022 • 50 minutes, 47 seconds

Creating Shared Context For Your Data Warehouse With A Controlled Vocabulary

Summary Communication and shared context are the hardest part of any data system. In recent years the focus has been on data catalogs as the means for documenting data assets, but those introduce a secondary system of record in order to find the necessary information. In this episode Emily Riederer shares her work to create a controlled vocabulary for managing the semantic elements of the data managed by her team and encoding it in the schema definitions in her data warehouse. She also explains how she created the dbtplyr package to simplify the work of creating and enforcing your own controlled vocabularies. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Emily Riederer about defining and enforcing column contracts and controlled vocabularies for your data warehouse Interview Introduction How did you get involved in the area of data management? Can you start by discussing some of the anti-patterns that you have encountered in data warehouse naming conventions and how it relates to the modeling approach? (e.g. star/snowflake schema, data vault, etc.) What are some of the types of contracts that can, and should, be defined and enforced in data workflows? What are the boundaries where we should think about establishing those contracts? What is the utility of column and table names for defining and enforcing contracts in analytical work? What is the process for establishing contractual elements in a naming schema? Who should be involved in that design process? Who are the participants in the communication paths for column naming contracts? What are some examples of context and details that can’t be captured in column names? What are some options for managing that additional information and linking it to the naming contracts? Can you describe the work that you have done with dbtplyr to make name contracts a supported construct in dbt projects? How does dbtplyr help in the creation and enforcement of contracts in the development of dbt workflows How are you using dbtplyr in your own work? How do you handle the work of building transformations to make data comply with contracts? What are the supplemental systems/techniques/documentation to work with name contracts and how they are leveraged by downstream consumers? What are the most interesting, innovative, or unexpected ways that you have seen naming contracts and/or dbtplyr used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on dbtplyr? When is dbtplyr the wrong choice? What do you have planned for the future of dbtplyr? Contact Info Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links dbtplyr Great Expectations Podcast Episode Controlled Vocabularies Presentation dplyr Data Vault Podcast Episode OpenMetadata Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/2/2022 • 1 hour, 34 seconds

A Reflection On The Data Ecosystem For The Year 2021

Summary This has been an active year for the data ecosystem, with a number of new product categories and substantial growth in existing areas. In an attempt to capture the zeitgeist Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy join the show to reflect on the past year and share their thought son the year to come. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Maura Church, David Wallace, Benn Stancil, and Gleb Mezhanskiy about the key themes of 2021 in the data ecosystem and what to expect for next year Interview Introduction How did you get involved in the area of data management? What were the main themes that you saw data practitioners and vendors focused on this year? What is the major bottleneck for Data teams in 2021? Will it be the same in 2022? One of the ways to reason about progress in any domain is to look at what was the primary bottleneck of further progress (data adoption for decision making) at different points in time. In the data domain, we have seen a number of bottlenecks, for example, scaling data platforms, the answer to which was Hadoop and on-prem columnar stores and then cloud data warehouses such as Snowflake & BigQuery. Then the problem was data integration and transformation which was solved by data integration vendors and frameworks such as Fivetran / Airbyte, modern orchestration frameworks such as Dagster & dbt and “reverse-ETL” Hightouch. What is the main challenge now? Will SQL be challenged as a primary interface to analytical data? In 2020 we’ve seen a few launches of post-SQL languages such as Malloy, Preql, metric layer query languages from Transform and Supergrain. To what extent does speed matter? Over the past couple of months, we’ve seen the resurgence of “benchmark wars” between major data warehousing platforms. To what extent do speed benchmarks inform decisions for modern data teams? How important is query speed in a modern data workflow? What needs to be true about your current DWH solution and potential alternatives to make a move? How has the way data teams work been changing? In 2020 remote seemed like a temporary emergency state. In 2021, it went mainstream. How has that affected the day-to-day of data teams, how they collaborate internally and with stakeholders? What’s it like to be a data vendor in 2021? Vertically integrated vs. modular data stack? There are multiple forces in play. Will the stack continue to be fragmented? Will we see major consolidation? If so, in which parts of the stack? Contact Info Maura LinkedIn Website @outoftheverse on Twitter David LinkedIn @davidjwallace on Twitter dwallace0723 on GitHub Benn LinkedIn @bennstancil on Twitter Gleb LinkedIn @glebmm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Patreon Dutchie Mode Analytics Datafold Podcast Episode Locally Optimistic RJ Metrics Stitch Mozart Data Podcast Episode Dagster Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/2/2022 • 1 hour, 3 minutes, 29 seconds

Exploring The Evolving Role Of Data Engineers

Summary Data Engineering is still a relatively new field that is going through a continued evolution as new technologies are introduced and new requirements are understood. In this episode Maxime Beauchemin returns to revisit what it means to be a data engineer and how the role has changed over the past 5 years. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin about the impacts that the evolution of the modern data stack has had on the role and responsibilities of data engineers Interview Introduction How did you get involved in the area of data management? What is your current working definition of a data engineer? How has that definition changed since your article on the "rise of the data engineer" and episode 3 of this show about "defining data engineering"? How has the growing availability of data infrastructure services shifted foundational skills and knowledge that are necessary to be effective? How should a new/aspiring data engineer focus their time and energy to become effective? One of the core themes in this current spate of technologies is "democratization of data". In your post on the downfall of the data engineer you called out the pressure on data engineers to maintain control with so many contributors with varying levels of skill and understanding. How well is the "modern data stack" balancing these concerns? An interesting impact of the growing usage of data is the constrained availability of data engineers. How do you see the effects of the job market on driving evolution of tooling and services? With the explosion of tools and services for working with data, a new problem has evolved of which ones to use for a given organization. What do you see as an effective and efficient process for enumerating and evaluating the available components for building a stack? There is also a lot of conversation around the "modern data stack", as well as the need for companies to build a "data platform". What (if any) difference do you see in the implications of those phrases and the skills required to compile a stack vs build a platform? How do you view the long term viability of templated SQL as a core workflow for transformations? What is the impact of more acessible and widespread machine learning/deep learning on data engineers/data infrastructure? How evenly distributed across industries and geographies are the advances in data infrastructure and engineering practices? What are some of the opportunities that are being missed or squandered during this dramatic shift in the data engineering landscape? What are the most interesting, innovative, or unexpected ways that you have seen the data ecosytem evolve? What are the most interesting, unexpected, or challenging lessons that you have learned while contributing to and participating in the data ecosystem? In episode 3 of this show (almost five years ago) we closed with some predictions for the following years of data engineering, many of which have been proven out. What is your retrospective on those claims, and what are your new predictions for the upcoming years? Contact Info LinkedIn @mistercrunch on Twitter mistercrunch on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links How the Modern Data Stack is Reshaping Data Engineering The Rise of the Data Engineer The Downfall of the Data Engineer Defining Data Engineering – Data Engineering Podcast Airflow Superset Podcast Episode Preset Fivetran Podcast Episode Meltano Podcast Episode Airbyte Podcast Episode Ralph Kimball Bill Inmon Feature Store Prophecy.io Podcast Episode Ab Initio Dremio Podcast Episode Data Mesh Podcast Episode Firebolt Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/27/2021 • 57 minutes, 41 seconds

Revisiting The Technical And Social Benefits Of The Data Mesh

Summary The data mesh is a thesis that was presented to address the technical and organizational challenges that businesses face in managing their analytical workflows at scale. Zhamak Dehghani introduced the concepts behind this architectural patterns in 2019, and since then it has been gaining popularity with many companies adopting some version of it in their systems. In this episode Zhamak re-joins the show to discuss the real world benefits that have been seen, the lessons that she has learned while working with her clients and the community, and her vision for the future of the data mesh. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m welcoming back Zhamak Dehghani to talk about her work on the data mesh book and the lessons learned over the past 2 years Interview Introduction How did you get involved in the area of data management? Can you start by giving a brief recap of the principles of the data mesh and the story behind it? How has your view of the principles of the data mesh changed since our conversation in July of 2019? What are some of the ways that your work on the data mesh book influenced your thinking on the practical elements of implementing a data mesh? What do you view as the as-yet-unknown elements of the technical and social design constructs that are needed for a sustainable data mesh implementation? In the opening of your book you state that "Data Mesh is a new approach in sourcing, managing, and accessing data for analytical use cases at scale". As with everything, scale is subjective, but what are some of the heuristics that you rely on for determining when a data mesh is an appropriate solution? What are some of the ways that data mesh concepts manifest at the boundaries of organizations? While the idea of federated access to data product quanta reduces the amount of coordination necessary at the organizational level, it raises the spectre of more complex logic required for consumers of multiple quanta. How can data mesh implementations mitigate the impact of this problem? What are some of the technical components that you have found to be best suited to the implementation of data elements within a mesh? What are the technological components that are still missing for a mesh-native data platform? How should an organization that wishes to implement a mesh style architecture think about the roles and skills that they will need on staff? How can vendors factor into the solution? What is the role of application developers in a data mesh ecosystem and how do they need to change their thinking around the interfaces that they provide in their products? What are the most interesting, innovative, or unexpected ways that you have seen data mesh principles used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data mesh implementations? When is a data mesh the wrong approach? What do you think the future of the data mesh will look like? Contact Info LinkedIn @zhamakd on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Data Engineering Podcast Data Mesh Interview Data Mesh Book Thoughtworks Expert Systems OpenLineage Podcast Episode Data Mesh Learning The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/27/2021 • 1 hour, 10 minutes, 53 seconds

Fast And Flexible Headless Data Analytics With Cube.JS

Summary One of the perennial challenges of data analytics is having a consistent set of definitions, along with a flexible and performant API endpoint for querying them. In this episode Artom Keydunov and Pavel Tiunov share their work on Cube.js and the various ways that it is being used in the open source community. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Artyom Keydunov and Pavel Tiunov about Cube.js a framework for building analytics APIs to power your applications and BI dashboards Interview Introduction How did you get involved in the area of data management? Can you describe what Cube is and the story behind it? What are the main use cases and platform architectures that you are focused on? Who are the target personas that will be using and managing Cube.js? The name comes from the concept of an OLAP cube. Can you discuss the applications of OLAP cubes and their role in the current state of the data ecosystem? How does the idea of an OLAP cube compare to the recent focus on a dedicated metrics layer? What are the pieces of a data platform that might be replaced by Cube.js? Can you describe the design and architecture of the Cube platform? How has the focus and target use case for the Cube platform evolved since you first started working on it? One of the perpetually hard problems in computer science is cache management. How have you approached that challenge in the pre-aggregation layer of the Cube framework? What is your overarching design philosophy for the API of the Cube system? Can you talk through the workflow of someone building a cube and querying it from a downstream system? What do the iteration cycles look like as you go from initial proof of concept to a more sophisticated usage of Cube.js? What are some of the data modeling steps that are needed in the source systems? The perennial problem of embedding SQL into another host language or DSL is how to deal with validation and developer tooling. What are the utilities that you and the community have built to reduce friction while writing the definitions of a cube? What are the methods available for maintaining visibility across all of the cubes defined within and across installations of Cube.js? What are the opportunities for composing multiple cubes together to form a higher level aggregation? What are the most interesting, innovative, or unexpected ways that you have seen Cube.js used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cube? When is Cube the wrong choice? What do you have planned for the future of Cube? Contact Info Artom keydunov on GitHub @keydunov on Twitter LinkedIn Pavel LinkedIn @paveltiunov87 on Twitter paveltiunov on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Cube.js Statsbot chart.js Highcharts D3 OLAP Cube dbt Superset Podcast Episode Streamlit Podcast.__init__ Episode Parquet Hasura kSQLDB Podcast Episode Materialize Podcast Episode Meltano Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/21/2021 • 54 minutes, 43 seconds

Building A System Of Record For Your Organization's Data Ecosystem At Metaphor

Summary Building a well managed data ecosystem for your organization requires a holistic view of all of the producers, consumers, and processors of information. The team at Metaphor are building a fully connected metadata layer to provide both technical and social intelligence about your data. In this episode Pardhu Gunnam and Mars Lan explain how they have designed the architecture and user experience to allow everyone to collaborate on the data lifecycle and provide opportunities for automation and extensible workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about Metaphor Data, a platform aiming to be the system of record for your data ecosystem Interview Introduction How did you get involved in the area of data management? Can you describe what Metaphor is and the story behind it? On your site it states that you are aiming to be the "system of record" for your data platform. Can you unpack that statement and its implications? What are the shortcomings in the "data catalog" approach to metadata collection and presentation? Who are the target end users of Metaphor and what are the pain points for each persona that you are prioritizing? How has that focus informed your priorities for user experience design and feature development? Can you describe how the Metaphor platform is architected? What are the lessons that you learned from your work at DataHub that have informed your work on Metaphor? There has been a huge amount of focus on the "modern data stack" with an assumption that there is a cloud data warehouse as the central component that all data flows through. How does Metaphor’s design allow for usage in platforms that aren’t dominated by a cloud data warehouse? What are some examples of information that you can extract through integrations with an organization’s communication platforms? Can you talk through a few example workflows where that information is used to inform the actions taken by a team member? What is your philosophy around data modeling or schema standardization for metadata records? What are some of the challenges that teams face in stitching together a meaningful set of relations across metadata records in Metaphor? What are some of the features or potential use cases for Metaphor that are overlooked or misunderstood as you work with your customers? What are the most interesting, innovative, or unexpected ways that you have seen Metaphor used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metaphor? When is Metaphor the wrong choice? What do you have planned for the future of Metaphor? Contact Info Pardhu LinkedIn @PardhuGunnam on Twitter Mars LinkedIn mars-lan on GitHub @mars_lan on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Metaphor The Modern Metadata Platform Why cant I find the right data? DataHub Transform Podcast Episode Supergrain MetriQL Podcast Episode dbt Podcast Interview OpenMetadata Podcast Interview Pegasus Data Language Modern Data Experience The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/20/2021 • 1 hour, 5 minutes, 33 seconds

Building Auditable Spark Pipelines At Capital One

Summary Spark is a powerful and battle tested framework for building highly scalable data pipelines. Because of its proven ability to handle large volumes of data Capital One has invested in it for their business needs. In this episode Gokul Prabagaren shares his use for it in calculating your rewards points, including the auditing requirements and how he designed his pipeline to maintain all of the necessary information through a pattern of data enrichment. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Gokul Prabagaren about how he is using Spark for real-world workflows at Capital One Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the types of data and workflows that you are responsible for at Capital one? In terms of the three "V"s (Volume, Variety, Velocity), what is the magnitude of the data that you are working with? What are some of the business and regulatory requirements that have to be factored into the solutions that you design? Who are the consumers of the data assets that you are producing? Can you describe the technical elements of the platform that you use for managing your data pipelines? What are the various ways that you are using Spark at Capital One? You wrote a post and presented at the Databricks conference about your experience moving from a data filtering to a data enrichment pattern for segmenting transactions. Can you give some context as to the use case and what your design process was for the initial implementation? What were the shortcomings to that approach/business requirements which led you to refactoring the approach to one that maintained all of the data through the different processing stages? What are some of the impacts on data volumes and processing latencies working with enriched data frames persisted between task steps? What are some of the other optimizations or improvements that you have made to that pipeline since you wrote the post? What are some of the limitations of Spark that you have experienced during your work at Capital One? How have you worked around them? What are the most interesting, innovative, or unexpected ways that you have seen Spark used at Capital One? What are the most interesting, unexpected, or challenging lessons that you have learned while working on data engineering at Capital One? What are some of the upcoming projects that you are focused on/excited for? How has your experience with the filtering vs. enrichment approach influenced your thinking on other projects that you work on? Contact Info @gocool_p on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Apache Spark Blog Post Databricks Presentation Delta Lake Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/13/2021 • 42 minutes, 9 seconds

Deliver Personal Experiences In Your Applications With The Unomi Open Source Customer Data Platform

Summary The core to providing your users with excellent service is to understand them and provide a personalized experience. Unfortunately many sites and applications take that to the extreme and collect too much information. In order to make it easier for developers to build customer profiles in a way that respects their privacy Serge Huber helped to create the Apache Unomi framework as an open source customer data platform. In this episode he explains how it can be used to build rich and useful profiles of your users, the system architecture that powers it, and some of the ways that it is being integrated into an organization’s broader data ecosystem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Serge Huber about Apache Unomi, an open source customer data platform designed to manage customers, leads and visitors data and help personalize customers experiences Interview Introduction How did you get involved in the area of data management? Can you describe what Unomi is and the story behind it? What are the goals and target use cases of Unomi? What are the aspects of collecting and aggregating profile information that present challenges to developers? How does the design of Unomi reduce that burden? How does the focus of Unomi compare to systems such as Segment/Rudderstack or Optimizely for collecting user interactions and applying personalization? How does Unomi fit in the architecture of an application or data infrastructure? Can you describe how Unomi itself is architected? How have the goals and design of the project changed or evolved since it started? What are some of the most complex or challenging engineering projects that you have worked through? Can you describe the workflow of using Unomi to manage a set of customer profiles? What are some examples of user experience customization that you can build with Unomi? What are some alternative architectures that you have seen to produce similar capabilities? One of the interesting features of Unomi is the end-user profile management. What are some of the system and developer challenges that are introduced by that capability? (e.g. constraints on data manipulation, security, privacy concerns, etc.) How did Unomi manage privacy concerns and the GDPR ? How does Unomi help with the new third party data restrictions ? Why is access to raw data so important ? Could cloud providers offer Unomi as a service ? How have you used Unomi in your own work? What are the most interesting, innovative, or unexpected ways that you have seen Unomi used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Unomi? When is Unomi the wrong choice? What do you have planned for the future of Unomi? Contact Info LinkedIn @sergehuber on Twitter @bhillou on Twitter sergehuber on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Apache Unomi Jahia OASIS Open Foundation Segment Podcast Episode Rudderstack The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/12/2021 • 57 minutes, 33 seconds

Data Driven Hiring For Data Professionals With Alooba

Summary Hiring data professionals is challenging for a multitude of reasons, and as with every interview process there is a potential for bias to creep in. Tim Freestone founded Alooba to provide a more stable reference point for evaluating candidates to ensure that you can make more informed comparisons based on their actual knowledge. In this episode he explains how Alooba got started, how it is being used in the interview process for data oriented roles, and how it can also provide visibility into your organizations overall data literacy. The whole process of hiring is an important organizational skill to cultivate and this is an interesting exploration of the specific challenges involved in finding data professionals. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Tim Freestone about Alooba, an assessment platform for evaluating data and analytics candidates to improve hiring outcomes for data roles. Interview Introduction How did you get involved in the area of data management? Can you describe what Alooba is and the story behind it? What are the main goals that you are trying to achieve with Alooba? What are the main challenges that employers and candidates face when navigating their respective roles in the hiring process? What are some of the difficulties that are specific to data oriented roles? What are some of the complexities involved in designing a user experience that is positive and productive for both candidates and companies? What are some strategies that you have developed for establishing a fair and consistent baseline of skills to ensure consistent comparison across candidates? One of the problems that comes from test-based skills assessment is the implicit bias toward candidates who test well. How do you work to mitigate that in the candidate evaluation process? Can you describe how the Alooba platform itself is implemented? How have the goals and design of the system changed or evolved since you first started it? What are some of the ways that you use Alooba internally? How do you stay up to date with the evolving skill requirements as roles change and new roles are created? Beyond evaluation of candidates for hiring, what are some of the other features that you have added to Alooba to support organizations in their effort to gain value from their data? What are the most interesting, innovative, or unexpected ways that you have seen Alooba used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Alooba? When is Alooba the wrong choice? What do you have planned for the future of Alooba? Contact Info LinkedIn @timmyfreestone on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Alooba The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/4/2021 • 50 minutes, 2 seconds

Experimentation and A/B Testing For Modern Data Teams With Eppo

Summary A/B testing and experimentation are the most reliable way to determine whether a change to your product will have the desired effect on your business. Unfortunately, being able to design, deploy, and validate experiments is a complex process that requires a mix of technical capacity and organizational involvement which is hard to come by. Chetan Sharma founded Eppo to provide a system that organizations of every scale can use to reduce the burden of managing experiments so that you can focus on improving your business. In this episode he digs into the technical, statistical, and design requirements for running effective experiments and how he has architected the Eppo platform to make the process more accessible to business and data professionals. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Chetan Sharma about Eppo, a platform for building A/B experiments that are easier to manage Interview Introduction How did you get involved in the area of data management? Can you describe what Eppo is and the story behind it? What are some examples of the kinds of experiments that teams and organizations might want to conduct? What are the points of friction that What are the steps involved in designing, deploying, and analyzing the outcomes of an A/B experiment? What are some of the statistical errors that are common when conducting an experiment? What are the design and UX principles that you have focused on in Eppo to improve the workflow of building and analyzing experiments? Can you describe the system design of the Eppo platform? What are the services or capabilities external to Eppo that are required for it to be effective? What are the integration points for adding Eppo to an organization’s existing platform? Beyond the technical capabilities for running experiments there are a number of design requirements involved. Can you talk through some of the decisions that need to be made when deciding what to change and how to measure its impact? Another difficult element of managing experiments is understanding how they all interact with each other when running a large number of simultaneous tests. How does Eppo help with tracking the various experiments and the cohorts that are bucketed into each? What are some of the ideas or assumptions that you had about the technical and design aspects of running experiments that have been challenged or changed while building Eppo? What are the most interesting, innovative, or unexpected ways that you have seen Eppo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Eppo? When is Eppo the wrong choice? What do you have planned for the future of Eppo? Contact Info LinkedIn @chesharma87 on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Eppo Knowledge Repo Apache Hive Frequentist Statistics Rudderstack The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/4/2021 • 58 minutes

Creating A Unified Experience For The Modern Data Stack At Mozart Data

Summary The modern data stack has been gaining a lot of attention recently with a rapidly growing set of managed services for different stages of the data lifecycle. With all of the available options it is possible to run a scalable, production grade data platform with a small team, but there are still sharp edges and integration challenges to work through. Peter Fishman and Dan Silberman experienced these difficulties firsthand and created Mozart Data to provide a single, easy to use option for getting started with the modern data stack. In this episode they explain how they designed a user experience to make working with data more accessibly by organizations without a data team, while allowing for more advanced users to build out more complex workflows. They also share their thoughts on the modern data ecosystem and how it improves the availability of analytics for companies of all sizes. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Peter Fishman and Dan Silberman about Mozart Data and how they are building a unified experience for the modern data stack Interview Introduction How did you get involved in the area of data management? Can you describe what Mozart Data is and the story behind it? The promise of the "modern data stack" is that it’s all delivered as a service to make it easier to set up. What are the missing pieces that make something like Mozart necessary? What are the main workflows or industries that you are focusing on? Who are the main personas that you are building Mozart for? How has that combination of user persona and industry focus informed your decisions around feature priorities and user experience? Can you describe how you have architected the Mozart platform? How have you approached the build vs. buy decision internally? What are some of the most interesting or challenging engineering projects that you have had to work on while building Mozart? What are the stages of the data lifecycle that you work the hardest to automate, and which do you focus on exposing to customers? What are the edge cases in what customers might try to do in the bounds of Mozart, or areas where you have explicitly decided not to include in your features? What are the options for extensibility, or custom engineering when customers encounter those situations? What do you see as the next phase in the evolution of the data stack? What are the most interesting, innovative, or unexpected ways that you have seen Mozart used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Mozart? When is Mozart the wrong choice? What do you have planned for the future of Mozart? Contact Info Peter LinkedIn @peterfishman on Twitter Dan LinkedIn silberman on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Mozart Data Modern Data Stack Mode Analytics Fivetran Podcast Episode Snowflake Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/27/2021 • 58 minutes, 31 seconds

Doing DataOps For External Data Sources As A Service at Demyst

Summary The data that you have access to affects the questions that you can answer. By using external data sources you can drastically increase the range of analysis that is available to your organization. The challenge comes in all of the operational aspects of finding, accessing, organizing, and serving that data. In this episode Mark Hookey discusses how he and his team at Demyst do all of the DataOps for external data sources so that you don’t have to, including the systems necessary to organize and catalog the various collections that they host, the various serving layers to provide query interfaces that match your platform, and the utility of having a single place to access a multitude of information. If you are having trouble answering questions for your business with the data that you generate and collect internally, then it is definitely worthwhile to explore the information available from external sources. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Mark Hookey about Demyst Data, a platform for operationalizing external data Interview Introduction How did you get involved in the area of data management? Can you describe what Demyst is and the story behind it? What are the services and systems that you provide for organizations to incorporate external sources in their data workflows? Who are your target customers? What are some examples of data sets that an organization might want to use in their analytics? How are these different from SaaS data that an organization might integrate with tools such as Stitcher and Fivetran? What are some of the challenges that are introduced by working with these external data sets? If an organization isn’t using Demyst what are some of the technical and organizational systems that they will need to build and manage? Can you describe how the Demyst platform is architected? What have been the most complex or difficult engineering challenges that you have dealt with while building Demyst? Given the wide variance in the systems that your customers are running, what are some strategies that you have used to provide flexible APIs for accessing the underlying information? What is the process for you to identify and onboard a new data source in your platform? What are some of the additional analytical systems that you have to run to manage your business (e.g. usage metering and analytics, etc.)? What are the most interesting, innovative, or unexpected ways that you have seen Demyst used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Demyst? When is Demyst the wrong choice? What do you have planned for the future of Demyst? Contact Info LinkedIn Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Demyst Data LexisNexis AWS Athena DataRobot The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/27/2021 • 59 minutes, 16 seconds

Laying The Foundation Of Your Data Platform For The Era Of Big Complexity With Dagster

Summary The technology for scaling storage and processing of data has gone through massive evolution over the past decade, leaving us with the ability to work with massive datasets at the cost of massive complexity. Nick Schrock created the Dagster framework to help tame that complexity and scale the organizational capacity for working with data. In this episode he shares the journey that he and his team at Elementl have taken to understand the state of the ecosystem and how they can provide a foundational layer for a holistic data platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform and blazing fast NVMe storage there’s nothing slowing you down. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/montecarlo to learn more. The first 10 people to request a personalized product tour will receive an exclusive Monte Carlo Swag box. Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Nick Schrock about the evolution of Dagster and its path forward Interview Introduction How did you get involved in the area of data management? Can you describe what Dagster is and the story behind it? How has the project and community changed/evolved since we last spoke 2 years ago? How has the experience of the past 2 years clarified the challenges and opportunities that exist in the data ecosystem? What do you see as the foundational vs transient complexities that are germane to the industry? One of the emerging ideas in Dagster is the "software defined data asset" as the central entity in the framework. How has that shifted the way that engineers approach pipeline design and composition? How did that conceptual shift inform the accompanying refactor of the core principles in the framework? (jobs, ops, graphs) One of the powerful elements of the Dagster framework is the investment in rich metadata as a foundational principle. What are the opportunities for integrating and extending that context throughout the rest of an organizations data platform? What do you see as the potential for efforts such as OpenLineage and OpenMetadata to allow for other components in the data platform to create and propagate that context more freely? What are some of the project architecture/repository structure/pipeline composition patterns that have begun to form in the community and your own internal work with Dagster? What are some of the anti-patterns that you have seen users fall into when working with Dagster? Along with your recent refactoring of the core API you have also started to roll out the Dagster Cloud offering. What was your process for determining the path to commercialization for the Dagster project and community? How are you managing governance and long-term viability of the open source elements of Dagster? What are your design principles for deciding the boundaries between OSS and commercial features? What do you see as the role of Dagster in the creation of a data platform architecture? What are the opportunities that it creates for data platform engineers? What is your perspective on the tradeoffs of pipelines as software vs. pipelines as "code" vs. low/no-code pipelines? What (if any) option do you see for language agnostic/multi-language pipeline definitions in Dagster? What do you see as the biggest threats to the future success of Dagster/Elementl? You were a relative outsider to the data ecosystem when you first started Dagster/Elementl. What have been the most interesting and surprising experiences as you have invested your time and energy in contributing to the community? What are the most interesting, innovative, or unexpected ways that you have seen Dagster used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Dagster? When is Dagster the wrong choice? What do you have planned for the future of Dagster? Contact Info LinkedIn @schrockn on Twitter schrockn on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Elementl Series A Announcement Video on software-defined assets Dagster Podcast Episode GraphQL dbt Podcast Episode Open Source Data Stack Conference Meltano Podcast Episode Amundsen Podcast Episode DataHub Podcast Episode Hashicorp Vercel The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/20/2021 • 1 hour, 5 minutes, 25 seconds

Exploring Processing Patterns For Streaming Data Integration In Your Data Lake

Summary One of the perennial challenges posed by data lakes is how to keep them up to date as new data is collected. With the improvements in streaming engines it is now possible to perform all of your data integration in near real time, but it can be challenging to understand the proper processing patterns to make that performant. In this episode Ori Rafael shares his experiences from Upsolver and building scalable stream processing for integrating and analyzing data, and what the tradeoffs are when coming from a batch oriented mindset. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Ori Rafael about strategies for building stream and batch processing patterns for data lake analytics Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the state of the market for data lakes today? What are the prevailing architectural and technological patterns that are being used to manage these systems? Batch and streaming systems have been used in various combinations since the early days of Hadoop. The Lambda architecture has largely been abandoned, so what is the answer for today’s data lakes? What are the challenges presented by streaming approaches to data transformations? The batch model for processing is intuitive despite its latency problems. What are the benefits that it provides? The core concept for data orchestration is the DAG. How does that manifest in a streaming context? In batch processing idempotent/immutable datasets are created by re-running the entire pipeline when logic changes need to be made. Given that there is no definitive start or end of a stream, what are the options for amending logical errors in transformations? What are some of the data processing/integration patterns that are impossible in a batch system? What are some useful strategies for migrating from a purely batch, or hybrid batch and streaming architecture, to a purely streaming system? What are some of the changes in technological or organizational patterns that are often overlooked or misunderstood in this shift? What are some of the most surprising things that you have learned about streaming systems in your time at Upsolver? What are the most interesting, innovative, or unexpected ways that you have seen streaming architectures used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on streaming data integration? When are streaming architectures the wrong approach? What do you have planned for the future of Upsolver to make streaming data easier to work with? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Upsolver Hive Metastore Hudi Podcast Episode Iceberg Podcast Episode Hadoop Lambda Architecture Kappa Architecture Apache Beam Event Sourcing Flink Podcast Episode Spark Structured Streaming The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/20/2021 • 52 minutes, 53 seconds

Data Quality Starts At The Source

Summary The most important gauge of success for a data platform is the level of trust in the accuracy of the information that it provides. In order to build and maintain that trust it is necessary to invest in defining, monitoring, and enforcing data quality metrics. In this episode Michael Harper advocates for proactive data quality and starting with the source, rather than being reactive and having to work backwards from when a problem is found. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Michael Harper about definitions of data quality and where to define and enforce it in the data platform Interview Introduction How did you get involved in the area of data management? What is your definition for the term "data quality" and what are the implied goals that it embodies? What are some ways that different stakeholders and participants in the data lifecycle might disagree about the definitions and manifestations of data quality? The market for "data quality tools" has been growing and gaining attention recently. How would you categorize the different approaches taken by open source and commercial options in the ecosystem? What are the tradeoffs that you see in each approach? (e.g. data warehouse as a chokepoint vs quality checks on extract) What are the difficulties that engineers and stakeholders encounter when identifying and defining information that is necessary to identify issues in their workflows? Can you describe some examples of adding data quality checks to the beginning stages of a data workflow and the kinds of issues that can be identified? What are some ways that quality and observability metrics can be aggregated across multiple pipeline stages to identify more complex issues? In application observability the metrics across multiple processes are often associated with a given service. What is the equivalent concept in data platform observabiliity? In your work at Databand what are some of the ways that your ideas and assumptions around data quality have been challenged or changed? What are the most interesting, innovative, or unexpected ways that you have seen Databand used? What are the most interesting, unexpected, or challenging lessons that you have learned while working at Databand? When is Databand the wrong choice? What do you have planned for the future of Databand? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Databand Clean Architecture (affiliate link) Great Expectations Deequ The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/14/2021 • 58 minutes, 54 seconds

Eliminate Friction In Your Data Platform Through Unified Metadata Using OpenMetadata

Summary A significant source of friction and wasted effort in building and integrating data management systems is the fragmentation of metadata across various tools. After experiencing the impacts of fragmented metadata and previous attempts at building a solution Suresh Srinivas and Sriharsha Chintalapani created the OpenMetadata project. In this episode they share the lessons that they have learned through their previous attempts and the positive impact that a unified metadata layer had during their time at Uber. They also explain how the OpenMetadat project is aiming to be a common standard for defining and storing metadata for every use case in data platforms and the ways that they are architecting the reference implementation to simplify its adoption. This is an ambitious and exciting project, so listen and try it out today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Sriharsha Chintalapani and Suresh Srinivas about OpenMetadata, an open standard for metadata and a reference implementation for a central metadata store Interview Introduction How did you get involved in the area of data management? Can you describe what the OpenMetadata project is and the story behind it? What are the goals of the project? What are the common challenges faced by engineers and data practitioners in organizing the metadata for their systems? What are the capabilities that a centralized and holistic view of a platform’s metadata can enable? How would you characterize the current state and progress on the open source initiative around OpenMetadata? How does OpenMetadata compare to the OpenLineage project and other similar systems? What opportunities do you see for collaborating with or learning from their efforts? What are the schema elements that you have identified as critical to a holistic view of an organization’s metadata? For an organization with an existing data platform, what is the role that OpenMetadata plays, and what are the points of integration across the different components? Can you describe the implementation of the OpenMetadata architecture? What are the user experience and operational characteristics that you are trying to optimize for as you iterate on the project? What are the challenges that you face in balancing the generality and specificity of the core schemas for metadata objects? There are a large and growing number of businesses that create systems on top of an organizations metadata in the form of catalogs, observability, governance, data quality, etc. What do you see as the role of the OpenMetadata project across that ecosystem of products? How has your perspective on the domain of metadata management and the associated challenges changed or evolved as you have been working on this project? What are the most interesting, unexpected, or challenging lessons that you have learned while working on OpenMetadata? When is OpenMetadata the wrong choice? What do you have planned for the future of OpenMetadata? Contact Info Suresh LinkedIn @suresh_m_s on Twitter sureshms on GitHub Sriharsha LinkedIn harshach on GitHub @d3fmacro on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links OpenMetadata Apache Storm Apache Kafka Hortonworks Apache Atlas OpenMetadata Sandbox OpenLineage Podcast Episode Egeria JSON Schema Amundsen Podcast Episode DataHub Podcast Episode JanusGraph Titan Graph Database HBase Jetty DropWizard The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/10/2021 • 1 hour, 6 minutes, 54 seconds

Business Intelligence Beyond The Dashboard With ClicData

Summary Business intelligence is often equated with a collection of dashboards that show various charts and graphs representing data for an organization. What is overlooked in that characterization is the level of complexity and effort that are required to collect and present that information, and the opportunities for providing those insights in other contexts. In this episode Telmo Silva explains how he co-founded ClicData to bring full featured business intelligence and reporting to every organization without having to build and maintain that capability on their own. This is a great conversation about the technical and organizational operations involved in building a comprehensive business intelligence system and the current state of the market. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Telmo Silva about ClicData, Interview Introduction How did you get involved in the area of data management? Can you describe what ClicData is and the story behind it? How would you characterize the current state of the market for business intelligence? What are the systems/capabilities that are required to run a full-featured BI system? What are the challenges that businesses face in developing in-house capacity for business intelligence? Can you describe how the ClicData platform is architected? How has it changed or evolved since you first began working on it? How are you approaching schema design and evolution in the storage layer? How do you handle questions of data security/privacy/regulations given that you are storing the information on behalf of the business? In your work with clients what are some of the challenges that businesses are facing when attempting to answer questions and gain insights from their data in a repeatable fashion? What are some strategies that you have found useful for structuring schemas or dashboards to make iterative exploration of data effective? What are the most interesting, innovative, or unexpected ways that you have seen ClicData used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on ClicData? When is ClicData the wrong choice? What do you have planned for the future of ClicData? Contact Info LinkedIn @telmo_clicdata on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links ClicData Tableau Superset Podcast Episode Pentaho D3.js Informatica Talend TIBCO Spotfire Looker Podcast Episode Bullet Chart PostgreSQL Podcast Episode Azure Crystal Reports The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/6/2021 • 1 hour, 2 minutes

Exploring The Evolution And Adoption of Customer Data Platforms and Reverse ETL

Summary The precursor to widespread adoption of cloud data warehouses was the creation of customer data platforms. Acting as a centralized repository of information about how your customers interact with your organization they drove a wave of analytics about how to improve products based on actual usage data. A natural outgrowth of that capability is the more recent growth of reverse ETL systems that use those analytics to feed back into the operational systems used to engage with the customer. In this episode Tejas Manohar and Rachel Bradley-Haas share the story of their own careers and experiences coinciding with these trends. They also discuss the current state of the market for these technological patterns and how to take advantage of them in your own work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Go to dataengineeringpodcast.com/montecarlo and start trusting your data with Monte Carlo today! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Rachel Bradley-Haas and Tejas Manohar about the combination of operational analytics and the customer data platform Interview Introduction How did you get involved in the area of data management? Can we start by discussing what it means to have a "customer data platform"? What are the challenges that organizations face in establishing a unified view of their customer interactions? How do the presence of multiple product lines impact the ability to understand the relationship with the customer? We have been building data warehouses and business intelligence systems for decades. How does the idea of a CDP differ from the approaches of those previous generations? A recent outgrowth of the focus on creating a CDP is the introduction of "operational analytics", which was initially termed "reverse ETL". What are your opinions on the semantics and importance of these names? What is the relationship between a CDP and operational analytics? (can you have one without the other?) How have the capabilities of operational analytics systems changed or evolved in the past couple of years? What new use cases or capabilities have been unlocked as a result of these changes? What are the opportunities over the medium to long term for operational analytics and customer data platforms? What are the most interesting, innovative, or unexpected ways that you have seen operational analytics and CDPs used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on operational analytics? When is a CDP the wrong choice? What other industry trends are you keeping an eye on? What do you anticipate will be the next breakout product category? Contact Info Rachel LinkedIn Tejas LinkedIn @tejasmanohar on Twitter tejasmanohar on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Links Big-Time Data Hightouch Podcast Episode Segment Podcast Episode Customer Data Platform Treasure Data Rudderstack Airflow DBT Cloud Fivetran Podcast Episode Stitch PLG == Product Led Growth ABM == Account Based Marketing Materialize Podcast Episode Transform Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/5/2021 • 1 hour, 2 minutes, 6 seconds

Removing The Barrier To Exploratory Analytics with Activity Schema and Narrator

Summary The perennial question of data warehousing is how to model the information that you are storing. This has given rise to methods as varied as star and snowflake schemas, data vault modeling, and wide tables. The challenge with many of those approaches is that they are optimized for answering known questions but brittle and cumbersome when exploring unknowns. In this episode Ahmed Elsamadisi shares his journey to find a more flexible and universal data model in the form of the "activity schema" that is powering the Narrator platform, and how it has allowed his customers to perform self-service exploration of their business domains without being blocked by schema evolution in the data warehouse. This is a fascinating exploration of what can be done when you challenge your assumptions about what is possible. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Ahmed Elsamadisi about Narrator, a platform to enable anyone to go from question to data-driven decision in minutes Interview Introduction How did you get involved in the area of data management? Can you describe what Narrator is and the story behind it? What are the challenges that you have seen organizations encounter when attempting to make analytics a self-serve capability? What are the use cases that you are focused on? How does Narrator fit within the data workflows of an organization? How is the Narrator platform implemented? How has the design and focus of the technology evolved since you first started working on Narrator? The core element of the analyses that you are building is the "activity schema". Can you describe the design process that led you to that format? What are the challenges that are posed by more widely used modeling techniques such as star/snowflake or data vault? How does the activity schema address those challenges? What are the performance characteristics of deriving models from an activity schema/timeseries table? For someone who wants to use Narrator, what is involved in transforming their data to map into the activity schema? Can you talk through the domain modeling that needs to happen when determining what entities and actions to capture? What are the most interesting, innovative, or unexpected ways that you have seen Narrator used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Narrator? When is Narrator the wrong choice? What do you have planned for the future of Narrator? Contact Info LinkedIn @ae4ai on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Narrator DARPA Challenge Fivetran Luigi Chartio Airflow Domain Driven Design Data Vault Snowflake Schema Event Sourcing Census Podcast Episode Hightouch Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/29/2021 • 1 hour, 8 minutes, 48 seconds

Streaming Data Pipelines Made SQL With Decodable

Summary Streaming data systems have been growing more capable and flexible over the past few years. Despite this, it is still challenging to build reliable pipelines for stream processing. In this episode Eric Sammer discusses the shortcomings of the current set of streaming engines and how they force engineers to work at an extremely low level of abstraction. He also explains why he started Decodable to address that limitation and the work that he and his team have done to let data engineers build streaming pipelines entirely in SQL. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Your host is Tobias Macey and today I’m interviewing Eric Sammer about Decodable, a platform for simplifying the work of building real-time data pipelines Interview Introduction How did you get involved in the area of data management? Can you describe what Decodable is and the story behind it? Who are the target users, and how has that focus informed your prioritization of features at launch? What are the complexities that data engineers encounter when building pipelines on streaming systems? What are the distributed systems concepts and design optimizations that are often skipped over or misunderstood by engineers who are using them? (e.g. backpressure, exactly once semantics, isolation levels, etc.) How do those mismatches in understanding and expectation impact the correctness and reliability of the workflows that they are building? Can you describe how you have architected the Decodable platform? What have been the most complex or time consuming engineering challenges that you have dealt with so far? What are the points of integration that you expose for engineers to wire in their existing infrastructure and data systems? What has been your process for designing the interfaces and abstractions that you are exposing to end users? What are some of the leaks in those abstractions that have either started to show or are anticipated? What have you learned about the state of data engineering and the costs and benefits of real-time data while working on Decodable? What are the most interesting, innovative, or unexpected ways that you have seen Decodable used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Decodable? When is Decodable the wrong choice? What do you have planned for the future of Decodable? Contact Info esammer on GitHub @esammer on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Decodable Cloudera Kafka Flink Podcast Episode Spark Snowflake Podcast Episode BigQuery RedShift kSQLDB Podcast Episode dbt Podcast Episode Millwheel Paper Dremel Paper Timely Dataflow Materialize Podcast Episode Software Defined Networking Data Mesh Podcast Episode OpenLineage Podcast Episode DataHub Podcast Episode Amundsen Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/29/2021 • 1 hour, 9 minutes, 32 seconds

Data Exploration For Business Users Powered By Analytics Engineering With Lightdash

Summary The market for business intelligence has been going through an evolutionary shift in recent years. One of the driving forces for that change has been the rise of analytics engineering powered by dbt. Lightdash has fully embraced that shift by building an entire open source business intelligence framework that is powered by dbt models. In this episode Oliver Laslett describes why dashboards aren’t sufficient for business analytics, how Lightdash promotes the work that you are already doing in your data warehouse modeling with dbt, and how they are focusing on bridging the divide between data teams and business teams and the requirements that they have for data workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Your host is Tobias Macey and today I’m interviewing Oliver Laslett about Lightdash, an open source business intelligence system powered by your dbt models Interview Introduction How did you get involved in the area of data management? Can you describe what Lightdash is and the story behind it? What are the main goals of the project? Who are the target users, and how has that profile informed your feature priorities? Business intelligence is a market that has gone through several generational shifts, with products targeting numerous personas and purposes. What are the capabilities that make Lightdash stand out from the other options? Can you describe how Lightdash is architected? How have the design and goals of the system changed or evolved since you first began working on it? What have been the most challenging engineering problems that you have dealt with? How does the approach that you are taking with Lightdash compare to systems such as Transform and Metriql that aim to provide a dedicated metrics layer? Can you describe the workflow for someone building an analysis in Lightdash? What are the points of collaboration around Lightdash for different roles in the organization? What are the methods that you use to expose information about the state of the underlying dbt models to the end users? How do they use that information in their exploration and decision making? What was your motivation for releasing Lightdash as open source? How are you handling the governance and long-term viability of the project? What are the most interesting, innovative, or unexpected ways that you have seen Lightdash used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Lightdash? When is Lightdash the wrong choice? What do you have planned for the future of Lightdash? Contact Info LinkedIn owlas on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Lightdash Looker Podcast Episode PowerBI Podcast Episode Redash Podcast Episode Metabase Podcast Episode dbt Podcast Episode Superset Podcast Episode Streamlit Podcast Episode Kubernetes JDBC SQLAlchemy SQLPad Singer Podcast Episode Airbyte Podcast Episode Meltano Podcast Episode Transform Podcast Episode Metriql Podcast Episode Cube.js OpenLineage Podcast Episode dbt Packages Rudderstack PostHog Podcast Interview Firebolt Podcast Interview The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/23/2021 • 1 hour, 6 minutes, 2 seconds

Completing The Feedback Loop Of Data Through Operational Analytics With Census

Summary The focus of the past few years has been to consolidate all of the organization’s data into a cloud data warehouse. As a result there have been a number of trends in data that take advantage of the warehouse as a single focal point. Among those trends is the advent of operational analytics, which completes the cycle of data from collection, through analysis, to driving further action. In this episode Boris Jabes, CEO of Census, explains how the work of synchronizing cleaned and consolidated data about your customers back into the systems that you use to interact with those customers allows for a powerful feedback loop that has been missing in data systems until now. He also discusses how Census makes that synchronization easy to manage, how it fits with the growth of data quality tooling, and how you can start using it today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Boris Jabes about Census and the growing category of operational analytics Interview Introduction How did you get involved in the area of data management? Can you describe what Census is and the story behind it? The terms "reverse ETL" and "operational analytics" have started being used for similar, and often interchangeable, purposes. What are your thoughts on the semantic and concrete differences between these phrases? What are the motivating factors for adding operational analytics or "data activation" to an organization’s data platform? This is a nascent but quickly growing market with a number of products and projects operating in the space. How would you characterize the current state of the segment and Census’ position in it? Can you describe how the Census platform is implemented? What are some of the early design choices that have had to be refactored or augmented as you have evolved the product and worked with customers? What are some of the assumptions that you had about the needs and uses for the platform which have been challenged or changed as you dug deeper into the problem? Can you describe the workflow for a customer adopting Census? What are some of the data modeling practices that make it easier to "activate" the organization’s data? Another recent trend in the data industry is the growth of data quality and data lineage tools. What is involved in using the measured quality or lineage information as a signal in the operational systems, or to prevent a synchronization? How can users test and validate their workflows in Census? What are the options for propagating Census’ runtime information back into lineage and data quality tracking? Census supports incremental syncs from the warehouse. What are the opportunities for bringing streaming architectures to the space of operational analytics? What are the challenges/complexities in the current set of technologies that act as a barrier? What are the most interesting, innovative, or unexpected ways that you have seen Census used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Census? When is Census the wrong choice? What do you have planned for the future of Census? Contact Info LinkedIn Website @borisjabes on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Census Operational Analytics Fivetran Podcast Episode dbt Podcast Episode Snowflake Podcast Episode Loom Materialize Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/21/2021 • 1 hour, 9 minutes, 6 seconds

Bringing The Power Of The DataHub Real-Time Metadata Graph To Everyone At Acryl Data

Summary The binding element of all data work is the metadata graph that is generated by all of the workflows that produce the assets used by teams across the organization. The DataHub project was created as a way to bring order to the scale of LinkedIn’s data needs. It was also designed to be able to work for small scale systems that are just starting to develop in complexity. In order to support the project and make it even easier to use for organizations of every size Shirshanka Das and Swaroop Jagadish founded Acryl Data. In this episode they discuss the recent work that has been done by the community, how their work is building on top of that foundation, and how you can get started with DataHub for your own work to manage data discovery today. They also share their ambitions for the near future of adding data observability and data quality management features. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Your host is Tobias Macey and today I’m interviewing Shirshanka Das and Swaroop Jagadish about Acryl Data, the company driving the open source metadata project DataHub for powering data discovery, data observability and federated data governance. Interview Introduction How did you get involved in the area of data management? Can you describe what Acryl Data is and the story behind it? How has your experience of building and running DataHub at LinkedIn informed your product direction for Acryl? What are some lessons that your co-founder Swaroop has contributed from his experience at AirBnB? The data catalog/discovery/quality market has become very active over the past year. What is your perspective on the market, and what are the gaps that are not yet being addressed? How does the focus of Acryl compare to what the team at Metaphor are building? How has the DataHub project changed in the past year with more companies outside of LinkedIn using and contributing to it? What are your plans for Data Observability? Can you describe the system architecture that you have built at Acryl? What are the convenience features that you are building to augment the capabilities and integration process for DataHub? What are some typical workflows that data teams build out when working with Acryl? What are some examples of automated actions that can be triggered from metadata changes? What are the available events that can be used to trigger actions? What are some of the challenges that teams are facing when integrating metadata management and analysis into their data workflows? What are your thoughts on the potential for the Open Lineage and Open metadata projects? How is the governance of DataHub being managed? What are the most interesting, innovative, or unexpected ways that you have seen Acryl/DataHub used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Acryl/DataHub? When is Acryl the wrong choice? What do you have planned for the future of Acryl? Contact Info Shirshanka LinkedIn @shirshanka on Twitter shirshanka on GitHub Swaroop LinkedIn @arudis on Twitter swaroopjagadish on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Acryl Data DataHub Hudi Podcast Episode Iceberg Podcast Episode Delta Lake Podcast Episode Apache Gobblin Airflow Superset Podcast Episode Collibra Podcast Episode Alation Strata Conference Presentation Acryl/DataHub Ingestion Framework Joe Hellerstein Trifacta DataHub Roadmap Data Mesh OpenLineage Podcast Episode OpenMetadata Egeria Open Metadata The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/16/2021 • 1 hour, 8 minutes, 18 seconds

How And Why To Become Data Driven As A Business

Summary Organizations of all sizes are striving to become data driven, starting in earnest with the rise of big data a decade ago. With the never-ending growth in data sources and methods for aggregating and analyzing them, the use of data to direct the business has become a requirement. Randy Bean has been helping enterprise organizations define and execute their data strategies since before the age of big data. In this episode he discusses his experiences and how he approached the work of distilling them for his book "Fail Fast, Learn Faster". This is an entertaining and enlightening exploration of the business side of data with an industry veteran. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Randy Bean about his recent book focusing on the use of big data and AI for informing data driven business leadership Interview Introduction How did you get involved in the area of data management? Can you start by discussing the focus of the book and what motivated you to write it? Who is the intended audience, and how did that inform the tone and content? Businesses and their officers have been aiming to be "data driven" for years. In your experience, what are the concrete goals that are implied by that term? What are the barriers that organizations encounter in the pursuit of those goals? How have the success rates (real and imagined) shifted in recent years as the level of sophistication of the tools and industry for data management has increased? What is the state of data initiatives in leading corporations today? What are the biggest opportunities and risks that organizations focus on related to their use of data? At what level(s) of the organization do lessons around data ethics need to be embedded? You have been working with large companies for many years to help them with their adoption of "big data". How has your work on this book shifted or clarified your perspectives on the subject? What are the main lessons or ideas that you hope readers will take away from the book? What are the most interesting, innovative, or unexpected ways that you have seen big data applied to business? What are the most interesting, unexpected, or challenging lessons that you have learned while working on this book? What are your predictions for the next decade of big data and AI? Contact Info @RandyBeanNVP on Twitter LinkedIn Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Fail Fast, Learn Faster: Lessons in Data-Driven Leadership in an Age of Disruption, Big Data, and AI (affiliate link) Book Website Harvard Business Review MIT Sloan Review New Vantage Partners COBOL Moneyball Weapons of Math Destruction The Seven Roles of the Chief Data Officer The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/14/2021 • 1 hour, 1 minute, 59 seconds

Make Your Business Metrics Reusable With Open Source Headless BI Using Metriql

Summary The key to making data valuable to business users is the ability to calculate meaningful metrics and explore them along useful dimensions. Business intelligence tools have provided this capability for years, but they don’t offer a means of exposing those metrics to other systems. Metriql is an open source project that provides a headless BI system where you can define your metrics and share them with all of your other processes. In this episode Burak Kabakcı shares the story behind the project, how you can use it to create your metrics definitions, and the benefits of treating the semantic layer as a dedicated component of your platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Your host is Tobias Macey and today I’m interviewing Burak Emre Kabakcı about Metriql, a headless BI and metrics layer for your data stack Interview Introduction How did you get involved in the area of data management? Can you describe what Metriql is and the story behind it? What are the characteristics and benefits of a "headless BI" system? What was your motivation to create and open-source Metriql as an independent project outside of your business? How are you approaching governance and sustainability of the project? How does Metriql compare to projects such as AirBnB’s Minerva or Transform’s platform? How does the industry/vertical of a business impact their ability to benefit from a metrics layer/headless BI? What are the limitations to the logical complexity that can be applied to the calculation of a given metric/set of metrics? Can you describe how Metriql is implemented? How have the design and goals of the project changed or evolved since you began working on it? What are the most complex/difficult engineering elements of building a metrics layer? Can you describe the workflow of defining metrics? What have been your guiding principles in defining the user experience for working with metriql? What are the opportunities for including business users in the definition of metrics? (e.g. pushing down/generating definitions from a BI layer) What are the biggest challenges and limitations of creating metrics definitions purely in SQL? What are the options for exposing metrics back to the warehouse and other operational systems such as reverse ETL vendors? What are the missing elements in the data ecosystem for taking full advantage of a headless BI/metrics layer? What are the most interesting, innovative, or unexpected ways that you have seen Metriql used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Metriql? When is Metriql the wrong choice? What do you have planned for the future of Metriql? Contact Info LinkedIn Website buremba on GitHub @bu7emba on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Metriql Rakam Hazelcast Headless BI Google Data Studio Superset Podcast Episode Podcast.__init__ Episode Trino Podcast Episode Supergrain The Missing Piece Of The Modern Data Stack article by Benn Stancil Metabase Podcast Episode dbt Podcast Episode dbt-metabase re_data OpenMetadata The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/8/2021 • 43 minutes, 37 seconds

Adding Support For Distributed Transactions To The Redpanda Streaming Engine

Summary Transactions are a necessary feature for ensuring that a set of actions are all performed as a single unit of work. In streaming systems this is necessary to ensure that a set of messages or transformations are all executed together across different queues. In this episode Denis Rystsov explains how he added support for transactions to the Redpanda streaming engine. He discusses the use cases for transactions, the different strategies, semantics, and guarantees that they might need to support, and how his implementation ended up improving the performance of bulk write operations. This is an interesting deep dive into the internals of a high performance streaming engine and the details that are involved in building distributed systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Denis Rystsov about implementing transactions in the RedPanda streaming engine Interview Introduction How did you get involved in the area of data management? Can you quickly recap what RedPanda is and the goals of the project? What are the use cases for transactions in a pub/sub messaging system? What are the elements of streaming systems that make atomic transactions a complex problem? What was the motivation for starting down the path of adding transactions to the RedPanda engine? How did the constraint of supporting the Kafka API influence your implementation strategy for transaction semantics? Can you talk through the details of how you ended up implementing transactions in RedPanda? What are some of the roadblocks and complexities that you encountered while working through the implementation? How did you approach the validation and verification of the transactions? What other features or capabilities are you planning to work on next? What are the most interesting, innovative, or unexpected ways that you have seen transactions in RedPanda used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on transactions for RedPanda? When are transactions the wrong choice? What do you have planned for the future of transaction support in RedPanda? Contact Info @rystsov on twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Vectorized RedPanda Podcast Episode RedPanda Transactions Post Yandex Cassandra MongoDB Riak Cosmos DB Jepsen Podcast Episode Testing Shared Memories paper Journal of Systems Research Kafka Pulsar Seastar Framework CockroachDB Podcast Episode TiDB Calvin Paper Polyjuice Paper Parallel Commit Chaos Testing Matchmaker Paxos Algorithm The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/6/2021 • 45 minutes, 58 seconds

Building Real-Time Data Platforms For Large Volumes Of Information With Aerospike

Summary Aerospike is a database engine that is designed to provide millisecond response times for queries across terabytes or petabytes. In this episode Chief Strategy Officer, Lenley Hensarling, explains how the ability to process these large volumes of information in real-time allows businesses to unlock entirely new capabilities. He also discusses the technical implementation that allows for such extreme performance and how the data model contributes to the scalability of the system. If you need to deal with massive data, at high velocities, in milliseconds, then Aerospike is definitely worth learning about. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Modern data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days or even weeks. By the time errors have made their way into production, it’s often too late and damage is done. Datafold’s proactive approach to data quality helps data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Visit dataengineeringpodcast.com/datafold today to book a demo with Datafold. Your host is Tobias Macey and today I’m interviewing Lenley Hensarling about Aerospike and building real-time data platforms Interview Introduction How did you get involved in the area of data management? Can you describe what Aerospike is and the story behind it? What are the use cases that it is uniquely well suited for? What are the use cases that you and the Aerospike team are focusing on and how does that influence your focus on priorities of feature development and user experience? What are the driving factors for building a real-time data platform? How is Aerospike being incorporated in application and data architectures? Can you describe how the Aerospike engine is architected? How have the design and architecture changed or evolved since it was first created? How have market forces influenced the product priorities and focus? What are the challenges that end users face when determining how to model their data given a key/value storage interface? What are the abstraction layers that you and/or your users build to manage reliational or hierarchical data architectures? What are the operational characteristics of the Aerospike system? (e.g. deployment, scaling, CP vs AP, upgrades, clustering, etc.) What are the most interesting, innovative, or unexpected ways that you have seen Aerospike used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Aerospike? When is Aerospike the wrong choice? What do you have planned for the future of Aerospike? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Aerospike GitHub EnterpriseDB "Nobody Expects The Spanish Inquisition" ARM CPU Architectures AWS Graviton Processors The Datacenter Is The Computer (Affiliate link) Jepsen Tests Podcast Episode Cloud Native Computing Foundation Prometheus Grafana OpenTelemetry Podcast.__init__ Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/2/2021 • 1 hour, 7 minutes, 38 seconds

Delivering Your Personal Data Cloud With Prifina

Summary The promise of online services is that they will make your life easier in exchange for collecting data about you. The reality is that they use more information than you realize for purposes that are not what you intended. There have been many attempts to harness all of the data that you generate for gaining useful insights about yourself, but they are generally difficult to set up and manage or require software development experience. The team at Prifina have built a platform that allows users to create their own personal data cloud and install applications built by developers that power useful experiences while keeping you in full control. In this episode Markus Lampinen shares the goals and vision of the company, the technical aspects of making it a reality, and the future vision for how services can be designed to respect user’s privacy while still providing compelling experiences. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Your host is Tobias Macey and today I’m interviewing Markus Lampinen about Prifina, a platform for building applications powered by personal data that is under the user’s control Interview Introduction How did you get involved in the area of data management? Can you describe what Prifina is and the story behind it? What are the primary goals of Prifina? There has been a lof of interest in the "quantified self" and different projects (many that are open source) which aim to aggregate all of a user’s data into a single system for analysis and integration. What was lacking in the ecosystem that makes Prifina necessary/valuable? What are some of the personalized applications for this data that have been most compelling or that users are most interested in? What are the sources of complexity that you are facing when managing access/privacy of user’s data? Can you describe the architecture of the platform that you are building? What are the technological/social/economic underpinnings that are necessary to make a platform like Prifina possible? What are the assumptions that you had when you first became involved in the project which have been challenged or invalidated as you worked through the implementation and began engaging with users and developers? How do you approach schema definition/management for developers to have a stable implementation target? How has that schema evolved as you introduced new data sources? What are the barriers that you and your users have to deal with when obtaining copies of their data for use with Prifina? What are the potential threats that you anticipate for users gaining and maintaining control of their own data? What are the untapped opportunities? What are the topics where you have had to invest the most in user education? What are the most interesting, innovative, or unexpected ways that you have seen Prifina used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Prifina? When is Prifina the wrong choice? What do you have planned for the future of Prifina? Contact Info LinkedIn @mmlampinen on Twitter mmlampinen on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Prifina Google Takeout The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/30/2021 • 1 hour, 12 minutes, 11 seconds

Digging Into Data Reliability Engineering

Summary The accuracy and availability of data has become critically important to the day-to-day operation of businesses. Similar to the practice of site reliability engineering as a means of ensuring consistent uptime of web services, there has been a new trend of building data reliability engineering practices in companies that rely heavily on their data. In this episode Egor Gryaznov explains how this practice manifests from a technical and organizational perspective and how you can start adopting it in your own teams. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. Your host is Tobias Macey and today I’m interviewing Egor Gryaznov, co-founder and CTO of Bigeye, about the ideas and practices of data reliability engineering and how to integrate it into your systems Interview Introduction How did you get involved in the area of data management? What does the term "Data Reliability Engineering" mean? What is encompassed under the umbrella of Data Reliability Engineering? How does it compare to the concepts from site reliability engineering? Is DRE just a repackaged version of DataOps? Why is Data Reliability Engineering particularly important now? Who is responsible for the practice of DRE in an organization? What are some areas of innovation that teams are focusing on to support a DRE practice? What are the tools that teams are using to improve the reliability of their data operations? What are the organizational systems that need to be in place to support a DRE practice? What are some potential roadblocks that teams might have to address when planning and implementing a DRE strategy? What are the most interesting, innovative, or unexpected approaches/solutions to DRE that you have seen? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Data Reliability Engineering? Is Data Reliability Engineering ever the wrong choice? What do you have planned for the future of Bigeye, especially in terms of Data Reliability Engineering? Contact Info Find us at bigeye.com or reach out to us at hello@bigeye.com You can find Egor on LinkedIn or email him at egor@bigeye.com Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Bigeye Podcast Episode Vertica Looker Podcast Episode Site Reliability Engineering Stemma Podcast Episode Collibra Podcast Episode OpenLineage Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/26/2021 • 58 minutes, 7 seconds

Massively Parallel Data Processing In Python Without The Effort Using Bodo

Summary Python has beome the de facto language for working with data. That has brought with it a number of challenges having to do with the speed and scalability of working with large volumes of information.There have been many projects and strategies for overcoming these challenges, each with their own set of tradeoffs. In this episode Ehsan Totoni explains how he built the Bodo project to bring the speed and processing power of HPC techniques to the Python data ecosystem without requiring any re-work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Your host is Tobias Macey and today I’m interviewing Ehsan Totoni about Bodo, a system for automatically optimizing and parallelizing python code for massively parallel data processing and analytics Interview Introduction How did you get involved in the area of data management? Can you describe what Bodo is and the story behind it? What are the techniques/technologies that teams might use to optimize or scale out their data processing workflows? Why have you focused your efforts on the Python language and toolchain? Do you see any potential for expanding into other language communities? What are the shortcomings of projects such as Dask and Ray for scaling out Python data projects? Many people are familiar with the principle of HPC architectures, but can you share an overview of the current state of the art for HPC? What are the tradeoffs of HPC vs scale-out distributed systems? Can you describe the technical implementation of the Bodo platform? What are the aspects of the Python language and package ecosystem that have complicated the work of building an optimizing compiler? How do you handle compiled extensions? (e.g. C/C++/Fortran) What are some of the assumptions/expectations that you had when first approaching this project that have been challenged as you progressed through its implementation? How do you handle data distribution for scale out computation? What are some software architecture/programming patterns that act as bottlenecks/optimization cliffs for parallelization? What are some of the educational challenges that you have run into while working with potential and current customers? What are the most interesting, innovative, or unexpected ways that you have seen Bodo used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Bodo? When is Bodo the wrong choice? What do you have planned for the future of Bodo? Contact Info LinkedIn @EhsanTn on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Bodo High Performance Computing (HPC) University of Illinois, Urbana-Champaign Julia Language Pandas Podcast.__init__ Episode NumPy Dask Podcast Episode Ray Podcast.__init__ Episode Numba LLVM SPMD MPI Elastic Fabric Adapter Iceberg Table Format Podcast Episode IPython Parallel The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/25/2021 • 1 hour, 4 minutes, 16 seconds

Declarative Machine Learning Without The Operational Overhead Using Continual

Summary Building, scaling, and maintaining the operational components of a machine learning workflow are all hard problems. Add the work of creating the model itself, and it’s not surprising that a majority of companies that could greatly benefit from machine learning have yet to either put it into production or see the value. Tristan Zajonc recognized the complexity that acts as a barrier to adoption and created the Continual platform in response. In this episode he shares his perspective on the benefits of declarative machine learning workflows as a means of accelerating adoption in businesses that don’t have the time, money, or ambition to build everything from scratch. He also discusses the technical underpinnings of what he is building and how using the data warehouse as a shared resource drastically shortens the time required to see value. This is a fascinating episode and Tristan’s work at Continual is likely to be the catalyst for a new stage in the machine learning community. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Tristan Zajonc about Continual, a platform for automating the creation and application of operational AI on top of your data warehouse Interview Introduction How did you get involved in the area of data management? Can you describe what Continual is and the story behind it? What is your definition for "operational AI" and how does it differ from other applications of ML/AI? What are some example use cases for AI in an operational capacity? What are the barriers to adoption for organizations that want to take advantage of predictive analytics? Who are the target users of Continual? Can you describe how the Continual platform is implemented? How has the design and infrastructure changed or evolved since you first began working on it? What is the workflow for someone building a model and putting it into production? Once a model has been deployed, what are the mechanisms that you expose for interacting with it? How does this differ from in-database ML capabilities such as what is offered by Vertica and BigQuery? How much understanding of ML/AI principles is necessary for someone to create a model with Continual? What is your estimation of the impact that Continual can have on the overall productivity of a data team/data scientist? What are the most interesting, innovative, or unexpected ways that you have seen Continual used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Continual? When is Continual the wrong choice? What do you have planned for the future of Continual? Contact Info LinkedIn @tristanzajonc on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Continual World Bank SAS SPSS Stata Feature Store DataRobot Transfer Learning dbt Podcast Episode Ludwig Overton (Apple) Hightouch Census Galaxy Schema In-Database ML Podcast Episode scikit-learn Snorkel Podcast Episode Materialize Podcast Episode Flink SQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/19/2021 • 1 hour, 11 minutes, 51 seconds

An Exploration Of The Data Engineering Requirements For Bioinformatics

Summary Biology has been gaining a lot of attention in recent years, even before the pandemic. As an outgrowth of that popularity, a new field has grown up that pairs statistics and compuational analysis with scientific research, namely bioinformatics. This brings with it a unique set of challenges for data collection, data management, and analytical capabilities. In this episode Jillian Rowe shares her experience of working in the field and supporting teams of scientists and analysts with the data infrastructure that they need to get their work done. This is a fascinating exploration of the collaboration between data professionals and scientists. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Struggling with broken pipelines? Stale dashboards? Missing data? If this resonates with you, you’re not alone. Data engineers struggling with unreliable data need look no further than Monte Carlo, the world’s first end-to-end, fully automated Data Observability Platform! In the same way that application performance monitoring ensures reliable software and keeps application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence, reducing time to detection and resolution from weeks or days to just minutes. Start trusting your data with Monte Carlo today! Visit dataengineeringpodcast.com/impact today to save your spot at IMPACT: The Data Observability Summit a half-day virtual event featuring the first U.S. Chief Data Scientist, founder of the Data Mesh, Creator of Apache Airflow, and more data pioneers spearheading some of the biggest movements in data. The first 50 to RSVP with this link will be entered to win an Oculus Quest 2 — Advanced All-In-One Virtual Reality Headset. RSVP today – you don’t want to miss it! Your host is Tobias Macey and today I’m interviewing Jillian Rowe about data engineering practices for bioinformatics projects Interview Introduction How did you get involved in the area of data management? How did you get into the field of bioinformatics? Can you describe what is unique about data needs in bioinformatics? What are some of the problems that you have found yourself regularly solving for your clients? When building data engineering stacks for bioinformatics, what are the attributes that you are optimizing for? (e.g. speed, UX, scale, correctness, etc.) Can you describe a typical set of technologies that you implement when working on a new project? What kinds of systems do you need to integrate with? What are the data formats that are widely used for bioinformatics? What are some details that a data engineer would need to know to work effectively with those formats while preparing data for analysis? What amount of domain expertise is necessary for a data engineer to work in life sciences? What are the most interesting, innovative, or unexpected solutions that you have seen for manipulating bioinformatics data? What are the most interesting, unexpected, or challenging lessons that you have learned while working on bioinformatics projects? What are some of the industry/academic trends or upcoming technologies that you are tracking for bioinformatics? Contact Info LinkedIn jerowe on GitHub Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Bioinformatics How Perl Saved The Human Genome Project Neo4J AWS Parallel Cluster Datashader R Shiny Plotly Dash Apache Parquet Dask Podcast Episode HDF5 Spark Superset Data Engineering Podcast Episode Podcast.__init__ Episode FastQ file format BAM (Binary Alignment Map) File Variant Call Format (VCF) HIPAA DVC Podcast Episode LakeFS BioThings API The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/19/2021 • 55 minutes, 9 seconds

Setting The Stage For The Next Chapter Of The Cassandra Database

Summary The Cassandra database is one of the first open source options for globally scalable storage systems. Since its introduction in 2008 it has been powering systems at every scale. The community recently released a new major version that marks a milestone in its maturity and stability as a project and database. In this episode Ben Bromhead, CTO of Instaclustr, shares the challenges that the community has worked through, the work that went into the release, and how the stability and testing improvements are setting the stage for the future of the project. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Ben Bromhead about the recent release of Cassandra version 4 and how it fits in the current landscape of data tools Interview Introduction How did you get involved in the area of data management? For anyone who isn’t familiar with Cassandra, can you briefly describe what it is and some of the story behind it? How did you get involved in the Cassandra project and how would you characterize your role? What are the main use cases and industries where someone is likely to use Cassandra? What is notable about the version 4 release? What were some of the factors that contributed to the long delay between versions 3 and 4? (2015 – 2021) What are your thoughts on the ongoing utility/benefits of projects such as ScyllaDB, particularly in light of the most recent release? Cassandra is primarily used as a system of record. What are some of the tools and system architectures that users turn to when building analytical workloads for data stored in Cassandra? The architecture of Cassandra has lent itself well to the cloud native ecosystem that has been growing in recent years. What do you see as the opportunities for Cassandra over the near to medium term as the cloud continues to grow in prominence? What are some of the challenges that you and the Cassandra community have faced with the flurry of new data storage and processing systems that have popped up over the past few years? What are the most interesting, innovative, or unexpected ways that you have seen Cassandra used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cassandra? When is Cassandra the wrong choice? What is in store for the future of Cassandra? Contact Info LinkedIn @benbromhead on Twitter benbromhead on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Cassandra Instaclustr HBase DynamoDB Whitepaper Property Based Testing QuickTheories Riak FoundationDB Podcast Episode ScyllaDB Podcast Episode YugabyteDB Podcast Episode Azure CosmoDB Amazon Keyspaces Netty Kafka CQRS == Command Query Responsibility Segregation Elasticsearch Redis Memcached Debezium Podcast Episode CDC == Change Data Capture Podcast Episodes Bigtable White Paper CockroachDB Podcast Episode Vitess CAP Theorem Paxos The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/12/2021 • 59 minutes, 28 seconds

A View From The Round Table Of Gartner's Cool Vendors

Summary Gartner analysts are tasked with identifying promising companies each year that are making an impact in their respective categories. For businesses that are working in the data management and analytics space they recognized the efforts of Timbr.ai, Soda Data, Nexla, and Tada. In this episode the founders and leaders of each of these organizations share their perspective on the current state of the market, and the challenges facing businesses and data professionals today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Saket Saurabh, Maarten Masschelein, Akshay Deshpande, and Dan Weitzner about the challenges facing data practitioners today and the solutions that are being brought to market for addressing them, as well as the work they are doing that got them recognized as "cool vendors" by Gartner. Interview Introduction How did you get involved in the area of data management? Can you each describe what you view as the biggest challenge facing data professionals? Who are you building your solutions for and what are the most common data management problems are you all solving? What are different components of Data Management and why is it so complex? What will simplify this process, if any? The report covers a lot of new data management terminology – data governance, data observability, data fabric, data mesh, DataOps, MLOps, AIOps – what does this all mean and why is it important for data engineers? How has the data management space changed in recent times? Describe the current data management landscape and any key developments. From your perspective, what are the biggest challenges in the data management space today? What modern data management features are lacking in existing databases? Gartner imagines a future where data and analytics leaders need to be prepared to rely on data management solutions that make heterogeneous, distributed data appear consolidated, easy to access and business friendly. How does this tally with your vision of the future of data management and what needs to happen to make this a reality? What are the most interesting, innovative, or unexpected ways that you have seen your respective products used (in isolation or combined)? What are the most interesting, unexpected, or challenging lessons that you have learned while working on your respective platforms? What are the upcoming trends and challenges that you are keeping a close eye on? Contact Info Saket LinkedIn @saketsaurabh on Twitter Maarten LinkedIn @masscheleinm on Twitter Dan LinkedIn Akshay Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Nexla Soda Tada Timbr Collibra Podcast Episode Gartner Cool Vendors The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/9/2021 • 1 hour, 4 minutes, 16 seconds

Designing And Building Data Platforms As A Product

Summary The term "data platform" gets thrown around a lot, but have you stopped to think about what it actually means for you and your organization? In this episode Lior Gavish, Lior Solomon, and Atul Gupte share their view of what it means to have a data platform, discuss their experiences building them at various companies, and provide advice on how to treat them like a software product. This is a valuable conversation about how to approach the work of selecting the tools that you use to power your data systems and considerations for how they can be woven together for a unified experience across your various stakeholders. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Lior Gavish, Lior Solomon, and Atul Gupte about the technical, social, and architectural aspects of building your data platform as a product for your internal customers Interview Introduction How did you get involved in the area of data management? – all Can we start by establishing a definition of "data platform" for the purpose of this conversation? Who are the stakeholders in a data platform? Where does the responsibility lie for creating and maintaining ("owning") the platform? What are some of the technical and organizational constraints that are likely to factor into the design and execution of the platform? What are the minimum set of requirements necessary to qualify as a platform? (as opposed to a collection of discrete components) What are the additional capabilities that should be in place to simplify the use and maintenance of the platform? How are data platforms managed? Are they managed by technical teams, product managers, etc.? What is the profile for a data product manager? – Atul G. How do you set SLIs / SLOs with your data platform team when you don’t have clear metrics you’re tracking? – Lior S. There has been a lot of conversation recently about different interpretations of the "modern data stack". For a team who is just starting to build out their platform, how much credence should they be giving to those debates? What are the first steps that you recommend for those practitioners? If an organization already has infrastructure in place for data/analytics, how might they think about building or buying their way toward a well integrated platform? Once a platform is established, what are some challenges that teams should anticipate in scaling the platform? Which axes of scale have you found to be most difficult to manage? (scale of infrastructure capacity, scale of organizational/technical complexity, scale of usage, etc.) Do we think the "data platform" is a skill set? How do we split up the role of the platform? Is there one for real-time? Is there one for ETLs? How do you handle the quality and reliability of the data powering your solution? What are helpful techniques that you have used for collecting, prioritizing, and managing feature requests? How do you justify the budget and resources for your data platform? How do you measure the success of a data platform? What is the relationship between a data platform and data products? Are there any other companies you admire when it comes to building robust, scalable data architecture? What are the most interesting, innovative, or unexpected ways that you have seen data platforms used? What are the most interesting, unexpected, or challenging lessons that you have learned while building and operating a data platform? When is a data platform the wrong choice? (as opposed to buying an integrated solution, etc.) What are the industry trends that you are monitoring/excited for in the space of data platforms? Contact Info Lior Gavish LinkedIn @lgavish on Twitter Lior Solomon LinkedIn @liorsolomon on Twitter Atul Gupte LinkedIn @atulgupte on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Monte Carlo Vimeo Facebook Uber Zynga Great Expectations Podcast Episode Airflow Podcast.__init__ Episode Fivetran Podcast Episode dbt Podcast Episode Snowflake Podcast Episode Looker Podcast Episode Modern Data Stack Podcast Episode Stitch The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/4/2021 • 1 hour

Presto Powered Cloud Data Lakes At Speed Made Easy With Ahana

Summary The Presto project has become the de facto option for building scalable open source analytics in SQL for the data lake. In recent months the community has focused their efforts on making it the fastest possible option for running your analytics in the cloud. In this episode Dipti Borkar discusses the work that she and her team are doing at Ahana to simplify the work of running your own PrestoDB environment in the cloud. She explains how they are optimizin the runtime to reduce latency and increase query throughput, the ways that they are contributing back to the open source community, and the exciting improvements that are in the works to make Presto an even more powerful option for all of your analytics. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Dipti Borkar, cofounder Ahana about Presto and Ahana, SaaS managed service for Presto Interview Introduction How did you get involved in the area of data management? Can you describe what Ahana is and the story behind it? There has been a lot of recent activity in the Presto community. Can you give an overview of the options that are available for someone wanting to use its SQL engine for querying their data? What is Ahana’s role in the community/ecosystem? (happy to skip this question if it’s too contentious) What are some of the notable differences that have emerged over the past couple of years between the Trino (formerly PrestoSQL) and PrestoDB projects? Another area that has been seeing a lot of activity is data lakes and projects to make them more manageable and feature complete (e.g. Hudi, Delta Lake, Iceberg, Nessie, LakeFS, etc.). How has that influenced your product focus and capabilities? How does this activity change the calculus for organizations who are deciding on a lake or warehouse for their data architecture? Can you describe how the Ahana Cloud platform is architected? What are the additional systems that you have built to manage deployment, scaling, and multi-tenancy? Beyond the storage and processing, what are the other notable tools and projects that have become part of the overall stack for supporting open analytics? What are some areas of ongoing activity that you are keeping an eye on as you build out the Ahana offerings? What are the most interesting, innovative, or unexpected ways that you have seen Ahana/Presto used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Ahana? When is Ahana the wrong choice? What do you have planned for the future of Ahana? Contact Info LinkedIn @dborkar on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Ahana Alluxio Podcast Episode Couchbase Kinetica Tensorflow PyTorch Podcast.__init__ Episode AWS Athena AWS Glue Hive Metastore Clickhouse Dremio Podcast Episode Apache Drill Teradata Snowflake Podcast Episode BigQuery RaptorX Aria Optimizations for Presto Apache Ranger Presto Plugin Trino Podcast Episode Starburst Podcast Episode Hive Iceberg Podcast Episode Hudi Podcast Episode Delta Lake Podcast Episode Superset Podcast.__init__ Episode Data Engineering Podcast Episode Nessie LakeFS Amundsen Podcast Episode DataHub Podcast Episode OtterTune Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/2/2021 • 1 hour, 30 seconds

Do Away With Data Integration Through A Dataware Architecture With Cinchy

Summary The reason that so much time and energy is spent on data integration is because of how our applications are designed. By making the software be the owner of the data that it generates, we have to go through the trouble of extracting the information to then be used elsewhere. The team at Cinchy are working to bring about a new paradigm of software architecture that puts the data as the central element. In this episode Dan DeMers, Cinchy’s CEO, explains how their concept of a "Dataware" platform eliminates the need for costly and error prone integration processes and the benefits that it can provide for transactional and analytical application design. This is a fascinating and unconventional approach to working with data, so definitely give this a listen to expand your thinking about how to build your systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Dan DeMers about Cinchy, a dataware platform aiming to simplify the work of data integration by eliminating ETL/ELT Interview Introduction How did you get involved in the area of data management? Can you describe what Cinchy is and the story behind it? In your experience working in data and building complex enterprise-grade systems, what are the shortcomings and negative externalities of an ETL/ELT approach to data integration? How is a Dataware platform from a data lake or data warehouses? What is it used for? What is Zero-Copy Integration? How does that work? Can you describe how customers start their Cinchy journey? What are the main use case patterns that you’re seeing with Dataware? Your platform offers unlimited users, including business users. What are some of the challenges that you face in building a user experience that doesn’t become overwhelming as an organization scales the number of data sources and processing flows? What are the most interesting, innovative, or unexpected ways that you have seen Cinchy used? When is Cinchy the wrong choice for a customer? Can you describe the technical architecture of the Cinchy platform? How do you establish connections/relationships among data from disparate sources? How do you manage schema evolution in source systems? What are some of the edge cases that users need to consider as they are designing and building those connections? What are some of the features or capabilities of Cinchy that you think are overlooked or under-utilized? How has your understanding of the problem space changed since you started working on Cinchy? How has the architecture and design of the system evolved to reflect that updated understanding? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cinchy? What do you have planned for the future of Cinchy? Contact Info LinkedIn @dandemers on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Cinchy Gordon Everest Data Collaboration Alliance The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/28/2021 • 51 minutes, 26 seconds

Decoupling Data Operations From Data Infrastructure Using Nexla

Summary The technological and social ecosystem of data engineering and data management has been reaching a stage of maturity recently. As part of this stage in our collective journey the focus has been shifting toward operation and automation of the infrastructure and workflows that power our analytical workloads. It is an encouraging sign for the industry, but it is still a complex and challenging undertaking. In order to make this world of DataOps more accessible and manageable the team at Nexla has built a platform that decouples the logical unit of data from the underlying mechanisms so that you can focus on the problems that really matter to your business. In this episode Saket Saurabh (CEO) and Avinash Shahdadpuri (CTO) share the story behind the Nexla platform, discuss the technical underpinnings, and describe how their concept of a Nexset simplifies the work of building data products for sharing within and between organizations. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Saket Saurabh and Avinash Shahdadpuri about Nexla, a platform for powering data operations and sharing within and across businesses Interview Introduction How did you get involved in the area of data management? Can you describe what Nexla is and the story behind it? What are the major problems that Nexla is aiming to solve? What are the components of a data platform that Nexla might replace? What are the use cases and benefits of being able to publish data sets for use outside and across organizations? What are the different elements involved in implementing DataOps? How is the Nexla platform implemented? What have been the most comple engineering challenges? How has the architecture changed or evolved since you first began working on it? What are some of the assumptions that you had at the start which have been challenged or invalidated? What are some of the heuristics that you have found most useful in generating logical units of data in an automated fashion? Once a Nexset has been created, what are some of the ways that they can be used or further processed? What are the attributes of a Nexset? (e.g. access control policies, lineage, etc.) How do you handle storage and sharing of a Nexset? What are some of your grand hopes and ambitions for the Nexla platform and the potential for data exchanges? What are the most interesting, innovative, or unexpected ways that you have seen Nexla used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Nexla? When is Nexla the wrong choice? What do you have planned for the future of Nexla? Contact Info Saket LinkedIn @saketsaurabh on Twitter Avinash LinkedIn @avinashpuri on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Nexla Nexsets The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/25/2021 • 57 minutes, 48 seconds

Let Your Analysts Build A Data Lakehouse With Cuelake

Summary Data lakes have been gaining popularity alongside an increase in their sophistication and usability. Despite improvements in performance and data architecture they still require significant knowledge and experience to deploy and manage. In this episode Vikrant Dubey discusses his work on the Cuelake project which allows data analysts to build a lakehouse with SQL queries. By building on top of Zeppelin, Spark, and Iceberg he and his team at Cuebook have built an autoscaled cloud native system that abstracts the underlying complexity. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Vikrant Dubey about Cuebook and their Cuelake project for building ELT pipelines for your data lakehouse entirely in SQL Interview Introduction How did you get involved in the area of data management? Can you describe what Cuelake is and the story behind it? There are a number of platforms and projects for running SQL workloads and transformations on a data lake. What was lacking in those systems that you are addressing with Cuelake? Who are the target users of Cuelake and how has that influenced the features and design of the system? Can you describe how Cuelake is implemented? What was your selection process for the various components? What are some of the sharp edges that you have had to work around when integrating these components? What involved in getting Cuelake deployed? How are you using Cuelake in your work at Cuebook? Given your focus on machine learning for anomaly detection of business metrics, what are the challenges that you faced in using a data warehouse for those workloads? What are the advantages that a data lake/lakehouse architecture maintains over a warehouse? What are the shortcomings of the lake/lakehouse approach that are solved by using a warehouse? What are the most interesting, innovative, or unexpected ways that you have seen Cuelake used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Cuelake? When is Cuelake the wrong choice? What do you have planned for the future of Cuelake? Contact Info LinkedIn vikrantcue on GitHub @vkrntd on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Cuelake Apache Druid Dremio Databricks Zeppelin Spark Apache Iceberg Apache Hudi The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/21/2021 • 27 minutes, 37 seconds

Migrate And Modify Your Data Platform Confidently With Compilerworks

Summary A major concern that comes up when selecting a vendor or technology for storing and managing your data is vendor lock-in. What happens if the vendor fails? What if the technology can’t do what I need it to? Compilerworks set out to reduce the pain and complexity of migrating between platforms, and in the process added an advanced lineage tracking capability. In this episode Shevek, CTO of Compilerworks, takes us on an interesting journey through the many technical and social complexities that are involved in evolving your data platform and the system that they have built to make it a manageable task. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Schema changes, missing data, and volume anomalies caused by your data sources can happen without any advanced notice if you lack visibility into your data-in-motion. That leaves DataOps reactive to data quality issues and can make your consumers lose confidence in your data. By connecting to your pipeline orchestrator like Apache Airflow and centralizing your end-to-end metadata, Databand.ai lets you identify data quality issues and their root causes from a single dashboard. With Databand.ai, you’ll know whether the data moving from your sources to your warehouse will be available, accurate, and usable when it arrives. Go to dataengineeringpodcast.com/databand to sign up for a free 30-day trial of Databand.ai and take control of your data quality today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Shevek about Compilerworks and his work on writing compilers to automate data lineage tracking from your SQL code Interview Introduction How did you get involved in the area of data management? Can you describe what Compilerworks is and the story behind it? What is a compiler? How are you applying compilers to the challenges of data processing systems? What are some use cases that Compilerworks is uniquely well suited to? There are a number of other methods and systems available for tracking and/or computing data lineage. What are the benefits of the approach that you are taking with Compilerworks? Can you describe the design and implementation of the Compilerworks platform? How has the system changed or evolved since you first began working on it? What programming languages and SQL dialects do you currently support? Which have been the most challenging to work with? How do you handle verification/validation of the algebraic representation of SQL code given the variability of implementations and the flexibility of the specification? Can you talk through the process of getting Compilerworks integrated into a customer’s infrastructure? What is a typical workflow for someone using Compilerworks to manage their data lineage? How does Compilerworks simplify the process of migrating between data warehouses/processing platforms? What are the most interesting, innovative, or unexpected ways that you have seen Compilerworks used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Compilerworks? When is Compilerworks the wrong choice? What do you have planned for the future of Compilerworks? Contact Info @shevek on GitHub Webiste Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Compilerworks Compiler ANSI SQL Spark SQL Google Flume Paper SAS Informatica Trie Data Structure Satisfiability Solver Lisp Scheme Snooker Qemu Java API The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/18/2021 • 1 hour, 6 minutes, 9 seconds

Prepare Your Unstructured Data For Machine Learning And Computer Vision Without The Toil Using Activeloop

Summary The vast majority of data tools and platforms that you hear about are designed for working with structured, text-based data. What do you do when you need to manage unstructured information, or build a computer vision model? Activeloop was created for exactly that purpose. In this episode Davit Buniatyan, founder and CEO of Activeloop, explains why he is spending his time and energy on building a platform to simplify the work of getting your unstructured data ready for machine learning. He discusses the inefficiencies that teams run into from having to reprocess data multiple times, his work on the open source Hub library to solve this problem for everyone, and his thoughts on the vast potential that exists for using computer vision to solve hard and meaningful problems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Davit Buniatyan about Activeloop, a platform for hosting and delivering datasets optimized for machine learning Interview Introduction How did you get involved in the area of data management? Can you describe what Activeloop is and the story behind it? How does the form and function of data storage introduce friction in the development and deployment of machine learning projects? How does the work that you are doing at Activeloop compare to vector databases such as Pinecone? You have a focus on image oriented data and computer vision projects. How does the specific applications of ML/DL influence the format and interactions with the data? Can you describe how the Activeloop platform is architected? How have the design and goals of the system changed or evolved since you began working on it? What are the feature and performance tradeoffs between self-managed storage locations (e.g. S3, GCS) and the Activeloop platform? What is the process for sourcing, processing, and storing data to be used by Hub/Activeloop? Many data assets are useful across ML/DL and analytical purposes. What are the considerations for managing the lifecycle of data between Activeloop/Hub and a data lake/warehouse? What do you see as the opportunity and effort to generalize Hub and Activeloop to support arbitrary ML frameworks/languages? What are the most interesting, innovative, or unexpected ways that you have seen Activeloop and Hub used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Activeloop? When is Hub/Activeloop the wrong choice? What do you have planned for the future of Activeloop? Contact Info LinkedIn @DBuniatyan on Twitter davidbuniat on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Activeloop Slack Community Princeton University ImageNet Tensorflow PyTorch Podcast Episode Activeloop Hub Delta Lake Podcast Episode Tensor Wasabi Ray/Anyscale Podcast Episode Humans In The Loop podcast The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/15/2021 • 48 minutes, 39 seconds

Build Trust In Your Data By Understanding Where It Comes From And How It Is Used With Stemma

Summary All of the fancy data platform tools and shiny dashboards that you use are pointless if the consumers of your analysis don’t have trust in the answers. Stemma helps you establish and maintain that trust by giving visibility into who is using what data, annotating the reports with useful context, and understanding who is responsible for keeping it up to date. In this episode Mark Grover explains what he is building at Stemma, how it expands on the success of the Amundsen project, and why trust is the most important asset for data teams. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Mark Grover about his work at Stemma to bring the Amundsen project to a wider audience and increase trust in their data. Interview Introduction Can you describe what Stemma is and the story behind it? Can you give me more context into how and why Stemma fits into the current data engineering world? Among the popular tools of today for data warehousing and other products that stitch data together – what is Stemma’s place? Where does it fit into the workflow? How has the explosion in options for data cataloging and discovery influenced your thinking on the necessary feature set for that class of tools? How do you compare to your competitors With how long we have been using data and building systems to analyze it, why do you think that trust in the results is still such a momentous problem? Tell me more about Stemma and how it compares to Amundsen? Can you tell me more about the impact of Stemma/Amundsen to companies that use it? What are the opportunities for innovating on top of Stemma to help organizations streamline communication between data producers and consumers? Beyond the technological capabilities of a data platform, the bigger question is usually the social/organizational patterns around data. How have the "best practices" around the people side of data changed in the recent past? What are the points of friction that you continue to see? A majority of conversations around data catalogs and discovery are focused on analytical usage. How can these platforms be used in ML and AI workloads? How has the data engineering world changed since you left Lyft/since we last spoke? How do you see it evolving in the future? Imagine 5 years down the line and let’s say Stemma is a household name. How have data analysts’ lives improved? Data engineers? Data scientists? What are the most interesting, innovative, or unexpected ways that you have seen Stemma used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Stemma? When is Stemma the wrong choice? What do you have planned for the future of Stemma? Contact Info LinkedIn Email @mark_grover on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Stemma Amundsen Podcast Episode CSAT == Customer Satisfaction Data Mesh Podcast Episode Feast open source feature store Supergrain Transform Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/10/2021 • 52 minutes, 36 seconds

Data Discovery From Dashboards To Databases With Castor

Summary Every organization needs to be able to use data to answer questions about their business. The trouble is that the data is usually spread across a wide and shifting array of systems, from databases to dashboards. The other challenge is that even if you do find the information you are seeking, there might not be enough context available to determine how to use it or what it means. Castor is building a data discovery platform aimed at solving this problem, allowing you to search for and document details about everything from a database column to a business intelligence dashboard. In this episode CTO Amaury Dumoulin shares his perspective on the complexity of letting everyone in the company find answers to their questions and how Castor is designed to help. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Have you ever had to develop ad-hoc solutions for security, privacy, and compliance requirements? Are you spending too much of your engineering resources on creating database views, configuring database permissions, and manually granting and revoking access to sensitive data? Satori has built the first DataSecOps Platform that streamlines data access and security. Satori’s DataSecOps automates data access controls, permissions, and masking for all major data platforms such as Snowflake, Redshift and SQL Server and even delegates data access management to business users, helping you move your organization from default data access to need-to-know access. Go to dataengineeringpodcast.com/satori today and get a $5K credit for your next Satori subscription. Your host is Tobias Macey and today I’m interviewing Amaury Dumoulin about Castor, a managed platform for easy data cataloging and discovery Interview Introduction How did you get involved in the area of data management? Can you describe what Castor is and the story behind it? The market for data catalogues is nascent but growing fast. What are the broad categories for the different products and projects in the space? What do you see as the core features that are required to be competitive? In what ways has that changed in the past 1 – 2 years? What are the opportunities for innovation and differentiation in the data catalog/discovery ecosystem? How do you characterize your current position in the market? Who are the target users of Castor? Can you describe the technical architecture and implementation of the Castor platform? How have the goals and design changed since you first began working on it? Can you talk through the workflow of getting Castor set up in an organization and onboarding the users? What are the design elements and platform features that allow for serving the various roles and stakeholders in an organization? What are the organizational benefits that you have seen from users adopting Castor or other data discovery/catalog systems? What are the most interesting, innovative, or unexpected ways that you have seen Castor used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Castor? When is Castor the wrong choice? What do you have planned for the future of Castor? Contact Info Amaury Dumoulin Castor website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Castor Atlan Podcast Episode dbt Podcast Episode Monte Carlo Podcast Episode Collibra Podcast Episode Amundsen Podcast Episode Airflow Podcast Episode Metabase Podcast Episode Airbyte Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/7/2021 • 52 minutes, 46 seconds

Charting A Path For Streaming Data To Fill Your Data Lake With Hudi

Summary Data lake architectures have largely been biased toward batch processing workflows due to the volume of data that they are designed for. With more real-time requirements and the increasing use of streaming data there has been a struggle to merge fast, incremental updates with large, historical analysis. Vinoth Chandar helped to create the Hudi project while at Uber to address this challenge. By adding support for small, incremental inserts into large table structures, and building support for arbitrary update and delete operations the Hudi project brings the best of both worlds together. In this episode Vinoth shares the history of the project, how its architecture allows for building more frequently updated analytical queries, and the work being done to add a more polished experience to the data lake paradigm. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Vinoth Chandar about Apache Hudi, a data lake management layer for supporting fast and incremental updates to your tables. Interview Introduction How did you get involved in the area of data management? Can you describe what Hudi is and the story behind it? What are the use cases that it is focused on supporting? There have been a number of alternative table formats introduced for data lakes recently. How does Hudi compare to projects like Iceberg, Delta Lake, Hive, etc.? Can you describe how Hudi is architected? How have the goals and design of Hudi changed or evolved since you first began working on it? If you were to start the whole project over today, what would you do differently? Can you talk through the lifecycle of a data record as it is ingested, compacted, and queried in a Hudi deployment? One of the capabilities that is interesting to explore is support for arbitrary record deletion. Can you talk through why this is a challenging operation in data lake architectures? How does Hudi make that a tractable problem? What are the data platform components that are needed to support an installation of Hudi? What is involved in migrating an existing data lake to use Hudi? How would someone approach supporting heterogeneous table formats in their lake? As someone who has invested a lot of time in technologies for supporting data lakes, what are your thoughts on the tradeoffs of data lake vs data warehouse and the current trajectory of the ecosystem? What are the most interesting, innovative, or unexpected ways that you have seen Hudi used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Hudi? When is Hudi the wrong choice? What do you have planned for the future of Hudi? Contact Info Linkedin Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Hudi Docs Hudi Design & Architecture Incremental Processing CDC == Change Data Capture Podcast Episodes Oracle GoldenGate Voldemort Kafka Hadoop Spark HBase Parquet Iceberg Table Format Data Engineering Episode Hive ACID Apache Kudu Podcast Episode Vertica Delta Lake Podcast Episode Optimistic Concurrency Control MVCC == Multi-Version Concurrency Control Presto Flink Podcast Episode Trino Podcast Episode Gobblin LakeFS Podcast Episode Nessie The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/3/2021 • 1 hour, 9 minutes, 36 seconds

Adding Context And Comprehension To Your Analytics Through Data Discovery With SelectStar

Summary Companies of all sizes and industries are trying to use the data that they and their customers generate to survive and thrive in the modern economy. As a result, they are relying on a constantly growing number of data sources being accessed by an increasingly varied set of users. In order to help data consumers find and understand the data is available, and help the data producers understand how to prioritize their work, SelectStar has built a data discovery platform that brings everyone together. In this episode Shinji Kim shares her experience as a data professional struggling to collaborate with her colleagues and how that led her to founding a company to address that problem. She also discusses the combination of technical and social challenges that need to be solved for everyone to gain context and comprehension around their most valuable asset. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Shinji Kim about SelectStar, an intelligent data discovery platform that helps you understand your data Interview Introduction How did you get involved in the area of data management? Can you describe what SelectStar is and the story behind it? What are the core challenges that organizations are facing around data cataloging and discovery? There has been a surge in tools and services for metadata collection, data catalogs, and data collaboration. How would you characterize the current state of the ecosystem? What is SelectStar’s role in the space? Who are your target customers and how does that shape your prioritization of features and the user experience design? Can you describe how SelectStar is architected? How have the goals and design of the platform shifted or evolved since you first began working on it? I understand that you have built integrations with a number of BI and dashboarding tools such as Looker, Tableau, Superset, etc. What are the use cases that those integrations enable? What are the challenges or complexities involved in building and maintaining those integrations? What are the other categories of integration that you have had to implement to make SelectStar a viable solution? Can you describe the workflow of a team that is using SelectStar to collaborate on data engineering and analytics? What have been the most complex or difficult problems to solve for? What are the most interesting, innovative, or unexpected ways that you have seen SelectStar used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SelectStar? When is SelectStar the wrong choice? What do you have planned for the future of SelectStar? Contact Info LinkedIn @shinjikim on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links SelectStar University of Waterloo Kafka Storm Concord Systems Akamai Snowflake Podcast Episode BigQuery Looker Podcast Episode Tableau dbt Podcast Episode OpenLineage Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/31/2021 • 51 minutes, 23 seconds

Building a Multi-Tenant Managed Platform For Streaming Data With Pulsar at Datastax

Summary Everyone expects data to be transmitted, processed, and updated instantly as more and more products integrate streaming data. The technology to make that possible has been around for a number of years, but the barriers to adoption have still been high due to the level of technical understanding and operational capacity that have been required to run at scale. Datastax has recently introduced a new managed offering for Pulsar workloads in the form of Astra Streaming that lowers those barriers and make stremaing workloads accessible to a wider audience. In this episode Prabhat Jha and Jonathan Ellis share the work that they have been doing to integrate streaming data into their managed Cassandra service. They explain how Pulsar is being used by their customers, the work that they have done to scale the administrative workload for multi-tenant environments, and the challenges of operating such a data intensive service at large scale. This is a fascinating conversation with a lot of useful lessons for anyone who wants to understand the operational aspects of Pulsar and the benefits that it can provide to data workloads. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Prabhat Jha and Jonathan Ellis about Astra Streaming, a cloud-native streaming platform built on Apache Pulsar Interview Introduction How did you get involved in the area of data management? Can you describe what the Astra platform is and the story behind it? How does streaming fit into your overall product vision and the needs of your customers? What was your selection process/criteria for adopting a streaming engine to complement your existing technology investment? What are the core use cases that you are aiming to support with Astra Streaming? Can you describe the architecture and automation of your hosted platform for Pulsar? What are the integration points that you have built to make it work well with Cassandra? What are some of the additional tools that you have added to your distribution of Pulsar to simplify operation and use? What are some of the sharp edges that you have had to sand down as you have scaled up your usage of Pulsar? What is the process for someone to adopt and integrate with your Astra Streaming service? How do you handle migrating existing projects, particularly if they are using Kafka currently? One of the capabilities that you highlight on the product page for Astra Streaming is the ability to execute machine learning workflows on data in flight. What are some of the supporting systems that are necessary to power that workflow? What are the capabilities that are built into Pulsar that simplify the operational aspects of streaming ML? What are the ways that you are engaging with and supporting the Pulsar community? What are the near to medium term elements of the Pulsar roadmap that you are working toward and excited to incorporate into Astra? What are the most interesting, innovative, or unexpected ways that you have seen Astra used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Astra? When is Astra the wrong choice? What do you have planned for the future of Astra? Contact Info Prabhat LinkedIn @prabhatja on Twitter prabhatja on GitHub Jonathan LinkedIn @spyced on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Pulsar Podcast Episode Streamnative Episode Datastax Astra Streaming Datastax Astra DB Luna Streaming Distribution Datastax Cassandra Kesque (formerly Kafkaesque) Kafka RabbitMQ Prometheus Grafana Pulsar Heartbeat Pulsar Summit Pulsar Summit Presentation on Kafka Connectors Replicated Chaos Engineering Fallout chaos engineering tools Jepsen Podcast Episode Jack VanLightly BookKeeper TLA+ Model Change Data Capture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/28/2021 • 1 hour, 12 seconds

Bringing The Metrics Layer To The Masses With Transform

Summary Collecting and cleaning data is only useful if someone can make sense of it afterward. The latest evolution in the data ecosystem is the introduction of a dedicated metrics layer to help address the challenge of adding context and semantics to raw information. In this episode Nick Handel shares the story behind Transform, a new platform that provides a managed metrics layer for your data platform. He explains the challenges that occur when metrics are maintained across a variety of systems, the benefits of unifying them in a common access layer, and the potential that it unlocks for everyone in the business to confidently answer questions with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Nick Handel about Transform, a platform providing a dedicated metrics layer for your data stack Interview Introduction How did you get involved in the area of data management? Can you describe what Transform is and the story behind it? How do you define the concept of a "metric" in the context of the data platform? What are the general strategies in the industry for creating, managing, and consuming metrics? How has that been changing in the past couple of years? What is driving that shift? What are the main goals that you have for the Transform platform? Who are the target users? How does that focus influence your approach to the design of the platform? How is the Transform platform architected? What are the core capabilities that are required for a metrics service? What are the integration points for a metrics service? Can you talk through the workflow of defining and consuming metrics with Transform? What are the challenges that teams face in establishing consensus or a shared understanding around a given metric definition? What are the lifecycle stages that need to be factored into the long-term maintenance of a metric definition? What are some of the capabilities or projects that are made possible by having a metrics layer in the data platform? What are the capabilities in downstream tools that are currently missing or underdeveloped to support the metrics store as a core layer of the platform? What are the most interesting, innovative, or unexpected ways that you have seen Transform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Transform? When is Transform the wrong choice? What do you have planned for the future of Transform? Contact Info LinkedIn @nick_handel on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Transform Transform’s Metrics Framework Transform’s Metrics Catalog Transform’s Metrics API Nick’s experiences using Airbnb’s Metrics Store Get Transform BlackRock AirBnB Airflow Superset Podcast Episode AirBnB Knowledge Repo AirBnB Minerva Metric Store OLAP Cube Semantic Layer Master Data Management Podcast Episode Data Normalization OpenLineage The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/23/2021 • 1 hour, 1 minute, 17 seconds

Strategies For Proactive Data Quality Management

Summary Data quality is a concern that has been gaining attention alongside the rising importance of analytics for business success. Many solutions rely on hand-coded rules for catching known bugs, or statistical analysis of records to detect anomalies retroactively. While those are useful tools, it is far better to prevent data errors before they become an outsized issue. In this episode Gleb Mezhanskiy shares some strategies for adding quality checks at every stage of your development and deployment workflow to identify and fix problematic changes to your data before they get to production. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Gleb Mezhanskiy about strategies for proactive data quality management and his work at Datafold to help provide tools for implementing them Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Datafold and the story behind it? What are the biggest factors that you see contributing to data quality issues? How are teams identifying and addressing those failures? How does the data platform architecture impact the potential for introducing quality problems? What are some of the potential risks or consequences of introducing errors in data processing? How can organizations shift to being proactive in their data quality management? How much of a role does tooling play in addressing the introduction and remediation of data quality problems? Can you describe how Datafold is designed and architected to allow for proactive management of data quality? What are some of the original goals and assumptions about how to empower teams to improve data quality that have been challenged or changed as you have worked through building Datafold? What is the workflow for an individual or team who is using Datafold as part of their data pipeline and platform development? What are the organizational patterns that you have found to be most conducive to proactive data quality management? Who is responsible for identifying and addressing quality issues? What are the most interesting, innovative, or unexpected ways that you have seen Datafold used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Datafold? When is Datafold the wrong choice? What do you have planned for the future of Datafold? Contact Info LinkedIn @glebmm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Datafold Autodesk Airflow Podcast.__init__ Episode Spark Looker Podcast Episode Amundsen Podcast Episode dbt Podcast Episode Dagster Podcast Episode Podcast.__init__ Episode Change Data Capture Podcast Episodes Delta Lake Podcast Episode Trino Podcast Episode Presto Parquet Podcast Episode Data Quality Meetup The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Special Guest: Gleb Mezhanskiy.

7/20/2021 • 1 hour, 1 minute, 6 seconds

Low Code And High Quality Data Engineering For The Whole Organization With Prophecy

Summary There is a wealth of tools and systems available for processing data, but the user experience of integrating them and building workflows is still lacking. This is particularly important in large and complex organizations where domain knowledge and context is paramount and there may not be access to engineers for codifying that expertise. Raj Bains founded Prophecy to address this need by creating a UI first platform for building and executing data engineering workflows that orchestrates Airflow and Spark. Rather than locking your business logic into a proprietary storage layer and only exposing it through a drag-and-drop editor Prophecy synchronizes all of your jobs with source control, allowing an easy bi-directional interaction between code first and no-code experiences. In this episode he shares his motivations for creating Prophecy, how he is leveraging the magic of compilers to translate between UI and code oriented representations of logic, and the organizational benefits of having a cohesive experience designed to bring business users and domain experts into the same platform as data engineers and analysts. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Raj Bains about Prophecy, a low-code data engineering platform built on Spark and Airflow Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Prophecy and the story behind it? There are a huge number of tools and recommended architectures for every variety of data need. Why is data engineering still such a complicated and challenging undertaking? What features and capabilities does Prophecy provide to help address those issues? What are the roles and use cases that you are focusing on serving with Prophecy? What are the elements of the data platform that Prophecy can replace? Can you describe how Prophecy is implemented? What was your selection criteria for the foundational elements of the platform? What would be involved in adopting other execution and orchestration engines? Can you describe the workflow of building a pipeline with Prophecy? What are the design and structural features that you have built to manage workflows as they scale in terms of technical and organizational complexity? What are the options for data engineers/data professionals to build and share reusable components across the organization? What are the most interesting, innovative, or unexpected ways that you have seen Prophecy used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Prophecy? When is Prophecy the wrong choice? What do you have planned for the future of Prophecy? Contact Info LinkedIn @_raj_bains on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Prophecy CUDA Apache Hive Hortonworks NoSQL NewSQL Paxos Apache Impala AbInitio Teradata Snowflake Podcast Episode Presto Podcast Episode LinkedIn Spark Databricks Cron Airflow Astronomer Alteryx Streamsets Azure Data Factory Apache Flink Podcast Episode Prefect Podcast Episode Dagster Podcast Episode Podcast.__init__ Episode Kubernetes Operator Scala Kafka Abstract Syntax Tree Language Server Protocol Amazon Deequ dbt Tecton Podcast Episode Informatica The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/16/2021 • 1 hour, 12 minutes, 35 seconds

Exploring The Design And Benefits Of The Modern Data Stack

Summary We have been building platforms and workflows to store, process, and analyze data since the earliest days of computing. Over that time there have been countless architectures, patterns, and "best practices" to make that task manageable. With the growing popularity of cloud services a new pattern has emerged and been dubbed the "Modern Data Stack". In this episode members of the GoDataDriven team, Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan, explain the combinations of services that comprise this architecture, share their experiences working with clients to employ the stack, and the benefits of bringing engineers and business users together with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Guillermo Sanchez, Bram Ochsendorf, and Juan Perafan about their experiences with managed services in the modern data stack in their work as consultants at GoDataDriven Interview Introduction How did you get involved in the area of data management? Can you start by giving your definition of the modern data stack? What are the key characteristics of a tool or platform that make it a candidate for the "modern" stack? How does the modern data stack shift the responsibilities and capabilities of data professionals and consumers? What are some difficulties that you face when working with customers to migrate to these new architectures? What are some of the limitations of the components or paradigms of the modern stack? What are some strategies that you have devised for addressing those limitations? What are some edge cases that you have run up against with specific vendors that you have had to work around? What are the "gotchas" that you don’t run up against until you’ve deployed a service and started using it at scale and over time? How does data governance get applied across the various services and systems of the modern stack? One of the core promises of cloud-based and managed services for data is the ability for data analysts and consumers to self-serve. What kinds of training have you found to be necessary/useful for those end-users? What is the role of data engineers in the context of the "modern" stack? What are the most interesting, innovative, or unexpected manifestations of the modern data stack that you have seen? What are the most interesting, unexpected, or challenging lessons that you have learned while working with customers to implement a modern data stack? When is the modern data stack the wrong choice? What new architectures or tools are you keeping an eye on for future client work? Contact Info Guillermo LinkedIn guillesd on GitHub Bram LinkedIn bramochsendorf on GitHub Juan LinkedIn jmperafan on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links GoDataDriven Deloitte RPA == Robotic Process Automation Analytics Engineer James Webb Space Telescope Fivetran Podcast Episode dbt Podcast Episode Data Governance Podcast Episodes Azure Cloud Platform Stitch Data Airflow Prefect Argo Project Looker Azure Purview Soda Data Podcast Episode Datafold Materialize Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/13/2021 • 49 minutes, 1 second

Democratize Data Cleaning Across Your Organization With Trifacta

Summary Every data project, whether it’s analytics, machine learning, or AI, starts with the work of data cleaning. This is a critical step and benefits from being accessible to the domain experts. Trifacta is a platform for managing your data engineering workflow to make curating, cleaning, and preparing your information more approachable for everyone in the business. In this episode CEO Adam Wilson shares the story behind the business, discusses the myriad ways that data wrangling is performed across the business, and how the platform is architected to adapt to the ever-changing landscape of data management tools. This is a great conversation about how deliberate user experience and platform design can make a drastic difference in the amount of value that a business can provide to their customers. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management You listen to this show to learn about all of the latest tools, patterns, and practices that power data engineering projects across every domain. Now there’s a book that captures the foundational lessons and principles that underly everything that you hear about here. I’m happy to announce I collected wisdom from the community to help you in your journey as a data engineer and worked with O’Reilly to publish it as 97 Things Every Data Engineer Should Know. Go to dataengineeringpodcast.com/97things today to get your copy! When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Adam Wilson about Trifacta, a platform for modern data workers to assess quality, transform, and automate data pipelines Interview Introduction How did you get involved in the area of data management? Can you describe what Trifacta is and the story behind it? Across your site and material you focus on using the term "data wrangling". What is your personal definition of that term, and in what ways do you differentiate from ETL/ELT? How does the deliberate use of that terminology influence the way that you think about the design and features of the Trifacta platform? What is Trifacta’s role in the overall data platform/data lifecycle for an organization? What are some examples of tools that Trifacta might replace? What tools or systems does Trifacta integrate with? Who are the target end-users of the Trifacta platform and how do those personas direct the design and functionality? Can you describe how Trifacta is architected? How have the goals and design of the system changed or evolved since you first began working on it? Can you talk through the workflow and lifecycle of data as it traverses your platform, and the user interactions that drive it? How can data engineers share and encourage proper patterns for working with data assets with end-users across the organization? What are the limits of scale for volume and complexity of data assets that users are able to manage through Trifacta’s visual tools? What are some strategies that you and your customers have found useful for pre-processing the information that enters your platform to increase the accessibility for end-users to self-serve? What are the most interesting, innovative, or unexpected ways that you have seen Trifacta used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Trifacata? When is Trifacta the wrong choice? What do you have planned for the future of Trifacta? Contact Info LinkedIn @a_adam_wilson on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Trifacta Informatica UC Berkeley Stanford University Citadel Podcast Episode Stanford Data Wrangler DBT Podcast Episode Pig Databricks Sqoop Flume SPSS Tableau SDLC == Software Delivery Life-Cycle The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/9/2021 • 1 hour, 7 minutes, 13 seconds

Stick All Of Your Systems And Data Together With SaaSGlue As Your Workflow Manager

Summary At the core of every data pipeline is an workflow manager (or several). Deploying, managing, and scaling that orchestration can consume a large fraction of a data team’s energy so it is important to pick something that provides the power and flexibility that you need. SaaSGlue is a managed service that lets you connect all of your systems, across clouds and physical infrastructure, and spanning all of your programming languages. In this episode Bart and Rich Wood explain how SaaSGlue is architected to allow for a high degree of flexibility in usage and deployment, their experience building a business with family, and how you can get started using it today. This is a fascinating platform with an endless set of use cases and a great team of people behind it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Rich and Bart Wood about SaasGlue, a SaaS-based integration, orchestration and automation platform that lets you fill the gaps in your existing automation infrastructure Interview Introduction How did you get involved in the area of data management? Can you describe what SaasGlue is and the story behind it? I understand that you are building this company with your 3 brothers. What have been the pros and cons of working with your family on this project? What are the main use cases that you are focused on enabling? Who are your target users and how has that influenced the features and design of the platform? Orchestration, automation, and workflow management are all areas that have a range of active products and projects. How do you characterize SaaSGlue’s position in the overall ecosystem? What are some of the ways that you see it integrated into a data platform? What are the core elements and concepts of the SaaSGlue platform? How is the SaaSGlue platform architected? How have the goals and design of the platform changed or evolved since you first began working on it? What are some of the assumptions that you had at the beginning of the project which have been challenged or changed as you worked through building it? Can you talk through the workflow of someone building a task graph with SaaSGlue? How do you handle dependency management for custom code in the payloads for agent tasks? How does SaasGlue manage metadata propagation throughout the execution graph? How do you handle the myriad failure modes that you are likely to encounter? (e.g. agent failure, network partitions, individual task failures, etc.) What are some of the tools/platforms/architectural paradigms that you looked to for inspiration while designing and building SaaSGlue? What are the most interesting, innovative, or unexpected ways that you have seen SaasGlue used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on SaasGlue? When is SaaSGlue the wrong choice? What do you have planned for the future of SaaSGlue? Contact Info Rich LinkedIn Bart LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links SaaSGlue Jenkins Cron Airflow Ansible Terraform DSL == Domain Specific Language Clojure Gradle Polymorphism Dagster Podcast Episode Podcast.__init__ Episode Martin Kleppman The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/5/2021 • 55 minutes, 31 seconds

Leveling Up Open Source Data Integration With Meltano Hub And The Singer SDK

Summary Data integration in the form of extract and load is the critical first step of every data project. There are a large number of commercial and open source projects that offer that capability but it is still far from being a solved problem. One of the most promising community efforts is that of the Singer ecosystem, but it has been plagued by inconsistent quality and design of plugins. In this episode the members of the Meltano project share the work they are doing to improve the discovery, quality, and capabilities of Singer taps and targets. They explain their work on the Meltano Hub and the Singer SDK and their long term goals for the Singer community. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Douwe Maan, Taylor Murphy, and AJ Steers about their work to level up the Singer ecosystem through projects like Meltano Hub and the Singer SDK Interview Introduction How did you get involved in the area of data management? Can you start by describing what the Singer ecosystem is? What are the current weak points/challenges in the ecosystem? What is the current role of the Meltano project/community within the ecosystem? What are the projects and activities related to Singer that you are focused on? What are the main goals of the Meltano Hub? What criteria are you using to determine which projects to include in the hub? Why is the number of targets so small? What additional functionality do you have planned for the hub? What functionality does the SDK provide? How does the presence of the SDK make it easier to write taps/targets? What do you believe the long-term impacts of the SDK on the overall availability and quality of plugins will be? Now that you have spun out your own business and raised funding, how does that influence the priorities and focus of your work? How do you hope to productize what you have built at Meltano? What are the most interesting, innovative, or unexpected ways that you have seen Meltano and Singer plugins used? What are the most interesting, unexpected, or challenging lessons that you have learned while working with the Singer community and the Meltano project? When is Singer/Meltano the wrong choice? What do you have planned for the future of Meltano, Meltano Hub, and the Singer SDK? Contact Info Douwe Website Taylor LinkedIn @tayloramurphy on Twitter Blog AJ LinkedIn @aaronsteers on Twitter aaronsteers on GitLab Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Singer Meltano Podcast Episode Meltano Hub Singer SDK Concert Genetics GitLab Snowflake dbt Podcast Episode Microsoft SQL Server Airflow Podcast Episode Dagster Podcast Episode Podcast.__init__ Episode Prefect Podcast Episode AWS Athena Reverse ETL REST (REpresentational State Transfer) GraphQL Meltano Interpretation of Singer Specification Vision for the Future of Meltano blog post Coalesce Conference Running Your Data Team Like A Product Team The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/3/2021 • 1 hour, 5 minutes, 24 seconds

A Candid Exploration Of Timeseries Data Analysis With InfluxDB

Summary While the overall concept of timeseries data is uniform, its usage and applications are far from it. One of the most demanding applications of timeseries data is for application and server monitoring due to the problem of high cardinality. In his quest to build a generalized platform for managing timeseries Paul Dix keeps getting pulled back into the monitoring arena. In this episode he shares the history of the InfluxDB project, the business that he has helped to build around it, and the architectural aspects of the engine that allow for its flexibility in managing various forms of timeseries data. This is a fascinating exploration of the technical and organizational evolution of the Influx Data platform, with some promising glimpses of where they are headed in the near future. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Paul Dix about Influx Data and the different facets of the market for timeseries databases Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Influx Data and the story behind it? Timeseries data is a fairly broad category with many variations in terms of storage volume, frequency, processing requirements, etc. This has led to an explosion of database engines and related tools to address these different needs. How do you think about your position and role in the ecosystem? Who are your target customers and how does that focus inform your product and feature priorities? What are the use cases that Influx is best suited for? Can you give an overview of the different projects, tools, and services that comprise your platform? How is InfluxDB architected? How have the design and implementation of the DB engine changed or evolved since you first began working on it? What are you optimizing for on the consistency vs. availability spectrum of CAP? What is your approach to clustering/data distribution beyond a single node? For the interface to your database engine you developed a custom query language. What was your process for deciding what syntax to use and how to structure the programmatic interface? How do you handle the lifecycle of data in an Influx deployment? (e.g. aging out old data, periodic compaction/rollups, etc.) With your strong focus on monitoring use cases, how do you handle the challenge of high cardinality in the data being stored? What are some of the data modeling considerations that users should be aware of as they are designing a deployment of Influx? What is the role of open source in your product strategy? What are the most interesting, innovative, or unexpected ways that you have seen the Influx platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Influx? When is Influx DB and/or the associated tools the wrong choice? What do you have planned for the future of Influx Data? Contact Info LinkedIn pauldix on GitHub @pauldix on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Influx Data Influx DB Search and Information Retrieval Datadog Podcast Episode New Relic StackDriver Scala Cassandra Redis KDB Latent Semantic Indexing TICK Stack ELK Stack Prometheus TSM storage engine TSI Storage Engine Golang Rust Language RAFT Protocol Telegraf Kafka InfluxQL Flux Language DataFusion Apache Arrow Apache Parquet The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/29/2021 • 1 hour, 6 minutes, 2 seconds

Lessons Learned From The Pipeline Data Engineering Academy

Summary Data Engineering is a broad and constantly evolving topic, which makes it difficult to teach in a concise and effective manner. Despite that, Daniel Molnar and Peter Fabian started the Pipeline Academy to do exactly that. In this episode they reflect on the lessons that they learned while teaching the first cohort of their bootcamp how to be effective data engineers. By focusing on the fundamentals, and making everyone write code, they were able to build confidence and impart the importance of context for their students. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Daniel Molnar and Peter Fabian about the lessons that they learned from their first cohort at the Pipeline data engineering academy Interview Introduction How did you get involved in the area of data management? Can you start by sharing the curriculum and learning goals for the students? How did you set a common baseline for all of the students to build from throughout the program? What was your process for determining the structure of the tasks and the tooling used? What were some of the topics/tools that the students had the most difficulty with? What topics/tools were the easiest to grasp? What are some difficulties that you encountered while trying to teach different concepts? How did you deal with the tension of teaching the fundamentals while tying them to toolchains that hiring managers are looking for? What are the successes that you had with this cohort and what changes are you making to your approach/curriculum to build on them? What are some of the failures that you encountered and what lessons have you taken from them? How did the pandemic impact your overall plan and execution of the initial cohort? What were the skills that you focused on for interview preparation? What level of ongoing support/engagement do you have with students once they complete the curriculum? What are the most interesting, innovative, or unexpected solutions that you saw from your students? What are the most interesting, unexpected, or challenging lessons that you have learned while working with your first cohort? When is a bootcamp the wrong approach for skill development? What do you have planned for the future of the Pipeline Academy? Contact Info Daniel LinkedIn Website @soobrosa on Twitter Peter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Pipeline Academy Blog Scikit Pandas Urchin Kafka Three "C"s – Context, Confidence, and Code Prefect Podcast Episode Great Expectations Podcast Episode Podcast.__init__ Episode Docker Kubernetes Become a Data Engineer On A Shoestring James Mickens The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/26/2021 • 1 hour, 11 minutes, 3 seconds

Make Database Performance Optimization A Playful Experience With OtterTune

Summary The database is the core of any system because it holds the data that drives your entire experience. We spend countless hours designing the data model, updating engine versions, and tuning performance. But how confident are you that you have configured it to be as performant as possible, given the dozens of parameters and how they interact with each other? Andy Pavlo researches autonomous database systems, and out of that research he created OtterTune to find the optimal set of parameters to use for your specific workload. In this episode he explains how the system works, the challenge of scaling it to work across different database engines, and his hopes for the future of database systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Andy Pavlo about OtterTune, a system to continuously monitor and improve database performance via machine learning Interview Introduction How did you get involved in the area of data management? Can you describe what OtterTune is and the story behind it? How does it relate to your work with NoisePage? What are the challenges that database administrators, operators, and users run into when working with, configuring, and tuning transactional systems? What are some of the contributing factors to the sprawling complexity of the configurable parameters for these databases? Can you describe how OtterTune is implemented? What are some of the aggregate benefits that OtterTune can gain by running as a centralized service and learning from all of the systems that it connects to? What are some of the assumptions that you made when starting the commercialization of this technology that have been challenged or invalidated as you began working with initial customers? How have the design and goals of the system changed or evolved since you first began working on it? What is involved in adding support for a new database engine? How applicable are the OtterTune capabilities to analytical database engines? How do you handle tuning for variable or evolving workloads? What are some of the most interesting or esoteric configuration options that you have come across while working on OtterTune? What are some that made you facepalm? What are the most interesting, innovative, or unexpected ways that you have seen OtterTune used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on OtterTune? When is OtterTune the wrong choice? What do you have planned for the future of OtterTune? Contact Info CMU Page apavlo on GitHub @andy_pavlo on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links OtterTune CMU (Carnegie Mellon University) Brown University Michael Stonebraker H-Store Learned Indexes NoisePage Oracle DB PostgreSQL Podcast Episode MySQL RDS Gaussian Process Model Reinforcement Learning AWS Aurora MVCC (Multi-Version Concurrency Control) Puppet VectorWise GreenPlum Snowflake Podcast Episode PGTune MySQL Tuner SIGMOD The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/23/2021 • 58 minutes, 28 seconds

Bring Order To The Chaos Of Your Unstructured Data Assets With Unstruk

Summary Working with unstructured data has typically been a motivation for a data lake. The challenge is imposing enough order on the platform to make it useful. Kirk Marple has spent years working with data systems and the media industry, which inspired him to build a platform for automatically organizing your unstructured assets to make them more valuable. In this episode he shares the goals of the Unstruk Data Warehouse, how it is architected to extract asset metadata and build a searchable knowledge graph from the information, and the myriad ways that the system can be used. If you are wondering how to deal with all of the information that doesn’t fit in your databases or data warehouses, then this episode is for you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Kirk Marple about Unstruk Data, a company that is building a data warehouse for unstructured data that ofers automated data preparation via metadata enrichment, integrated compute, and graph-based search Interview Introduction How did you get involved in the area of data management? Can you describe what Unstruk Data is and the story behind it? What would you classify as "unstructured data"? What are some examples of industries that rely on large or varied sets of unstructured data? What are the challenges for analytics that are posed by the different categories of unstructured data? What is the current state of the industry for working with unstructured data? What are the unique capabilities that Unstruk provides and how does it integrate with the rest of the ecosystem? Where does it sit in the overall landscape of data tools? Can you describe how the Unstruk data warehouse is implemented? What are the assumptions that you had at the start of this project that have been challenged as you started working through the technical implementation and customer trials? How has the design and architecture evolved or changed since you began working on it? How do you handle versioning of data, given the potential for individual files to be quite large? What are some of the considerations that users should have in mind when modeling their data in the warehouse? Can you talk through the workflow of ingesting and analyzing data with Unstruk? How do you manage data enrichment/integration with structured data sources? What are the most interesting, innovative, or unexpected ways that you have seen the technology of Unstruk used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with the Unstruk platform? When is Unstruk the wrong choice? What do you have planned for the future of Unstruk? Contact Info LinkedIn @KirkMarple on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Unstruk Data TIFF ROSBag HDF5 Media/Digital Asset Management Data Mesh SAN NAS Knowledge Graph Entity Extraction OCR (Optical Character Recognition) Cloud Native Cosmos DB Azure Functions Azure EventHub Azure Cognitive Search GraphQL KNative Schema.org Pinecone Vector Database Podcast Episode Dublin Core Metadata Initiative Knowledge Management The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/18/2021 • 40 minutes, 47 seconds

Accelerating ML Training And Delivery With In-Database Machine Learning

Summary When you build a machine learning model, the first step is always to load your data. Typically this means downloading files from object storage, or querying a database. To speed up the process, why not build the model inside the database so that you don’t have to move the information? In this episode Paige Roberts explains the benefits of pushing the machine learning processing into the database layer and the approach that Vertica has taken for their implementation. If you are looking for a way to speed up your experimentation, or an easy way to apply AutoML then this conversation is for you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Paige Roberts about machine learning workflows inside the database Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the current state of the market for databases that support in-process machine learning? What are the motivating factors for running a machine learning workflow inside the database? What styles of ML are feasible to do inside the database? (e.g. bayesian inference, deep learning, etc.) What are the performance implications of running a model training pipeline within the database runtime? (both in terms of training performance boosts, and database performance impacts) Can you describe the architecture of how the machine learning process is managed by the database engine? How do you manage interacting with Python/R/Jupyter/etc. when working within the database? What is the impact on data pipeline and MLOps architectures when using the database to manage the machine learning workflow? What are the most interesting, innovative, or unexpected ways that you have seen in-database ML used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on machine learning inside the database? When is in-database ML the wrong choice? What are the recent trends/changes in machine learning for the database that you are excited for? Contact Info LinkedIn Blog @RobertsPaige on Twitter @PaigeEwing on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Vertica SyncSort Hortonworks Infoworld – 8 databases supporting in-database machine learning Power BI Podcast Episode Grafana Tableau K-Means Clustering MPP == Massively Parallel Processing AutoML Random Forest PMML == Predictive Model Markup Language SVM == Support Vector Machine Naive Bayes XGBoost Pytorch Tensorflow Neural Magic Tensorflow Frozen Graph Parquet ORC Avro CNCF == Cloud Native Computing Foundation Hotel California VerticaPy Pandas Podcast.__init__ Episode Jupyter Notebook UDX Unifying Analytics Presentation Hadoop Yarn Holden Karau Spark Vertica Academy The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/15/2021 • 1 hour, 5 minutes, 32 seconds

Taking A Tour Of The Google Cloud Platform For Data And Analytics

Summary Google pioneered an impressive number of the architectural underpinnings of the broader big data ecosystem. Now they offer the technologies that they run internally to external users of their cloud platform. In this episode Lak Lakshmanan enumerates the variety of services that are available for building your various data processing and analytical systems. He shares some of the common patterns for building pipelines to power business intelligence dashboards, machine learning applications, and data warehouses. If you’ve ever been overwhelmed or confused by the array of services available in the Google Cloud Platform then this episode is for you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bored with writing scripts to move data into SaaS tools like Salesforce, Marketo, or Facebook Ads? Hightouch is the easiest way to sync data into the platforms that your business teams rely on. The data you’re looking for is already in your data warehouse and BI tools. Connect your warehouse to Hightouch, paste a SQL query, and use their visual mapper to specify how data should appear in your SaaS systems. No more scripts, just SQL. Supercharge your business teams with customer data using Hightouch for Reverse ETL today. Get started for free at dataengineeringpodcast.com/hightouch. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Lak Lakshmanan about the suite of services for data and analytics in Google Cloud Platform. Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the tools and products that are offered as part of Google Cloud for data and analytics? How do the various systems relate to each other for building a full workflow? How do you balance the need for clean integration between services with the need to make them useful in isolation when used as a single component of a data platform? What have you found to be the primary motivators for customers who are adopting GCP for some or all of their data workloads? What are some of the challenges that new users of GCP encounter when working with the data and analytics products that it offers? What are the systems that you have found to be easiest to work with? Which are the most challenging to work with, whether due to the kinds of problems that they are solving for, or due to their user experience design? How has your work with customers fed back into the products that you are building on top of? What are some examples of architectural or software patterns that are unique to the GCP product suite? What are the most interesting, innovative, or unexpected ways that you have seen Google Cloud’s data and analytics services used? What are the most interesting, unexpected, or challenging lessons that you have learned while working at Google and helping customers succeed in their data and analytics efforts? What are some of the new capabilities, new services, or industry trends that you are most excited for? Contact Info LinkedIn @lak_gcp on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Google Cloud Data and Analytics Services Forrester Wave Dremel BigQuery MapReduce Cloud Spanner Spanner Paper Hadoop Tensorflow Google Cloud SQL Apache Spark Dataproc Dataflow Apache Beam Databricks Mixpanel Avalanche data warehouse Kubernetes GKE (Google Kubernetes Engine) Google Cloud Run Android Youtube Google Translate Teradata Power BI Podcast Episode AI Platform Notebooks GitHub Data Repository Stack Overflow Questions Data Repository PyPI Download Statistics Recommendations AI Pub/Sub Bigtable Datastream Change Data Capture Podcast Episode About Debezium for CDC Podcast Episode About CDC with Datacoral Document AI Google Meet Data Governance Podcast Episodes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/12/2021 • 53 minutes, 16 seconds

Make Sure Your Records Are Reliable With The BookKeeper Distributed Storage Layer

Summary The way to build maintainable software and systems is through composition of individual pieces. By making those pieces high quality and flexible they can be used in surprising ways that the original creators couldn’t have imagined. One such component that has gone above and beyond its originally envisioned use case is BookKeeper, a distributed storage system that is optimized for durability and speed. In this episode Matteo Merli shares the story behind the creation of BookKeeper, the various ways that it is being used today, and the architectural aspects that make it such a strong building block for projects such as Pulsar. He also shares some of the other interesting systems that have been built on top of it and an amusing war story of running it at scale in its early years. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. We’ve all been asked to help with an ad-hoc request for data by the sales and marketing team. Then it becomes a critical report that they need updated every week or every day. Then what do you do? Send a CSV via email? Write some Python scripts to automate it? But what about incremental sync, API quotas, error handling, and all of the other details that eat up your time? Today, there is a better way. With Census, just write SQL or plug in your dbt models and start syncing your cloud warehouse to SaaS applications like Salesforce, Marketo, Hubspot, and many more. Go to dataengineeringpodcast.com/census today to get a free 14-day trial. Your host is Tobias Macey and today I’m interviewing Matteo Merli about Apache BookKeeper, a scalable, fault-tolerant, and low-latency storage service optimized for real-time workloads Interview Introduction How did you get involved in the area of data management? Can you describe what BookKeeper is and the story behind it? What are the most notable features/capabilities of BookKeeper? What are some of the ways that BookKeeper is being used? How has your work on Pulsar influenced the features and product direction of BookKeeper? Can you describe the architecture of a BookKeeper cluster? How have the design and goals of BookKeeper changed or evolved over time? What is the impact of record-oriented storage on data distribution/allocation within the cluster when working with variable record sizes? What are some of the operational considerations that users should be aware of? What are some of the most interesting/compelling features from your perspective? What are some of the most often overlooked or misunderstood capabilities of BookKeeper? What are the most interesting, innovative, or unexpected ways that you have seen BookKeeper used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on BookKeeper? When is BookKeeper the wrong choice? What do you have planned for the future of BookKeeper? Contact Info LinkedIn @merlimat on Twitter merlimat on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Apache BookKeeper Apache Pulsar Podcast Episode StreamNative Podcast Episode Hadoop NameNode Apache Zookeeper Podcast Episode ActiveMQ Write Ahead Log (WAL) BookKeeper Architecture RocksDB LSM == Log-Structured Merge-Tree RAID Controller Pravega Podcast Episode BookKeeper etcd Metadata Storage LevelDB Ceph Podcast Episode Direct IO Page Cache The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/9/2021 • 42 minutes, 1 second

Build Your Analytics With A Collaborative And Expressive SQL IDE Using Querybook

Summary SQL is the most widely used language for working with data, and yet the tools available for writing and collaborating on it are still clunky and inefficient. Frustrated with the lack of a modern IDE and collaborative workflow for managing the SQL queries and analysis of their big data environments, the team at Pinterest created Querybook. In this episode Justin Mejorada-Pier and Charlie Gu share the story of how the initial prototype for a data catalog ended up as one of their most widely used interfaces to their analytical data. They also discuss the unique combination of features that it offers, how it is implemented, and the path to releasing it as open source. Querybook is an impressive and unique piece of technology that is well worth exploring, so listen and try it out today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Firebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Justin Mejorada-Pier and Charlie Gu about Querybook, an open source IDE for your big data projects Interview Introduction How did you get involved in the area of data management? Can you describe what Querybook is and the story behind it? What are the main use cases or workflows that Querybook is designed for? What are the shortcomings of dashboarding/BI tools that make something like Querybook necessary? The tag line calls out the fact that Querybook is an IDE for "big data". What are the manifestations of that focus in the feature set and user experience? Who are the target users of Querybook and how does that inform the feature priorities and user experience? Can you describe how Querybook is architected? How have the goals and design changed or evolved since you first began working on it? What were some of the assumptions or design choices that you had to unwind in the process of open sourcing it? What is the workflow for someone building a DataDoc with Querybook? What is the experience of working as a collaborator on an analysis? How do you handle lifecycle management of query results? What are your thoughts on the potential for extending Querybook beyond SQL-oriented analysis and integrating something like Jupyter kernels? What are the most interesting, innovative, or unexpected ways that you have seen Querybook used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Querybook? When is Querybook the wrong choice? What do you have planned for the future of Querybook? Contact Info Justin LinkedIn Website Charlie czgu on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Querybook Announcing Querybook as Open Source Pinterest University of Waterloo Superset Podcast Episode Podcast.__init__ Episode Sequel Pro Presto Trino Podcast Episode Flask uWSGI Podcast.__init__ Episode Celery Redis SocketIO Elasticsearch Podcast Episode Amundsen Podcast Episode Apache Atlas DataHub Podcast Episode Okta LDAP (Lightweight Directory Access Protocol) Grand Rounds The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/3/2021 • 52 minutes, 35 seconds

Making Data Pipelines Self-Serve For Everyone With Shipyard

Summary Every part of the business relies on data, yet only a small team has the context and expertise to build and maintain workflows and data pipelines to transform, clean, and integrate it. In order for the true value of your data to be realized without burning out your engineers you need a way for everyone to get access to the information they care about. To help make that a more tractable problem Blake Burch co-founded Shipyard. In this episode he explains the utility of a low code solution that lets non engineers create their own self-serve pipelines, how the Shipyard platform is designed to make that possible, and how it allows engineers to create reusable tasks to satisfy the specific needs of the business. This is an interesting conversation about how to make data more accessible and more useful by improving the user experience of the tools that we create. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt. Your host is Tobias Macey and today I’m interviewing Blake Burch about Shipyard, and his mission to create the easiest way for data teams to launch, monitor, and share resilient pipelines with less engineering Interview Introduction How did you get involved in the area of data management? Can you describe what you are building at Shipyard and the story behind it? What are the main goals that you have for Shipyard? How does it compare to other data orchestration frameworks in the market? Who are the target users of Shipyard and how does that influence the features and design of the product? What are your thoughts on the role of data orchestration in the business? How is the Shipyard platform implemented? What was your process for identifying the core requirements of the platform? How have the design and goals of the system evolved since you first began working on it? Can you describe the workflow of building a data workflow with Shipyard? How do you manage the dependency chain across tasks in the execution graph? (e.g. task-based, data assets, etc.) How do you handle testing and data quality management in your workflows? What is the interface for creating custom task definitions? How do you address dependencies and sandboxing for custom code? What is your approach to developing templates? What are the operational challenges that you have had to address to manage scaling and multi-tenancy in your platform? What are the most interesting, innovative, or unexpected ways that you have seen Shipyard used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on Shipyard? When is Shipyard the wrong choice? What do you have planned for the future of Shipyard? Contact Info LinkedIn @BlakeBurch_ on Twitter Website blakeburch on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Shipyard Zapier Airtable BigQuery Snowflake Podcast Episode Docker ECS == Elastic Container Service Great Expectations Podcast Episode Monte Carlo Podcast Episode Soda Data Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/2/2021 • 51 minutes, 22 seconds

Paving The Road For Fast Analytics On Distributed Clouds With The Yellowbrick Data Warehouse

Summary The data warehouse has become the focal point of the modern data platform. With increased usage of data across businesses, and a diversity of locations and environments where data needs to be managed, the warehouse engine needs to be fast and easy to manage. Yellowbrick is a data warehouse platform that was built from the ground up for speed, and can work across clouds and all the way to the edge. In this episode CTO Mark Cusack explains how the engine is architected, the benefits that speed and predictable pricing has for the organization, and how you can simplify your platform by putting the warehouse close to the data, instead of the other way around. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Firebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Mark Cusack about Yellowbrick, a data warehouse designed for distributed clouds Interview Introduction How did you get involved in the area of data management? Can you start by describing what Yellowbrick is and some of the story behind it? What does the term "distributed cloud" signify and what challenges are associated with it? How would you characterize Yellowbrick’s position in the database/DWH market? How is Yellowbrick architected? How have the goals and design of the platform changed or evolved over time? How does Yellowbrick maintain visibility across the different data locations that it is responsible for? What capabilities does it offer for being able to join across the disparate "clouds"? What are some data modeling strategies that users should consider when designing their deployment of Yellowbrick? What are some of the capabilities of Yellowbrick that you find most useful or technically interesting? For someone who is adopting Yellowbrick, what is the process for getting it integrated into their data systems? What are the most underutilized, overlooked, or misunderstood features of Yellowbrick? What are the most interesting, innovative, or unexpected ways that you have seen Yellowbrick used? What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Yellowbrick? When is Yellowbrick the wrong choice? What do you have planned for the future of the product? Contact Info LinkedIn @markcusack on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Yellowbrick Teradata Rainstor Distributed Cloud Hybrid Cloud SwimOS Podcast Episode Kafka Pulsar Podcast Episode Snowflake Podcast Episode AWS Redshift MPP == Massively Parallel Processing Presto Trino Podcast Episode L3 Cache NVMe Reactive Programming Coroutine Star Schema Denodo Lexis Nexis Vertica Netezza Grenplum PostgreSQL Podcast Episode Clickhouse Podcast Episode Erasure Coding The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/28/2021 • 52 minutes, 40 seconds

Easily Build Advanced Similarity Search With The Pinecone Vector Database

Summary Machine learning models use vectors as the natural mechanism for representing their internal state. The problem is that in order for the models to integrate with external systems their internal state has to be translated into a lower dimension. To eliminate this impedance mismatch Edo Liberty founded Pinecone to build database that works natively with vectors. In this episode he explains how this technology will allow teams to accelerate the speed of innovation, how vectors make it possible to build more advanced search functionality, and how Pinecone is architected. This is an interesting conversation about how reconsidering the architecture of your systems can unlock impressive new capabilities. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt. Your host is Tobias Macey and today I’m interviewing Edo Liberty about Pinecone, a vector database for powering machine learning and similarity search Interview Introduction How did you get involved in the area of data management? Can you start by describing what Pinecone is and the story behind it? What are some of the contexts where someone would want to perform a similarity search? What are the considerations that someone should be aware of when deciding between Pinecone and Solr/Lucene for a search oriented use case? What are some of the other use cases that Pinecone enables? In the absence of Pinecone, what kinds of systems and solutions are people building to address those use cases? Where does Pinecone sit in the lifecycle of data and how does it integrate with the broader data management ecosystem? What are some of the systems, tools, or frameworks that Pinecone might replace? How is Pinecone implemented? How has the architecture evolved since you first began working on it? What are the most complex or difficult aspects of building Pinecone? Who is your target user and how does that inform the user experience design and product development priorities? For someone who wants to start using Pinecone, what is involved in populating it with data building an analysis or service with it? What are some of the data modeling considerations when building a set of vectors in Pinecone? What are some of the most interesting, unexpected, or innovative ways that you have seen Pinecone used? What are the most interesting, unexpected, or challenging lessons that you have learned while building and growing the Pinecone technology and business? When is Pinecone the wrong choice? What do you have planned for the future of Pinecone? Contact Info Website LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Pinecone Theoretical Physics High Dimensional Geometry AWS Sagemaker Visual Cortex Temporal Lobe Inverted Index Elasticsearch Podcast Episode Solr Lucene NMSLib Johnson-Lindenstrauss Lemma The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/25/2021 • 46 minutes, 47 seconds

A Holistic Approach To Data Governance Through Self Reflection At Collibra

Summary Data governance is a phrase that means many different things to many different people. This is because it is actually a concept that encompasses the entire lifecycle of data, across all of the people in an organization who interact with it. Stijn Christiaens co-founded Collibra with the goal of addressing the wide variety of technological aspects that are necessary to realize such an important and expansive process. In this episode he shares his thoughts on the balance between human and technological processes that are necessary for a well-managed data governance strategy, how Collibra is designed to aid in that endeavor, and his experiences using the platform that his company is building to help power the company. This is an excellent conversation that spans the engineering and philosophical complexities of an important and ever-present aspect of working with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Stijn Christiaens about data governance in the enterprise and how Collibra applies the lessons learned from their customers to their own business Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Collibra and the story behind the company? Wat does "data governance" mean to you, and how does that definition inform your work at Collibra? How would you characterize the current landscape of "data governance" offerings and Collibra’s position within it? What are the elements of governance that are often ignored in small/medium businesses but which are essential for the enterprise? (e.g. data stewards, business glossaries, etc.) One of the most important tasks as a data professional is to establish and maintain trust in the information you are curating. What are the biggest obstacles to overcome in that mission? What are some of the data problems that you will only find at large or complex organizations? How does Collibra help to tame that complexity? Who are the end users of Collibra within an organization? Can you talk through the workflow and various interactions that your customers have as it relates to the overall flow of data through an organization? Can you describe how the Collibra platform is implemented? How has the scope and design of the system evolved since you first began working on it? You are currently leading a team that uses Collibra to manage the operations of the business. What are some of the most notable surprises that you have learned from being your own customer? What are some of the weak points that you have been able to identify and resolve? How have you been able to use those lessons to help your customers? What are the activities that are resistant to automation? How do you design the system to allow for a smooth handoff between mechanistic and humanistic processes? What are some of the most interesting, innovative, or unexpected ways that you have seen Collibra used? What are the most interesting, unexpected, or challenging lessons that you have learned while building and growing Collibra, and running the internal data office? When is Collibra the wrong choice? What do you have planned for the future of the platform? Contact Info LinkedIn @stichris on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Collibra Collibra Data Office Electrical Engineering Resistor Color Codes STAR Lab (semantics, technology, and research) Microsoft Azure Data Governance GDPR Chief Data Officer Dunbar’s Number Business Glossary Data Steward ERP == Enterprise Resource Planning CRM == Customer Relationship Management Data Ownership Data Mesh Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/21/2021 • 55 minutes, 52 seconds

Unlocking The Power of Data Lineage In Your Platform with OpenLineage

Summary Data lineage is the common thread that ties together all of your data pipelines, workflows, and systems. In order to get a holistic understanding of your data quality, where errors are occurring, or how a report was constructed you need to track the lineage of the data from beginning to end. The complicating factor is that every framework, platform, and product has its own concepts of how to store, represent, and expose that information. In order to eliminate the wasted effort of building custom integrations every time you want to combine lineage information across systems Julien Le Dem introduced the OpenLineage specification. In this episode he explains his motivations for starting the effort, the far-reaching benefits that it can provide to the industry, and how you can start integrating it into your data platform today. This is an excellent conversation about how competing companies can still find mutual benefit in co-operating on open standards. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. When it comes to serving data for AI and ML projects, do you feel like you have to rebuild the plane while you’re flying it across the ocean? Molecula is an enterprise feature store that operationalizes advanced analytics and AI in a format designed for massive machine-scale projects without having to manage endless one-off information requests. With Molecula, data engineers manage one single feature store that serves the entire organization with millisecond query performance whether in the cloud or at your data center. And since it is implemented as an overlay, Molecula doesn’t disrupt legacy systems. High-growth startups use Molecula’s feature store because of its unprecedented speed, cost savings, and simplified access to all enterprise data. From feature extraction to model training to production, the Molecula feature store provides continuously updated feature access, reuse, and sharing without the need to pre-process data. If you need to deliver unprecedented speed, cost savings, and simplified access to large scale, real-time data, visit dataengineeringpodcast.com/molecula and request a demo. Mention that you’re a Data Engineering Podcast listener, and they’ll send you a free t-shirt. Your host is Tobias Macey and today I’m interviewing Julien Le Dem about Open Lineage, a new standard for structuring metadata to enable interoperability across the ecosystem of data management tools. Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what the Open Lineage project is and the story behind it? What is the current state of the ecosystem for generating and sharing metadata between systems? What are your goals for the OpenLineage effort? What are the biggest conceptual or consistency challenges that you are facing in defining a metadata model that is broad and flexible enough to be widely used while still being prescriptive enough to be useful? What is the current state of the project? (e.g. code available, maturity of the specification, etc.) What are some of the ideas or assumptions that you had at the beginning of this project that have had to be revisited as you iterate on the definition and implementation? What are some of the projects/organizations/etc. that have committed to supporting or adopting OpenLineage? What problem domain(s) are best suited to adopting OpenLineage? What are some of the problems or use cases that you are explicitly not including in scope for OpenLineage? For someone who already has a lineage and/or metadata catalog, what is involved in evolving that system to work well with OpenLineage? What are some of the downstream/long-term impacts that you anticipate or hope that this standardization effort will generate? What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on the OpenLineage effort? What do you have planned for the future of the project? Contact Info LinkedIn @J_ on Twitter julienledem on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links OpenLineage Marquez Podcast Episode Hadoop Pig Apache Parquet Podcast Episode Doug Cutting Avro Apache Arrow Service Oriented Architecture Data Lineage Apache Atlas DataHub Podcast Episode Amundsen Podcast Episode Egeria Pandas Podcast.__init__ Episode Apache Spark EXIF JSON Schema OpenTelemetry Podcast.__init__ Episode OpenTracing Superset Podcast.__init__ Episode Data Engineering Podcast Episode Iceberg Podcast Episode Great Expectations Podcast Episode dbt Podcast Episode Data Mesh Podcast Episode The map is not the territory Kafka Apache Flink Apache Storm Kafka Streams Stone Soup Apache Beam Linux Foundation AI & Data The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/18/2021 • 57 minutes, 38 seconds

Building Your Data Warehouse On Top Of PostgreSQL

Summary There is a lot of attention on the database market and cloud data warehouses. While they provide a measure of convenience, they also require you to sacrifice a certain amount of control over your data. If you want to build a warehouse that gives you both control and flexibility then you might consider building on top of the venerable PostgreSQL project. In this episode Thomas Richter and Joshua Drake share their advice on how to build a production ready data warehouse with Postgres. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Firebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Thomas Richter and Joshua Drake about using Postgres as your data warehouse Interview Introduction How did you get involved in the area of data management? Can you start by establishing a working definition of what constitutes a data warehouse for the purpose of this discussion? What are the limitations for out-of-the-box Postgres when trying to use it for these workloads? There are a large and growing number of options for data warehouse style workloads. How would you categorize the different systems and what is PostgreSQL’s position in that ecosystem? What do you see as the motivating factors for a team or organization to select from among those categories? Why would someone want to use Postgres as their data warehouse platform rather than using a purpose-built engine? What is the cost/performance equation for Postgres as compared to other data warehouse solutions? For someone who wants to turn Postgres into a data warehouse engine, what are their options? What are the relative tradeoffs of the different open source and commercial offerings? (e.g. Citus, cstore_fdw, zedstore, Swarm64, Greenplum, etc.) One of the biggest areas of growth right now is in the "cloud data warehouse" market where storage and compute are decoupled. What are the options for making that possible with Postgres? (e.g. using foreign data wrappers for interacting with data lake storage (S3, HDFS, Alluxio, etc.)) What areas of work are happening in the Postgres community for upcoming releases to make it more easily suited to data warehouse/analytical workloads? What are some of the most interesting, innovative, or unexpected ways that you have seen Postgres used in analytical contexts? What are the most interesting, unexpected, or challenging lessons that you have learned from your own experiences of building analytical systems with Postgres? When is Postgres the wrong choice for a data warehouse? What are you most excited for/what are you keeping an eye on in upcoming releases of Postgres and its ecosystem? Contact Info Thomas LinkedIn JD LinkedIn @linuxhiker on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links PostgreSQL Podcast Episode Swarm64 Podcast Episode Command Prompt Inc. IBM Cognos OLAP Cube MariaDB MySQL Powell’s Books DBase Practical PostgreSQL Netezza Presto Trino Apache Drill Parquet Parquet Foreign Data Wrapper Snowflake Podcast Episode Amazon RDS Amazon Aurora Hyperscale Citus TimescaleDB Podcast Episode Followup Podcast Episode Greenplum zedstore Redshift Microsoft SQL Server Postgres Tablespaces Debezium Podcast Episode EDI == Enterprise Data Integration Change Data Capture Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/14/2021 • 1 hour, 15 minutes, 6 seconds

Making Analytical APIs Fast With Tinybird

Summary Building an API for real-time data is a challenging project. Making it robust, scalable, and fast is a full time job. The team at Tinybird wants to make it easy to turn a continuous stream of data into a production ready API or data product. In this episode CEO Jorge Sancha explains how they have architected their system to handle high data throughput and fast response times, and why they have invested heavily in Clickhouse as the core of their platform. This is a great conversation about the challenges of building a maintainable business from a technical and product perspective. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Ascend.io — recognized as a 2021 Gartner Cool Vendor in Enterprise AI Operationalization and Engineering—empowers data teams to to build, scale, and operate declarative data pipelines with 95% less code and zero maintenance. Connect to any data source using Ascend’s new flex code data connectors, rapidly iterate on transformations and send data to any destination in a fraction of the time it traditionally takes—just ask companies like Harry’s, HNI, and Mayvenn. Sound exciting? Come join the team! We’re hiring data engineers, so head on over to dataengineeringpodcast.com/ascend and check out our careers page to learn more. Your host is Tobias Macey and today I’m interviewing Jorge Sancha about Tinybird, a platform to easily build analytical APIs for real-time data Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Tinybird and the story behind it? What are some of the types of use cases that your customers are focused on? What are the areas of complexity that come up when building analytical APIs that are often overlooked when first designing a system to operate on and expose real-time data? What are the supporting systems that are necessary and useful for operating this kind of system which contribute to the overall time and engineering cost beyond the baseline functionality? How is the Tinybird platform architected? How have the goals and implementation of Tinybird changed or evolved since you first began building it? What was your criteria for selecting the core building block of your platform, and how did that lead to your choice to build on top of Clickhouse? What are some of the sharp edges that you have run into while operating Clickhouse? What are some of the custom tools or systems that you have built to help deal with them? What are some of the performance challenges that an API built with Tinybird might run into? What are the considerations that users should be aware of to avoid introducing performance issues? How do you handle multi-tenancy in your platform? (e.g. separate clusters, in-database quotas, etc.) For users of Tinybird, can you talk through the workflow of getting it integrated into their platform and designing an API from their data? What are some of the most interesting, innovative, or unexpected ways that you have seen Tinybird used? What are the most interesting, unexpected, or challenging lessons that you have learned while building and growing Tinybird? When is Tinybird the wrong choice? What do you have planned for the future of the product and business? Contact Info @jorgesancha on Twitter LinkedIn jorgesancha on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Tinybird Carto PostgreSQL Podcast Episode PostGIS Clickhouse Podcast Episode Kafka Tornado Podcast.__init__ Episode Redis Formula 1 Web Application Firewall The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/11/2021 • 54 minutes, 23 seconds

Making Spark Cloud Native At Data Mechanics

Summary Spark is one of the most well-known frameworks for data processing, whether for batch or streaming, ETL or ML, and at any scale. Because of its popularity it has been deployed on every kind of platform you can think of. In this episode Jean-Yves Stephan shares the work that he is doing at Data Mechanics to make it sing on Kubernetes. He explains how operating in a cloud-native context simplifies some aspects of running the system while complicating others, how it simplifies the development and experimentation cycle, and how you can get a head start using their pre-built Spark container. This is a great conversation for understanding how new ways of operating systems can have broader impacts on how they are being used. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Firebolt is the fastest cloud data warehouse. Visit dataengineeringpodcast.com/firebolt to get started. The first 25 visitors will receive a Firebolt t-shirt. Atlan is a collaborative workspace for data-driven teams, like Github for engineering or Figma for design teams. By acting as a virtual hub for data assets ranging from tables and dashboards to SQL snippets & code, Atlan enables teams to create a single source of truth for all their data assets, and collaborate across the modern data stack through deep integrations with tools like Snowflake, Slack, Looker and more. Go to dataengineeringpodcast.com/atlan today and sign up for a free trial. If you’re a data engineering podcast listener, you get credits worth $3000 on an annual subscription Your host is Tobias Macey and today I’m interviewing Jean-Yves Stephan about Data Mechanics, a cloud-native Spark platform for data engineers Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Data Mechanics and the story behind it? What are the operational characteristics of Spark that make it difficult to run in a cloud-optimized environment? How do you handle retries, state redistribution, etc. when instances get pre-empted during the middle of a job execution? What are some of the tactics that you have found useful when designing jobs to make them more resilient to interruptions? What are the customizations that you have had to make to Spark itself? What are some of the supporting tools that you have built to allow for running Spark in a Kubernetes environment? How is the Data Mechanics platform implemented? How have the goals and design of the platform changed or evolved since you first began working on it? How does running Spark in a container/Kubernetes environment change the ways that you and your customers think about how and where to use it? How does it impact the development workflow for data engineers and data scientists? What are some of the most interesting, unexpected, or challenging lessons that you have learned while building the Data Mechanics product? When is Spark/Data Mechanics the wrong choice? What do you have planned for the future of the platform? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Data Mechanics Databricks Stanford Andrew Ng Mining Massive Datasets Spark Kubernetes Spot Instances Infiniband Data Mechanics Spark Container Image Delight – Spark monitoring utility Terraform Blue/Green Deployment Spark Operator for Kubernetes JupyterHub Jupyter Enterprise Gateway The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/7/2021 • 40 minutes, 15 seconds

The Grand Vision And Present Reality of DataOps

Summary The Data industry is changing rapidly, and one of the most active areas of growth is automation of data workflows. Taking cues from the DevOps movement of the past decade data professionals are orienting around the concept of DataOps. More than just a collection of tools, there are a number of organizational and conceptual changes that a proper DataOps approach depends on. In this episode Kevin Stumpf, CTO of Tecton, Maxime Beauchemin, CEO of Preset, and Lior Gavish, CTO of Monte Carlo, discuss the grand vision and present realities of DataOps. They explain how to think about your data systems in a holistic and maintainable fashion, the security challenges that threaten to derail your efforts, and the power of using metadata as the foundation of everything that you do. If you are wondering how to get control of your data platforms and bring all of your stakeholders onto the same page then this conversation is for you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Max Beauchemin, Lior Gavish, and Kevin Stumpf about the real world challenges of embracing DataOps practices and systems, and how to keep things secure as you scale Interview Introduction How did you get involved in the area of data management? Before we get started, can you each give your definition of what "DataOps" means to you? How does this differ from "business as usual" in the data industry? What are some of the things that DataOps isn’t (despite what marketers might say)? What are the biggest difficulties that you have faced in going from concept to production with a workflow or system intended to power self-serve access to other members of the organization? What are the weak points in the current state of the industry, whether technological or social, that contribute to your greatest sense of unease from a security perspective? As founders of companies that aim to facilitate adoption of various aspects of DataOps, how are you applying the products that you are building to your own internal systems? How does security factor into the design of robust DataOps systems? What are some of the biggest challenges related to security when it comes to putting these systems into production? What are the biggest differences between DevOps and DataOps, particularly when it concerns designing distributed systems? What areas of the DataOps landscape do you think are ripe for innovation? Nowadays, it seems like new DataOps companies are cropping up every day to try and solve some of these problems. Why do you think DataOps is becoming such an important component of the modern data stack? There’s been a lot of conversation recently around the "rise of the data engineer" versus other roles in the data ecosystem (i.e. data scientist or data analyst). Why do you think that is? What are some of the most valuable lessons that you have learned from working with your customers about how to apply DataOps principles? What are some of the most interesting, unexpected, or challenging lessons that you have learned while building your respective platforms and businesses? What are the industry trends that you are each keeping an eye on to inform you future product direction? Contact Info Kevin LinkedIn kevinstumpf on GitHub @kevinstumpf on Twitter Maxime LinkedIn @mistercrunch on Twitter mistercrunch on GitHub Lior LinkedIn @lgavish on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Tecton Monte Carlo Superset Preset Barracuda Networks Feature Store DataOps DevOps Data Catalog Amundsen OpenLineage The Downfall of the Data Engineer Hashicorp Vault Reverse ELT The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/4/2021 • 57 minutes, 8 seconds

Self Service Data Exploration And Dashboarding With Superset

Summary The reason for collecting, cleaning, and organizing data is to make it usable by the organization. One of the most common and widely used methods of access is through a business intelligence dashboard. Superset is an open source option that has been gaining popularity due to its flexibility and extensible feature set. In this episode Maxime Beauchemin discusses how data engineers can use Superset to provide self service access to data and deliver analytics. He digs into how it integrates with your data stack, how you can extend it to fit your use case, and why open source systems are a good choice for your business intelligence. If you haven’t already tried out Superset then this conversation is well worth your time. Give it a listen and then take it for a test drive today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Max Beauchemin about Superset, an open source platform for data exploration, dashboards, and business intelligence Interview Introduction How did you get involved in the area of data management? Can you start by describing what Superset is? Superset is becoming part of the reference architecture for a modern data stack. What are the factors that have contributed to its popularity over other tools such as Redash, Metabase, Looker, etc.? Where do dashboarding and exploration tools like Superset fit in the responsibilities and workflow of a data engineer? What are some of the challenges that Superset faces in being performant when working with large data sources? Which data sources have you found to be the most challenging to work with? What are some anti-patterns that users of Superset might run into when building out a dashboard? What are some of the ways that users can surface data quality indicators (e.g. freshness, lineage, check results, etc.) in a Superset dashboard? Another trend in analytics and dashboard tools is providing actionable insights. How can Superset support those use cases where a business user or analyst wants to perform an action based on the data that they are being shown? How can Superset factor into a data governance strategy for the business? What are some of the most interesting, innovative, or unexpected ways that you have seen Superset used? dogfooding What are the most interesting, unexpected, or challenging lessons that you have learned from working on Superset and founding Preset? When is Superset the wrong choice? What do you have planned for the future of Superset and Preset? Contact Info LinkedIn @mistercrunch on Twitter mistercrunch on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Superset Podcast.__init__ Episode Preset ASP (Active Server Pages) VBScript Data Warehouse Institute Ralph Kimball Bill Inmon Ubisoft Hadoop Tableau Looker Podcast Episode The Future of Business Intelligence Is Open Source Supercharging Apache Superset Redash Podcast.__init__ Episode Metabase Podcast Episode The Rise Of The Data Engineer AirBnB Data University Python DBAPI SQLAlchemy Druid SQL Common Table Expressions SQL Window Functions Data Warehouse Semantic Layer Amundsen Podcast Episode Open Lineage Datakin Marquez Podcast Episode Apache Arrow Podcast.__init__ Episode with Wes McKinney Apache Parquet DataHub Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/27/2021 • 47 minutes, 24 seconds

Moving Machine Learning Into The Data Pipeline at Cherre

Summary Most of the time when you think about a data pipeline or ETL job what comes to mind is a purely mechanistic progression of functions that move data from point A to point B. Sometimes, however, one of those transformations is actually a full-fledged machine learning project in its own right. In this episode Tal Galfsky explains how he and the team at Cherre tackled the problem of messy data for Addresses by building a natural language processing and entity resolution system that is served as an API to the rest of their pipelines. He discusses the myriad ways that addresses are incomplete, poorly formed, and just plain wrong, why it was a big enough pain point to invest in building an industrial strength solution for it, and how it actually works under the hood. After listening to this you’ll look at your data pipelines in a new light and start to wonder how you can bring more advanced strategies into the cleaning and transformation process. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Tal Galfsky about how Cherre is bringing order to the messy problem of physical addresses and entity resolution in their data pipelines. Interview Introduction How did you get involved in the area of data management? Started as physicist and evolved into Data Science Can you start by giving a brief recap of what Cherre is and the types of data that you deal with? Cherre is a company that connects data We’re not a data vendor, in that we don’t sell data, primarily We help companies connect and make sense of their data The real estate market is historically closed, gut let, behind on tech What are the biggest challenges that you deal with in your role when working with real estate data? Lack of a standard domain model in real estate. Ontology. What is a property? Each data source, thinks about properties in a very different way. Therefore, yielding similar, but completely different data. QUALITY (Even if the dataset are talking about the same thing, there are different levels of accuracy, freshness). HIREARCHY. When is one source better than another What are the teams and systems that rely on address information? Any company that needs to clean or organize (make sense) their data, need to identify, people, companies, and properties. Our clients use Address resolution in multiple ways. Via the UI or via an API. Our service is both external and internal so what I build has to be good enough for the demanding needs of our data science team, robust enough for our engineers, and simple enough that non-expert clients can use it. Can you give an example for the problems involved in entity resolution Known entity example. Empire state buidling. To resolve addresses in a way that makes sense for the client you need to capture the real world entities. Lots, buildings, units. Identify the type of the object (lot, building, unit) Tag the object with all the relevant addresses Relations to other objects (lot, building, unit) What are some examples of the kinds of edge cases or messiness that you encounter in addresses? First class is string problems. Second class component problems. third class is geocoding. I understand that you have developed a service for normalizing addresses and performing entity resolution to provide canonical references for downstream analyses. Can you give an overview of what is involved? What is the need for the service. The main requirement here is connecting an address to lot, building, unit with latitude and longitude coordinates How were you satisfying this requirement previously? Before we built our model and dedicated service we had a basic prototype for pipeline only to handle NYC addresses. What were the motivations for designing and implementing this as a service? Need to expand nationwide and to deal with client queries in real time. What are some of the other data sources that you rely on to be able to perform this normalization and resolution? Lot data, building data, unit data, Footprints and address points datasets. What challenges do you face in managing these other sources of information? Accuracy, hirearchy, standardization, unified solution, persistant ids and primary keys Digging into the specifics of your solution, can you talk through the full lifecycle of a request to resolve an address and the various manipulations that are performed on it? String cleaning, Parse and tokenize, standardize, Match What are some of the other pieces of information in your system that you would like to see addressed in a similar fashion? Our named entity solution with connection to knowledge graph and owner unmasking. What are some of the most interesting, unexpected, or challenging lessons that you learned while building this address resolution system? Scaling nyc geocode example. The NYC model was exploding a subset of the options for messing up an address. Flexibility. Dependencies. Client exposure. Now that you have this system running in production, if you were to start over today what would you do differently? a lot but at this point the module boundaries and client interface are defined in such way that we are able to make changes or completely replace any given part of it without breaking anything client facing What are some of the other projects that you are excited to work on going forward? Named entity resolution and Knowledge Graph Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? BigQuery is huge asset and in particular UDFs but they don’t support API calls or python script Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Cherre Podcast Episode Photonics Knowledge Graph Entity Resolution BigQuery NLP == Natural Language Processing dbt Podcast Episode Airflow Podcast.__init__ Episode Datadog Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/20/2021 • 48 minutes, 4 seconds

Exploring The Expanding Landscape Of Data Professions with Josh Benamram of Databand

Summary "Business as usual" is changing, with more companies investing in data as a first class concern. As a result, the data team is growing and introducing more specialized roles. In this episode Josh Benamram, CEO and co-founder of Databand, describes the motivations for these emerging roles, how these positions affect the team dynamics, and the types of visibility that they need into the data platform to do their jobs effectively. He also talks about how his experience working with these teams informs his work at Databand. If you are wondering how to apply your talents and interests to working with data then this episode is a must listen. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Josh Benamram about the continued evolution of roles and responsibilities in data teams and their varied requirements for visibility into the data stack Interview Introduction How did you get involved in the area of data management? Can you start by discussing the set of roles that you see in a majority of data teams? What new roles do you see emerging, and what are the motivating factors? Which of the more established positions are fracturing or merging to create these new responsibilities? What are the contexts in which you are seeing these role definitions used? (e.g. small teams, large orgs, etc.) How do the increased granularity/specialization of responsibilities across data teams change the ways that data and platform architects need to think about technology investment? What are the organizational impacts of these new types of data work? How do these shifts in role definition change the ways that the individuals in the position interact with the data platform? What are the types of questions that practitioners in different roles are asking of the data that they are working with? (e.g. what is the lineage of this asset vs. what is the distribution of values in this column, etc.) How can metrics and observability data about pipelines and data systems help to support these various roles? What are the different ways of measuring data quality for the needs of these roles? How is the work you are doing at Databand informed by these changing needs? One of the big challenges caused by data systems is the varying modes of access and interaction across the different stakeholders and activities. How can data platform teams and vendors help to surface useful metrics and information across these various interfaces without forcing users into a new or unfamiliar workflow? What are some of the long-term impacts that you foresee in the data ecosystem and ways of interacting with data as a result of the current trend toward more specialized tasks? As a vendor working to provide useful context to these practitioners what are some of the most interesting, unexpected, or challenging lessons that you have learned? What do you have planned for the future of Databand? Contact Info Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Databand Website Platform Open Core More data engineering stories & best practices Atlassian Chartio Data Mesh Article Podcast Episode Grafana Metabase Superset Podcast.__init__ Episode Snowflake Podcast Episode Spark Airflow Podcast.__init__ Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/13/2021 • 1 hour, 8 minutes, 36 seconds

Put Your Whole Data Team On The Same Page With Atlan

Summary One of the biggest obstacles to success in delivering data products is cross-team collaboration. Part of the problem is the difference in the information that each role requires to do their job and where they expect to find it. This introduces a barrier to communication that is difficult to overcome, particularly in teams that have not reached a significant level of maturity in their data journey. In this episode Prukalpa Sankar shares her experiences across multiple attempts at building a system that brings everyone onto the same page, ultimately bringing her to found Atlan. She explains how the design of the platform is informed by the needs of managing data projects for large and small teams across her previous roles, how it integrates with your existing systems, and how it can work to bring everyone onto the same page. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Prukalpa Sankar about Atlan, a modern data workspace that makes collaboration among data stakeholders easier, increasing efficiency and agility in data projects Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Atlan and some of the story behind it? Who are the target users of Atlan? What portions of the data workflow is Atlan responsible for? What components of the data stack might Atlan replace? How would you characterize Atlan’s position in the current data ecosystem? What makes Atlan stand out from other systems for data cataloguing, metadata management, or data governance? What types of data assets (e.g. structured vs unstructured, textual vs binary, etc.) is Atlan designed to understand? Can you talk through how Atlan is implemented? How have the goals and design of the platform changed or evolved since you first began working on it? What are some of the early assumptions that you have had to revisit or reconsider? What is involved in getting Atlan deployed and integrated into an existing data platform? Beyond the technical aspects, what are the business processes that teams need to implement to be successful when incorporating Atlan into their systems? Once Atlan is set up, what is a typical workflow for an individual and their team to collaborate on a set of data assets, or building out a new processing pipeline? What are some useful steps for introducing all of the stakeholders to the system and workflow? What are the available extension points for managing data in systems that aren’t supported by Atlan out of the box? What are some of the most interesting, innovative, or unexpected ways that you have seen Atlan used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Atlan? When is Atlan the wrong choice? What do you have planned for the future of the product? Contact Info LinkedIn @prukalpa on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Atlan India’s National Data Platform World Economic Forum UN Gates Foundation GitHub Figma Snowflake Redshift Databricks DBT Sisense Looker Apache Atlas Immuta DataHub Datakin Aapache Ranger Great Expectations Trino Airflow Dagster Privacera Databand Cloudformation Grafana Deequ We Failed to Set Up a Data Catalog 3x. Here’s Why. Analysing the analysers book OpenAPI The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/6/2021 • 57 minutes, 36 seconds

Data Quality Management For The Whole Team With Soda Data

Summary Data quality is on the top of everyone’s mind recently, but getting it right is as challenging as ever. One of the contributing factors is the number of people who are involved in the process and the potential impact on the business if something goes wrong. In this episode Maarten Masschelein and Tom Baeyens share the work they are doing at Soda to bring everyone on board to make your data clean and reliable. They explain how they started down the path of building a solution for managing data quality, their philosophy of how to empower data engineers with well engineered open source tools that integrate with the rest of the platform, and how to bring all of the stakeholders onto the same page to make your data great. There are many aspects of data quality management and it’s always a treat to learn from people who are dedicating their time and energy to solving it for everyone. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Maarten Masschelein and Tom Baeyens about the work are doing at Soda to power data quality management Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Soda? What problem are you trying to solve? And how are you solving that problem? What motivated you to start a business focused on data monitoring and data quality? The data monitoring and broader data quality space is a segment of the industry that is seeing a huge increase in attention recently. Can you share your perspective on the current state of the ecosystem and how your approach compares to other tools and products? who have you created Soda for (e.g platform engineers, data engineers, data product owners etc) and what is a typical workflow for each of them? How do you go about integrating Soda into your data infrastructure? How has the Soda platform been architected? Why is this architecture important? How have the goals and design of the system changed or evolved as you worked with early customers and iterated toward your current state? What are some of the challenges associated with the ongoing monitoring and testing of data? what are some of the tools or techniques for data testing used in conjunction with Soda? What are some of the most interesting, innovative, or unexpected ways that you have seen Soda being used? What are the most interesting, unexpected, or challenging lessons that you have learned while building the technology and business for Soda? When is Soda the wrong choice? What do you have planned for the future? Contact Info Maarten LinkedIn @masscheleinm on Twitter Tom LinkedIn @tombaeyens on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Soda Data Soda SQL RedHat Collibra Spark Getting Things Done by David Allen (affiliate link) Slack OpsGenie DBT Airflow The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/30/2021 • 58 minutes

Real World Change Data Capture At Datacoral

Summary The world of business is becoming increasingly dependent on information that is accurate up to the minute. For analytical systems, the only way to provide this reliably is by implementing change data capture (CDC). Unfortunately, this is a non-trivial undertaking, particularly for teams that don’t have extensive experience working with streaming data and complex distributed systems. In this episode Raghu Murthy, founder and CEO of Datacoral, does a deep dive on how he and his team manage change data capture pipelines in production. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Raghu Murthy about his recent work of making change data capture more accessible and maintainable Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what CDC is and when it is useful? What are the alternatives to CDC? What are the cases where a more batch-oriented approach would be preferable? What are the factors that you need to consider when deciding whether to implement a CDC system for a given data integration? What are the barriers to entry? What are some of the common mistakes or misconceptions about CDC that you have encountered in your own work and while working with customers? How does CDC fit into a broader data platform, particularly where there are likely to be other data integration pipelines in operation? (e.g. Fivetran/Airbyte/Meltano/custom scripts) What are the moving pieces in a CDC workflow that need to be considered as you are designing the system? What are some examples of the configuration changes necessary in source systems to provide the needed log data? How would you characterize the current landscape of tools available off the shelf for building a CDC pipeline? What are your predictions about the potential for a unified abstraction layer for log-based CDC across databases? What are some of the potential performance/uptime impacts on source databases, both during the initial historical sync and once you hit a steady state? How can you mitigate the impacts of the CDC pipeline on the source databases? What are some of the implementation details that application developers DBAs need to be aware of for data modeling in the source systems to allow for proper replication via CDC? Are there any performance challenges that need to be addressed in the consumers or destination systems? e.g. parallelism Can you describe the technical implementation and architecture that you use for implementing CDC? How has the design evolved as you have grown the scale and sophistication of your system? In the destination system, what data modeling decisions need to be made to ensure that the replicated information is usable for anlytics? What additional attributes need to be added to track things like row modifications, deletions, schema changes, etc.? How do you approach treatment of data copies in the DWH? (e.g. ELT – keep all source tables and use DBT for converting relevant tables into star/snowflake/data vault/wide tables) What are your thoughts on the viability of a data lake as the destination system? (e.g. S3/Parquet or Trino/Drill/etc.) CDC is a topic that is generally reserved for coversations about databases, but what are some of the other systems that we could think about implementing CDC? e.g. APIs and third party data sources How can we integrage CDC into metadata/lineage tooling? How do you handle observability of CDC flows? What is involved in debugging a replication flow? How can we build data quality checks into CDC workflows? What are some of the most interesting, innovative, or unexpected ways that you have seen CDC used? What are the most interesting, unexpected, or challenging lessons that you have learned from digging deep into CDC implementation? When is CDC the wrong choice? What are some of the industry or technology trends around CDC that you are most excited by? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links DataCoral Podcast Episode DataCoral Blog 3 Steps To Build A Modern Data Stack Change Data Capture: Overview Hive Hadoop DBT Podcast Episode FiveTran Podcast Episode Change Data Capture Metadata First Blog Post Debezium Podcast Episode UUID == Universally Unique Identifier Airflow Oracle Goldengate Parquet Trino AWS Lambda Data Mesh Podcast Episode Enterprise Message Bus The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/23/2021 • 49 minutes, 58 seconds

Managing The DoorDash Data Platform

Summary The team at DoorDash has a complex set of optimization challenges to deal with using data that they collect from a multi-sided marketplace. In order to handle the volume and variety of information that they use to run and improve the business the data team has to build a platform that analysts and data scientists can use in a self-service manner. In this episode the head of data platform for DoorDash, Sudhir Tonse, discusses the technologies that they are using, the approach that they take to adding new systems, and how they think about priorities for what to support for the whole company vs what to leave as a specialized concern for a single team. This is a valuable look at how to manage a large and growing data platform with that supports a variety of teams with varied and evolving needs. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Sudhir Tonse about how the team at DoorDash designed their data platform Interview Introduction How did you get involved in the area of data management? Can you start by giving a quick overview of what you do at DoorDash? What are some of the ways that data is used to power the business? How has the pandemic affected the scale and volatility of the data that you are working with? Can you describe the type(s) of data that you are working with? What are the primary sources of data that you collect? What secondary or third party sources of information do you rely on? Can you give an overview of the collection process for that data? In selecting the technologies for the various components in your data stack, what are the primary factors that you consider when evaluating the build vs. buy decision? In your recent post about how you are scaling the capabilities and capacity of your data platform you mentioned the concept of maintaining a "paved path" of supported technologies to simplify integration across teams. What are the technologies that you use and rely on for the "paved path"? How are you managing quality and consistency of your data across its lifecycle? What are some of the specific data quality solutions that you have integrated into the platform and "paved path"? What are some of the technologies that were used early on at DoorDash that failed to keep up as the business scaled? How do you manage the migration path for adopting new technologies or techniques? In the same post you mentioned the tendency to allow for building point solutions before deciding whether to generalize a given use case into a generalized platform capability. Can you give some examples of cases where a point solution remains a one-off versus when it needs to be expanded into a widely used component? How do you identify and tracking cost factors in the data platform? What do you do with that information? What is your approach for identifying and measuring useful OKRs (Objectives and Key Results)? How do you quantify potentially subjective metrics such as reliability and quality? How have you designed the organizational structure for your data teams? What are the responsibilities and organizational interfaces for data engineers within the company? How have the organizational structures/patterns shifted or changed at different levels of scale/maturity for the business? What are some of the most interesting, useful, unexpected, or challenging lessons that you have learned during your time as a data professional at DoorDash? What are some of the upcoming projects or changes that you anticipate in the near to medium future? Contact Info LinkedIn @stonse on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links How DoorDash is Scaling its Data Platform to Delight Customers and Meet our Growing Demand DoorDash Uber Netscape Netflix Change Data Capture Debezium Podcast Episode SnowflakeDB Podcast Episode Airflow Podcast.__init__ Episode Kafka Flink Podcast Episode Pinot GDPR CCPA Data Governance AWS LightGBM XGBoost Big Data Landscape Kinesis Kafka Connect Cassandra PostgreSQL Podcast Episode Amundsen Podcast Episode SQS Feature Toggles BigEye Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/16/2021 • 46 minutes, 4 seconds

Leave Your Data Where It Is And Automate Feature Extraction With Molecula

Summary A majority of the time spent in data engineering is copying data between systems to make the information available for different purposes. This introduces challenges such as keeping information synchronized, managing schema evolution, building transformations to match the expectations of the destination systems. H.O. Maycotte was faced with these same challenges but at a massive scale, leading him to question if there is a better way. After tasking some of his top engineers to consider the problem in a new light they created the Pilosa engine. In this episode H.O. explains how using Pilosa as the core he built the Molecula platform to eliminate the need to copy data between systems in able to make it accessible for analytical and machine learning purposes. He also discusses the challenges that he faces in helping potential users and customers understand the shift in thinking that this creates, and how the system is architected to make it possible. This is a fascinating conversation about what the future looks like when you revisit your assumptions about how systems are designed. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing H.O. Maycotte about Molecula, a cloud based feature store based on the open source Pilosa project Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Molecula and the story behind it? What are the additional capabilities that Molecula offers on top of the open source Pilosa project? What are the problems/use cases that Molecula solves for? What are some of the technologies or architectural patterns that Molecula might replace in a companies data platform? One of the use cases that is mentioned on the Molecula site is as a feature store for ML and AI. This is a category that has been seeing a lot of growth recently. Can you provide some context how Molecula fits in that market and how it compares to options such as Tecton, Iguazio, Feast, etc.? What are the benefits of using a bitmap index for identifying and computing features? Can you describe how the Molecula platform is architected? How has the design and goal of Molecula changed or evolved since you first began working on it? For someone who is using Molecula, can you describe the process of integrating it with their existing data sources? Can you describe the internal data model of Pilosa/Molecula? How should users think about data modeling and architecture as they are loading information into the platform? Once a user has data in Pilosa, what are the available mechanisms for performing analyses or feature engineering? What are some of the most underutilized or misunderstood capabilities of Molecula? What are some of the most interesting, unexpected, or innovative ways that you have seen the Molecula platform used? What are the most interesting, unexpected, or challenging lessons that you have learned from building and scaling Molecula? When is Molecula the wrong choice? What do you have planned for the future of the platform and business? Contact Info LinkedIn @maycotte on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Molecula Pilosa Podcast Episode The Social Dilemma Feature Store Cassandra Elasticsearch Podcast Episode Druid MongoDB SwimOS Podcast Episode Kafka Kafka Schema Registry Podcast Episode Homomorphic Encryption Lucene Solr The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/9/2021 • 51 minutes, 39 seconds

Bridging The Gap Between Machine Learning And Operations At Iguazio

Summary The process of building and deploying machine learning projects requires a staggering number of systems and stakeholders to work in concert. In this episode Yaron Haviv, co-founder of Iguazio, discusses the complexities inherent to the process, as well as how he has worked to democratize the technologies necessary to make machine learning operations maintainable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Yaron Haviv about Iguazio, a platform for end to end automation of machine learning applications using MLOps principles. Interview Introduction How did you get involved in the area of data science & analytics? Can you start by giving an overview of what Iguazio is and the story of how it got started? How would you characterize your target or typical customer? What are the biggest challenges that you see around building production grade workflows for machine learning? How does Iguazio help to address those complexities? For customers who have already invested in the technical and organizational capacity for data science and data engineering, how does Iguazio integrate with their environments? What are the responsibilities of a data engineer throughout the different stages of the lifecycle for a machine learning application? Can you describe how the Iguazio platform is architected? How has the design of the platform evolved since you first began working on it? How have the industry best practices around bringing machine learning to production changed? How do you approach testing/validation of machine learning applications and releasing them to production environments? (e.g. CI/CD) Once a model is in production, what are the types and sources of information that you collect to monitor their performance? What are the factors that contribute to model drift? What are the remaining gaps in the tooling or processes available for managing the lifecycle of machine learning projects? What are the most interesting, innovative, or unexpected ways that you have seen the Iguazio platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while building and scaling the Iguazio platform and business? When is Iguazio the wrong choice? What do you have planned for the future of the platform? Contact Info LinkedIn @yaronhaviv on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Iguazio MLOps Oracle Exadata SAP HANA Mellanox NVIDIA Multi-Model Database Nuclio MLRun Jupyter Notebook Pandas Scala Feature Imputing Feature Store Parquet Spark Apache Flink Podcast Episode Apache Beam NLP (Natural Language Processing) Deep Learning BERT Airflow Podcast.__init__ Episode Dagster Data Engineering Podcast Episode Podcast.__init__ Episode Kubeflow Argo AWS Step Functions Presto/Trino Podcast Episode Dask Podcast Episode Hadoop Sagemaker Tecton Podcast Episode Seldon DataRobot RapidMiner H2O.ai Grafana Storey The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/2/2021 • 1 hour, 6 minutes, 27 seconds

Self Service Open Source Data Integration With AirByte

Summary Data integration is a critical piece of every data pipeline, yet it is still far from being a solved problem. There are a number of managed platforms available, but the list of options for an open source system that supports a large variety of sources and destinations is still embarrasingly short. The team at Airbyte is adding a new entry to that list with the goal of making robust and easy to use data integration more accessible to teams who want or need to maintain full control of their data. In this episode co-founders John Lafleur and Michel Tricot share the story of how and why they created Airbyte, discuss the project’s design and architecture, and explain their vision of what an open soure data integration platform should offer. If you are struggling to maintain your extract and load pipelines or spending time on integrating with a new system when you would prefer to be working on other projects then this is definitely a conversation worth listening to. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Michel Tricot and John Lafleur about Airbyte, an open source framework for building data integration pipelines. Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Airbyte is and the story behind it? Businesses and data engineers have a variety of options for how to manage their data integration. How would you characterize the overall landscape and how does Airbyte distinguish itself in that space? How would you characterize your target users? How have those personas instructed the priorities and design of Airbyte? What do you see as the benefits and tradeoffs of a UI oriented data integration platform as compared to a code first approach? what are the complex/challenging elements of data integration that makes it such a slippery problem? motivation for creating open source ELT as a business Can you describe how the Airbyte platform is implemented? What was your motivation for choosing Java as the primary language? incidental complexity of forcing all connectors to be packaged as containers shortcomings of the Singer specification/motivation for creating a backwards incompatible interface perceived potential for community adoption of Airbyte specification tradeoffs of using JSON as interchange format vs. e.g. protobuf/gRPC/Avro/etc. information lost when converting records to JSON types/how to preserve that information (e.g. field constraints, valid enums, etc.) interfaces/extension points for integrating with other tools, e.g. Dagster abstraction layers for simplifying implementation of new connectors tradeoffs of storing all connectors in a monorepo with the Airbyte core impact of community adoption/contributions What is involved in setting up an Airbyte installation? What are the available axes for scaling an Airbyte deployment? challenges of setting up and maintaining CI environment for Airbyte How are you managing governance and long term sustainability of the project? What are some of the most interesting, unexpected, or innovative ways that you have seen Airbyte used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Airbyte? When is Airbyte the wrong choice? What do you have planned for the future of the project? Contact Info Michel LinkedIn @MichelTricot on Twitter michel-tricot on GitHub John LinkedIn @JeanLafleur on Twitter johnlafleur on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Airbyte Liveramp Fivetran Podcast Episode Stitch Data Matillion DataCoral Podcast Episode Singer Meltano Podcast Episode Airflow Podcast.__init__ Episode Kotlin Docker Monorepo Airbyte Specification Great Expectations Podcast Episode Dagster Data Engineering Podcast Episode Podcast.__init__ Episode Prefect Podcast Episode DBT Podcast Episode Kubernetes Snowflake Podcast Episode Redshift Presto Spark Parquet Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/23/2021 • 52 minutes, 15 seconds

Building The Foundations For Data Driven Businesses at 5xData

Summary Every business aims to be data driven, but not all of them succeed in that effort. In order to be able to truly derive insights from the data that an organization collects, there are certain foundational capabilities that they need to have capacity for. In order to help more businesses build those foundations, Tarush Aggarwal created 5xData, offering collaborative workshops to assist in setting up the technical and organizational systems that are necessary to succeed. In this episode he shares his thoughts on the core elements that are necessary for every business to be data driven, how he is helping companies incorporate those capabilities into their structure, and the ongoing support that he is providing through a network of mastermind groups. This is a great conversation about the initial steps that every group should be thinking of as they start down the road to making data informed decisions. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. RudderStack’s smart customer data pipeline is warehouse-first. It builds your customer data warehouse and your identity graph on your data warehouse, with support for Snowflake, Google BigQuery, Amazon Redshift, and more. Their SDKs and plugins make event streaming easy, and their integrations with cloud applications like Salesforce and ZenDesk help you go beyond event streaming. With RudderStack you can use all of your customer data to answer more difficult questions and then send those insights to your whole customer data stack. Sign up free at dataengineeringpodcast.com/rudder today. Your host is Tobias Macey and today I’m interviewing Tarush Aggarwal about his mission at 5xData to teach companies how to build solid foundations for their data capabilities Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at 5xData and the story behind it? impact of industry on challenges in becoming data driven profile of companies that you are trying to work with common mistakes when designing data platform misconceptions that the business has around how to invest in data challenges in attracting/interviewing/hiring data talent What are the core components that you have standardized on for building the foundational layers of the data platform? providing context and training to business users in order to allow them to self-serve the answers to their questions tooling/interfaces needed to allow them to ask and investigate questions most high impact areas for data engineers to focus on in the initial stages of implementing the data platform how to identify and prioritize areas of effort useful structure of data team at different stages of maturity What are the most interesting, unexpected, or challenging lessons that you have learned while building out the business and team of 5xData? What do you have planned for the future of the business? What are the industry trends or specific technologies that you are keeping a close watch on? Contact Info LinkedIn @tarush on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links 5xData Looker Podcast Episode Snowflake Podcast Episode Fivetran Podcast Episode DBT Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/16/2021 • 52 minutes, 15 seconds

How Shopify Is Building Their Production Data Warehouse Using DBT

Summary With all of the tools and services available for building a data platform it can be difficult to separate the signal from the noise. One of the best ways to get a true understanding of how a technology works in practice is to hear from people who are running it in production. In this episode Zeeshan Qureshi and Michelle Ark share their experiences using DBT to manage the data warehouse for Shopify. They explain how the structured the project to allow for multiple teams to collaborate in a scalable manner, the additional tooling that they added to address the edge cases that they have run into, and the optimizations that they baked into their continuous integration process to provide fast feedback and reduce costs. This is a great conversation about the lessons learned from real world use of a specific technology and how well it lives up to its promises. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today. Your host is Tobias Macey and today I’m interviewing Zeeshan Qureshi and Michelle Ark about how Shopify is building their production data warehouse platform with DBT Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what the Shopify platform is? What kinds of data sources are you working with? Can you share some examples of the types of analysis, decisions, and products that you are building with the data that you manage? How have you structured your data teams to be able to deliver those projects? What are the systems that you have in place, technological or otherwise, to allow you to support the needs of the various data professionals and business users? What was the tipping point that led you to reconsider your system design and start down the road of architecting a data warehouse? What were your criteria when selecting a platform for your data warehouse? What decision did that criteria lead you to make? Once you decided to orient a large portion of your reporting around a data warehouse, what were the biggest unknowns that you were faced with while deciding how to structure the workflows and access policies? What were your criteria for determining what toolchain to use for managing the data warehouse? You ultimately decided to standardize on DBT. What were the other options that you explored and what were the requirements that you had for determining the candidates? What was your process for onboarding users into the DBT toolchain and determining how to structure the project layout? What are some of the shortcomings or edge cases that you ran into? Rather than rely on the vanilla DBT workflow you created a wrapper project to add additional functionality. What were some of the features that you needed to add to suit your particular needs? What has been your experience with extending and integrating with DBT to customize it for your environment? Can you talk through how you manage testing of your DBT pipelines and the tables that it is responsible for? How much of the testing are you able to do with out-of-the-box functionality from DBT? What are the additional capabilities that you have bolted on to provide a more robust and scalable means of verifying your pipeline changes? Can you share how you manage the CI/CD process for changes in your data warehouse? What kinds of monitoring or metrics collection do you perform on the execution of your DBT pipelines? How do you integrate the management of your data warehouse and DBT workflows with your broader data platform? Now that you have been using DBT in production for a while, what are the challenges that you have encountered when using it at scale? Are there any patterns that you and your team have found useful that are worth digging into for other teams who are considering DBT or are actively using it? What are the opportunities and available mechanisms that you have found for introducing abstraction layers to reduce the maintenance burden for your data warehouse? What is the data modeling approach that you are using? (e.g. Data Vault, Star/Snowflake Schema, wide tables, etc.) As you continue to work with DBT and rely on the data warehouse for production use cases, what are some of the additional features/improvements that you have planned? What are some of the unexpected/innovative/surprising use cases that you and your team have found for the Seamster tool or the data models that it generates? What are the cases where you think that DBT or data warehousing is the wrong answer and teams should be looking to other solutions? What are the most interesting, unexpected, or challenging lessons that you learned while working through the process of migrating a portion of your data workloads into the data warehouse and managing them with DBT? Contact Info Zeeshan @zeeshanq on Twitter Website Michelle @michellearky on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links How to Build a Production Grade Workflow with SQL Modelling Shopify JRuby PySpark Druid Amplitude Mode Snowflake Schema Data Vault Podcast Episode BigQuery Amazon Redshift CI/CD Great Expectations Podcast Episode Master Data Management Podcast Episode Flink SQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/9/2021 • 46 minutes, 30 seconds

System Observability For The Cloud Native Era With Chronosphere

Summary Collecting and processing metrics for monitoring use cases is an interesting data problem. It is eminently possible to generate millions or billions of data points per second, the information needs to be propagated to a central location, processed, and analyzed in timeframes on the order of milliseconds or single-digit seconds, and the consumers of the data need to be able to query the information quickly and flexibly. As the systems that we build continue to grow in scale and complexity the need for reliable and manageable monitoring platforms increases proportionately. In this episode Rob Skillington, CTO of Chronosphere, shares his experiences building metrics systems that provide observability to companies that are operating at extreme scale. He describes how the M3DB storage engine is designed to manage the pressures of a critical system component, the inherent complexities of working with telemetry data, and the motivating factors that are contributing to the growing need for flexibility in querying the collected metrics. This is a fascinating conversation about an area of data management that is often taken for granted. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $100 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Today’s episode of Data Engineering Podcast is sponsored by Datadog, the monitoring and analytics platform for cloud-scale infrastructure and applications. Datadog’s machine-learning based alerts, customizable dashboards, and 400+ vendor-backed integrations makes it easy to unify disparate data sources and pivot between correlated metrics and events for faster troubleshooting. By combining metrics, traces, and logs in one place, you can easily improve your application performance. Try Datadog free by starting a your 14-day trial and receive a free t-shirt once you install the agent. Go to dataengineeringpodcast.com/datadog today see how you can unify your monitoring today. Your host is Tobias Macey and today I’m interviewing Rob Skillington about Chronosphere, a scalable, reliable and customizable monitoring-as-a-service purpose built for cloud-native applications. Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Chronosphere and your motivation for turning it into a business? What are the biggest challenges inherent to monitoring use cases? How does the advent of cloud native environments complicate things further? While you were at Uber you helped to create the M3 storage engine. There are a wide array of time series databases available, including many purpose built for metrics use cases. What were the missing pieces that made it necessary to create a new system? How do you handle schema design/data modeling for metrics storage? How do the usage patterns of metrics systems contribute to the complexity of building a storage layer to support them? What are the optimizations that need to be made for the read and write paths in M3? How do you handle high cardinality of metrics and ad-hoc queries to understand system behaviors? What are the scaling factors for M3? Can you describe how you have architected the Chronosphere platform? What are the convenience features built on top of M3 that you are creating at Chronosphere? How do you handle deployment and scaling of your infrastructure given the scale of the businesses that you are working with? Beyond just server infrastructure and application behavior, what are some of the other sources of metrics that you and your users are sending into Chronosphere? How do those alternative metrics sources complicate the work of generating useful insights from the data? In addition to the read and write loads, metrics systems also need to be able to identify patterns, thresholds, and anomalies in the data to alert on it with minimal latency. How do you handle that in the Chronosphere platform? What are some of the most interesting, innovative, or unexpected ways that you have seen Chronosphere/M3 used? What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Chronosphere? When is Chronosphere the wrong choice? What do you have planned for the future of the platform and business? Contact Info LinkedIn @roskilli on Twitter robskillington on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Chronosphere Lidar Cloud Native M3DB OpenTracing Metrics/Telemetry Graphite Podcast.__init__ Episode InfluxDB Clickhouse Podcast Episode Prometheus Inverted Index Druid Cardinality Apache Flink Podcast Episode HDFS Avro Podcast Episode Grafana Tecton Podcast Episode Datadog Podcast Episode Kubernetes Sourcegraph The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/2/2021 • 1 hour, 4 minutes, 50 seconds

Making It Easier To Stick B2B Data Integration Pipelines Together With Hotglue

Summary Businesses often need to be able to ingest data from their customers in order to power the services that they provide. For each new source that they need to integrate with it is another custom set of ETL tasks that they need to maintain. In order to reduce the friction involved in supporting new data transformations David Molot and Hassan Syyid built the Hotlue platform. In this episode they describe the data integration challenges facing many B2B companies, how their work on the Hotglue platform simplifies their efforts, and how they have designed the platform to make these ETL workloads embeddable and self service for end users. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. This episode of Data Engineering Podcast is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age. Datadog provides customizable dashboards, log management, and machine-learning-based alerts in one fully-integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog’s 400+ vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. Try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Go to dataengineeringpodcast.com/datadog today to see how you can enhance visibility into your stack with Datadog. Your host is Tobias Macey and today I’m interviewing David Molot and Hassan Syyid about Hotglue, an embeddable data integration tool for B2B developers built on the Python ecosystem. Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Hotglue? What was your motivation for starting a business to address this particular problem? Who is the target user of Hotglue and what are their biggest data problems? What are the types and sources of data that they are likely to be working with? How are they currently handling solutions for those problems? How does the introduction of Hotglue simplify or improve their work? What is involved in getting Hotglue integrated into a given customer’s environment? How is Hotglue itself implemented? How has the design or goals of the platform evolved since you first began building it? What were some of the initial assumptions that you had at the outset and how well have they held up as you progressed? Once a customer has set up Hotglue what is their workflow for building and executing an ETL workflow? What are their options for working with sources that aren’t supported out of the box? What are the biggest design and implementation challenges that you are facing given the need for your product to be embedded in customer platforms and exposed to their end users? What are some of the most interesting, innovative, or unexpected ways that you have seen Hotglue used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Hotglue? When is Hotglue the wrong choice? What do you have planned for the future of the product? Contact Info David @davidmolot on Twitter LinkedIn Hassan hsyyid on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Hotglue Python The Python Podcast.__init__ B2B == Business to Business Meltano Podcast Episode Airbyte Singer The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/26/2021 • 34 minutes, 5 seconds

Using Your Data Warehouse As The Source Of Truth For Customer Data With Hightouch

Summary The data warehouse has become the central component of the modern data stack. Building on this pattern, the team at Hightouch have created a platform that synchronizes information about your customers out to third party systems for use by marketing and sales teams. In this episode Tejas Manohar explains the benefits of sourcing customer data from one location for all of your organization to use, the technical challenges of synchronizing the data to external systems with varying APIs, and the workflow for enabling self-service access to your customer data by your marketing teams. This is an interesting conversation about the importance of the data warehouse and how it can be used beyond just internal analytics. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. This episode of Data Engineering Podcast is sponsored by Datadog, a unified monitoring and analytics platform built for developers, IT operations teams, and businesses in the cloud age. Datadog provides customizable dashboards, log management, and machine-learning-based alerts in one fully-integrated platform so you can seamlessly navigate, pinpoint, and resolve performance issues in context. Monitor all your databases, cloud services, containers, and serverless functions in one place with Datadog’s 400+ vendor-backed integrations. If an outage occurs, Datadog provides seamless navigation between your logs, infrastructure metrics, and application traces in just a few clicks to minimize downtime. Try it yourself today by starting a free 14-day trial and receive a Datadog t-shirt after installing the agent. Go to dataengineeringpodcast.com/datadog today to see how you can enhance visibility into your stack with Datadog. Your host is Tobias Macey and today I’m interviewing Tejas Manohar about Hightouch, a data platform that helps you sync your customer data from your data warehouse to your CRM, marketing, and support tools Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Hightouch and your motivation for creating it? What are the main points of friction for teams who are trying to make use of customer data? Where is Hightouch positioned in the ecosystem of customer data tools such as Segment, Mixpanel, Amplitude, etc.? Who are the target users of Hightouch? How has that influenced the design of the platform? What are the baseline attributes necessary for Hightouch to populate downstream systems? What are the data modeling considerations that users need to be aware of when sending data to other platforms? Can you describe how Hightouch is architected? How has the design of the platform evolved since you first began working on it? What goals or assumptions did you have when you first began building Hightouch that have been modified or invalidated once you began working with customers? Can you talk through the workflow of using Hightouch to propagate data to other platforms? How do you keep data up to date between the warehouse and downstream systems? What are the upstream systems that users need to have in place to make Hightouch a viable and effective tool? What are the benefits of using the data warehouse as the source of truth for downstream services? What are the trends in data warehousing that you are keeping a close eye on? What are you most excited for? Are there any that you find worrisome? What are some of the most interesting, unexpected, or innovative ways that you have seen Hightouch used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Hightouch? When is Hightouch the wrong choice? What do you have planned for the future of the platform? Contact Info LinkedIn @tejasmanohar on Twitter tejasmanoharon GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Hightouch Segment Podcast Episode DBT Podcast Episode Looker Podcast Episode Change Data Capture Podcast Episode Database Trigger Materialize Podcast Episode Flink Podcast Episode Zapier The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/19/2021 • 59 minutes, 33 seconds

Enabling Version Controlled Data Collaboration With TerminusDB

Summary As data professionals we have a number of tools available for storing, processing, and analyzing data. We also have tools for collaborating on software and analysis, but collaborating on data is still an underserved capability. Gavin Mendel-Gleason encountered this problem first hand while working on the Sesshat databank, leading him to create TerminusDB and TerminusHub. In this episode he explains how the TerminusDB system is architected to provide a versioned graph storage engine that allows for branching and merging of data sets, how that opens up new possibilities for individuals and teams to work together on building new data repositories. This is a fascinating conversation on the technical challenges involved, the opportunities that such as system provides, and the complexities inherent to building a successful business on open source. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat! Your host is Tobias Macey and today I’m interviewing Gavin Mendel-Gleason about TerminusDB, an open source model driven graph database for knowledge graph representation Interview Introduction How did you get involved in the area of data management? Can you start by describing what TerminusDB is and what motivated you to build it? What are the use cases that TerminusDB and TerminusHub are designed for? There are a number of different reasons and methods for versioning data, such as the work being done with Datomic, LakeFS, DVC, etc. Where does TerminusDB fit in relation to those and other data versioning systems that are available today? Can you describe how TerminusDB is implemented? How has the design changed or evolved since you first began working on it? What was the decision process and design considerations that led you to choose Prolog as the implementation language? One of the challenges that have faced other knowledge engines built around RDF is that of scale and performance. How are you addressing those difficulties in TerminusDB? What are the scaling factors and limitations for TerminusDB? (e.g. volumes of data, clustering, etc.) How does the use of RDF triples and JSON-LD impact the audience for TerminusDB? How much overhead is incurred by maintaining a long history of changes for a database? How do you handle garbage collection/compaction of versions? How does the availability of branching and merging strategies change the approach that data teams take when working on a project? What are the edge cases in merging and conflict resolution, and what tools does TerminusDB/TerminusHub provide for working through those situations? What are some useful strategies that teams should be aware of for working effectively with collaborative datasets in TerminusDB? Another interesting element of the TerminusDB platform is the query language. What did you use as inspiration for designing it and how much of a learning curve is involved? What are some of the most interesting, innovative, or unexpected ways that you have seen TerminusDB used? https://en.wikipedia.org/wiki/Semantic_Web-?utm_source=rss&utm_medium=rss What are the most interesting, unexpected, or challenging lessons that you have learned while building TerminusDB and TerminusHub? When is TerminusDB the wrong choice? What do you have planned for the future of the project? Contact Info @GavinMGleason on Twitter LinkedIn GavinMendelGleason on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links TerminusDB TerminusHub Chem Informatics Type Theory Graph Database Trinity College Dublin Sesshat Databank analytics over civilizations in history PostgreSQL DGraph Grakn Neo4J Datomic LakeFS DVC Dolt Persistent Succinct Data Structure Currying Prolog WOQL TerminusDB query language RDF JSON-LD Semantic Web Property Graph Hypergraph Super Node Bloom Filters Data Curation Podcast Episode CRDT == Conflict-Free Replicated Data Types Podcast Episode SPARQL Datalog AST == Abstract Syntax Tree The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/11/2021 • 57 minutes, 48 seconds

Bringing Feature Stores and MLOps to the Enterprise at Tecton

Summary As more organizations are gaining experience with data management and incorporating analytics into their decision making, their next move is to adopt machine learning. In order to make those efforts sustainable, the core capability they need is for data scientists and analysts to be able to build and deploy features in a self service manner. As a result the feature store is becoming a required piece of the data platform. To fill that need Kevin Stumpf and the team at Tecton are building an enterprise feature store as a service. In this episode he explains how his experience building the Michelanagelo platform at Uber has informed the design and architecture of Tecton, how it integrates with your existing data systems, and the elements that are required for well engineered feature store. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Do you want to get better at Python? Now is an excellent time to take an online course. Whether you’re just learning Python or you’re looking for deep dives on topics like APIs, memory mangement, async and await, and more, our friends at Talk Python Training have a top-notch course for you. If you’re just getting started, be sure to check out the Python for Absolute Beginners course. It’s like the first year of computer science that you never took compressed into 10 fun hours of Python coding and problem solving. Go to dataengineeringpodcast.com/talkpython today and get 10% off the course that will help you find your next level. That’s dataengineeringpodcast.com/talkpython, and don’t forget to thank them for supporting the show. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat! Your host is Tobias Macey and today I’m interviewing Kevin Stumpf about Tecton and the role that the feature store plays in a modern MLOps platform Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Tecton and your motivation for starting the business? For anyone who isn’t familiar with the concept, what is an example of a feature? How do you define what a feature store is? What role does a feature store play in the overall lifecycle of a machine learning project? How would you characterize the current landscape of feature stores? What are the other components that are necessary for a complete ML operations platform? At what points in the lifecycle of data does the feature store get integrated? What types of data can feature stores manage? (e.g. text vs. image/binary vs. spatial, etc.) How is the Tecton platform implemented? How has the design evolved since you first began building it? How did your work on Uber’s Michelangelo inform your work on Tecton? What is the workflow and lifecycle of developing, testing, and deploying a feature to a feature store? What aspects of a feature do you monitor to determine whether it has drifted? How do you define drift in the context of a feature? How does that differ from drift in an ML model? How does Tecton handle versioning of features and associating those different versions with the models that are using them? What are some of the most interesting, innovative, or unexpected projects that you have seen built with Tecton? When is Tecton the wrong choice? What do you have planned for the future of the product? Contact Info LinkedIn kevinstumpf on GitHub @kevinstumpf on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Tecton Uber Michelangelo MLOps Feature Store Blog: What Is A Feature Store StreamSQL Podcast Episode AWS Feature Store Logical Clocks EMR Kotlin DynamoDB scikit-learn Tensorflow MLFlow Algorithmia SageMaker Feast open source feature store Jaeger OpenTelemetry The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/5/2021 • 47 minutes, 40 seconds

Off The Shelf Data Governance With Satori

Summary One of the core responsibilities of data engineers is to manage the security of the information that they process. The team at Satori has a background in cybersecurity and they are using the lessons that they learned in that field to address the challenge of access control and auditing for data governance. In this episode co-founder and CTO Yoav Cohen explains how the Satori platform provides a proxy layer for your data, the challenges of managing security across disparate storage systems, and their approach to building a dynamic data catalog based on the records that your organization is actually using. This is an interesting conversation about the intersection of data and security and the lessons that can be learned in each direction. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Your host is Tobias Macey and today I’m interviewing Yoav Cohen about Satori, a data access service to monitor, classify and control access to sensitive data Interview Introduction How did you get involved in the area of data management? Can you start by describing what you have built at Satori? What is the story behind the product and company? How does Satori compare to other tools and products for managing access control and governance for data assets? What are the biggest challenges that organizations face in establishing and enforcing policies for their data? What are the main goals for the Satori product and what use cases does it enable? Can you describe how the Satori platform is architected? How has the design of the platform evolved since you first began working on it? How have your experiences working in cyber security informed your approach to data governance? How does the design of the Satori platform simplify technical aspects of data governance? What aspects of governance do you delegate to other systems or platforms? What elements of data infrastructure does Satori integrate with? For someone who is adopting Satori, what is involved in getting it deployed and set up with their existing data platforms? What do you see as being the most complex or underserved aspects of data governance? How much of that complexity is inherent to the problem vs. being a result of how the industry has evolved? What are some of the most interesting, innovative, or unexpected ways that you have seen the Satori platform used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Satori? When is Satori the wrong choice? What do you have planned for the future of the platform? Contact Info LinkedIn @yoavcohen on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Satori Data Governance Data Masking TLS == Transport Layer Security The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/28/2020 • 34 minutes, 24 seconds

Low Friction Data Governance With Immuta

Summary Data governance is a term that encompasses a wide range of responsibilities, both technical and process oriented. One of the more complex aspects is that of access control to the data assets that an organization is responsible for managing. The team at Immuta has built a platform that aims to tackle that problem in a flexible and maintainable fashion so that data teams can easily integrate authorization, data masking, and privacy enhancing technologies into their data infrastructure. In this episode Steve Touw and Stephen Bailey share what they have built at Immuta, how it is implemented, and how it streamlines the workflow for everyone involved in working with sensitive data. If you are starting down the path of implementing a data governance strategy then this episode will provide a great overview of what is involved. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Feature flagging is a simple concept that enables you to ship faster, test in production, and do easy rollbacks without redeploying code. Teams using feature flags release new software with less risk, and release more often. ConfigCat is a feature flag service that lets you easily add flags to your Python code, and 9 other platforms. By adopting ConfigCat you and your manager can track and toggle your feature flags from their visual dashboard without redeploying any code or configuration, including granular targeting rules. You can roll out new features to a subset or your users for beta testing or canary deployments. With their simple API, clear documentation, and pricing that is independent of your team size you can get your first feature flags added in minutes without breaking the bank. Go to dataengineeringpodcast.com/configcat today to get 35% off any paid plan with code DATAENGINEERING or try out their free forever plan. You invest so much in your data infrastructure – you simply can’t afford to settle for unreliable data. Fortunately, there’s hope: in the same way that New Relic, DataDog, and other Application Performance Management solutions ensure reliable software and keep application downtime at bay, Monte Carlo solves the costly problem of broken data pipelines. Monte Carlo’s end-to-end Data Observability Platform monitors and alerts for data issues across your data warehouses, data lakes, ETL, and business intelligence. The platform uses machine learning to infer and learn your data, proactively identify data issues, assess its impact through lineage, and notify those who need to know before it impacts the business. By empowering data teams with end-to-end data reliability, Monte Carlo helps organizations save time, increase revenue, and restore trust in their data. Visit dataengineeringpodcast.com/montecarlo today to request a demo and see how Monte Carlo delivers data observability across your data infrastructure. The first 25 will receive a free, limited edition Monte Carlo hat! Your host is Tobias Macey and today I’m interviewing Steve Touw and Stephen Bailey about Immuta and how they work to automate data governance Interview Introduction How did you get involved in the area of data management? Can you start by describing what you have built at Immuta and your motivation for starting the company? What is data governance? How much of data governance can be solved with technology and how much is a matter of process and communication? What does the current landscape of data governance solutions look like? What are the motivating factors that would lead someone to choose Immuta as a component of their data governance strategy? How does Immuta integrate with the broader ecosystem of data tools and platforms? What other workflows or activities are necessary outside of Immuta to ensure a comprehensive governance/compliance strategy? What are some of the common blind spots when it comes to data governance? How is the Immuta platform architected? How have the design and goals of the system evolved since you first started building it? What is involved in adopting Immuta for an existing data platform? Once an organization has integrated Immuta, what are the workflows for the different stakeholders of the data? What are the biggest challenges in automated discovery/identification of sensitive data? How does the evolution of what qualifies as sensitive complicate those efforts? How do you approach the challenge of providing a unified interface for access control and auditing across different systems (e.g. BigQuery, Snowflake, RedShift, etc.)? What are the complexities that creep into data masking? What are some alternatives for obfuscating and managing access to sensitive information? How do you handle managing access control/masking/tagging for derived data sets? What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Immuta? When is Immuta the wrong choice? What do you have planned for the future of the platform and business? Contact Info Steve LinkedIn @steve_touw on Twitter Stephen LinkedIn Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Immuta Data Governance Data Catalog Snowflake DB Podcast Episode Looker Podcast Episode Collibra ABAC == Attribute Based Access Control RBAC == Role Based Access Control Paul Ohm: Broken Promises of Privacy PET == Privacy Enhancing Technologies K Anonymization Differential Privacy LDAP == Lightweight Directory Access Protocol Active Directory COVID Alliance HIPAA GDPR CCPA The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/21/2020 • 53 minutes, 33 seconds

Building A Self Service Data Platform For Alternative Data Analytics At YipitData

Summary As a data engineer you’re familiar with the process of collecting data from databases, customer data platforms, APIs, etc. At YipitData they rely on a variety of alternative data sources to inform investment decisions by hedge funds and businesses. In this episode Andrew Gross, Bobby Muldoon, and Anup Segu describe the self service data platform that they have built to allow data analysts to own the end-to-end delivery of data projects and how that has allowed them to scale their output. They share the journey that they went through to build a scalable and maintainable system for web scraping, how to make it reliable and resilient to errors, and the lessons that they learned in the process. This was a great conversation about real world experiences in building a successful data-oriented business. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. Your host is Tobias Macey and today I’m interviewing Andrew Gross, Bobby Muldoon, and Anup Segu about they are building pipelines at Yipit Data Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what YipitData does? What kinds of data sources and data assets are you working with? What is the composition of your data teams and how are they structured? Given the use of your data products in the financial sector how do you handle monitoring and alerting around data quality? For web scraping in particular, given how fragile it can be, what have you done to make it a reliable and repeatable part of the data pipeline? Can you describe how your data platform is implemented? How has the design of your platform and its goals evolved or changed? What is your guiding principle for providing an approachable interface to analysts? How much knowledge do your analysts require about the guarantees offered, and edge cases to be aware of in the underlying data and its processing? What are some examples of specific tools that you have built to empower your analysts to own the full lifecycle of the data that they are working with? Can you characterize or quantify the benefits that you have seen from training the analysts to work with the engineering tool chain? What have been some of the most interesting, unexpected, or surprising outcomes of how you are approaching the different responsibilities and levels of ownership in your data organization? What are some of the most interesting, unexpected, or challenging lessons that you have learned from building out the platform, tooling, and organizational structure for creating data products at Yipit? What advice or recommendations do you have for other leaders of data teams about how to think about the organizational and technical aspects of managing the lifecycle of data projects? Contact Info Andrew LinkedIn @awgross on Twitter Bobby LinkedIn @TheDooner64 Anup LinkedIn anup-segu on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Yipit Data Redshift MySQL Airflow Databricks Groupon Living Social Web Scraping Podcast.__init__ Episode Readypipe Graphite Podcast.init Episode AWS Kinesis Firehose Parquet Papermill Podcast Episode About Notebooks At Netflix Fivetran Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/15/2020 • 1 hour, 4 minutes, 47 seconds

Proven Patterns For Building Successful Data Teams

Summary Building data products are complicated by the fact that there are so many different stakeholders with competing goals and priorities. It is also challenging because of the number of roles and capabilities that are necessary to go from idea to delivery. Different organizations have tried a multitude of organizational strategies to improve the success rate of these data teams with varying levels of success. In this episode Jesse Anderson shares the lessons that he has learned while working with dozens of businesses across industries to determine the team structures and communication styles that have generated the best results. If you are struggling to deliver value from big data, or just starting down the path of building the organizational capacity to turn raw information into valuable products then this is a conversation that you don’t want to miss. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. Your host is Tobias Macey and today I’m interviewing Jesse Anderson about best practices for organizing and managing data teams Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of how you view the mission and responsibilities of a data team? What are the critical elements of a successful data team? Beyond the core pillars of data science, data engineering, and operations, what other specialized roles do you find helpful for larger or more sophisticated teams? For organizations that have "small data", how does that change the necessary composition of roles for successful data projects? What are the signs and symptoms that point to the need for a dedicated team that focuses on data? With data scientists and data engineers in particular being in such high demand, what are strategies that you have found effective for attracting new talent? In the case where you have engineers on staff, how do you identify internal talent that can be trained into these specialized roles? Another challenge that organizations face in dealing with data is how the team is organized. What are your thoughts on effective strategies for how to structure the communication and reporting structures of data teams? (e.g. centralized, embedded, etc.) How do you recommend evaluating potential candidates for each of the necessary roles? What are your thoughts on when to hire an outside consultant, vs building internal capacity? For managers who are responsible for data teams, how much understanding of data and analytics do they need to be effective? How do you define success or measure performance of a team focused on working with data? What are some of the anti-patterns that you have seen in managers who oversee data professionals? What are some of the most interesting, unexpected, or challenging lessons that you have learned in the process of helping organizations and individuals achieve success in data and analytics? What advice or additional resources do you have for anyone who is interested in learning more about how to build and grow a successful data team? Contact Info Website @jessetanderson on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Data Teams Book DBA == Database Administrator ML Engineer DataOps Three Vs The Ultimate Guide To Switching Careers To Big Data S-1 Report Jesse Anderson’s Youtube Channel Video about interviewing for data teams Uber Data Infrastructure Progression Blog Post The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/7/2020 • 1 hour, 12 minutes, 30 seconds

Streaming Data Integration Without The Code at Equalum

Summary The first stage of every good pipeline is to perform data integration. With the increasing pace of change and the need for up to date analytics the need to integrate that data in near real time is growing. With the improvements and increased variety of options for streaming data engines and improved tools for change data capture it is possible for data teams to make that goal a reality. However, despite all of the tools and managed distributions of those streaming engines it is still a challenge to build a robust and reliable pipeline for streaming data integration, especially if you need to expose those capabilities to non-engineers. In this episode Ido Friedman, CTO of Equalum, explains how they have built a no-code platform to make integration of streaming data and change data capture feeds easier to manage. He discusses the challenges that are inherent in the current state of CDC technologies, how they have architected their system to integrate well with existing data platforms, and how to build an appropriate level of abstraction for such a complex problem domain. If you are struggling with streaming data integration and change data capture then this interview is definitely worth a listen. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Your host is Tobias Macey and today I’m interviewing Ido Friedman about Equalum, a no-code platform for streaming data integration Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Equalum and how it got started? There are a number of projects and platforms on the market that target data integration. Can you give some context of how Equalum fits in that market and the differentiating factors that engineers should consider? What components of the data ecosystem might Equalum replace, and which are you designed to integrate with? Can you walk through the workflow for someone who is using Equalum for a simple data integration use case? What options are available for doing in-flight transformations of data or creating customized routing rules? How do you handle versioning and staged rollouts of changes to pipelines? How is the Equalum platform implemented? How has the design and architecture of Equalum evolved since it was first created? What have you found to be the most complex or challenging aspects of building the platform? Change data capture is a growing area of interest, with a significant level of difficulty in implementing well. How do you handle support for the variety of different sources that customers are working with? What are the edge cases that you typically run into when working with changes in databases? How do you approach the user experience of the platform given its focus as a low code/no code system? What options exist for sophisticated users to create custom operations? How much of the underlying concerns do you surface to end users, and how much are you able to hide? What is the process for a customer to integrate Equalum into their existing infrastructure and data systems? What are some of the most interesting, unexpected, or innovative ways that you have seen Equalum used? What are the most interesting, unexpected, or challenging lessons that you have learned while building and growing the Equalum platform? When is Equalum the wrong choice? What do you have planned for the future of Equalum? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Equalum Change Data Capture Debezium Podcast Episode SQL Server DBA == Database Administrator Fivetran Podcast Episode Singer Pentaho EMR Snowflake Podcast Episode S3 Kafka Spark Prometheus Grafana Logminer OBLP == Oracle Binary Log Parser Ansible Terraform Jupyter Notebooks Papermill The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/30/2020 • 44 minutes, 50 seconds

Keeping A Bigeye On The Data Quality Market

Summary One of the oldest aphorisms about data is "garbage in, garbage out", which is why the current boom in data quality solutions is no surprise. With the growth in projects, platforms, and services that aim to help you establish and maintain control of the health and reliability of your data pipelines it can be overwhelming to stay up to date with how they all compare. In this episode Egor Gryaznov, CTO of Bigeye, joins the show to explore the landscape of data quality companies, the general strategies that they are using, and what problems they solve. He also shares how his own product is designed and the challenges that are involved in building a system to help data engineers manage the complexity of a data platform. If you are wondering how to get better control of your own pipelines and the traps to avoid then this episode is definitely worth a listen. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Go to dataengineeringpodcast.com/datafold today to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Your host is Tobias Macey and today I’m interviewing Egor Gryaznov about the state of the industry for data quality management and what he is building at Bigeye. Interview Introduction How did you get involved in the area of data management? Can you start by sharing your views on what attributes you consider when defining data quality? You use the term "data semantics" – can you elaborate on what that means? What are the driving factors that contribute to the presence or lack of data quality in an organization or data platform? Why do you think now is the right time to focus on data quality as an industry? What are you building at Bigeye and how did it get started? How does Bigeye help teams understand and manage their data quality? What is the difference between existing data quality approaches and data observability? What do you see as the tradeoffs for the approach that you are taking at Bigeye? What are the most common data quality issues that you’ve seen and what are some more interesting ones that you wouldn’t expect? Where do you see Bigeye fitting into the data management landscape? What are alternatives to Bigeye? What are some of the most interesting, innovative, or unexpected ways that you have seen Bigeye being used? What are some of the most interesting homegrown approaches that you have seen? What have you found to be the most interesting, unexpected, or challenging lessons that you have learned while building the Bigeye platform and business? What are the biggest trends you’re following in data quality management? When is Bigeye the wrong choice? What do you see in store for the future of Bigeye? Contact Info You can email Egor about anything data LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Bigeye Uber A/B Testing Hadoop MapReduce Apache Impala One King’s Lane Vertica Mode Tableau Jupyter Notebooks Redshift Snowflake PyTorch Podcast.__init__ Episode Tensorflow DataOps DevOps Data Catalog DBT Podcast Episode SRE Handbook Article About How Uber Applied SRE Principles to Data SLA == Service Level Agreement SLO == Service Level Objective Dagster Podcast Episode Podcast.__init__ Episode Delta Lake Great Expectations Podcast Episode Podcast.__init__ Episode Amundsen Podcast Episode Alation Collibra The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/23/2020 • 49 minutes, 25 seconds

Self Service Data Management From Ingest To Insights With Isima

Summary The core mission of data engineers is to provide the business with a way to ask and answer questions of their data. This often takes the form of business intelligence dashboards, machine learning models, or APIs on top of a cleaned and curated data set. Despite the rapid progression of impressive tools and products built to fulfill this mission, it is still an uphill battle to tie everything together into a cohesive and reliable platform. At Isima they decided to reimagine the entire ecosystem from the ground up and built a single unified platform to allow end-to-end self service workflows from data ingestion through to analysis. In this episode CEO and co-founder of Isima Darshan Rawal explains how the biOS platform is architected to enable ease of use, the challenges that were involved in building an entirely new system from scratch, and how it can integrate with the rest of your data platform to allow for incremental adoption. This was an interesting and contrarian take on the current state of the data management industry and is worth a listen to gain some additional perspective. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Your host is Tobias Macey and today I’m interviewing Darshan Rawal about Îsíma, a unified platform for building data applications Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you are building at Îsíma? What was your motivation for creating a new platform for data applications? What is the story behind the name? What are the tradeoffs of a fully integrated platform vs a modular approach? What components of the data ecosystem does Isima replace, and which does it integrate with? What are the use cases that Isima enables which were previously impractical? Can you describe how Isima is architected? How has the design of the platform changed or evolved since you first began working on it? What were your initial ideas or assumptions that have been changed or invalidated as you worked through the problem you’re addressing? With a focus on the enterprise, how did you approach the user experience design to allow for organizational complexity? One of the biggest areas of difficulty that many data systems face is security and scaleable access control. How do you tackle that problem in your platform? How did you address the issue of geographical distribution of data and users? Can you talk through the overall lifecycle of data as it traverses the bi(OS) platform from ingestion through to presentation? What is the workflow for someone using bi(OS)? What are some of the most interesting, innovative, or unexpected ways that you have seen bi(OS) used? What have you found to be the most interesting, unexpected, or challenging aspects of building the bi(OS) platform? When is it the wrong choice? What do you have planned for the future of Isima and bi(OS)? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Îsíma Datastax Verizon AT&T Click Fraud ESB == Enterprise Service Bus ETL == Extract, Transform, Load EDW == Enterprise Data Warehouse BI == Business Intelligence The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/17/2020 • 44 minutes, 2 seconds

Building A Cost Effective Data Catalog With Tree Schema

Summary A data catalog is a critical piece of infrastructure for any organization who wants to build analytics products, whether internal or external. While there are a number of platforms available for building that catalog, many of them are either difficult to deploy and integrate, or expensive to use at scale. In this episode Grant Seward explains how he built Tree Schema to be an easy to use and cost effective option for organizations to build their data catalogs. He also shares the internal architecture, how he approached the design to make it accessible and easy to use, and how it autodiscovers the schemas and metadata for your source systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Modern Data teams are dealing with a lot of complexity in their data pipelines and analytical code. Monitoring data quality, tracing incidents, and testing changes can be daunting and often takes hours to days. Datafold helps Data teams gain visibility and confidence in the quality of their analytical data through data profiling, column-level lineage and intelligent anomaly detection. Datafold also helps automate regression testing of ETL code with its Data Diff feature that instantly shows how a change in ETL or BI code affects the produced data, both on a statistical level and down to individual rows and values. Datafold integrates with all major data warehouses as well as frameworks such as Airflow & dbt and seamlessly plugs into CI workflows. Follow go.datafold.com/dataengineeringpodcast to start a 30-day trial of Datafold. Once you sign up and create an alert in Datafold for your company data, they will send you a cool water flask. Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Your host is Tobias Macey and today I’m interviewing Grant Seward about Tree Schema, a human friendly data catalog Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what you have built at Tree Schema? What was your motivation for creating it? At what stage of maturity should a team or organization consider a data catalog to be a necessary component in their data platform? There are a large and growing number of projects and products designed to provide a data catalog, with each of them addressing the problem in a slightly different way. What are the necessary elements for a data catalog? How does Tree Schema compare to the available options? (e.g. Amundsen, Company Wiki, Metacat, Metamapper, etc.) How is the Tree Schema system implemented? How has the design or direction of Tree Schema evolved since you first began working on it? How did you approach the schema definitions for defining entities? What was your guiding heuristic for determining how to design the interface and data models? – I wrote down notes that combine this with the question above How do you handle integrating with data sources? In addition to storing schema information you allow users to store information about the transformations being performed. How is that represented? How can users populate information about their transformations in an automated fashion? How do you approach evolution and versioning of schema information? What are the scaling limitations of tree schema, whether in terms of the technical or cognitive complexity that it can handle? What are some of the most interesting, innovative, or unexpected ways that you have seen Tree Schema being used? What have you found to be the most interesting, unexpected, or challenging lessons learned in the process of building and promoting Tree Schema? When is Tree Schema the wrong choice? What do you have planned for the future of the product? Contact Info Email Linkedin Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Tree Schema Tree Schema – Data Lineage as Code Capital One Walmart Labs Data Catalog Data Discovery Amundsen Metacat Marquez Metamapper Infoworks Collibra Faust Podcast.__init__ Episode Django PostgreSQL Redis Celery Amazon ECS (Elastic Container Service) Django Storages Dagster Airflow DataHub Avro Singer Apache Atlas The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/10/2020 • 51 minutes, 52 seconds

Add Version Control To Your Data Lake With LakeFS

Summary Data lakes are gaining popularity due to their flexibility and reduced cost of storage. Along with the benefits there are some additional complexities to consider, including how to safely integrate new data sources or test out changes to existing pipelines. In order to address these challenges the team at Treeverse created LakeFS to introduce version control capabilities to your storage layer. In this episode Einat Orr and Oz Katz explain how they implemented branching and merging capabilities for object storage, best practices for how to use versioning primitives to introduce changes to your data lake, how LakeFS is architected, and how you can start using it for your own data platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. Your host is Tobias Macey and today I’m interviewing Einat Orr and Oz Katz about their work at Treeverse on the LakeFS system for versioning your data lakes the same way you version your code. Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what LakeFS is and why you built it? There are a number of tools and platforms that support data virtualization and data versioning. How does LakeFS compare to the available options? (e.g. Alluxio, Denodo, Pachyderm, DVC, etc.) What are the primary use cases that LakeFS enables? For someone who wants to use LakeFS what is involved in getting it set up? How is LakeFS implemented? How has the design of the system changed or evolved since you began working on it? What assumptions did you have going into it which have since been invalidated or modified? How does the workflow for an engineer or analyst change from working directly against S3 to running against the LakeFS interface? How do you handle merge conflicts and resolution? What are some of the potential edge cases or foot guns that they should be aware of when there are multiple people using the same repository? How do you approach management of the data lifecycle or garbage collection to avoid ballooning the cost of storage for a dataset that is tracking a high number of branches with diverging commits? Given that S3 and GCS are eventually consistent storage layers, how do you handle snapshots/transactionality of the data you are working with? What are the axes for scaling an installation of LakeFS? What are the limitations in terms of size or geographic distribution of the datasets? What are some of the most interesting, unexpected, or innovative ways that you have seen LakeFS being used? What are the most interesting, unexpected, or challenging lessons that you have learned while building LakeFS? When is LakeFS the wrong choice? What do you have planned for the future of the project? Contact Info Einat Orr Oz Katz Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Treeverse LakeFS GitHub Documentation lakeFS Slack Channel SimilarWeb Kaggle DagsHub Alluxio Pachyderm DVC ML Ops (Machine Learning Operations) DoltHub Delta Lake Podcast Episode Hudi Iceberg Table Format Podcast Episode Kubernetes PostgreSQL Podcast Episode Git Spark Presto CockroachDB YugabyteDB Citus Hive Metastore Iceberg Table Format Immunai The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/3/2020 • 50 minutes, 15 seconds

Cloud Native Data Security As Code With Cyral

Summary One of the most challenging aspects of building a data platform has nothing to do with pipelines and transformations. If you are putting your workflows into production, then you need to consider how you are going to implement data security, including access controls and auditing. Different databases and storage systems all have their own method of restricting access, and they are not all compatible with each other. In order to simplify the process of securing your data in the Cloud Manav Mital created Cyral to provide a way of enforcing security as code. In this episode he explains how the system is architected, how it can help you enforce compliance, and what is involved in getting it integrated with your existing systems. This was a good conversation about an aspect of data management that is too often left as an afterthought. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Manav Mital about the challenges involved in securing your data and the work that he is doing at Cyral to help address those problems. Interview Introduction How did you get involved in the area of data management? What is Cyral and what motivated you to build a business focused on addressing data security in the cloud? Can you start by giving an overview of some of the common security issues that occur when working with data? What new security challenges are introduced by building data platforms in public cloud environments? What are the organizational roles that are typically responsible for managing security and access control to data sources and repositories? What are the tensions, technical or organizational, that lead to a problematic or incomplete security posture? What are the differences in security requirements and implementation complexity between software applications and data systems? What are the data systems that Cyral integrates with? How did you determine what platforms to prioritize? How does Cyral integrate into the toolchains used to deploy, maintain, and upgrade an organization’s data infrastructure? How does the Cyral platform address security and access control of data across an organization’s infrastructure? How are schema changes handled when using Cyral to enforce access control to PII or other attributes? How does Cyral help with reducing sprawl of data across unmonitored systems? What are some of the most interesting, unexpected, or challenging lessons that you learned while building Cyral? When is Cyral the wrong choice? What do you have planned for the future of the Cyral platform? Contact Info LinkedIn @manavrm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Cyral Snowflake Podcast Episode BigQuery Object Storage MongoDB The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/26/2020 • 48 minutes, 32 seconds

Better Data Quality Through Observability With Monte Carlo

Summary In order for analytics and machine learning projects to be useful, they require a high degree of data quality. To ensure that your pipelines are healthy you need a way to make them observable. In this episode Barr Moses and Lior Gavish, co-founders of Monte Carlo, share the leading causes of what they refer to as data downtime and how it manifests. They also discuss methods for gaining visibility into the flow of data through your infrastructure, how to diagnose and prevent potential problems, and what they are building at Monte Carlo to help you maintain your data’s uptime. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Barr Moses and Lior Gavish about observability for your data pipelines and how they are addressing it at Monte Carlo. Interview Introduction How did you get involved in the area of data management? How did you come up with the idea to found Monte Carlo? What is "data downtime"? Can you start by giving your definition of observability in the context of data workflows? What are some of the contributing factors that lead to poor data quality at the different stages of the lifecycle? Monitoring and observability of infrastructure and software applications is a well understood problem. In what ways does observability of data applications differ from "traditional" software systems? What are some of the metrics or signals that we should be looking at to identify problems in our data applications? Why is this the year that so many companies are working to address the issue of data quality and observability? How are you addressing the challenge of bringing observability to data platforms at Monte Carlo? What are the areas of integration that you are targeting and how did you identify where to prioritize your efforts? For someone who is using Monte Carlo, how does the platform help them to identify and resolve issues in their data? What stage of the data lifecycle have you found to be the biggest contributor to downtime and quality issues? What are the most challenging systems, platforms, or tool chains to gain visibility into? What are some of the most interesting, innovative, or unexpected ways that you have seen teams address their observability needs? What are the most interesting, unexpected, or challenging lessons that you have learned while building the business and technology of Monte Carlo? What are the alternatives to Monte Carlo? What do you have planned for the future of the platform? Contact Info Visit www.montecarlodata.com?utm_source=rss&utm_medium=rss to lean more about our data reliability platform; Or reach out directly to barr@montecarlodata.com — happy to chat about all things data! Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Monte Carlo Monte Carlo Platform Observability Gainsight Barracuda Networks DevOps New Relic Datadog Netflix RAD Outlier Detection The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/19/2020 • 55 minutes, 52 seconds

Rapid Delivery Of Business Intelligence Using Power BI

Summary Business intelligence efforts are only as useful as the outcomes that they inform. Power BI aims to reduce the time and effort required to go from information to action by providing an interface that encourages rapid iteration. In this episode Rob Collie shares his enthusiasm for the Power BI platform and how it stands out from other options. He explains how he helped to build the platform during his time at Microsoft, and how he continues to support users through his work at Power Pivot Pro. Rob shares some useful insights gained through his consulting work, and why he considers Power BI to be the best option on the market today for business analytics. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Equalum’s end to end data ingestion platform is relied upon by enterprises across industries to seamlessly stream data to operational, real-time analytics and machine learning environments. Equalum combines streaming Change Data Capture, replication, complex transformations, batch processing and full data management using a no-code UI. Equalum also leverages open source data frameworks by orchestrating Apache Spark, Kafka and others under the hood. Tool consolidation and linear scalability without the legacy platform price tag. Go to dataengineeringpodcast.com/equalum today to start a free 2 week test run of their platform, and don’t forget to tell them that we sent you. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Rob Collie about Microsoft’s Power BI platform and his work at Power Pivot Pro to help users employ it effectively. Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what Power BI is? The business intelligence market is fairly crowded. What are the features of Power BI that make it stand out? Who are the target users of Power BI? How does the design of the platform reflect those priorities? Can you talk through the workflow for someone to build a report or dashboard in Power BI? What is the broader ecosystem of data tools and platforms that Power BI sits within? What are the available integration and extension points for Power BI? In addition to your work at Microsoft building Power BI you now run a consulting company dedicated to helping people adopt that platform. What are some of the common challenges that users face in employing Power BI effectively? In your experience working with clients, what are some of the core principles of data processing and visualization that apply across industries? What are some of the modeling or presentation methods that are specific to a given industry? One of the perennial challenges of business intelligence is to make reports discoverable. What facilities does Power BI have to aid in surfacing useful information to end users? What capabilities does Power BI have for exposing elements of data quality? What are some of the most challenging aspects of building and maintaining a business intelligence effort in an organization? What are some of the most interesting, unexpected, or innovative uses of Power BI that you have seen, or projects that you have worked on? What are some of the most interesting, unexpected, or challenging lessons that you have learned in your work building Power BI and building a business to support its users? When is Power BI the wrong choice? What trends in business intelligence are you most excited by? Contact Info LinkedIn @robocolli3 on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links P3 Power BI Microsoft Excel Fantasy Football Excel Functions Lisp Business Intelligence VLOOKUP Looker Podcast Episode SQL Server Reporting Services SQL Server Analysis Services Tableau Master Data Management ERP == Enterprise Resoure Planning M Language Power Query DAX The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/12/2020 • 1 hour, 2 minutes, 54 seconds

Self Service Real Time Data Integration Without The Headaches With Meroxa

Summary Analytical workloads require a well engineered and well maintained data integration process to ensure that your information is reliable and up to date. Building a real-time pipeline for your data lakes and data warehouses is a non-trivial effort, requiring a substantial investment of time and energy. Meroxa is a new platform that aims to automate the heavy lifting of change data capture, monitoring, and data loading. In this episode founders DeVaris Brown and Ali Hamidi explain how their tenure at Heroku informed their approach to making data integration self service, how the platform is architected, and how they have designed their system to adapt to the continued evolution of the data ecosystem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing DeVaris Brown and Ali Hamidi about Meroxa, a new platform as a service for data integration Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Meroxa and what motivated you to turn it into a business? What are the lessons that you learned from your time at Heroku which you are applying to your work on Meroxa? Who are your target users and what are your guiding principles for designing the platform interface? What are the common difficulties that engineers face in building and maintaining data infrastructure? There are a variety of platforms that offer solutions for managing data integration, or powering end-to-end analytics, or building machine learning pipelines. What are the shortcomings of those existing options that might lead someone to choose Meroxa? How is the Meroxa platform architected? What are some of the initial assumptions that you had which have been challenged as you proceed with implementation? What new capabilities does Meroxa bring to someone who uses it for integrating their application data? What are the growth options for organizations that get started with Meroxa? What are the core principles that you are focused on to allow for evolving your platform over the long run as the surrounding ecosystem continues to mature? When is Meroxa the wrong choice? What do you have planned for the future? Contact Info DeVaris Brown Ali Hamidi Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Meroxa Heroku Heroku Kafka Ascend StreamSets Nexus Kafka Connect Airflow Podcast.__init__ Episode Spark Data Engineering Episode Change Data Capture Segment Podcast Episode Rudderstack MParticle Debezium Podcast Episode DBT Podcast Episode Materialize Podcast Episode Stitch Data Fivetran Podcast Episode Elasticsearch Podcast Episode gRPC GraphQL REST == REpresentational State Transfer Dagster/Elementl Data Engineering Podcast Episode Podcast.__init__ Episode Prefect Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/5/2020 • 1 hour, 55 seconds

Speed Up And Simplify Your Streaming Data Workloads With Red Panda

Summary Kafka has become a de facto standard interface for building decoupled systems and working with streaming data. Despite its widespread popularity, there are numerous accounts of the difficulty that operators face in keeping it reliable and performant, or trying to scale an installation. To make the benefits of the Kafka ecosystem more accessible and reduce the operational burden, Alexander Gallego and his team at Vectorized created the Red Panda engine. In this episode he explains how they engineered a drop-in replacement for Kafka, replicating the numerous APIs, that can scale more easily and deliver consistently low latencies with a much lower hardware footprint. He also shares some of the areas of innovation that they have found to help foster the next wave of streaming applications while working within the constraints of the existing Kafka interfaces. This was a fascinating conversation with an energetic and enthusiastic engineer and founder about the challenges and opportunities in the realm of streaming data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. If you’re looking for a way to optimize your data engineering pipeline – with instant query performance – look no further than Qubz. Qubz is next-generation OLAP technology built for the scale of Big Data from UST Global, a renowned digital services provider. Qubz lets users and enterprises analyze data on the cloud and on-premise, with blazing speed, while eliminating the complex engineering required to operationalize analytics at scale. With an emphasis on visual data engineering, connectors for all major BI tools and data sources, Qubz allow users to query OLAP cubes with sub-second response times on hundreds of billions of rows. To learn more, and sign up for a free demo, visit dataengineeringpodcast.com/qubz. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Alexander Gallego about his work at Vectorized building Red Panda as a performance optimized, drop-in replacement for Kafka Interview Introduction How did you get involved in the area of data management? Can you start by describing what Red Panda is and what motivated you to create it? What are the limitations of Kafka that make something like Red Panda necessary? What are the current strengths of the Kafka ecosystem that make it a reasonable implementation target for Red Panda? How is Red Panda architected? How has the design or direction changed or evolved since you first began working on it? What are the challenges that you face in automatically optimizing the runtime to take advantage of the hardware that it is deployed on? How do cloud environments contribute to that complexity? How are you handling the compatibility layer for the Kafka API? What is your approach for managing versioning and ensuring that you maintain bug compatibility? Beyond performance, what other areas of innovation or improvement in the capabilities and experience do you see while adhering to the Kafka protocol? What are the opportunities for innovation in the streaming space that aren’t being explored yet? What are some of the most interesting, innovative, or unexpected ways that you have seen Redpanda being used? What are the most interesting, unexpected, or challenging lessons that you have learned while building Red Panda and Vectorized? When is Red Panda the wrong choice? What do you have planned for the future of the product and business? What is your Hack The Planet diversity scholarship? Contact Info @emaxerrno on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Vectorized Free Download Trial @vectorizedio Company Twitter Accn’t Community Slack Concord alternative to Flink Apache Flink Podcast Episode FAANG == Facebook, Apple, Amazon, Netflix, and Google Blackblaze Raft NATS Pulsar Podcast Episode StreamNative Podcast Episode Open Messaging Specification ScyllaDB CockroachDB MemSQL WASM == Web Assembly Debezium Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/29/2020 • 59 minutes, 40 seconds

Cutting Through The Noise And Focusing On The Fundamentals Of Data Engineering With The Data Janitor

Summary Data engineering is a constantly growing and evolving discipline. There are always new tools, systems, and design patterns to learn, which leads to a great deal of confusion for newcomers. Daniel Molnar has dedicated his time to helping data professionals get back to basics through presentations at conferences and meetups, and with his most recent endeavor of building the Pipeline Data Engineering Academy. In this episode he shares advice on how to cut through the noise, which principles are foundational to building a successful career as a data engineer, and his approach to educating the next generation of data practitioners. This was a useful conversation for anyone working with data who has found themselves spending too much time chasing the latest trends and wishes to develop a more focused approach to their work. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Are you bogged down by having to manually manage data access controls, repeatedly move and copy data, and create audit reports to prove compliance? How much time could you save if those tasks were automated across your cloud platforms? Immuta is an automated data governance solution that enables safe and easy data analytics in the cloud. Our comprehensive data-level security, auditing and de-identification features eliminate the need for time-consuming manual processes and our focus on data and compliance team collaboration empowers you to deliver quick and valuable data analytics on the most sensitive data to unlock the full potential of your cloud data platforms. Learn how we streamline and accelerate manual processes to help you derive real results from your data at dataengineeringpodcast.com/immuta. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Daniel Molnar about being a data janitor and how to cut through the hype to understand what to learn for the long run Interview Introduction How did you get involved in the area of data management? Can you start by describing your thoughts on the current state of the data management industry? What is your strategy for being effective in the face of so much complexity and conflicting needs for data? What are some of the common difficulties that you see data engineers contend with, whether technical or social/organizational? What are the core fundamentals that you think are necessary for data engineers to be effective? What are the gaps in knowledge or experience that you have seen data engineers contend with? You recently started down the path of building a bootcamp for training data engineers. What was your motivation for embarking on that journey? How would you characterize your particular approach? What are some of the reasons that your applicants have for wanting to become versed in data engineering? What is the baseline of capabilities that you expect of your target audience? What level of proficiency do you aim for when someone has completed your training program? Who do you think would not be a good fit for your academy? As a hiring manager, what are the core capabilities that you look for in a data engineering candidate? What are some of the methods that you use to assess competence? What are the overall trends in the data management space that you are worried by? Which ones are you happy about? What are your plans and overall goals for the pipeline academy? Contact Info LinkedIn @soobrosa on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Pipeline Data Engineering Academy Data Janitor 101 The Data Janitor Returns Berlin, Germany Hungary Urchin google analytics precursor AWS Redshift Nassim Nicholas Taleb Black Swans (affiliate link) KISS == Keep It Simple Stupid Dan McKinley Ralph Kimball Data Warehousing design Falsehoods Programmers Believe Apache Kafka AWS Kinesis ETL/ELT CI/CD Telemetry Dêpeche Mode Designing Data Intensive Applications (affiliate link) Stop Hiring DevOps Engineers and Start Growing Them T Shaped Engineer Pipeline Data Engineering Academy Curriculum MPP == Massively Parallel Processing Apache Flink Podcast Episode Flask web framework YAGNI == You Ain’t Gonna Need It Pair Programming Clojure The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/22/2020 • 47 minutes, 40 seconds

Distributed In Memory Processing And Streaming With Hazelcast

Summary In memory computing provides significant performance benefits, but brings along challenges for managing failures and scaling up. Hazelcast is a platform for managing stateful in-memory storage and computation across a distributed cluster of commodity hardware. On top of this foundation, the Hazelcast team has also built a streaming platform for reliable high throughput data transmission. In this episode Dale Kim shares how Hazelcast is implemented, the use cases that it enables, and how it complements on-disk data management systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Tree Schema is a data catalog that is making metadata management accessible to everyone. With Tree Schema you can create your data catalog and have it fully populated in under five minutes when using one of the many automated adapters that can connect directly to your data stores. Tree Schema includes essential cataloging features such as first class support for both tabular and unstructured data, data lineage, rich text documentation, asset tagging and more. Built from the ground up with a focus on the intersection of people and data, your entire team will find it easier to foster collaboration around your data. With the most transparent pricing in the industry – $99/mo for your entire company – and a money-back guarantee for excellent service, you’ll love Tree Schema as much as you love your data. Go to dataengineeringpodcast.com/treeschema today to get your first month free, and mention this podcast to get %50 off your first three months after the trial. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Dale Kim about Hazelcast, a distributed in-memory computing platform for data intensive applications Interview Introduction How did you get involved in the area of data management? Can you start by describing what Hazelcast is and its origins? What are the benefits and tradeoffs of in-memory computation for data-intensive workloads? What are some of the common use cases for the Hazelcast in memory grid? How is Hazelcast implemented? How has the architecture evolved since it was first created? How is the Jet streaming framework architected? What was the motivation for building it? How do the capabilities of Jet compare to systems such as Flink or Spark Streaming? How has the introduction of hardware capabilities such as NVMe drives influenced the market for in-memory systems? How is the governance of the open source grid and Jet projects handled? What is the guiding heuristic for which capabilities or features to include in the open source projects vs. the commercial offerings? What is involved in building an application or workflow on top of Hazelcast? What are the common patterns for engineers who are building on top of Hazelcast? What is involved in deploying and maintaining an installation of the Hazelcast grid or Jet streaming? What are the scaling factors for Hazelcast? What are the edge cases that users should be aware of? What are some of the most interesting, innovative, or unexpected ways that you have seen Hazelcast used? When is Hazelcast Grid or Jet the wrong choice? What is in store for the future of Hazelcast? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links HazelCast Istanbul Apache Spark OrientDB CAP Theorem NVMe Memristors Intel Optane Persistent Memory Hazelcast Jet Kappa Architecture IBM Cloud Paks Digital Integration Hub (Gartner) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/15/2020 • 44 minutes, 7 seconds

Simplify Your Data Architecture With The Presto Distributed SQL Engine

Summary Databases are limited in scope to the information that they directly contain. For analytical use cases you often want to combine data across multiple sources and storage locations. This frequently requires cumbersome and time-consuming data integration. To address this problem Martin Traverso and his colleagues at Facebook built the Presto distributed query engine. In this episode he explains how it is designed to allow for querying and combining data where it resides, the use cases that such an architecture unlocks, and the innovative ways that it is being employed at companies across the world. If you need to work with data in your cloud data lake, your on-premise database, or a collection of flat files, then give this episode a listen and then try out Presto today. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Martin Traverso about PrestoSQL, a distributed SQL engine that queries data in place Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what Presto is and its origin story? What was the motivation for releasing Presto as open source? For someone who is responsible for architecting their organization’s data platform, what are some of the signals that Presto will be a good fit for them? What are the primary ways that Presto is being used? I interviewed your colleague at Starburst, Kamil 2 years ago. How has Presto changed or evolved in that time, both technically and in terms of community and ecosystem growth? What are some of the deployment and scaling considerations that operators of Presto should be aware of? What are the best practices that have been established for working with data through Presto in terms of centralizing in a data lake vs. federating across disparate storage locations? What are the tradeoffs of using Presto on top of a data lake vs a vertically integrated warehouse solution? When designing the layout of a data lake that will be interacted with via Presto, what are some of the data modeling considerations that can improve the odds of success? What are some of the most interesting, unexpected, or innovative ways that you have seen Presto used? What are the most interesting, unexpected, or challenging lessons that you have learned while building, growing, and supporting the Presto project? When is Presto the wrong choice? What is in store for the future of the Presto project and community? Contact Info LinkedIn @mtraverso on Twitter martint on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Presto Starburst Data Podcast Episode Hadoop Hive Glue Metastore BigQuery Kinesis Apache Pinot Elasticsearch ORC Parquet AWS Redshift Avro Podcast Episode LZ4 Zstandard KafkaSQL Flink Podcast Episode PyTorch Podcast.__init__ Episode Tensorflow Spark The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/7/2020 • 53 minutes, 59 seconds

Building A Better Data Warehouse For The Cloud At Firebolt

Summary Data warehouse technology has been around for decades and has gone through several generational shifts in that time. The current trends in data warehousing are oriented around cloud native architectures that take advantage of dynamic scaling and the separation of compute and storage. Firebolt is taking that a step further with a core focus on speed and interactivity. In this episode CEO and founder Eldad Farkash explains how the Firebolt platform is architected for high throughput, their simple and transparent pricing model to encourage widespread use, and the use cases that it unlocks through interactive query speeds. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Eldad Farkash about Firebolt, a cloud data warehouse optimized for speed and elasticity on structured and semi-structured data Interview Introduction How did you get involved in the area of data management? Can you start by describing what Firebolt is and your motivation for building it? How does Firebolt compare to other data warehouse technologies what unique features does it provide? The lines between a data warehouse and a data lake have been blurring in recent years. Where on that continuum does Firebolt lie? What are the unique use cases that Firebolt allows for? How do the performance characteristics of Firebolt change the ways that an engineer should think about data modeling? What technologies might someone replace with Firebolt? How is Firebolt architected and how has the design evolved since you first began working on it? What are some of the most challenging aspects of building a data warehouse platform that is optimized for speed? How do you handle support for nested and semi-structured data? In what ways have you found it necessary/useful to extend SQL? Due to the immutability of object storage, for data lakes the update or delete process involves reprocessing a potentially large amount of data. How do you approach that in Firebolt with your F3 format? What have you found to be the most interesting, unexpected, or challenging lessons while building and scaling the Firebolt platform and business? When is Firebolt the wrong choice? What do you have planned for the future of Firebolt? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Firebolt Sisense SnowflakeDB Podcast Episode Redshift Spark Podcast Episode Parquet Podcast Episode Hadoop HDFS S3 AWS Athena BigQuery Data Vault Podcast Episode Star Schema Dimensional Modeling Slowly Changing Dimensions JDBC TPC Benchmarks DBT Podcast Episode Tableau Looker Podcast Episode PrestoSQL Podcast Episode PostgreSQL Podcast Episode FoundationDB Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/1/2020 • 1 hour, 5 minutes, 50 seconds

Metadata Management And Integration At LinkedIn With DataHub

Summary In order to scale the use of data across an organization there are a number of challenges related to discovery, governance, and integration that need to be solved. The key to those solutions is a robust and flexible metadata management system. LinkedIn has gone through several iterations on the most maintainable and scalable approach to metadata, leading them to their current work on DataHub. In this episode Mars Lan and Pardhu Gunnam explain how they designed the platform, how it integrates into their data platforms, and how it is being used to power data discovery and analytics at LinkedIn. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Pardhu Gunnam and Mars Lan about DataHub, LinkedIn’s metadata management and data catalog platform Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what DataHub is and some of its back story? What were you using at LinkedIn for metadata management prior to the introduction of DataHub? What was lacking in the previous solutions that motivated you to create a new platform? There are a large number of other systems available for building data catalogs and tracking metadata, both open source and proprietary. What are the features of DataHub that would lead someone to use it in place of the other options? Who is the target audience for DataHub? How do the needs of those end users influence or constrain your approach to the design and interfaces provided by DataHub? Can you describe how DataHub is architected? How has it evolved since you first began working on it? What was your motivation for releasing DataHub as an open source project? What have been the benefits of that decision? What are the challenges that you face in maintaining changes between the public repository and your internally deployed instance? What is the workflow for populating metadata into DataHub? What are the challenges that you see in managing the format of metadata and establishing consistent models for the information being stored? How do you handle discovery of data assets for users of DataHub? What are the integration and extension points of the platform? What is involved in deploying and maintaining and instance of the DataHub platform? What are some of the most interesting or unexpected ways that you have seen DataHub used inside or outside of LinkedIn? What are some of the most interesting, unexpected, or challenging lessons that you learned while building and working with DataHub? When is DataHub the wrong choice? What do you have planned for the future of the project? Contact Info Mars LinkedIn mars-lan on GitHub Pardhu LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links DataHub Map/Reduce Apache Flume LinkedIn Blog Post introducing DataHub WhereHows Hive Metastore Kafka CDC == Change Data Capture Podcast Episode PDL LinkedIn language GraphQL Elasticsearch Neo4J Apache Pinot Apache Gobblin Apache Samza Open Sourcing DataHub Blog Post The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/25/2020 • 51 minutes, 4 seconds

Exploring The TileDB Universal Data Engine

Summary Most databases are designed to work with textual data, with some special purpose engines that support domain specific formats. TileDB is a data engine that was built to support every type of data by using multi-dimensional arrays as the foundational primitive. In this episode the creator and founder of TileDB shares how he first started working on the underlying technology and the benefits of using a single engine for efficiently storing and querying any form of data. He also discusses the shifts in database architectures from vertically integrated monoliths to separately deployed layers, and the approach he is taking with TileDB cloud to embed the authorization into the storage engine, while providing a flexible interface for compute. This was a great conversation about a different approach to database architecture and how that enables a more flexible way to store and interact with data to power better data sharing and new opportunities for blending specialized domains. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Stavros Papadopoulos about TileDB, the universal storage engine Interview Introduction How did you get involved in the area of data management? Can you start by describing what TileDB is and the problem that you are trying to solve with it? What was your motivation for building it? What are the main use cases or problem domains that you are trying to solve for? What are the shortcomings of existing approaches to database design that prevent them from being useful for these applications? What are the benefits of using matrices for data processing and domain modeling? What are the challenges that you have faced in storing and processing sparse matrices efficiently? How does the usage of matrices as the foundational primitive affect the way that users should think about data modeling? What are the benefits of unbundling the storage engine from the processing layer Can you describe how TileDB embedded is architected? How has the design evolved since you first began working on it? What is your approach to integrating with the broader ecosystem of data storage and processing utilities? What does the workflow look like for someone using TileDB? What is required to deploy TileDB in a production context? How is the built in data versioning implemented? What is the user experience for interacting with different versions of datasets? How do you manage the lifecycle of versioned data to allow garbage collection? How are you managing the governance and ongoing sustainability of the open source project, and the commercial offerings that you are building on top of it? What are the most interesting, unexpected, or innovative ways that you have seen TileDB used? What have you found to be the most interesting, unexpected, or challenging aspects of building TileDB? What features or capabilities are you consciously deciding not to implement? When is TileDB the wrong choice? What do you have planned for the future of TileDB? Contact Info LinkedIn stavrospapadopoulos on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links TileDB GitHub Data Frames TileDB Cloud MIT Intel Sparse Linear Algebra Sparse Matrices HDF5 Dask Spark MariaDB PrestoDB GDAL PDAL Turing Complete Clustered Index Parquet File Format Podcast Episode Serializability Delta Lake Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/17/2020 • 1 hour, 5 minutes, 44 seconds

Closing The Loop On Event Data Collection With Iteratively

Summary Event based data is a rich source of information for analytics, unless none of the event structures are consistent. The team at Iteratively are building a platform to manage the end to end flow of collaboration around what events are needed, how to structure the attributes, and how they are captured. In this episode founders Patrick Thompson and Ondrej Hrebicek discuss the problems that they have experienced as a result of inconsistent event schemas, how the Iteratively platform integrates the definition, development, and delivery of event data, and the benefits of elevating the visibility of event data for improving the effectiveness of the resulting analytics. If you are struggling with inconsistent implementations of event data collection, lack of clarity on what attributes are needed, and how it is being used then this is definitely a conversation worth following. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Patrick Thompson and Ondrej Hrebicek about Iteratively, a platform for enforcing consistent schemas for your event data Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Iteratively and your motivation for creating it? What are some of the ways that you have seen inconsistent message structures cause problems? What are some of the common anti-patterns that you have seen for managing the structure of event messages? What are the benefits that Iteratively provides for the different roles in an organization? Can you describe the workflow for a team using Iteratively? How is the Iteratively platform architected? How has the design changed or evolved since you first began working on it? What are the difficulties that you have faced in building integrations for the Iteratively workflow? How is schema evolution handled throughout the lifecycle of an event? What are the challenges that engineers face in building effective integration tests for their event schemas? What has been your biggest challenge in messaging for your platform and educating potential users of its benefits? What are some of the most interesting or unexpected ways that you have seen Iteratively used? What are some of the most interesting, unexpected, or challenging lessons that you have learned while building Iteratively? When is Iteratively the wrong choice? What do you have planned for the future of Iteratively? Contact Info Patrick LinkedIn @Patrickt010 on Twitter Website Ondrej LinkedIn @ondrej421 on Twitter ondrej on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Iteratively Syncplicity Locally Optimistic DBT Podcast Episode Snowplow Analytics Podcast Episode JSON Schema Master Data Management Podcast Episode SDLC == Software Development Life Cycle Amplitude Mixpanel Mode Analytics CRUD == Create, Read, Update, Delete Segment Podcast Episode Schemaver (JSON Schema Versioning Strategy) Great Expectations Podcast.init Interview Data Engineering Podcast Interview Confluence Notion Confluent Schema Registry Podcast Episode Snowplow Iglu Schema Registry Pulsar Schema Registry The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/10/2020 • 59 minutes, 17 seconds

A Practical Introduction To Graph Data Applications

Summary Finding connections between data and the entities that they represent is a complex problem. Graph data models and the applications built on top of them are perfect for representing relationships and finding emergent structures in your information. In this episode Denise Gosnell and Matthias Broecheler discuss their recent book, the Practitioner’s Guide To Graph Data, including the fundamental principles that you need to know about graph structures, the current state of graph support in database engines, tooling, and query languages, as well as useful tips on potential pitfalls when putting them into production. This was an informative and enlightening conversation with two experts on graph data applications that will help you start on the right track in your own projects. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Denise Gosnell and Matthias Broecheler about the recently published practitioner’s guide to graph data Interview Introduction How did you get involved in the area of data management? Can you start by explaining what your goals are for the Practitioner’s Guide To Graph Data? What was your motivation for writing a book to address this topic? What do you see as the driving force behind the growing popularity of graph technologies in recent years? What are some of the common use cases/applications of graph data and graph traversal algorithms? What are the core elements of graph thinking that data teams need to be aware of to be effective in identifying those cases in their existing systems? What are the fundamental principles of graph technologies that data engineers should be familiar with? What are the core modeling principles that they need to know for designing schemas in a graph database? Beyond databases, what are some of the other components of the data stack that can or should handle graphs natively? Do you typically use a graph database as the primary or complementary data store? What are some of the common challenges that you see when bringing graph applications to production? What have you found to be some of the common points of confusion or error prone aspects of implementing and maintaining graph oriented applications? When it comes to the specific technologies of different graph databases, what are some of the edge cases/variances in the interfaces or modeling capabilities that they present? How does the variation in query languages impact the overall adoption of these technologies? What are your thoughts on the recent standardization of GSQL as an ANSI specification? What are some of the scaling challenges that exist for graph data engines? What are the ongoing developments/improvements/trends in graph technology that you are most excited about? What are some of the shortcomings in existing technology/ecosystem for graph applications that you would like to see addressed? What are some of the cases where a graph is the wrong abstraction for a data project? What are some of the other resources that you recommend for anyone who wants to learn more about the various aspects of graph data? Contact Info Denise LinkedIn @DeniseKGosnell on Twitter Matthias LinkedIn @MBroecheler on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links The Practitioner’s Guide To Graph Data Datastax Titan graph database Goethe Graph Database NoSQL Relational Database Elasticsearch Podcast Episode Associative Array Data Structure RDF Triple Datastax Multi-model Graph Database Semantic Web Gremlin Graph Query Language Super Node Neuromorphic Computing Datastax Desktop The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/4/2020 • 1 hour, 43 seconds

Build More Reliable Distributed Systems By Breaking Them With Jepsen

Summary A majority of the scalable data processing platforms that we rely on are built as distributed systems. This brings with it a vast number of subtle ways that errors can creep in. Kyle Kingsbury created the Jepsen framework for testing the guarantees of distributed data processing systems and identifying when and why they break. In this episode he shares his approach to testing complex systems, the common challenges that are faced by engineers who build them, and why it is important to understand their limitations. This was a great look at some of the underlying principles that power your mission critical workloads. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! If you’ve been exploring scalable, cost-effective and secure ways to collect and route data across your organization, RudderStack is the only solution that helps you turn your own warehouse into a state of the art customer data platform. Their mission is to empower data engineers to fully own their customer data infrastructure and easily push value to other parts of the organization, like marketing and product management. With their open-source foundation, fixed pricing, and unlimited volume, they are enterprise ready, but accessible to everyone. Go to dataengineeringpodcast.com/rudder to request a demo and get one free month of access to the hosted platform along with a free t-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Kyle Kingsbury about his work on the Jepsen testing framework and the failure modes of distributed systems Interview Introduction How did you get involved in the area of data management? Can you start by describing what the Jepsen project is? What was your inspiration for starting the project? What other methods are available for evaluating and stress testing distributed systems? What are some of the common misconceptions or misunderstanding of distributed systems guarantees and how they impact real world usage of things like databases? How do you approach the design of a test suite for a new distributed system? What is your heuristic for determining the completeness of your test suite? What are some of the common challenges of setting up a representative deployment for testing? Can you walk through the workflow of setting up, running, and evaluating the output of a Jepsen test? How is Jepsen implemented? How has the design evolved since you first began working on it? What are the pros and cons of using Clojure for building Jepsen? If you were to start over today on the Jepsen framework what would you do differently? What are some of the most common failure modes that you have identified in the platforms that you have tested? What have you found to be the most difficult to resolve distributed systems bugs? What are some of the interesting developments in distributed systems design that you are keeping an eye on? How do you perceive the impact that Jepsen has had on modern distributed systems products? What have you found to be the most interesting, unexpected, or challenging lessons learned while building Jepsen and evaluating mission critical systems? What do you have planned for the future of the Jepsen framework? Contact Info aphyr on GitHub Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Jepsen Riak Distributed Systems TLA+ Coq Isabelle Cassandra DTest FoundationDB Podcast Episode CRDT == Conflict-free Replicated Data-type Podcast Episode Riemann Clojure JVM == Java Virtual Machine Kotlin Haskell Scala Groovy TiDB YugabyteDB Podcast Episode CockroachDB Podcast Episode Raft consensus algorithm Paxos Leslie Lamport Calvin FaunaDB Podcast Episode Heidi Howard CALM Conjecture Causal Consistency Hillel Wayne Christopher Meiklejohn Distsys Class Distributed Systems For Fun And Profit by Mikito Takada Christopher Meiklejohn Reading List The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/28/2020 • 49 minutes, 38 seconds

Making Wind Energy More Efficient With Data At Turbit Systems

Summary Wind energy is an important component of an ecologically friendly power system, but there are a number of variables that can affect the overall efficiency of the turbines. Michael Tegtmeier founded Turbit Systems to help operators of wind farms identify and correct problems that contribute to suboptimal power outputs. In this episode he shares the story of how he got started working with wind energy, the system that he has built to collect data from the individual turbines, and how he is using machine learning to provide valuable insights to produce higher energy outputs. This was a great conversation about using data to improve the way the world works. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Michael Tegtmeier about Turbit, a machine learning powered platform for performance monitoring of wind farms Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Turbit and your motivation for creating the business? What are the most problematic factors that contribute to low performance in power generation with wind turbines? What is the current state of the art for accessing and analyzing data for wind farms? What information are you able to gather from the SCADA systems in the turbine? How uniform is the availability and formatting of data from different manufacturers? How are you handling data collection for the individual turbines? How much information are you processing at the point of collection vs. sending to a centralized data store? Can you describe the system architecture of Turbit and the lifecycle of turbine data as it propagates from collection to analysis? How do you incorporate domain knowledge into the identification of useful data and how it is used in the resultant models? What are some of the most challenging aspects of building an analytics product for the wind energy sector? What have you found to be the most interesting, unexpected, or challenging aspects of building and growing Turbit? What do you have planned for the future of the technology and business? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Turbit Systems LIDAR Pulse Shaping Wind Turbine SCADA Genetic Algorithm Bremen Germany Pitch Yaw Nacelle Anemometer Neural Network Swarm64 Podcast Episode Tensorflow The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/21/2020 • 40 minutes, 48 seconds

Open Source Production Grade Data Integration With Meltano

Summary The first stage of every data pipeline is extracting the information from source systems. There are a number of platforms for managing data integration, but there is a notable lack of a robust and easy to use open source option. The Meltano project is aiming to provide a solution to that situation. In this episode, project lead Douwe Maan shares the history of how Meltano got started, the motivation for the recent shift in focus, and how it is implemented. The Singer ecosystem has laid the groundwork for a great option to empower teams of all sizes to unlock the value of their Data and Meltano is building the reamining structure to make it a fully featured contender for proprietary systems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Douwe Maan about Meltano, an open source platform for building, running & orchestrating ELT pipelines. Interview Introduction How did you get involved in the area of data management? Can you start by describing what Meltano is and the story behind it? Who is the target audience? How does the focus on small or early stage organizations constrain the architectural decisions that go into Meltano? What have you found to be the complexities in trying to encapsulate the entirety of the data lifecycle in a single tool or platform? What are the most painful transitions in that lifecycle and how does that pain manifest? How and why has the focus of the project shifted from its original vision? With your current focus on the data integration/data transfer stage of the lifecycle, what are you seeing as the biggest barriers to entry with the current ecosystem? What are the main elements of your strategy to address these barriers? How is the Meltano platform in its current incarnation implemented? How much of the original architecture have you been able to retain, and how have you evolved it to align with your new direction? What have you found to be the challenges that your users face when going from the easy on-ramp of local execution to then trying to scale and customize their pipelines for production use? What are the most critical features that you are focusing on building now to make Meltano competitive with managed platforms? What are the most interesting, unexpected, or challenging lessons that you have learned while working on and with Meltano? When is Meltano the wrong choice? What is your broad vision for the future of Meltano? What are the most immediate needs for contribution that will help you realize that vision? Contact Info Website DouweM on GitLab DouweM on GitHub @DouweM on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Meltano GitLab Mexico City Netherlands Locally Optimistic Singer Stitch Data DBT ELT Informatica Version Control Code Review CI/CD Jupyter Notebook LookML Meltano Modeling Syntax Redash Metabase Apache Superset Apache Airflow Luigi Prefect Dagster Transferwise Pipelinewise 12 Factor Application The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/13/2020 • 1 hour, 5 minutes, 19 seconds

DataOps For Streaming Systems With Lenses.io

Summary There are an increasing number of use cases for real time data, and the systems to power them are becoming more mature. Once you have a streaming platform up and running you need a way to keep an eye on it, including observability, discovery, and governance of your data. That’s what the Lenses.io DataOps platform is built for. In this episode CTO Andrew Stevenson discusses the challenges that arise from building decoupled systems, the benefits of using SQL as the common interface for your data, and the metrics that need to be tracked to keep the overall system healthy. Observability and governance of streaming data requires a different approach than batch oriented workflows, and this episode does an excellent job of outlining the complexities involved and how to address them. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Today’s episode of the Data Engineering Podcast is sponsored by Datadog, a SaaS-based monitoring and analytics platform for cloud-scale infrastructure, applications, logs, and more. Datadog uses machine-learning based algorithms to detect errors and anomalies across your entire stack—which reduces the time it takes to detect and address outages and helps promote collaboration between Data Engineering, Operations, and the rest of the company. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial. If you start a trial and install Datadog’s agent, Datadog will send you a free T-shirt. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Andrew Stevenson about Lenses.io, a platform to provide real-time data operations for engineers Interview Introduction How did you get involved in the area of data management? Can you start by describing what Lenses is and the story behind it? What is your working definition for what constitutes DataOps? How does the Lenses platform support the cross-cutting concerns that arise when trying to bridge the different roles in an organization to deliver value with data? What are the typical barriers to collaboration, and how does Lenses help with that? Many different systems provide a SQL interface to streaming data on various substrates. What was your reason for building your own SQL engine and what is unique about it? What are the main challenges that you see engineers facing when working with streaming systems? What have you found to be the most notable evolutions in the community and ecosystem around Kafka and streaming platforms? One of the interesting features in the recent release is support for topologies to map out the relations between different producers and consumers across a stream. Why is that a difficult problem and how have you approached it? On the point of monitoring, what are the foundational challenges that engineers run into when trying to gain visibility into streams of data? What are some useful strategies for collecting and analyzing traces of data flows? As with many things in the space of data, local development and pre-production testing and validation are complicated due to the potential scale and variability of a production system. What advice do you have for engineers who are trying to establish a sustainable workflow for streaming applications? How do you facilitate the CI/CD process for enabling a culture of testing and establishing confidence in the correct functionality of your systems? How is the Lenses platform implemented and how has its design evolved since you first began working on it? What are some of the specifics of Kafka that you have had to reconsider or redesign as you began adding support for additional streaming engines (e.g. Redis and Pulsar)? What are some of the most interesting, unexpected, or innovative ways that you have seen the Lenses platform used? What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on and with Lenses? When is Lenses the wrong choice? What do you have planned for the future of the platform? Contact Info LinkedIn @StevensonA_D on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Lenses.io Babylon Health DevOps DataOps GitOps Apache Calcite kSQL Kafka Connect Query Language Apache Flink Podcast Episode Apache Spark Podcast Episode Apache Pulsar Podcast Episode StreamNative Episode Playtika Riskfuel(?) JMX Metrics Amazon MSK (Managed Streaming for Kafka) Prometheus Canary Deployment Kafka on Pulsar Data Catalog Data Mesh Podcast Episode Dagster Airflow The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/6/2020 • 45 minutes, 36 seconds

Data Collection And Management To Power Sound Recognition At Audio Analytic

Summary We have machines that can listen to and process human speech in a variety of languages, but dealing with unstructured sounds in our environment is a much greater challenge. The team at Audio Analytic are working to impart a sense of hearing to our myriad devices with their sound recognition technology. In this episode Dr. Chris Mitchell and Dr. Thomas le Cornu describe the challenges that they are faced with in the collection and labelling of high quality data to make this possible, including the lack of a publicly available collection of audio samples to work from, the need for custom metadata throughout the processing pipeline, and the need for customized data processing tools for working with sound data. This was a great conversation about the complexities of working in a niche domain of data analysis and how to build a pipeline of high quality data from collection to analysis. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For more opportunities to stay up to date, gain new skills, and learn from your peers there are a growing number of virtual events that you can attend from the comfort and safety of your home. Go to dataengineeringpodcast.com/conferences to check out the upcoming events being offered by our partners and get registered today! Your host is Tobias Macey and today I’m interviewing Dr. Chris Mitchell and Dr. Thomas le Cornu about Audio Analytic, a company that is building sound recognition technology that is giving machines a sense of hearing beyond speech and music Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Audio Analytic? What was your motivation for building an AI platform for sound recognition? What are some of the ways that your platform is being used? What are the unique challenges that you have faced in working with arbitrary sound data? How do you handle the collection and labelling of the source data that you rely on for building your models? Beyond just collection and storage, what is your process for defining a taxonomy of the audio data that you are working with? How has the taxonomy had to evolve, and what assumptions have had to change, as you progressed in building the data set and the resulting models? challenges of building an embeddable AI model update cycle difficulty of identifying relevant audio and dealing with literal noise in the input data rights and ownership challenges in collection of source data What was your design process for constructing a pipeline for the audio data that you need to process? Can you describe how your overall data management system is architected? How has that architecture evolved since you first began building and using it? A majority of data tools are oriented around, and optimized for, collection and processing of textual data. How much off-the-shelf technology have you been able to use for working with audio? What are some of the assumptions that you made at the start which have been shown to be inaccurate or in need of reconsidering? How do you address variability in the duration of source samples in the processing pipeline? How much of an issue do you face as a result of the variable quality of microphones in the embedded devices where the model is being run? What are the limitations of the model in dealng with complex and layered audio environments? How has the testing and evaluation of your model fed back into your strategies for collecting source data? What are some of the weirdest or most unusual sounds that you have worked with? What have been the most interesting, unexpected, or challenging lessons that you have learned in the process of building the technology and business of Audio Analytic? What do you have planned for the future of the company? Contact Info Chris LinkedIn Thomas LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Audio Analytic Twitter Anechoic Chamber EXIF Data ID3 Tags Polyphonic Sound Detection Score GitHub Repository ICASSP CES MO+ ARM Processor Context Systems Blog Post The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/30/2020 • 57 minutes, 28 seconds

Bringing Business Analytics To End Users With GoodData

Summary The majority of analytics platforms are focused on use internal to an organization by business stakeholders. As the availability of data increases and overall literacy in how to interpret it and take action improves there is a growing need to bring business intelligence use cases to a broader audience. GoodData is a platform focused on simplifying the work of bringing data to employees and end users. In this episode Sheila Jung and Philip Farr discuss how the GoodData platform is being used, how it is architected to provide scalable and performant analytics, and how it integrates into customer’s data platforms. This was an interesting conversation about a different approach to business intelligence and the importance of expanded access to data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! GoodData is revolutionizing the way in which companies provide analytics to their customers and partners. Start now with GoodData Free that makes our self-service analytics platform available to you at no cost. Register today at dataengineeringpodcast.com/gooddata Your host is Tobias Macey and today I’m interviewing Sheila Jung and Philip Farr about how GoodData is building a platform that lets you share your analytics outside the boundaries of your organization Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at GoodData and some of its origin story? The business intelligence market has been around for decades now and there are dozens of options with different areas of focus. What are the factors that might motivate me to choose GoodData over the other contenders in the space? What are the use cases and industries that you focus on supporting with GoodData? How has the market of business intelligence tools evolved in recent years? What are the contributing trends in technology and business use cases that are driving that change? What are some of the ways that your customers are embedding analytics into their own products? What are the differences in processing and serving capabilities between an internally used business intelligence tool, and one that is used for embedding into externally used systems? What unique challenges are posed by the embedded analytics use case? How do you approach topics such as security, access control, and latency in a multitenant analytics platform? What guidelines have you found to be most useful when addressing the concerns of accuracy and interpretability of the data being presented? How is the GoodData platform architected? What are the complexities that you have had to design around in order to provide performant access to your customers’ data sources in an interactive use case? What are the off-the-shelf components that you have been able to integrate into the platform, and what are the driving factors for solutions that have been built specifically for the GoodData use case? What is the process for your users to integrate GoodData into their existing data platform? What is the workflow for someone building a data product in GoodData? How does GoodData manage the lifecycle of the data that your customers are presenting to their end users? How does GoodData integrate into the customer development lifecycle? What are some of the most interesting, unexpected, or challenging lessons that you have learned while working on and with GoodData? Can you give an overview of the MAQL (Multi-Dimension Analytical Query Language) dialect that you use in GoodData and contrast it with SQL? What are the benefits and additional functionality that MAQL provides? When is GoodData the wrong choice? What is on the roadmap for the future of GoodData? Contact Info Sheila LinkedIn Philip LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links GoodData Teradata ReactJS SnowflakeDB Podcast Episode Redshift BigQuery SOC2 HIPAA GDPR == General Data Protection Regulation IoT == Internet of Things SAML Ruby Multi-Dimension Analytical Query Language Kubernetes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/23/2020 • 52 minutes, 24 seconds

Accelerate Your Machine Learning With The StreamSQL Feature Store

Summary Machine learning is a process driven by iteration and experimentation which requires fast and easy access to relevant features of the data being processed. In order to reduce friction in the process of developing and delivering models there has been a recent trend toward building a dedicated feature. In this episode Simba Khadder discusses his work at StreamSQL building a feature store to make creation, discovery, and monitoring of features fast and easy to manage. He describes the architecture of the system, the benefits of streaming data for machine learning, and how a feature store provides a useful interface between data engineers and machine learning engineers to reduce communication overhead. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! Your host is Tobias Macey and today I’m interviewing Simba Khadder about his views on the importance of ML feature stores, and his experience implementing one at StreamSQL Interview Introduction How did you get involved in the areas of machine learning and data management? What is StreamSQL and what motivated you to start the business? Can you describe what a machine learning feature is? What is the difference between generating features for training a model and generating features for serving? How is feature management typically handled today? What is a feature store and how is it different from the status quo? What is the overall lifecycle of identifying useful features, defining and generating them, using them for training, and then serving them in production? How does the usage of a feature store impact the workflow of ML engineers/data scientists and data engineers? What are the general requirements of a feature store? What additional capabilities or tangential services are necessary for providing a pleasant UX for a feature store? How is discovery and documentation of features handled? What is the current landscape of feature stores and how does StreamSQL compare? How is the StreamSQL feature store implemented? How is the supporting infrastructure architected and how has it evolved since you first began working on it? Why is streaming data such a focal point of feature stores? How do you generate features for training? How do you approach monitoring of features and what does remediation look like for a feature that is no longer valid? How do you handle versioning and deploying features? What’s the process for integrating data sources into StreamSQL for processing into features? How are the features materialized? What are the most challenging or complex aspects of working on or with a feature store? When is StreamSQL the wrong choice for a feature store? What are the most interesting, challenging, or unexpected lessons that you have learned in the process of building StreamSQL? What do you have planned for the future of the product? Contact Info LinkedIn @simba_khadder on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links StreamSQL Feature Stores for ML Distributed Systems Google Cloud Datastore Triton Uber Michelangelo AirBnB Zipline Lyft Dryft Apache Flink Podcast Episode Apache Kafka Spark Streaming Apache Cassandra Redis Apache Pulsar Podcast Episode StreamNative Episode TDD == Test Driven Development Lyft presentation – Bootstrapping Flink Go-Jek Feast Hopsworks The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/15/2020 • 46 minutes, 12 seconds

Data Management Trends From An Investor Perspective

Summary The landscape of data management and processing is rapidly changing and evolving. There are certain foundational elements that have remained steady, but as the industry matures new trends emerge and gain prominence. In this episode Astasia Myers of Redpoint Ventures shares her perspective as an investor on which categories she is paying particular attention to for the near to medium term. She discusses the work being done to address challenges in the areas of data quality, observability, discovery, and streaming. This is a useful conversation to gain a macro perspective on where businesses are looking to improve their capabilities to work with data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar to get you up and running in no time. With simple pricing, fast networking, S3 compatible object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll. Your host is Tobias Macey and today I’m interviewing Astasia Myers about the trends in the data industry that she sees as an investor at Redpoint Ventures Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of Redpoint Ventures and your role there? From an investor perspective, what is most appealing about the category of data-oriented businesses? What are the main sources of information that you rely on to keep up to date with what is happening in the data industry? What is your personal heuristic for determining the relevance of any given piece of information to decide whether it is worthy of further investigation? As someone who works closely with a variety of companies across different industry verticals and different areas of focus, what are some of the common trends that you have identified in the data ecosystem? In your article that covers the trends you are keeping an eye on for 2020 you call out 4 in particular, data quality, data catalogs, observability of what influences critical business indicators, and streaming data. Taking those in turn: What are the driving factors that influence data quality, and what elements of that problem space are being addressed by the companies you are watching? What are the unsolved areas that you see as being viable for newcomers? What are the challenges faced by businesses in establishing and maintaining data catalogs? What approaches are being taken by the companies who are trying to solve this problem? What shortcomings do you see in the available products? For gaining visibility into the forces that impact the key performance indicators (KPI) of businesses, what is lacking in the current approaches? What additional information needs to be tracked to provide the needed context for making informed decisions about what actions to take to improve KPIs? What challenges do businesses in this observability space face to provide useful access and analysis to this collected data? Streaming is an area that has been growing rapidly over the past few years, with many open source and commercial options. What are the major business opportunities that you see to make streaming more accessible and effective? What are the main factors that you see as driving this growth in the need for access to streaming data? With your focus on these trends, how does that influence your investment decisions and where you spend your time? What are the unaddressed markets or product categories that you see which would be lucrative for new businesses? In most areas of technology now there is a mix of open source and commercial solutions to any given problem, with varying levels of maturity and polish between them. What are your views on the balance of this relationship in the data ecosystem? For data in particular, there is a strong potential for vendor lock-in which can cause potential customers to avoid adoption of commercial solutions. What has been your experience in that regard with the companies that you work with? Contact Info @AstasiaMyers on Twitter @astasia on Medium LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Redpoint Ventures 4 Data Trends To Watch in 2020 Seagate Western Digital Pure Storage Cisco Cohesity Looker Podcast Episode DGraph Podcast Episode Dremio Podcast Episode SnowflakeDB Podcast Episode Thoughspot Tibco Elastic Splunk Informatica Data Council DataCoral Mattermost Bitwarden Snowplow Podcast Interview Interview About Snowplow Infrastructure CHAOSSEARCH Podcast Episode Kafka Streams Pulsar Podcast Interview Followup Podcast Interview Soda Toro Great Expectations Alation Collibra Amundsen DataHub Netflix Metacat Marquez Podcast Episode LDAP == Lightweight Directory Access Protocol Anodot Databricks Flink Podcast Episode Zookeeper Podcast Episode Pravega Podcast Episode Airtable Alteryx CockroachDB Podcast Episode Superset The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/8/2020 • 54 minutes, 58 seconds

Building A Data Lake For The Database Administrator At Upsolver

Summary Data lakes offer a great deal of flexibility and the potential for reduced cost for your analytics, but they also introduce a great deal of complexity. What used to be entirely managed by the database engine is now a composition of multiple systems that need to be properly configured to work in concert. In order to bring the DBA into the new era of data management the team at Upsolver added a SQL interface to their data lake platform. In this episode Upsolver CEO Ori Rafael and CTO Yoni Iny describe how they have grown their platform deliberately to allow for layering SQL on top of a robust foundation for creating and operating a data lake, how to bring more people on board to work with the data being collected, and the unique benefits that a data lake provides. This was an interesting look at the impact that the interface to your data can have on who is empowered to work with it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With their managed Kubernetes platform it’s now even easier to deploy and scale your workflows, or try out the latest Helm charts from tools like Pulsar and Pachyderm. With simple pricing, fast networking, object storage, and worldwide data centers, you’ve got everything you need to run a bulletproof data platform. Go to dataengineeringpodcast.com/linode today and get a $60 credit to try out a Kubernetes cluster of your own. And don’t forget to thank them for their continued support of this show! You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll. Your host is Tobias Macey and today I’m interviewing Ori Rafael and Yoni Iny about building a data lake for the DBA at Upsolver Interview Introduction How did you get involved in the area of data management? Can you start by sharing your definition of what a data lake is and what it is comprised of? We talked last in November of 2018. How has the landscape of data lake technologies and adoption changed in that time? How has Upsolver changed or evolved since we last spoke? How has the evolution of the underlying technologies impacted your implementation and overall product strategy? What are some of the common challenges that accompany a data lake implementation? How do those challenges influence the adoption or viability of a data lake? How does the introduction of a universal SQL layer change the staffing requirements for building and maintaining a data lake? What are the advantages of a data lake over a data warehouse if everything is being managed via SQL anyway? What are some of the underlying realities of the data systems that power the lake which will eventually need to be understood by the operators of the platform? How is the SQL layer in Upsolver implemented? What are the most challenging or complex aspects of managing the underlying technologies to provide automated partitioning, indexing, etc.? What are the main concepts that you need to educate your customers on? What are some of the pitfalls that users should be aware of? What features of your platform are often overlooked or underutilized which you think should be more widely adopted? What have you found to be the most interesting, unexpected, or challenging lessons learned while building the technical and business elements of Upsolver? What do you have planned for the future? Contact Info Ori LinkedIn Yoni yoniiny on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Upsolver Podcast Episode DBA == Database Administrator IDF == Israel Defense Forces Data Lake Eventual Consistency Apache Spark Redshift Spectrum Azure Synapse Analytics SnowflakeDB Podcast Episode BigQuery Presto Podcast Episode Apache Kafka Cartesian Product kSQLDB Podcast Episode Eventador Podcast Episode Materialize Podcast Episode Common Table Expressions Lambda Architecture Kappa Architecture Apache Flink Podcast Episode Reinforcement Learning Cloudformation GDPR The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/2/2020 • 56 minutes, 17 seconds

Mapping The Customer Journey For B2B Companies At Dreamdata

Summary Gaining a complete view of the customer journey is especially difficult in B2B companies. This is due to the number of different individuals involved and the myriad ways that they interface with the business. Dreamdata integrates data from the multitude of platforms that are used by these organizations so that they can get a comprehensive view of their customer lifecycle. In this episode Ole Dallerup explains how Dreamdata was started, how their platform is architected, and the challenges inherent to data management in the B2B space. This conversation is a useful look into how data engineering and analytics can have a direct impact on the success of the business. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management What are the pieces of advice that you wish you had received early in your career of data engineering? If you hand a book to a new data engineer, what wisdom would you add to it? I’m working with O’Reilly on a project to collect the 97 things that every data engineer should know, and I need your help. Go to dataengineeringpodcast.com/97things to add your voice and share your hard-earned expertise. When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show because you love working with data and want to keep your skills up to date. Machine learning is finding its way into every aspect of the data landscape. Springboard has partnered with us to help you take the next step in your career by offering a scholarship to their Machine Learning Engineering career track program. In this online, project-based course every student is paired with a Machine Learning expert who provides unlimited 1:1 mentorship support throughout the program via video conferences. You’ll build up your portfolio of machine learning projects and gain hands-on experience in writing machine learning algorithms, deploying models into production, and managing the lifecycle of a deep learning prototype. Springboard offers a job guarantee, meaning that you don’t have to pay for the program until you get a job in the space. The Data Engineering Podcast is exclusively offering listeners 20 scholarships of $500 to eligible applicants. It only takes 10 minutes and there’s no obligation. Go to dataengineeringpodcast.com/springboard and apply today! Make sure to use the code AISPRINGBOARD when you enroll. Your host is Tobias Macey and today I’m interviewing Ole Dallerup about Dreamdata, a platform for simplifying data integration for B2B companies Interview Introduction How did you get involved in the area of data management? Can you start by describing what you are building at Dreamata? What was your inspiration for starting a company and what keeps you motivated? How do the data requirements differ between B2C and B2B companies? What are the challenges that B2B companies face in gaining visibility across the lifecycle of their customers? How does that lack of visibility impact the viability or growth potential of the business? What are the factors that contribute to silos in visibility of customer activity within a business? What are the data sources that you are dealing with to generate meaningful analytics for your customers? What are some of the challenges that business face in either generating or collecting useful information about their customer interactions? How is the technical platform of Dreamdata implemented and how has it evolved since you first began working on it? What are some of the ways that you approach entity resolution across the different channels and data sources? How do you reconcile the information collected from different sources that might use disparate data formats and representations? What is the onboarding process for your customers to identify and integrate with all of their systems? How do you approach the definition of the schema model for the database that your customers implement for storing their footprint? Do you allow for customization by the customer? Do you rely on a tool such as DBT for populating the table definitions and transformations from the source data? How do you approach representation of the analysis and actionable insights to your customers so that they are able to accurately intepret the results? How have your own experiences at Dreamdata influenced the areas that you invest in for the product? What are some of the most interesting or surprising insights that you have been able to gain as a result of the unified view that you are building? What are some of the most challenging, interesting, or unexpected lessons that you have learned from building and growing the technical and business elements of Dreamdata? When might a user be better served by building their own pipelines or analysis for tracking their customer interactions? What do you have planned for the future of Dreamdata? What are some of the industry trends that you are keeping an eye on and what potential impacts to your business do you anticipate? Contact Info LinkedIn @oledallerup on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Dreamdata Poker Tracker TrustPilot Zendesk Salesforce Hubspot Google BigQuery SnowflakeDB Podcast Episode AWS Redshift Singer Stitch Data Dataform Podcast Episode DBT Podcast Episode Segment Podcast Episode Cloud Dataflow Apache Beam UTM Parameters Clearbit Capterra G2 Crowd The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/25/2020 • 46 minutes, 59 seconds

Power Up Your PostgreSQL Analytics With Swarm64

Summary The PostgreSQL database is massively popular due to its flexibility and extensive ecosystem of extensions, but it is still not the first choice for high performance analytics. Swarm64 aims to change that by adding support for advanced hardware capabilities like FPGAs and optimized usage of modern SSDs. In this episode CEO and co-founder Thomas Richter discusses his motivation for creating an extension to optimize Postgres hardware usage, the benefits of running your analytics on the same platform as your application, and how it works under the hood. If you are trying to get more performance out of your database then this episode is for you! Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required. Your host is Tobias Macey and today I’m interviewing Thomas Richter about Swarm64, a PostgreSQL extension to improve parallelism and add support for FPGAs Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Swarm64 is? How did the business get started and what keeps you motivated? What are some of the common bottlenecks that users of postgres run into? What are the use cases and workloads that gain the most benefit from increased parallelism in the database engine? By increasing the processing throughput of the database, how does that impact disk I/O and what are some options for avoiding bottlenecks in the persistence layer? Can you describe how Swarm64 is implemented? How has the product evolved since you first began working on it? How has the evolution of postgres impacted your product direction? What are some of the notable challenges that you have dealt with as a result of upstream changes in postgres? How has the hardware landscape evolved and how does that affect your prioritization of features and improvements? What are some of the other extensions in the postgres ecosystem that are most commonly used alongside Swarm64? Which extensions conflict with yours and how does that impact potential adoption? In addition to your work to optimize performance of the postres engine, you also provide support for using an FPGA as a co-processor. What are the benefits that an FPGA provides over and above a CPU or GPU architecture? What are the available options for provisioning hardware in a datacenter or the cloud that has access to an FPGA? Most people are familiar with the relevant attributes for selecting a CPU or GPU, what are the specifications that they should be looking at when selecting an FPGA? For users who are adopting Swarm64, how does it impact the way they should be thinking of their data models? What is involved in migrating an existing database to use Swarm64? What are some of the most interesting, unexpected, or challenging lessons that you have learned while building and growing the product and business of Swarm64? When is Swarm64 the wrong choice? What do you have planned for the future of Swarm64? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Swarm64 Lufthansa Cargo IBM Cognos Analytics OLAP Cube PostgreSQL Geospatial Data TimescaleDB Podcast Episode FPGA == Field Programmable Gate Array Greenplum Foreign Data Tables PostgreSQL Table Storage API EnterpriseDB Xilinx OVH Cloud Nimbix Azure Tableau The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/18/2020 • 52 minutes, 43 seconds

StreamNative Brings Streaming Data To The Cloud Native Landscape With Pulsar

Summary There have been several generations of platforms for managing streaming data, each with their own strengths and weaknesses, and different areas of focus. Pulsar is one of the recent entrants which has quickly gained adoption and an impressive set of capabilities. In this episode Sijie Guo discusses his motivations for spending so much of his time and energy on contributing to the project and growing the community. His most recent endeavor at StreamNative is focused on combining the capabilities of Pulsar with the cloud native movement to make it easier to build and scale real time messaging systems with built in event processing capabilities. This was a great conversation about the strengths of the Pulsar project, how it has evolved in recent years, and some of the innovative ways that it is being used. Pulsar is a well engineered and robust platform for building the core of any system that relies on durable access to easily scalable streams of data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You monitor your website to make sure that you’re the first to know when something goes wrong, but what about your data? Tidy Data is the DataOps monitoring platform that you’ve been missing. With real time alerts for problems in your databases, ETL pipelines, or data warehouse, and integrations with Slack, Pagerduty, and custom webhooks you can fix the errors before they become a problem. Go to dataengineeringpodcast.com/tidydata today and get started for free with no credit card required. Your host is Tobias Macey and today I’m interviewing Sijie Guo about the current state of the Pulsar framework for stream processing and his experiences building a managed offering for it at StreamNative Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what Pulsar is? How did you get involved with the project? What is Pulsar’s role in the lifecycle of data and where does it fit in the overall ecosystem of data tools? How has the Pulsar project evolved or changed over the past 2 years? How has the overall state of the ecosystem influenced the direction that Pulsar has taken? One of the critical elements in the success of a piece of technology is the ecosystem that grows around it. How has the community responded to Pulsar, and what are some of the barriers to adoption? How are you and other project leaders addressing those barriers? You were a co-founder at Streamlio, which was built on top of Pulsar, and now you have founded StreamNative to offer Pulsar as a service. What did you learned from your time at Streamlio that has been most helpful in your current endeavor? How would you characterize your relationship with the project and community in each role? What motivates you to dedicate so much of your time and enery to Pulsar in particular, and the streaming data ecosystem in general? Why is streaming data such an important capability? How have projects such as Kafka and Pulsar impacted the broader software and data landscape? What are some of the most interesting, innovative, or unexpected ways that you have seen Pulsar used? When is Pulsar the wrong choice? What do you have planned for the future of StreamNative? Contact Info LinkedIn @sijieg on Twitter sijie on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Apache Pulsar Podcast Episode StreamNative Streamlio Hadoop HBase Hive Tencent Yahoo BookKeeper Publish/Subscribe Kafka Zookeeper Podcast Episode Kafka Connect Pulsar Functions Pulsar IO Kafka On Pulsar Webinar Video Pulsar Protocol Handler OVH Cloud Open Messaging ActiveMQ Kubernetes Helm Pulsar Helm Charts Grafana BestPay(?) Lambda Architecture Event Sourcing WebAssembly Apache Flink Podcast Episode Pulsar Summit The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/11/2020 • 55 minutes, 19 seconds

Enterprise Data Operations And Orchestration At Infoworks

Summary Data management is hard at any scale, but working in the context of an enterprise organization adds even greater complexity. Infoworks is a platform built to provide a unified set of tooling for managing the full lifecycle of data in large businesses. By reducing the barrier to entry with a graphical interface for defining data transformations and analysis, it makes it easier to bring the domain experts into the process. In this interview co-founder and CTO of Infoworks Amar Arsikere explains the unique challenges faced by enterprise organizations, how the platform is architected to provide the needed flexibility and scale, and how a unified platform for data improves the outcomes of the organizations using it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Free yourself from maintaining brittle data pipelines that require excessive coding and don’t operationally scale. With the Ascend Unified Data Engineering Platform, you and your team can easily build autonomous data pipelines that dynamically adapt to changes in data, code, and environment — enabling 10x faster build velocity and automated maintenance. On Ascend, data engineers can ingest, build, integrate, run, and govern advanced data pipelines with 95% less code. Go to dataengineeringpodcast.com/ascend to start building with a free 30-day trial. You’ll partner with a dedicated data engineer at Ascend to help you get started and accelerate your journey from prototype to production. Your host is Tobias Macey and today I’m interviewing Amar Arsikere about the Infoworks platform for enterprise data operations and orchestration Interview Introduction How did you get involved in the area of data management? Can you start by describing what you have built at Infoworks and the story of how it got started? What are the fundamental challenges that often plague organizations dealing with "big data"? How do those challenges change or compound in the context of an enterprise organization? What are some of the unique needs that enterprise organizations have of their data? What are the design or technical limitations of existing big data technologies that contribute to the overall difficulty of using or integrating them effectively? What are some of the tools or platforms that InfoWorks replaces in the overall data lifecycle? How do you identify and prioritize the integrations that you build? How is Infoworks itself architected and how has it evolved since you first built it? Discoverability and reuse of data is one of the biggest challenges facing organizations of all sizes. How do you address that in your platform? What are the roles that use InfoWorks in their day-to-day? What does the workflow look like for each of those roles? Can you talk through the overall lifecycle of a unit of data in InfoWorks and the different subsystems that it interacts with at each stage? What are some of the design challenges that you face in building a UI oriented workflow while providing the necessary level of control for these systems? How do you handle versioning of pipelines and validation of new iterations prior to production release? What are the cases where the no code, graphical paradigm for data orchestration breaks down? What are some of the most challenging, interesting, or unexpected lessons that you have learned since starting Infoworks? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links InfoWorks Google BigTable Apache Spark Apache Hadoop Zynga Data Partitioning Informatica Pentaho Talend Apache NiFi GoldenGate BigQuery Change Data Capture Podcast Episode About Debezium Slowly Changing Dimensions Snowflake DB Podcast Episode Tableau Data Catalog The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/4/2020 • 45 minutes, 53 seconds

Taming Complexity In Your Data Driven Organization With DataOps

Summary Data is a critical element to every role in an organization, which is also what makes managing it so challenging. With so many different opinions about which pieces of information are most important, how it needs to be accessed, and what to do with it, many data projects are doomed to failure. In this episode Chris Bergh explains how taking an agile approach to delivering value can drive down the complexity that grows out of the varied needs of the business. Building a DataOps workflow that incorporates fast delivery of well defined projects, continuous testing, and open lines of communication is a proven path to success. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! If DataOps sounds like the perfect antidote to your pipeline woes, DataKitchen is here to help. DataKitchen’s DataOps Platform automates and coordinates all the people, tools, and environments in your entire data analytics organization – everything from orchestration, testing and monitoring to development and deployment. In no time, you’ll reclaim control of your data pipelines so you can start delivering business value instantly, without errors. Go to dataengineeringpodcast.com/datakitchen today to learn more and thank them for supporting the show! Your host is Tobias Macey and today I’m welcoming back Chris Bergh to talk about ways that DataOps principles can help to reduce organizational complexity Interview Introduction How did you get involved in the area of data management? How are typical data and analytic teams organized? What are their roles and structure? Can you start by giving an outline of the ways that complexity can manifest in a data organization? What are some of the contributing factors that generate this complexity? How does the size or scale of an organization and their data needs impact the segmentation of responsibilities and roles? How does this organizational complexity play out within a single team? For example between data engineers, data scientists, and production/operations? How do you approach the definition of useful interfaces between different roles or groups within an organization? What are your thoughts on the relationship between the multivariate complexities of data and analytics workflows and the software trend toward microservices as a means of addressing the challenges of organizational communication patterns in the software lifecycle? How does this organizational complexity play out between multiple teams? For example between centralized data team and line of business self service teams? Isn’t organizational complexity just ‘the way it is’? Is there any how in getting out of meetings and inter team conflict? What are some of the technical elements that are most impactful in reducing the time to delivery for different roles? What are some strategies that you have found to be useful for maintaining a connection to the business need throughout the different stages of the data lifecycle? What are some of the signs or symptoms of problematic complexity that individuals and organizations should keep an eye out for? What role can automated testing play in improving this process? How do the current set of tools contribute to the fragmentation of data workflows? Which set of technologies are most valuable in reducing complexity and fragmentation? What advice do you have for data engineers to help with addressing complexity in the data organization and the problems that it contributes to? Contact Info LinkedIn @ChrisBergh on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links DataKitchen DataOps NASA Ames Research Center Excel Tableau Looker Podcast Episode Alteryx Trifacta Paxata AutoML Informatica SAS Conway’s Law Random Forest K-Means Clustering GraphQL Microservices Intuit Superglue Amundsen Podcast Episode Master Data Management Podcast Episode Hadoop Great Expectations Podcast Episode Observability Continuous Integration Continuous Delivery W. Edwards Deming The Joel Test Joel Spolsky DataOps Blog The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/28/2020 • 1 hour, 1 minute, 48 seconds

Building Real Time Applications On Streaming Data With Eventador

Summary Modern applications frequently require access to real-time data, but building and maintaining the systems that make that possible is a complex and time consuming endeavor. Eventador is a managed platform designed to let you focus on using the data that you collect, without worrying about how to make it reliable. In this episode Eventador Founder and CEO Kenny Gorman describes how the platform is architected, the challenges inherent to managing reliable streams of data, the simplicity offered by a SQL interface, and the interesting projects that his customers have built on top of it. This was an interesting inside look at building a business on top of open source stream processing frameworks and how to reduce the burden on end users. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Your host is Tobias Macey and today I’m interviewing Kenny Gorman about the Eventador streaming SQL platform Interview Introduction How did you get involved in the area of data management? Can you start by describing what the Eventador platform is and the story behind it? How has your experience at ObjectRocket influenced your approach to streaming SQL? How do the capabilities and developer experience of Eventador compare to other streaming SQL engines such as ksqlDB, Pulsar SQL, or Materialize? What are the main use cases that you are seeing people use for streaming SQL? How does it fit into an application architecture? What are some of the design changes in the different layers that are necessary to take advantage of the real time capabilities? Can you describe how the Eventador platform is architected? How has the system design evolved since you first began working on it? How has the overall landscape of streaming systems changed since you first began working on Eventador? If you were to start over today what would you do differently? What are some of the most interesting and challenging operational aspects of running your platform? What are some of the ways that you have modified or augmented the SQL dialect that you support? What is the tipping point for when SQL is insufficient for a given task and a user might want to leverage Flink? What is the workflow for developing and deploying different SQL jobs? How do you handle versioning of the queries and integration with the software development lifecycle? What are some data modeling considerations that users should be aware of? What are some of the sharp edges or design pitfalls that users should be aware of? What are some of the most interesting, innovative, or unexpected ways that you have seen your customers use your platform? What are some of the most interesting, unexpected, or challenging lessons that you have learned in the process of building and scaling Eventador? What do you have planned for the future of the platform? Contact Info LinkedIn Blog @kennygorman on Twitter kgorman on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Eventador Oracle DB Paypal EBay Semaphore MongoDB ObjectRocket RackSpace RethinkDB Apache Kafka Pulsar PostgreSQL Write-Ahead Log (WAL) ksqlDB Podcast Episode Pulsar SQL Materialize Podcast Episode PipelineDB Podcast Episode Apache Flink Podcast Episode Timely Dataflow FinTech == Financial Technology Anomaly Detection Network Security Materialized View Kubernetes Confluent Schema Registry Podcast Episode ANSI SQL Apache Calcite PostgreSQL User Defined Functions Change Data Capture Podcast Episode AWS Kinesis Uber AthenaX Netflix Keystone Ververica Rockset Podcast Episode Backpressure Keen.io The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/20/2020 • 50 minutes, 30 seconds

Making Data Collection In Your Code Easy With Rookout

Summary The software applications that we build for our businesses are a rich source of data, but accessing and extracting that data is often a slow and error-prone process. Rookout has built a platform to separate the data collection process from the lifecycle of your code. In this episode, CTO Liran Haimovitch discusses the benefits of shortening the iteration cycle and bringing non-engineers into the process of identifying useful data. This was a great conversation about the importance of democratizing the work of data collection. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Your host is Tobias Macey and today I’m interviewing Liran Haimovitch, CTO of Rookout, about the business value of operations metrics and other dark data in your organization Interview Introduction How did you get involved in the area of data management? Can you start by describing the types of data that we typically collect for the systems operations context? What are some of the business questions that can be answered from these data sources? What are some of the considerations that developers and operations engineers need to be aware of when they are defining the collection points for system metrics and log messages? What are some effective strategies that you have found for including business stake holders in the process of defining these collection points? One of the difficulties in building useful analyses from any source of data is maintaining the appropriate context. What are some of the necessary metadata that should be maintained along with operational metrics? What are some of the shortcomings in the systems we design and use for operational data stores in terms of making the collected data useful for other purposes? How does the existing tooling need to be changed or augmented to simplify the collaboration between engineers and stake holders for defining and collecting the needed information? The types of systems that we use for collecting and analyzing operations metrics are often designed and optimized for different access patterns and data formats than those used for analytical and exploratory purposes. What are your thoughts on how to incorporate the collected metrics with behavioral data? What are some of the other sources of dark data that we should keep an eye out for in our organizations? Contact Info LinkedIn @Liran_Last on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Rookout Cybersecurity DevOps DataDog Graphite Elasticsearch Logz.io Kafka The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/14/2020 • 26 minutes

Building A Knowledge Graph Of Commercial Real Estate At Cherre

Summary Knowledge graphs are a data resource that can answer questions beyond the scope of traditional data analytics. By organizing and storing data to emphasize the relationship between entities, we can discover the complex connections between multiple sources of information. In this episode John Maiden talks about how Cherre builds knowledge graphs that provide powerful insights for their customers and the engineering challenges of building a scalable graph. If you’re wondering how to extract additional business value from existing data, this episode will provide a way to expand your data resources. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. We have partnered with organizations such as ODSC, and Data Council. Upcoming events include ODSC East which has gone virtual starting April 16th. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing John Maiden about how Cherre is building and using a knowledge graph of commercial real estate information Interview Introduction How did you get involved in the area of data management? Can you start by describing what Cherre is and the role that data plays in the business? What are the benefits of a knowledge graph for making real estate investment decisions? What are the main ways that you and your customers are using the knowledge graph? What are some of the challenges that you face in providing a usable interface for end-users to query the graph? What technology are you using for storing and processing the graph? What challenges do you face in scaling the complexity and analysis of the graph? What are the main sources of data for the knowledge graph? What are some of the ways that messiness manifests in the data that you are using to populate the graph? How are you managing cleaning of the data and how do you identify and process records that can’t be coerced into the desired structure? How do you handle missing attributes or extra attributes in a given record? How did you approach the process of determining an effective taxonomy for records in the graph? What is involved in performing entity extraction on your data? What are some of the most interesting or unexpected questions that you have been able to ask and answer with the graph? What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with this data? What are some of the near and medium term improvements that you have planned for your knowledge graph? What advice do you have for anyone who is interested in building a knowledge graph of their own? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Cherre Commercial Real Estate Knowledge Graph RDF Triple DGraph Podcast Interview Neo4J TigerGraph Google BigQuery Apache Spark Spark In Action Episode Entity Extraction/Named Entity Recognition NetworkX Spark Graph Frames Graph Embeddings Airflow Podcast.__init__ Interview DBT The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/7/2020 • 45 minutes, 20 seconds

The Life Of A Non-Profit Data Professional

Summary Building and maintaining a system that integrates and analyzes all of the data for your organization is a complex endeavor. Operating on a shoe-string budget makes it even more challenging. In this episode Tyler Colby shares his experiences working as a data professional in the non-profit sector. From managing Salesforce data models to wrangling a multitude of data sources and compliance challenges, he describes the biggest challenges that he is facing. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on great conferences. We have partnered with organizations such as ODSC, and Data Council. Upcoming events include the Observe 20/20 virtual conference and ODSC East which has also gone virtual. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Tyler Colby about his experiences working as a data professional in the non-profit arena, most recently at the Natural Resources Defense Council Interview Introduction How did you get involved in the area of data management? Can you start by describing your responsibilities as the director of data infrastructure at the NRDC? What specific challenges are you facing at the NRDC? Can you describe some of the types of data that you are working with at the NRDC? What types of systems are you relying on for the source of your data? What kinds of systems have you put in place to manage the data needs of the NRDC? What are your biggest influences in the build vs. buy decisions that you make? What heuristics or guidelines do you rely on for aligning your work with the business value that it will produce and the broader mission of the organization? Have you found there to be any extra scrutiny of your work as a member of a non-profit in terms of regulations or compliance questions? Your career has involved a significant focus on the Salesforce platform. For anyone not familiar with it, what benefits does it provide in managing information flows and analysis capabilities? What are some of the most challenging or complex aspects of working with Saleseforce? In light of the current global crisis posed by COVID-19 you have established a new non-profit entity to organize the efforts of various technical professionals. Can you describe the nature of that mission? What are some of the unique data challenges that you anticipate or have already encountered? How do the data challenges of this new organization compare to your past experiences? What have you found to be most useful or beneficial in the current landscape of data management systems and practices in your career with non-profit organizations? What are the areas that need to be addressed or improved for workers in the non-profit sector? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links NRDC AWS Redshift Time Warner Cable Salesforce Cloud For Good Tableau Civis Analytics EveryAction BlackBaud ActionKit MobileCommons XKCD 1667 GDPR == General Data Privacy Regulation CCPA == California Consumer Privacy Act Salesforce Apex Salesforce.org Salesforce Non-Profit Success Pack Validity OpenRefine JitterBit Skyvia The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/30/2020 • 44 minutes, 36 seconds

Behind The Scenes Of The Linode Object Storage Service

Summary There are a number of platforms available for object storage, including self-managed open source projects. But what goes on behind the scenes of the companies that run these systems at scale so you don’t have to? In this episode Will Smith shares the journey that he and his team at Linode recently completed to bring a fast and reliable S3 compatible object storage to production for your benefit. He discusses the challenges of running object storage for public usage, some of the interesting ways that it was stress tested internally, and the lessons that he learned along the way. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Will Smith about his work on building object storage for the Linode cloud platform Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the current state of your object storage product? What was the motivating factor for building and managing your own object storage system rather than building an integration with another offering such as Wasabi or Backblaze? What is the scale and scope of usage that you had to design for? Can you describe how your platform is implemented? What was your criteria for deciding whether to use an available platform such as Ceph or MinIO vs building your own from scratch? How have your initial assumptions about the operability and maintainability of your installation been challenged or updated since it has been released to the public? What have been the biggest challenges that you have faced in designing and deploying a system that can meet the scale and reliability requirements of Linode? What are the most important capabilities for the underlying hardware that you are running on? What supporting systems and tools are you using to manage the availability and durability of your object storage? How did you approach the rollout of Linode’s object storage to gain the confidence that you needed to feel comfortable with full scale usage? What are some of the benefits that you have gained internally at Linode from having an object storage system available to your product teams? What are your thoughts on the state of the S3 API as a de facto standard for object storage? What is your main focus now that object storage is being rolled out to more data centers? Contact Info Dorthu on GitHub dorthu22 on Twitter LinkedIn Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Linode Object Storage Xen Hypervisor KVM (Linux Kernel Virtual Machine) Linode API V4 Ceph Distributed Filesystem Podcast Episode Wasabi Backblaze MinIO CERN Ceph Scaling Paper RADOS Gateway OpenResty Lua Prometheus Linode Managed Kubernetes Ceph Swift Protocol Ceph Bug Tracker Linode Dashboard Application Source Code The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/23/2020 • 35 minutes, 53 seconds

Building A New Foundation For CouchDB

Summary CouchDB is a distributed document database built for scale and ease of operation. With a built-in synchronization protocol and a HTTP interface it has become popular as a backend for web and mobile applications. Created 15 years ago, it has accrued some technical debt which is being addressed with a refactored architecture based on FoundationDB. In this episode Adam Kocoloski shares the history of the project, how it works under the hood, and how the new design will improve the project for our new era of computation. This was an interesting conversation about the challenges of maintaining a large and mission critical project and the work being done to evolve it. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve ClickHouse, the open-source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for ClickHouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Adam Kocoloski about CouchDB and the work being done to migrate the storage layer to FoundationDB Interview Introduction How did you get involved in the area of data management? Can you starty by describing what CouchDB is? How did you get involved in the CouchDB project and what is your current role in the community? What are the use cases that it is well suited for? Can you share some of the history of CouchDB and its role in the NoSQL movement? How is CouchDB currently architected and how has it evolved since it was first introduced? What have been the benefits and challenges of Erlang as the runtime for CouchDB? How is the current storage engine implemented and what are its shortcomings? What problems are you trying to solve by replatforming on a new storage layer? What were the selection criteria for the new storage engine and how did you structure the decision making process? What was the motivation for choosing FoundationDB as opposed to other options such as rocksDB, levelDB, etc.? How is the adoption of FoundationDB going to impact the overall architecture and implementation of CouchDB? How will the use of FoundationDB impact the way that the current capabilities are implemented, such as data replication? What will the migration path be for people running an existing installation? What are some of the biggest challenges that you are facing in rearchitecting the codebase? What new capabilities will the FoundationDB storage layer enable? What are some of the most interesting/unexpected/innovative ways that you have seen CouchDB used? What new capabilities or use cases do you anticipate once this migration is complete? What are some of the most interesting/unexpected/challenging lessons that you have learned while working with the CouchDB project and community? What is in store for the future of CouchDB? Contact Info LinkedIn @kocolosk on Twitter kocolosk on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Apache CouchDB FoundationDB Podcast Episode IBM Cloudant Experimental Particle Physics FPGA == Field Programmable Gate Array Apache Software Foundation CRDT == Conflict-free Replicated Data Type Podcast Episode Erlang Riak RabbitMQ Heisenbug Kubernetes Property Based Testing The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/17/2020 • 55 minutes, 25 seconds

Scaling Data Governance For Global Businesses With A Data Hub Architecture

Summary Data governance is a complex endeavor, but scaling it to meet the needs of a complex or globally distributed organization requires a well considered and coherent strategy. In this episode Tim Ward describes an architecture that he has used successfully with multiple organizations to scale compliance. By treating it as a graph problem, where each hub in the network has localized control with inheritance of higher level controls it reduces overhead and provides greater flexibility. Tim provides useful examples for understanding how to adopt this approach in your own organization, including some technology recommendations for making it maintainable and scalable. If you are struggling to scale data quality controls and governance requirements then this interview will provide some useful ideas to incorporate into your roadmap. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Tim Ward about using an architectural pattern called data hub that allows for scaling data management across global businesses Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the goals of a data hub architecture? What are the elements of a data hub architecture and how do they contribute to the overall goals? What are some of the patterns or reference architectures that you drew on to develop this approach? What are some signs that an organization should implement a data hub architecture? What is the migration path for an organization who has an existing data platform but needs to scale their governance and localize storage and access? What are the features or attributes of an individual hub that allow for them to be interconnected? What is the interface presented between hubs to allow for accessing information across these localized repositories? What is the process for adding a new hub and making it discoverable across the organization? How is discoverability of data managed within and between hubs? If someone wishes to access information between hubs or across several of them, how do you prevent data proliferation? If data is copied between hubs, how are record updates accounted for to ensure that they are replicated to the hubs that hold a copy of that entity? How are access controls and data masking managed to ensure that various compliance regimes are honored? In addition to compliance issues, another challenge of distributed data repositories is the question of latency. How do you mitigate the performance impacts of querying across multiple hubs? Given that different hubs can have differing rules for quality, cleanliness, or structure of a given record how do you handle transformations of data as it traverses different hubs? How do you address issues of data loss or corruption within those transformations? How is the topology of a hub infrastructure arranged and how does that impact questions of data loss through multiple zone transformations, latency, etc.? How do you manage tracking and reporting of data lineage within and across hubs? For an organization that is interested in implementing their own instance of a data hub architecture, what are the necessary components of an individual hub? What are some of the considerations and useful technologies that would assist in creating and connecting hubs? Should the hubs be implmeneted in a homogeneous fashion, or is there room for heterogeneity in their infrastructure as long as they expose the appropriate interface? When is a data hub architecture the wrong approach? Contact Info LinkedIn @jerrong on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links CluedIn Podcast Episode Eventual Connectivity Episode Futurama Kubernetes Zookeeper Podcast Episode Data Governance Data Lineage Data Sovereignty Graph Database Helm Chart Application Container Docker Compose LinkedIn DataHub Udemy PluralSight The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/9/2020 • 54 minutes, 8 seconds

Easier Stream Processing On Kafka With ksqlDB

Summary Building applications on top of unbounded event streams is a complex endeavor, requiring careful integration of multiple disparate systems that were engineered in isolation. The ksqlDB project was created to address this state of affairs by building a unified layer on top of the Kafka ecosystem for stream processing. Developers can work with the SQL constructs that they are familiar with while automatically getting the durability and reliability that Kafka offers. In this episode Michael Drogalis, product manager for ksqlDB at Confluent, explains how the system is implemented, how you can use it for building your own stream processing applications, and how it fits into the lifecycle of your data infrastructure. If you have been struggling with building services on low level streaming interfaces then give this episode a listen and try it out for yourself. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Michael Drogalis about ksqlDB, the open source streaming database layer for Kafka Interview Introduction How did you get involved in the area of data management? Can you start by describing what ksqlDB is? What are some of the use cases that it is designed for? How do the capabilities and design of ksqlDB compare to other solutions for querying streaming data with SQL such as Pulsar SQL, PipelineDB, or Materialize? What was the motivation for building a unified project for providing a database interface on the data stored in Kafka? How is ksqlDB architected? If you were to rebuild the entire platform and its components from scratch today, what would you do differently? What is the workflow for an analyst or engineer to design and build an application on top of ksqlDB? What dialect of SQL is supported? What kinds of extensions or built in functions have been added to aid in the creation of streaming queries? How are table schemas defined and enforced? How do you handle schema migrations on active streams? Typically a database is considered a long term storage location for data, whereas Kafka is a streaming layer with a bounded amount of durable storage. What is a typical lifecycle of information in ksqlDB? Can you talk through an example architecture that might incorporate ksqlDB including the source systems, applications that might interact with the data in transit, and any destinations sytems for long term persistence? What are some of the less obvious features of ksqlDB or capabilities that you think should be more widely publicized? What are some of the edge cases or potential pitfalls that users should be aware of as they are designing their streaming applications? What is involved in deploying and maintaining an installation of ksqlDB? What are some of the operational characteristics of the system that should be considered while planning an installation such as scaling factors, high availability, or potential bottlenecks in the architecture? When is ksqlDB the wrong choice? What are some of the most interesting/unexpected/innovative projects that you have seen built with ksqlDB? What are some of the most interesting/unexpected/challenging lessons that you have learned while working on ksqlDB? What is in store for the future of the project? Contact Info @michaeldrogalis on Twitter michaeldrogalis on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links ksqlDB Confluent Erlang Onyx Apache Storm Stream Processing Kafka ksql Kafka Streams Pulsar Podcast Episode Pulsar SQL PipelineDB Podcast Episode Materialize Podcast Episode Kafka Connect RocksDB Java Jar CLI == Command Line Interface PrestoDB Podcast Episode ANSI SQL Pravega Podcast Episode Eventual Consistency Confluent Cloud MySQL PostgreSQL GraphQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/2/2020 • 43 minutes, 36 seconds

Shining A Light on Shadow IT In Data And Analytics

Summary Misaligned priorities across business units can lead to tensions that drive members of the organization to build data and analytics projects without the guidance or support of engineering or IT staff. The availability of cloud platforms and managed services makes this a viable option, but can lead to downstream challenges. In this episode Sean Knapp and Charlie Crocker share their experiences of working in and with companies that have dealt with shadow IT projects and the importance of enabling and empowering the use and exploration of data and analytics. If you have ever been frustrated by seemingly draconian policies or struggled to align everyone on your supported platform, then this episode will help you gain some perspective and set you on a path to productive collaboration. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Are you spending too much time maintaining your data pipeline? Snowplow empowers your business with a real-time event data pipeline running in your own cloud account without the hassle of maintenance. Snowplow takes care of everything from installing your pipeline in a couple of hours to upgrading and autoscaling so you can focus on your exciting data projects. Your team will get the most complete, accurate and ready-to-use behavioral web and mobile data, delivered into your data warehouse, data lake and real-time streams. Go to dataengineeringpodcast.com/snowplow today to find out why more than 600,000 websites run Snowplow. Set up a demo and mention you’re a listener for a special offer! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Sean Knapp, Charlie Crocker about shadow IT in data and analytics Interview Introduction How did you get involved in the area of data management? Can you start by sharing your definition of shadow IT? What are some of the reasons that members of an organization might start building their own solutions outside of what is supported by the engineering teams? What are some of the roles in an organization that you have seen involved in these shadow IT projects? What kinds of tools or platforms are well suited for being provisioned and managed without involvement from the platform team? What are some of the pitfalls that these solutions present as a result of their initial ease of use? What are the benefits to the organization of individuals or teams building and managing their own solutions? What are some of the risks associated with these implementations of data collection, storage, management, or analysis that have no oversight from the teams typically tasked with managing those systems? What are some of the ways that compliance or data quality issues can arise from these projects? Once a project has been started outside of the approved channels it can quickly take on a life of its own. What are some of the ways you have identified the presence of "unauthorized" data projects? Once you have identified the existence of such a project how can you revise their implementation to integrate them with the "approved" platform that the organization supports? What are some strategies for removing the friction in the collection, access, or availability of data in an organization that can eliminate the need for shadow IT implementations? What are some of the inherent complexities in data management which you would like to see resolved in order to reduce the tensions that lead to these bespoke solutions? Contact Info Sean LinkedIn @seanknapp on Twitter Charlie LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Shadow IT Ascend Podcast Episode ZoneHaven Google Sawzall M&A == Mergers and Acquisitions DevOps Waterfall Development Data Governance Data Lineage Pioneers, Settlers, and Town Planners PowerBI Tableau Excel Amundsen Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/25/2020 • 46 minutes, 8 seconds

Data Infrastructure Automation For Private SaaS At Snowplow

Summary One of the biggest challenges in building reliable platforms for processing event pipelines is managing the underlying infrastructure. At Snowplow Analytics the complexity is compounded by the need to manage multiple instances of their platform across customer environments. In this episode Josh Beemster, the technical operations lead at Snowplow, explains how they manage automation, deployment, monitoring, scaling, and maintenance of their streaming analytics pipeline for event data. He also shares the challenges they face in supporting multiple cloud environments and the need to integrate with existing customer systems. If you are daunted by the needs of your data infrastructure then it’s worth listening to how Josh and his team are approaching the problem. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Josh Beemster about how Snowplow manages deployment and maintenance of their managed service in their customer’s cloud accounts. Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of the components in your system architecture and the nature of your managed service? What are some of the challenges that are inherent to private SaaS nature of your managed service? What elements of your system require the most attention and maintenance to keep them running properly? Which components in the pipeline are most subject to variability in traffic or resource pressure and what do you do to ensure proper capacity? How do you manage deployment of the full Snowplow pipeline for your customers? How has your strategy for deployment evolved since you first began Soffering the managed service? How has the architecture of the pipeline evolved to simplify operations? How much customization do you allow for in the event that the customer has their own system that they want to use in place of one of your supported components? What are some of the common difficulties that you encounter when working with customers who need customized components, topologies, or event flows? How does that reflect in the tooling that you use to manage their deployments? What types of metrics do you track and what do you use for monitoring and alerting to ensure that your customers pipelines are running smoothly? What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of working with and on Snowplow? What are some lessons that you can generalize for management of data infrastructure more broadly? If you could start over with all of Snowplow and the infrastructure automation for it today, what would you do differently? What do you have planned for the future of the Snowplow product and infrastructure management? Contact Info LinkedIn jbeemster on GitHub @jbeemster1 on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Snowplow Analytics Podcast Episode Terraform Consul Nomad Meltdown Vulnerability Spectre Vulnerability AWS Kinesis Elasticsearch SnowflakeDB Indicative S3 Segment AWS Cloudwatch Stackdriver Apache Kafka Apache Pulsar Google Cloud PubSub AWS SQS AWS SNS AWS Redshift Ansible AWS Cloudformation Kubernetes AWS EMR The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/18/2020 • 49 minutes, 1 second

Data Modeling That Evolves With Your Business Using Data Vault

Summary Designing the structure for your data warehouse is a complex and challenging process. As businesses deal with a growing number of sources and types of information that they need to integrate, they need a data modeling strategy that provides them with flexibility and speed. Data Vault is an approach that allows for evolving a data model in place without requiring destructive transformations and massive up front design to answer valuable questions. In this episode Kent Graziano shares his journey with data vault, explains how it allows for an agile approach to data warehousing, and explains the core principles of how to use it. If you’re struggling with unwieldy dimensional models, slow moving projects, or challenges integrating new data sources then listen in on this conversation and then give data vault a try for yourself. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, a 40Gbit public network, fast object storage, and a brand new managed Kubernetes platform, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. And for your machine learning workloads, they’ve got dedicated CPU and GPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Setting up and managing a data warehouse for your business analytics is a huge task. Integrating real-time data makes it even more challenging, but the insights you obtain can make or break your business growth. You deserve a data warehouse engine that outperforms the demands of your customers and simplifies your operations at a fraction of the time and cost that you might expect. You deserve Clickhouse, the open source analytical database that deploys and scales wherever and whenever you want it to and turns data into actionable insights. And Altinity, the leading software and service provider for Clickhouse, is on a mission to help data engineers and DevOps managers tame their operational analytics. Go to dataengineeringpodcast.com/altinity for a free consultation to find out how they can help you today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Kent Graziano about data vault modeling and the role that it plays in the current data landscape Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what data vault modeling is and how it differs from other approaches such as third normal form or the star/snowflake schema? What is the history of this approach and what limitations of alternate styles of modeling is it attempting to overcome? How did you first encounter this approach to data modeling and what is your motivation for dedicating so much time and energy to promoting it? What are some of the primary challenges associated with data modeling that contribute to the long lead times for data requests or outright project Datafailure? What are some of the foundational skills and knowledge that are necessary for effective modeling of data warehouses? How has the era of data lakes, unstructured/semi-structured data, and non-relational storage engines impacted the state of the art in data modeling? Is there any utility in data vault modeling in a data lake context (S3, Hadoop, etc.)? What are the steps for establishing and evolving a data vault model in an organization? How does that approach scale from one to many data sources and their varying lifecycles of schema changes and data loading? What are some of the changes in query structure that consumers of the model will need to plan for? Are there any performance or complexity impacts imposed by the data vault approach? Can you talk through the overall lifecycle of data in a data vault modeled warehouse? How does that compare to approaches such as audit/history tables in transaction databases or slowly changing dimensions in a star or snowflake model? What are some cases where a data vault approach doesn’t fit the needs of an organization or application? For listeners who want to learn more, what are some references or exercises that you recommend? Contact Info Website LinkedIn @KentGraziano on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links SnowflakeDB Data Vault Modeling Data Warrior Blog OLTP == On-Line Transaction Processing Data Warehouse Bill Inmon Claudia Imhoff Oracle DB Third Normal Form Star Schema Snowflake Schema Relational Theory Sixth Normal Form Denormalization Pivot Table Dan Linstedt TDAN.com Ralph Kimball Agile Manifesto Schema On Read Data Lake Hadoop NoSQL Data Vault Conference Teradata ODS (Operational Data Store) Model Supercharge Your Data Warehouse (affiliate link) Building A Scalable Data Warehouse With Data Vault 2.0 (affiliate link) Data Model Resource Book (affiliate link) Data Warehouse Toolkit (affiliate link) Building The Data Warehouse (affiliate link) Dan Linstedt Blog Perforrmance G2 Scale Free European Classes Certus Australian Classes Wherescape Erwin VaultSpeed Data Vault Builder Varigence BimlFlex The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/9/2020 • 1 hour, 6 minutes, 21 seconds

The Benefits And Challenges Of Building A Data Trust

Summary Every business collects data in some fashion, but sometimes the true value of the collected information only comes when it is combined with other data sources. Data trusts are a legal framework for allowing businesses to collaboratively pool their data. This allows the members of the trust to increase the value of their individual repositories and gain new insights which would otherwise require substantial effort in duplicating the data owned by their peers. In this episode Tom Plagge and Greg Mundy explain how the BrightHive platform serves to establish and maintain data trusts, the technical and organizational challenges they face, and the outcomes that they have witnessed. If you are curious about data sharing strategies or data collaboratives, then listen now to learn more! Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Tom Plagge and Gregory Mundy about BrightHive, a platform for building data trusts Interview Introduction How did you get involved in the area of data management? Can you start by describing what a data trust is? Why might an organization want to build one? What is BrightHive and what is its origin story? Beyond having a storage location with access controls, what are the components of a data trust that are necessary for them to be viable? What are some of the challenges that are common in establishing an agreement among organizations who are participating in a data trust? What are the responsibilities of each of the participants in a data trust? For an individual or organization who wants to participate in an existing trust, what is involved in gaining access? How does BrightHive support the process of building a data trust? How is ownership of derivative data sets/data products and associated intellectual property handled in the context of a trust? How is the technical architecture of BrightHive implemented and how has it evolved since it first started? What are some of the ways that you approach the challenge of data privacy in these sharing agreements? What are some legal and technical guards that you implement to encourage ethical uses of the data contained in a trust? What is the motivation for releasing the technical elements of BrightHive as open source? What are some of the most interesting, innovative, or inspirational ways that you have seen BrightHive used? Being a shared platform for empowering other organizations to collaborate I imagine there is a strong focus on long-term sustainability. How are you approaching that problem and what is the business model for BrightHive? What have you found to be the most interesting/unexpected/challenging aspects of building and growing the technical and business infrastructure of BrightHive? What do you have planned for the future of BrightHive? Contact Info Tom LinkedIn tplagge on GitHub Gregory LinkedIn gregmundy on GitHub @graygoree on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links BrightHive Data Science For Social Good Workforce Data Initiative NASA NOAA Data Trust Data Collaborative Public Benefit Corporation Terraform Airflow Podcast.__init__ Episode Dagster Podcast Episode Secure Multi-Party Computation Public Key Encryption AWS Macie Blockchain Smart Contracts The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/3/2020 • 56 minutes, 52 seconds

Pay Down Technical Debt In Your Data Pipeline With Great Expectations

Summary Data pipelines are complicated and business critical pieces of technical infrastructure. Unfortunately they are also complex and difficult to test, leading to a significant amount of technical debt which contributes to slower iteration cycles. In this episode James Campbell describes how he helped create the Great Expectations framework to help you gain control and confidence in your data delivery workflows, the challenges of validating and monitoring the quality and accuracy of your data, and how you can use it in your own environments to improve your ability to move fast. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing James Campbell about Great Expectations, the open source test framework for your data pipelines which helps you continually monitor and validate the integrity and quality of your data Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Great Expecations is and the origin of the project? What has changed in the implementation and focus of Great Expectations since we last spoke on Podcast.__init__ 2 years ago? Prior to your introduction of Great Expectations what was the state of the industry with regards to testing, monitoring, or validation of the health and quality of data and the platforms operating on them? What are some of the types of checks and assertions that can be made about a pipeline using Great Expectations? What are some of the non-obvious use cases for Great Expectations? What aspects of a data pipeline or the context that it operates in are unable to be tested or validated in a programmatic fashion? Can you describe how Great Expectations is implemented? For anyone interested in using Great Expectations, what is the workflow for incorporating it into their environments? What are some of the test cases that are often overlooked which data engineers and pipeline operators should be considering? Can you talk through some of the ways that Great Expectations can be extended? What are some notable extensions or integrations of Great Expectations? Beyond the testing and validation of data as it is being processed you have also included features that support documentation and collaboration of the data lifecycles. What are some of the ways that those features can benefit a team working with Great Expectations? What are some of the most interesting/innovative/unexpected ways that you have seen Great Expectations used? What are the limitations of Great Expectations? What are some cases where Great Expectations would be the wrong choice? What do you have planned for the future of Great Expectations? Contact Info LinkedIn @jpcampbell42 on Twitter jcampbell on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Great Expectations GitHub Twitter Podcast.__init__ Interview on Great Expectations Superconductive Health Abe Gong Pandas Podcast.__init__ Interview SQLAlchemy PostgreSQL Podcast Episode RedShift BigQuery Spark Cloudera DataBricks Great Expectations Data Docs Great Expectations Data Profiling Apache NiFi Amazon Deequ Tensorflow Data Validation The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/27/2020 • 46 minutes, 31 seconds

Replatforming Production Dataflows

Summary Building a reliable data platform is a neverending task. Even if you have a process that works for you and your business there can be unexpected events that require a change in your platform architecture. In this episode the head of data for Mayvenn shares their experience migrating an existing set of streaming workflows onto the Ascend platform after their previous vendor was acquired and changed their offering. This is an interesting discussion about the ongoing maintenance and decision making required to keep your business data up to date and accurate. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Sheel Choksi and Sean Knapp about Mayvenn’s experience migrating their dataflows onto the Ascend platform Interview Introduction How did you get involved in the area of data management? Can you start off by describing what Mayvenn is and give a sense of how you are using data? What are the sources of data that you are working with? What are the biggest challenges you are facing in collecting, processing, and analyzing your data? Before adopting Ascend, what did your overall platform for data management look like? What were the pain points that you were facing which led you to seek a new solution? What were the selection criteria that you set forth for addressing your needs at the time? What were the aspects of Ascend which were most appealing? What are some of the edge cases that you have dealt with in the Ascend platform? Now that you have been using Ascend for a while, what components of your previous architecture have you been able to retire? Can you talk through the migration process of incorporating Ascend into your platform and any validation that you used to ensure that your data operations remained accurate and consistent? How has the migration to Ascend impacted your overall capacity for processing data or integrating new sources into your analytics? What are your future plans for how to use data across your organization? Contact Info Sheel LinkedIn sheelc on GitHub Sean LinkedIn @seanknapp on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Mayvenn Ascend Podcast Episode Google Sawzall Clickstream Apache Kafka Alooma Podcast Episode Amazon Redshift ELT == Extract, Load, Transform DBT Podcast Episode Amazon Data Pipeline Upsolver Pentaho Stitch Data Fivetran Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/20/2020 • 39 minutes

Planet Scale SQL For The New Generation Of Applications With YugabyteDB

Summary The modern era of software development is identified by ubiquitous access to elastic infrastructure for computation and easy automation of deployment. This has led to a class of applications that can quickly scale to serve users worldwide. This requires a new class of data storage which can accomodate that demand without having to rearchitect your system at each level of growth. YugabyteDB is an open source database designed to support planet scale workloads with high data density and full ACID compliance. In this episode Karthik Ranganathan explains how Yugabyte is architected, their motivations for being fully open source, and how they simplify the process of scaling your application from greenfield to global. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Karthik Ranganathan about YugabyteDB, the open source, high-performance distributed SQL database for global, internet-scale apps. Interview Introduction How did you get involved in the area of data management? Can you start by describing what YugabyteDB is and its origin story? A growing trend in database engines (e.g. FaunaDB, CockroachDB) has been an out of the box focus on global distribution. Why is that important and how does it work in Yugabyte? What are the caveats? What are the most notable features of YugabyteDB that would lead someone to choose it over any of the myriad other options? What are the use cases that it is uniquely suited to? What are some of the systems or architecture patterns that can be replaced with Yugabyte? How does the design of Yugabyte or the different ways it is being used influence the way that users should think about modeling their data? Yugabyte is an impressive piece of engineering. Can you talk through the major design elements and how it is implemented? Easy scaling and failover is a feature that many database engines would like to be able to claim. What are the difficult elements that prevent them from implementing that capability as a standard practice? What do you have to sacrifice in order to support the level of scale and fault tolerance that you provide? Speaking of scaling, there are many ways to define that term, from vertical scaling of storage or compute, to horizontal scaling of compute, to scaling of reads and writes. What are the primary scaling factors that you focus on in Yugabyte? How do you approach testing and validation of the code given the complexity of the system that you are building? In terms of the query API you have support for a Postgres compatible SQL dialect as well as a Cassandra based syntax. What are the benefits of targeting compatibility with those platforms? What are the challenges and benefits of maintaining compatibility with those other platforms? Can you describe how the storage layer is implemented and the division between the different query formats? What are the operational characteristics of YugabyteDB? What are the complexities or edge cases that users should be aware of when planning a deployment? One of the challenges of working with large volumes of data is creating and maintaining backups. How does Yugabyte handle that problem? Most open source infrastructure projects that are backed by a business withhold various "enterprise" features such as backups and change data capture as a means of driving revenue. Can you talk through your motivation for releasing those capabilities as open source? What is the business model that you are using for YugabyteDB and how does it differ from the tribal knowledge of how open source companies generally work? What are some of the most interesting, innovative, or unexpected ways that you have seen yugabyte used? When is Yugabyte the wrong choice? What do you have planned for the future of the technical and business aspects of Yugabyte? Contact Info @karthikr on Twitter LinkedIn rkarthik007 on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links YugabyteDB GitHub Nutanix Facebook Engineering Apache Cassandra Apache HBase Delphi FuanaDB Podcast Episode CockroachDB Podcast Episode HA == High Availability Oracle Microsoft SQL Server PostgreSQL Podcast Episode MongoDB Amazon Aurora PGCrypto PostGIS pl/pgsql Foreign Data Wrappers PipelineDB Podcast Episode Citus Podcast Episode Jepsen Testing Yugabyte Jepsen Test Results OLTP == Online Transaction Processing OLAP == Online Analytical Processing DocDB Google Spanner Google BigTable Spot Instances Kubernetes Cloudformation Terraform Prometheus Debezium Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/13/2020 • 1 hour, 1 minute, 16 seconds

Change Data Capture For All Of Your Databases With Debezium

Summary Databases are useful for inspecting the current state of your application, but inspecting the history of that data can get messy without a way to track changes as they happen. Debezium is an open source platform for reliable change data capture that you can use to build supplemental systems for everything from maintaining audit trails to real-time updates of your data warehouse. In this episode Gunnar Morling and Randall Hauch explain why it got started, how it works, and some of the myriad ways that you can use it. If you have ever struggled with implementing your own change data capture pipeline, or understanding when it would be useful then this episode is for you. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Randall Hauch and Gunnar Morling about Debezium, an open source distributed platform for change data capture Interview Introduction How did you get involved in the area of data management? Can you start by describing what Change Data Capture is and some of the ways that it can be used? What is Debezium and what problems does it solve? What was your motivation for creating it? What are some of the use cases that it enables? What are some of the other options on the market for handling change data capture? Can you describe the systems architecture of Debezium and how it has evolved since it was first created? How has the tight coupling with Kafka impacted the direction and capabilities of Debezium? What, if any, other substrates does Debezium support (e.g. Pulsar, Bookkeeper, Pravega)? What are the data sources that are supported by Debezium? Given that you have branched into non-relational stores, how have you approached organization of the code to allow for handling the specifics of those engines while retaining a common core set of functionality? What is involved in deploying, integrating, and maintaining an installation of Debezium? What are the scaling factors? What are some of the edge cases that users and operators should be aware of? Debezium handles the ingestion and distribution of database changesets. What are the downstream challenges or complications that application designers or systems architects have to deal with to make use of that information? What are some of the design tensions that exist in the Debezium community between acting as a simple pipe vs. adding functionality for interpreting/aggregating/formatting the information contained in the changesets? What are some of the common downstream systems that consume the outputs of Debezium? What challenges or complexities are involved in building clients that can consume the changesets from the different engines that you support? What are some of the most interesting, unexpected, or innovative ways that you have seen Debezium used? What have you found to be the most challenging, complex, or complicated aspects of building, maintaining, and growing Debezium? What is in store for the future of Debezium? Contact Info Randall LinkedIn @rhauch on Twitter rhauch on GitHub Gunnar gunnarmorling on GitHub @gunnarmorling on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Debezium Confluent Kafka Connect RedHat Bean Validation Change Data Capture DBMS == DataBase Management System Apache Kafka Apache Flink Podcast Episode Yugabyte DB PostgreSQL Podcast Episode MySQL Microsoft SQL Server Apache Pulsar Podcast Episode Pravega Podcast Episode NATS Amazon Kinesis Pulsar IO WePay The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/6/2020 • 53 minutes, 1 second

Building The DataDog Platform For Processing Timeseries Data At Massive Scale

Summary DataDog is one of the most successful companies in the space of metrics and monitoring for servers and cloud infrastructure. In order to support their customers, they need to capture, process, and analyze massive amounts of timeseries data with a high degree of uptime and reliability. Vadim Semenov works on their data engineering team and joins the podcast in this episode to discuss the challenges that he works through, the systems that DataDog has built to power their business, and how their teams are organized to allow for rapid growth and massive scale. Getting an inside look at the companies behind the services we use is always useful, and this conversation was no exception. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Vadim Semenov about how data engineers work at DataDog Interview Introduction How did you get involved in the area of data management? For anyone who isn’t familiar with DataDog, can you start by describing the types and volumes of data that you’re dealing with? What are the main components of your platform for managing that information? How are the data teams at DataDog organized and what are your primary responsibilities in the organization? What are some of the complexities and challenges that you face in your work as a result of the volume of data that you are processing? What are some of the strategies which have proven to be most useful in overcoming those challenges? Who are the main consumers of your work and how do you build in feedback cycles to ensure that their needs are being met? Given that the majority of the data being ingested by DataDog is timeseries, what are your lifecycle and retention policies for that information? Most of the data that you are working with is customer generated from your deployed agents and API integrations. How do you manage cleanliness and schema enforcement for the events as they are being delivered? What are some of the upcoming projects that you have planned for the upcoming months and years? What are some of the technologies, patterns, or practices that you are hoping to adopt? Contact Info LinkedIn @databuryat on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links DataDog Hadoop Hive Yarn Chef SRE == Site Reliability Engineer Application Performance Management (APM) Apache Kafka RocksDB Cassandra Apache Parquet data serialization format SLA == Service Level Agreement WatchDog Apache Spark Podcast Episode Apache Pig Databricks JVM == Java Virtual Machine Kubernetes SSIS (SQL Server Integration Services) Pentaho JasperSoft Apache Airflow Podcast.__init__ Episode Apache NiFi Podcast Episode Luigi Dagster Podcast Episode Prefect The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/30/2019 • 45 minutes, 54 seconds

Building The Materialize Engine For Interactive Streaming Analytics In SQL

Summary Transactional databases used in applications are optimized for fast reads and writes with relatively simple queries on a small number of records. Data warehouses are optimized for batched writes and complex analytical queries. Between those use cases there are varying levels of support for fast reads on quickly changing data. To address that need more completely the team at Materialize has created an engine that allows for building queryable views of your data as it is continually updated from the stream of changes being generated by your applications. In this episode Frank McSherry, chief scientist of Materialize, explains why it was created, what use cases it enables, and how it works to provide fast queries on continually updated data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference in NYC, Strata Data in San Jose, and PyCon US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Frank McSherry about Materialize, an engine for maintaining materialized views on incrementally updated data from change data captures Interview Introduction How did you get involved in the area of data management? Can you start by describing what Materialize is and the problems that you are aiming to solve with it? What was your motivation for creating it? What use cases does Materialize enable? What are some of the existing tools or systems that you have seen employed to address those needs which can be replaced by Materialize? How does it fit into the broader ecosystem of data tools and platforms? What are some of the use cases that Materialize is uniquely able to support? How is Materialize architected and how has the design evolved since you first began working on it? Materialize is based on your timely-dataflow project, which itself is based on the work you did on Naiad. What was your reasoning for using Rust as the implementation target and what benefits has it provided? What are some of the components or primitives that were missing in the Rust ecosystem as compared to what is available in Java or C/C++, which have been the dominant languages for distributed data systems? In the list of features, you highlight full support for ANSI SQL 92. What were some of the edge cases that you faced in complying with that standard given the distributed execution context for Materialize? A majority of SQL oriented platforms define custom extensions or built-in functions that are specific to their problem domain. What are some of the existing or planned additions for Materialize? Can you talk through the lifecycle of data as it flows from the source database and through the Materialize engine? What are the considerations and constraints on maintaining the full history of the source data within Materialize? For someone who wants to use Materialize, what is involved in getting it set up and integrated with their data sources? What is the workflow for defining and maintaining a set of views? What are some of the complexities that users might face in ensuring the ongoing functionality of those views? For someone who is unfamiliar with the semantics of streaming SQL, what are some of the conceptual shifts that they should be aware of? The Materialize product is currently pre-release. What are the remaining steps before launching it? What do you have planned for the future of the product and company? Contact Info frankmcsherry on GitHub @frankmcsherry on Twitter Blog Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Materialize Timely Dataflow Dryad: Distributed Data-Parallel Programs from SequentialBuilding Blocks [Naiad](Programs from SequentialBuilding Blocks): A Timely Dataflow System Differential Privacy PageRank Data Council Presentation on Materialize Change Data Capture Debezium Apache Spark Podcast Episode Flink Podcast Episode Go language Rust Haskell Rust Borrow Checker GDB (GNU Debugger) Avro Apache Calcite ANSI SQL 92 Correlated Subqueries OOM (Out Of Memory) Killer Log-Structured Merge Tree The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/23/2019 • 48 minutes, 7 seconds

Solving Data Lineage Tracking And Data Discovery At WeWork

Summary Building clean datasets with reliable and reproducible ingestion pipelines is completely useless if it’s not possible to find them and understand their provenance. The solution to discoverability and tracking of data lineage is to incorporate a metadata repository into your data platform. The metadata repository serves as a data catalog and a means of reporting on the health and status of your datasets when it is properly integrated into the rest of your tools. At WeWork they needed a system that would provide visibility into their Airflow pipelines and the outputs produced. In this episode Julien Le Dem and Willy Lulciuc explain how they built Marquez to serve that need, how it is architected, and how it compares to other options that you might be considering. Even if you already have a metadata repository this is worth a listen to learn more about the value that visibility of your data can bring to your organization. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You work hard to make sure that your data is clean, reliable, and reproducible throughout the ingestion pipeline, but what happens when it gets to the data warehouse? Dataform picks up where your ETL jobs leave off, turning raw data into reliable analytics. Their web based transformation tool with built in collaboration features lets your analysts own the full lifecycle of data in your warehouse. Featuring built in version control integration, real-time error checking for their SQL code, data quality tests, scheduling, and a data catalog with annotation capabilities it’s everything you need to keep your data warehouse in order. Sign up for a free trial today at dataengineeringpodcast.com/dataform and email team@dataform.co with the subject "Data Engineering Podcast" to get a hands-on demo from one of their data experts. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Corinium Global Intelligence, ODSC, and Data Council. Upcoming events include the Software Architecture Conference, the Strata Data conference, and PyCon US. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Willy Lulciuc and Julien Le Dem about Marquez, an open source platform to collect, aggregate, and visualize a data ecosystem’s metadata Interview Introduction How did you get involved in the area of data management? Can you start by describing what Marquez is? What was missing in existing metadata management platforms that necessitated the creation of Marquez? How do the capabilities of Marquez compare with tools and services that bill themselves as data catalogs? How does it compare to the Amundsen platform that Lyft recently released? What are some of the tools or platforms that are currently integrated with Marquez and what additional integrations would you like to see? What are some of the capabilities that are unique to Marquez and how are you using them at WeWork? What are the primary resource types that you support in Marquez? What are some of the lowest common denominator attributes that are necessary and useful to track in a metadata repository? Can you explain how Marquez is architected and how the design has evolved since you first began working on it? Many metadata management systems are simply a service layer on top of a separate data storage engine. What are the benefits of using PostgreSQL as the system of record for Marquez? What are some of the complexities that arise from relying on a relational engine as opposed to a document store or graph database? How is the metadata itself stored and managed in Marquez? How much up-front data modeling is necessary and what types of schema representations are supported? Can you talk through the overall workflow of someone using Marquez in their environment? What is involved in registering and updating datasets? How do you define and track the health of a given dataset? What are some of the interesting questions that can be answered from the information stored in Marquez? What were your assumptions going into this project and how have they been challenged or updated as you began using it for production use cases? For someone who is interested in using Marquez what is involved in deploying and maintaining an installation of it? What have you found to be the most challenging or unanticipated aspects of building and maintaining a metadata repository and data discovery platform? When is Marquez the wrong choice for a metadata repository? What do you have planned for the future of Marquez? Contact Info Julien Le Dem @J_ on Twitter Email julienledem on GitHub Willy LinkedIn @wslulciuc on Twitter wslulciuc on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Marquez DataEngConf Presentation WeWork Canary Yahoo Dremio Hadoop Pig Parquet Podcast Episode Airflow Apache Atlas Amundsen Podcast Episode Uber DataBook LinkedIn DataHub Iceberg Table Format Podcast Episode Delta Lake Podcast Episode Great Expectations data pipeline unit testing framework Podcast.__init__ Episode Redshift SnowflakeDB Podcast Episode Apache Kafka Schema Registry Podcast Episode Open Tracing Jaeger Zipkin DropWizard Java framework Marquez UI Cayley Graph Database Kubernetes Marquez Helm Chart Marquez Docker Container Dagster Podcast Episode Luigi DBT Podcast Episode Thrift Protocol Buffers The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/16/2019 • 1 hour, 1 minute, 52 seconds

SnowflakeDB: The Data Warehouse Built For The Cloud

Summary Data warehouses have gone through many transformations, from standard relational databases on powerful hardware, to column oriented storage engines, to the current generation of cloud-native analytical engines. SnowflakeDB has been leading the charge to take advantage of cloud services that simplify the separation of compute and storage. In this episode Kent Graziano, chief technical evangelist for SnowflakeDB, explains how it is differentiated from other managed platforms and traditional data warehouse engines, the features that allow you to scale your usage dynamically, and how it allows for a shift in your workflow from ETL to ELT. If you are evaluating your options for building or migrating a data platform, then this is definitely worth a listen. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media and the Python Software Foundation. Upcoming events include the Software Architecture Conference in NYC and PyCOn US in Pittsburgh. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Kent Graziano about SnowflakeDB, the cloud-native data warehouse Interview Introduction How did you get involved in the area of data management? Can you start by explaining what SnowflakeDB is for anyone who isn’t familiar with it? How does it compare to the other available platforms for data warehousing? How does it differ from traditional data warehouses? How does the performance and flexibility affect the data modeling requirements? Snowflake is one of the data stores that is enabling the shift from an ETL to an ELT workflow. What are the features that allow for that approach and what are some of the challenges that it introduces? Can you describe how the platform is architected and some of the ways that it has evolved as it has grown in popularity? What are some of the current limitations that you are struggling with? For someone getting started with Snowflake what is involved with loading data into the platform? What is their workflow for allocating and scaling compute capacity and running anlyses? One of the interesting features enabled by your architecture is data sharing. What are some of the most interesting or unexpected uses of that capability that you have seen? What are some other features or use cases for Snowflake that are not as well known or publicized which you think users should know about? When is SnowflakeDB the wrong choice? What are some of the plans for the future of SnowflakeDB? Contact Info LinkedIn Website @KentGraziano on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links SnowflakeDB Free Trial Stack Overflow Data Warehouse Oracle DB MPP == Massively Parallel Processing Shared Nothing Architecture Multi-Cluster Shared Data Architecture Google BigQuery AWS Redshift AWS Redshift Spectrum Presto Podcast Episode SnowflakeDB Semi-Structured Data Types Hive ACID == Atomicity, Consistency, Isolation, Durability 3rd Normal Form Data Vault Modeling Dimensional Modeling JSON AVRO Parquet SnowflakeDB Virtual Warehouses CRM == Customer Relationship Management Master Data Management Podcast Episode FoundationDB Podcast Episode Apache Spark Podcast Episode SSIS == SQL Server Integration Services Talend Informatica Fivetran Podcast Episode Matillion Apache Kafka Snowpipe Snowflake Data Exchange OLTP == Online Transaction Processing GeoJSON Snowflake Documentation SnowAlert Splunk Data Catalog The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/9/2019 • 58 minutes, 56 seconds

Organizing And Empowering Data Engineers At Citadel

Summary The financial industry has long been driven by data, requiring a mature and robust capacity for discovering and integrating valuable sources of information. Citadel is no exception, and in this episode Michael Watson and Robert Krzyzanowski share their experiences managing and leading the data engineering teams that power the business. They shared helpful insights into some of the challenges associated with working in a regulated industry, organizing teams to deliver value rapidly and reliably, and how they approach career development for data engineers. This was a great conversation for an inside look at how to build and maintain a data driven culture. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Michael Watson and Robert Krzyzanowski about the technical and organizational challenges that he and his team are working on at Citadel Interview Introduction How did you get involved in the area of data management? Can you start by describing the size and structure of the data engineering teams at Citadel? How have the scope and nature of responsibilities for data engineers evolved over the past few years at Citadel as more and better tools and platforms have been made available in the space and machine learning techniques have grown more sophisticated? Can you describe the types of data that you are working with at Citadel? What is the process for identifying, evaluating, and ingesting new sources of data? What are some of the common core aspects of your data infrastructure? What are some of the ways that it differs across teams or projects? How involved are data engineers in the overall product design and delivery lifecycle? For someone who joins your team as a data engineer, what are some of the options available to them for a career path? What are some of the challenges that you are currently facing in managing the data lifecycle for projects at Citadel? What are some tools or practices that you are excited to try out? Contact Info Michael LinkedIn @detroitcoder on Twitter detroitcoder on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Citadel Python Hedge Fund Quantitative Trading Citadel Securities Apache Airflow Jupyter Hub Alembic database migrations for SQLAlchemy Terraform DQM == Data Quality Management Great Expectations Podcast.__init__ Episode Nomad RStudio Active Directory The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/3/2019 • 45 minutes, 50 seconds

Building A Real Time Event Data Warehouse For Sentry

Summary The team at Sentry has built a platform for anyone in the world to send software errors and events. As they scaled the volume of customers and data they began running into the limitations of their initial architecture. To address the needs of their business and continue to improve their capabilities they settled on Clickhouse as the new storage and query layer to power their business. In this episode James Cunningham and Ted Kaemming describe the process of rearchitecting a production system, what they learned in the process, and some useful tips for anyone else evaluating Clickhouse. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Ted Kaemming and James Cunningham about Snuba, the new open source search service at Sentry implemented on top of Clickhouse Interview Introduction How did you get involved in the area of data management? Can you start by describing the internal and user-facing issues that you were facing at Sentry with the existing search capabilities? What did the previous system look like? What was your design criteria for building a new platform? What was your initial list of possible system components and what was your evaluation process that resulted in your selection of Clickhouse? Can you describe the system architecture of Snuba and some of the ways that it differs from your initial ideas of how it would work? What have been some of the sharp edges of Clickhouse that you have had to engineer around? How have you found the operational aspects of Clickhouse? How did you manage the introduction of this new piece of infrastructure to a business that was already handling massive amounts of real-time data? What are some of the downstream benefits of using Clickhouse for managing event data at Sentry? For someone who is interested in using Snuba for their own purposes, how flexible is it for different domain contexts? What are some of the other data challenges that you are currently facing at Sentry? What is your next highest priority for evolving or rebuilding to address technical or business challenges? Contact Info James @JTCunning on Twitter JTCunning on GitHub Ted tkaemming on GitHub Website @tkaemming on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Sentry Podcast.__init__ Episode Snuba Blog Post Clickhouse Podcast Episode Disqus Urban Airship HBase Google Bigtable PostgreSQL Redis HyperLogLog Riak Celery RabbitMQ Apache Spark Presto Cassandra Apache Kudu Apache Pinot Apache Druid Flask Apache Kafka Cassandra Tombstone Sentry Blog XML Change Data Capture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/26/2019 • 1 hour, 1 minute, 15 seconds

Escaping Analysis Paralysis For Your Data Platform With Data Virtualization

Summary With the constant evolution of technology for data management it can seem impossible to make an informed decision about whether to build a data warehouse, or a data lake, or just leave your data wherever it currently rests. What’s worse is that any time you have to migrate to a new architecture, all of your analytical code has to change too. Thankfully it’s possible to add an abstraction layer to eliminate the churn in your client code, allowing you to evolve your data platform without disrupting your downstream data users. In this episode AtScale co-founder and CTO Matthew Baird describes how the data virtualization and data engineering automation capabilities that are built into the platform free up your engineers to focus on your business needs without having to waste cycles on premature optimization. This was a great conversation about the power of abstractions and appreciating the value of increasing the efficiency of your data team. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Matt Baird about AtScale, a platform that Interview Introduction How did you get involved in the area of data management? Can you start by describing the AtScale platform and how it fits in the ecosystem of data tools? What was your motivation for building the platform and what were some of the early challenges that you faced in achieving your current level of success? How is the AtScale platform architected and what have been some of the main areas of evolution and change since you first began building it? How has the surrounding data ecosystem changed since AtScale was founded? How are current industry trends influencing your product focus? Can you talk through the workflow for someone implementing AtScale? What are some of the main use cases that benefit from data virtualization capabilities? How does it influence the relevancy of data warehouses or data lakes? What are some of the types of tools or patterns that AtScale replaces in a data platform? What are some of the most interesting or unexpected ways that you have seen AtScale used? What have been some of the most challenging aspects of building and growing the platform? When is AtScale the wrong choice? What do you have planned for the future of the platform and business? Contact Info LinkedIn @zetty on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links AtScale PeopleSoft Oracle Hadoop PrestoDB Impala Apache Kylin Apache Druid Go Language Scala The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/18/2019 • 55 minutes, 42 seconds

Designing For Data Protection

Summary The practice of data management is one that requires technical acumen, but there are also many policy and regulatory issues that inform and influence the design of our systems. With the introduction of legal frameworks such as the EU GDPR and California’s CCPA it is necessary to consider how to implement data protectino and data privacy principles in the technical and policy controls that govern our data platforms. In this episode Karen Heaton and Mark Sherwood-Edwards share their experience and expertise in helping organizations achieve compliance. Even if you aren’t subject to specific rules regarding data protection it is definitely worth listening to get an overview of what you should be thinking about while building and running data pipelines. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Karen Heaton and Mark Sherwood-Edwards about the idea of data protection, why you might need it, and how to include the principles in your data pipelines. Interview Introduction How did you get involved in the area of data management? Can you start by explaining what is encompassed by the idea of data protection? What regulations control the enforcement of data protection requirements, and how can we determine whether we are subject to their rules? What are some of the conflicts and constraints that act against our efforts to implement data protection? How much of data protection is handled through technical implementation as compared to organizational policies and reporting requirements? Can you give some examples of the types of information that are subject to data protection? One of the challenges in data management generally is tracking the presence and usage of any given information. What are some strategies that you have found effective for auditing the usage of protected information? A corollary to tracking and auditing of protected data in the GDPR is the need to allow for deletion of an individual’s information. How can we ensure effective deletion of these records when dealing with multiple storage systems? What are some of the system components that are most helpful in implementing and maintaining technical and policy controls for data protection? How do data protection regulations impact or restrict the technology choices that are viable for the data preparation layer? Who in the organization is responsible for the proper compliance to GDPR and other data protection regimes? Downstream from the storage and management platforms that we build as data engineers are data scientists and analysts who might request access to protected information. How do the regulations impact the types of analytics that they can use? Contact Info Karen Email Website Mark Email Website GDPR Now Podcast Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Data Protection GDPR This Is DPO Intellectual Property European Convention Of Human Rights CCPA == California Consumer Privacy Act PII == Personally Identifiable Information Privacy By Design US Privacy Shield Principle of Least Privilege International Association of Privacy Professionals Privacy Technology Vendor Report Data Provenance Chief Data Officer UK ICO (Information Commissioner’s Office) AI Audit Framework Data Council The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/11/2019 • 51 minutes, 23 seconds

Automating Your Production Dataflows On Spark

Summary As data engineers the health of our pipelines is our highest priority. Unfortunately, there are countless ways that our dataflows can break or degrade that have nothing to do with the business logic or data transformations that we write and maintain. Sean Knapp founded Ascend to address the operational challenges of running a production grade and scalable Spark infrastructure, allowing data engineers to focus on the problems that power their business. In this episode he explains the technical implementation of the Ascend platform, the challenges that he has faced in the process, and how you can use it to simplify your dataflow automation. This is a great conversation to get an understanding of all of the incidental engineering that is necessary to make your data reliable. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com today to find out more. Having all of your logs and event data in one place makes your life easier when something breaks, unless that something is your Elastic Search cluster because it’s storing too much data. CHAOSSEARCH frees you from having to worry about data retention, unexpected failures, and expanding operating costs. They give you a fully managed service to search and analyze all of your logs in S3, entirely under your control, all for half the cost of running your own Elastic Search cluster or using a hosted platform. Try it out for yourself at dataengineeringpodcast.com/chaossearch and don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Sean Knapp about Ascend, which he is billing as an autonomous dataflow service Interview Introduction How did you get involved in the area of data management? Can you start by explaining what the Ascend platform is? What was your inspiration for creating it and what keeps you motivated? What was your criteria for determining the best execution substrate for the Ascend platform? Can you describe any limitations that are imposed by your selection of Spark as the processing engine? If you were to rewrite Spark from scratch today to fit your particular requirements, what would you change about it? Can you describe the technical implementation of Ascend? How has the system design evolved since you first began working on it? What are some of the assumptions that you had at the beginning of your work on Ascend that have been challenged or updated as a result of working with the technology and your customers? How does the programming interface for Ascend differ from that of a vanilla Spark deployment? What are the main benefits that a data engineer would get from using Ascend in place of running their own Spark deployment? How do you enforce the lack of side effects in the transforms that comprise the dataflow? Can you describe the pipeline orchestration system that you have built into Ascend and the benefits that it provides to data engineers? What are some of the most challenging aspects of building and launching Ascend that you have dealt with? What are some of the most interesting or unexpected lessons learned or edge cases that you have encountered? What are some of the capabilities that you are most proud of and which have gained the greatest adoption? What are some of the sharp edges that remain in the platform? When is Ascend the wrong choice? What do you have planned for the future of Ascend? Contact Info LinkedIn @seanknapp on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Ascend Kubernetes BigQuery Apache Spark Apache Beam Go Language SHA Hashes PySpark Delta Lake Podcast Episode DAG == Directed Acyclic Graph PrestoDB MinIO Podcast Episode Parquet Snappy Compression Tensorflow Kafka Druid The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/4/2019 • 48 minutes, 50 seconds

Build Maintainable And Testable Data Applications With Dagster

Summary Despite the fact that businesses have relied on useful and accurate data to succeed for decades now, the state of the art for obtaining and maintaining that information still leaves much to be desired. In an effort to create a better abstraction for building data applications Nick Schrock created Dagster. In this episode he explains his motivation for creating a product for data management, how the programming model simplifies the work of building testable and maintainable pipelines, and his vision for the future of data programming. If you are building dataflows then Dagster is definitely worth exploring. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Nick Schrock about Dagster, an open source system for building modern data applications Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Dagster is and the origin story for the project? In the tagline for Dagster you describe it as "a system for building modern data applications". There are a lot of contending terms that one might use in this context, such as ETL, data pipelines, etc. Can you describe your thinking as to what the term "data application" means, and the types of use cases that Dagster is well suited for? Can you talk through how Dagster is architected and some of the ways that it has evolved since you first began working on it? What do you see as the current industry trends that are leading us away from full stack frameworks such as Airflow and Oozie for ETL and into an abstracted programming environment that is composable with different execution contexts? What are some of the initial assumptions that you had which have been challenged or updated in the process of working with users of Dagster? For someone who wants to extend Dagster, or integrate it with other components of their data infrastructure, such as a metadata engine, what interfaces do you provide for extensibility? For someone who wants to get started with Dagster can you describe a typical workflow for writing a data pipeline? Once they have something working, what is involved in deploying it? One of the things that stands out about Dagster is the strong contracts that it enforces between computation nodes, or "solids". Why do you feel that those contracts are necessary, and what benefits do they provide during the full lifecycle of a data application? Another difficult aspect of data applications is testing, both before and after deploying it to a production environment. How does Dagster help in that regard? It is also challenging to keep track of the entirety of a DAG for a given workflow. How does Dagit keep track of the task dependencies, and what are the limitations of that tool? Can you give an overview of where you see Dagster fitting in the overall ecosystem of data tools? What are some of the features or capabilities of Dagster which are often overlooked that you would like to highlight for the listeners? Your recent release of Dagster includes a built-in scheduler, as well as a built-in deployment capability. Why did you feel that those were necessary capabilities to incorporate, rather than continuing to leave that as end-user considerations? You have built a new company around Dagster in the form of Elementl. How are you approaching sustainability and governance of Dagster, and what is your path to sustainability for the business? What should listeners be keeping an eye out for in the near to medium future from Elementl and Dagster? What is on your roadmap that you consider necessary before creating a 1.0 release? Contact Info @schrockn on Twitter schrockn on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Dagster Elementl ETL GraphQL React Matei Zaharia DataOps Episode Kafka Fivetran Podcast Episode Spark Supervised Learning DevOps Luigi Airflow Dask Podcast Episode Kubernetes Ray Maxime Beauchemin Podcast Interview Dagster Testing Guide Great Expectations Podcast.__init__ Interview Papermill Notebooks At Netflix Episode DBT Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/28/2019 • 1 hour, 7 minutes, 49 seconds

Data Orchestration For Hybrid Cloud Analytics

Summary The scale and complexity of the systems that we build to satisfy business requirements is increasing as the available tools become more sophisticated. In order to bridge the gap between legacy infrastructure and evolving use cases it is necessary to create a unifying set of components. In this episode Dipti Borkar explains how the emerging category of data orchestration tools fills this need, some of the existing projects that fit in this space, and some of the ways that they can work together to simplify projects such as cloud migration and hybrid cloud environments. It is always useful to get a broad view of new trends in the industry and this was a helpful perspective on the need to provide mechanisms to decouple physical storage from computing capacity. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral, an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure, meaning you can spend your time invested in data transformations and business needs, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit dataengineeringpodcast.com/datacoral today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Dipti Borkark about data orchestration and how it helps in migrating data workloads to the cloud Interview Introduction How did you get involved in the area of data management? Can you start by describing what you mean by the term "Data Orchestration"? How does it compare to the concept of "Data Virtualization"? What are some of the tools and platforms that fit under that umbrella? What are some of the motivations for organizations to use the cloud for their data oriented workloads? What are they giving up by using cloud resources in place of on-premises compute? For businesses that have invested heavily in their own datacenters, what are some ways that they can begin to replicate some of the benefits of cloud environments? What are some of the common patterns for cloud migration projects and what challenges do they present? Do you have advice on useful metrics to track for determining project completion or success criteria? How do businesses approach employee education for designing and implementing effective systems for achieving their migration goals? Can you talk through some of the ways that different data orchestration tools can be composed together for a cloud migration effort? What are some of the common pain points that organizations encounter when working on hybrid implementations? What are some of the missing pieces in the data orchestration landscape? Are there any efforts that you are aware of that are aiming to fill those gaps? Where is the data orchestration market heading, and what are some industry trends that are driving it? What projects are you most interested in or excited by? For someone who wants to learn more about data orchestration and the benefits the technologies can provide, what are some resources that you would recommend? Contact Info LinkedIn @dborkar on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Alluxio Podcast Episode UC San Diego Couchbase Presto Podcast Episode Spark SQL Data Orchestration Data Virtualization PyTorch Podcast.init Episode Rook storage orchestration PySpark MinIO Podcast Episode Kubernetes Openstack Hadoop HDFS Parquet Files Podcast Episode ORC Files Hive Metastore Iceberg Table Format Podcast Episode Data Orchestration Summit Star Schema Snowflake Schema Data Warehouse Data Lake Teradata The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/22/2019 • 42 minutes, 51 seconds

Keeping Your Data Warehouse In Order With DataForm

Summary Managing a data warehouse can be challenging, especially when trying to maintain a common set of patterns. Dataform is a platform that helps you apply engineering principles to your data transformations and table definitions, including unit testing SQL scripts, defining repeatable pipelines, and adding metadata to your warehouse to improve your team’s communication. In this episode CTO and co-founder of Dataform Lewis Hemens joins the show to explain his motivation for creating the platform and company, how it works under the covers, and how you can start using it today to get your data warehouse under control. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. Are you working on data, analytics, or AI using platforms such as Presto, Spark, or Tensorflow? Check out the Data Orchestration Summit on November 7 at the Computer History Museum in Mountain View. This one day conference is focused on the key data engineering challenges and solutions around building analytics and AI platforms. Attendees will hear from companies including Walmart, Netflix, Google, and DBS Bank on how they leveraged technologies such as Alluxio, Presto, Spark, Tensorflow, and you will also hear from creators of open source projects including Alluxio, Presto, Airflow, Iceberg, and more! Use discount code PODCAST for 25% off of your ticket, and the first five people to register get free tickets! Register now as early bird tickets are ending this week! Attendees will takeaway learnings, swag, a free voucher to visit the museum, and a chance to win the latest ipad Pro! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Lewis Hemens about DataForm, a platform that helps analysts manage all data processes in your cloud data warehouse Interview Introduction How did you get involved in the area of data management? Can you start by explaining what DataForm is and the origin story for the platform and company? What are the main benefits of using a tool like DataForm and who are the primary users? Can you talk through the workflow for someone using DataForm and highlight the main features that it provides? What are some of the challenges and mistakes that are common among engineers and analysts with regard to versioning and evolving schemas and the accompanying data? How does CI/CD and change management manifest in the context of data warehouse management? How is the Dataform SDK itself implemented and how has it evolved since you first began working on it? Can you differentiate the capabilities between the open source CLI and the hosted web platform, and when you might need to use one over the other? What was your selection process for an embedded runtime and how did you decide on javascript? Can you talk through some of the use cases that having an embedded runtime enables? What are the limitations of SQL when working in a collaborative environment? Which database engines do you support and how do you reduce the maintenance burden for supporting different dialects and capabilities? What is involved in adding support for a new backend? When is DataForm the wrong choice? What do you have planned for the future of DataForm? Contact Info LinkedIn @lewishemens on Twitter lewish on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links DataForm YCombinator DBT == Data Build Tool Podcast Episode Fishtown Analytics Typescript Continuous Integration Continuous Delivery BigQuery Snowflake DB UDF == User Defined Function RedShift PostgreSQL Podcast Episode AWS Athena Presto Podcast Episode Apache Beam Apache Kafka Segment Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/15/2019 • 47 minutes, 4 seconds

Fast Analytics On Semi-Structured And Structured Data In The Cloud

Summary The process of exposing your data through a SQL interface has many possible pathways, each with their own complications and tradeoffs. One of the recent options is Rockset, a serverless platform for fast SQL analytics on semi-structured and structured data. In this episode CEO Venkat Venkataramani and SVP of Product Shruti Bhat explain the origins of Rockset, how it is architected to allow for fast and flexible SQL analytics on your data, and how their serverless platform can save you the time and effort of implementing portions of your own infrastructure. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! This week’s episode is also sponsored by Datacoral. They provide an AWS-native, serverless, data infrastructure that installs in your VPC. Datacoral helps data engineers build and manage the flow of data pipelines without having to manage any infrastructure. Datacoral’s customers report that their data engineers are able to spend 80% of their work time invested in data transformations, rather than pipeline maintenance. Raghu Murthy, founder and CEO of Datacoral built data infrastructures at Yahoo! and Facebook, scaling from mere terabytes to petabytes of analytic data. He started Datacoral with the goal to make SQL the universal data programming language. Visit Datacoral.com today to find out more. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, Alluxio, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, the Data Orchestration Summit, and Data Council in NYC. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Shruti Bhat and Venkat Venkataramani about Rockset, a serverless platform for enabling fast SQL queries across all of your data Interview Introduction How did you get involved in the area of data management? Can you start by describing what Rockset is and your motivation for creating it? What are some of the use cases that it enables which would otherwise be impractical or intractable? How does Rockset fit into the infrastructure and workflow of data teams and what portions of a typical stack does it replace? Can you describe how the Rockset platform is architected and how it has evolved as you onboard more customers? Can you describe the flow of a piece of data as it traverses the full lifecycle in Rockset? How is your storage backend implemented to allow for speed and flexibility in the query layer? How does it manage distribution, balancing, and durability of the data? What are your strategies for handling node and region failure in the cloud? You have a whitepaper describing your architecture as being oriented around microservices on Kubernetes in order to be cloud agnostic. How do you handle the case where customers have data sources that span multiple cloud providers or regions and the latency that can result? How is the query engine structured to allow for optimizing so many different query types (e.g. search, graph, timeseries, etc.)? With Rockset handling a large portion of the underlying infrastructure work that a data engineer might be involved with, what are some ways that you have seen them use the time that they have gained and how has that benefitted the organizations that they work for? What are some of the most interesting/unexpected/innovative ways that you have seen Rockset used? When is Rockset the wrong choice for a given project? What have you found to be the most challenging and the most exciting aspects of building the Rockset platform and company? What do you have planned for the future of Rockset? Contact Info Venkat LinkedIn @iamveeve on Twitter veeve on GitHub Shruti LinkedIn @shrutibhat on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, the Data Engineering Podcast for the latest on modern data management. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@podcastinit.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at pythonpodcast.com/chat Links Rockset Blog Oracle VMWare Facebook Rube Goldberg Machine SnowflakeDB Protocol Buffers Spark Podcast Episode Presto Podcast Episode Apache Kafka RocksDB InnoDB Lucene Log Structured Merge Tree (LSTM) Kubernetes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/8/2019 • 54 minutes, 38 seconds

Ship Faster With An Opinionated Data Pipeline Framework

Summary Building an end-to-end data pipeline for your machine learning projects is a complex task, made more difficult by the variety of ways that you can structure it. Kedro is a framework that provides an opinionated workflow that lets you focus on the parts that matter, so that you don’t waste time on gluing the steps together. In this episode Tom Goldenberg explains how it works, how it is being used at Quantum Black for customer projects, and how it can help you structure your own. Definitely worth a listen to gain more understanding of the benefits that a standardized process can provide. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the combined events of the Data Architecture Summit and Graphorum, Data Council in Barcelona, and the Data Orchestration Summit. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Tom Goldenberg about Kedro, an open source development workflow tool that helps structure reproducible, scaleable, deployable, robust and versioned data pipelines. Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Kedro is and its origin story? Who are the primary users of Kedro, and how does it fit into and impact the workflow of data engineers and data scientists? Can you talk through a typical lifecycle for a project that is built using Kedro? What are the overall features of Kedro and how do they compound to encourage best practices for data projects? How does the culture and background of QuantumBlack influence the design and capabilities of Kedro? What was the motivation for releasing it publicly as an open source framework? What are some examples of ways that Kedro is being used within QuantumBlack and how has that experience informed the design and direction of the project? Can you describe how Kedro itself is implemented and how it has evolved since you first started working on it? There has been a recent trend away from end-to-end ETL frameworks and toward a decoupled model that focuses on a programming target with pluggable execution. What are the industry pressures that are driving that shift and what are your thoughts on how that will manifest in the long term? How do the capabilities and focus of Kedro compare to similar projects such as Prefect and Dagster? It has not yet reached a stable release. What are the aspects of Kedro that are still in flux and where are the changes most concentrated? What is still missing for a stable 1.x release? What are some of the most interesting/innovative/unexpected ways that you have seen Kedro used? When is Kedro the wrong choice? What do you have in store for the future of Kedro? Contact Info LinkedIn @tomgoldenberg on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Kedro GitHub Quantum Black Labs GitHub Agolo McKinsey Airflow Docker Kubernetes DataBricks Formula 1 Kedro Viz Dask Podcast Interview Py.Test Azure Data Factory Prefect Podcast Interview Dagster The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/1/2019 • 35 minutes, 8 seconds

Open Source Object Storage For All Of Your Data

Summary Object storage is quickly becoming the unifying layer for data intensive applications and analytics. Modern, cloud oriented data warehouses and data lakes both rely on the durability and ease of use that it provides. S3 from Amazon has quickly become the de-facto API for interacting with this service, so the team at MinIO have built a production grade, easy to manage storage engine that replicates that interface. In this episode Anand Babu Periasamy shares the origin story for the MinIO platform, the myriad use cases that it supports, and the challenges that they have faced in replicating the functionality of S3. He also explains the technical implementation, innovative design, and broad vision for the project. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Anand Babu Periasamy about MinIO, the neutral, open source, enterprise grade object storage system. Interview Introduction How did you get involved in the area of data management? Can you explain what MinIO is and its origin story? What are some of the main use cases that MinIO enables? How does MinIO compare to other object storage options and what benefits does it provide over other open source platforms? Your marketing focuses on the utility of MinIO for ML and AI workloads. What benefits does object storage provide as compared to distributed file systems? (e.g. HDFS, GlusterFS, Ceph) What are some of the challenges that you face in terms of maintaining compatibility with the S3 interface? What are the constraints and opportunities that are provided by adhering to that API? Can you describe how MinIO is implemented and the overall system design? How has that design evolved since you first began working on it? What assumptions did you have at the outset and how have they been challenged or updated? What are the axes for scaling that MinIO provides and how does it handle clustering? Where does it fall on the axes of availability and consistency in the CAP theorem? One of the useful features that you provide is efficient erasure coding, as well as protection against data corruption. How much overhead do those capabilties incur, in terms of computational efficiency and, in a clustered scenario, storage volume? For someone who is interested in running MinIO, what is involved in deploying and maintaining an installation of it? What are the cases where it makes sense to use MinIO in place of a cloud-native object store such as S3 or Google Cloud Storage? How do you approach project governance and sustainability? What are some of the most interesting/innovative/unexpected ways that you have seen MinIO used? What do you have planned for the future of MinIO? Contact Info LinkedIn @abperiasamy on Twitter abperiasamy on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links MinIO GlusterFS Object Storage RedHat Bionics AWS S3 Ceph Swift Stack POSIX HDFS Google BigQuery AzureML AWS SageMaker AWS Athena S3 Select Azure Blob Store BackBlaze Round Robin DNS Service Mesh Istio Envoy SmartStack Free Software RocksDB TanTan Blog Post Presto SparkML MCAdmin Trace DTrace The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/23/2019 • 1 hour, 8 minutes, 19 seconds

Navigating Boundless Data Streams With The Swim Kernel

Summary The conventional approach to analytics involves collecting large amounts of data that can be cleaned, followed by a separate step for analysis and interpretation. Unfortunately this strategy is not viable for handling real-time, real-world use cases such as traffic management or supply chain logistics. In this episode Simon Crosby, CTO of Swim Inc., explains how the SwimOS kernel and the enterprise data fabric built on top of it enable brand new use cases for instant insights. This was an eye opening conversation about how stateful computation of data streams from edge devices can reduce cost and complexity as compared to batch oriented workflows. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Simon Crosby about Swim.ai, a data fabric for the distributed enterprise Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Swim.ai is and how the project and business got started? Can you explain the differentiating factors between the SwimOS and Data Fabric platforms that you offer? What are some of the use cases that are enabled by the Swim platform that would otherwise be impractical or intractable? How does Swim help alleviate the challenges of working with sensor oriented applications or edge computing platforms? Can you describe a typical design for an application or system being built on top of the Swim platform? What does the developer workflow look like? What kind of tooling do you have for diagnosing and debugging errors in an application built on top of Swim? Can you describe the internal design for the SwimOS and how it has evolved since you first began working on it? For such widely distributed applications, efficient discovery and communication is essential. How does Swim handle that functionality? What mechanisms are in place to account for network failures? Since the application nodes are explicitly stateful, how do you handle scaling as compared to a stateless web application? Since there is no explicit data layer, how is data redundancy handled by Swim applications? What are some of the most interesting/unexpected/innovative ways that you have seen the Swim technology used? What have you found to be the most challenging aspects of building the Swim platform? What are some of the assumptions that you had going into the creation of SwimOS and how have they been challenged or updated? What do you have planned for the future of the technical and business aspects of Swim.ai? Contact Info LinkedIn Wikipedia @simoncrosby on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Swim.ai Hadoop Streaming Data Apache Flink Podcast Episode Apache Kafka Wallaroo Podcast Episode Digital Twin Swim Concepts Documentation RFID == Radio Frequency IDentification PCB == Printed Circuit Board Graal VM Azure IoT Edge Framework Azure DLS (Data Lake Storage) Power BI WARP Protocol LightBend The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/18/2019 • 57 minutes, 55 seconds

Building A Reliable And Performant Router For Observability Data

Summary The first stage in every data project is collecting information and routing it to a storage system for later analysis. For operational data this typically means collecting log messages and system metrics. Often a different tool is used for each class of data, increasing the overall complexity and number of moving parts. The engineers at Timber.io decided to build a new tool in the form of Vector that allows for processing both of these data types in a single framework that is reliable and performant. In this episode Ben Johnson and Luke Steensen explain how the project got started, how it compares to other tools in this space, and how you can get involved in making it even better. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Ben Johnson and Luke Steensen about Vector, a high-performance, open-source observability data router Interview Introduction How did you get involved in the area of data management? Can you start by explaining what the Vector project is and your reason for creating it? What are some of the comparable tools that are available and what were they lacking that prompted you to start a new project? What strategy are you using for project governance and sustainability? What are the main use cases that Vector enables? Can you explain how Vector is implemented and how the system design has evolved since you began working on it? How did your experience building the business and products for Timber influence and inform your work on Vector? When you were planning the implementation, what were your criteria for the runtime implementation and why did you decide to use Rust? What led you to choose Lua as the embedded scripting environment? What data format does Vector use internally? Is there any support for defining and enforcing schemas? In the event of a malformed message is there any capacity for a dead letter queue? What are some strategies for formatting source data to improve the effectiveness of the information that is gathered and the ability of Vector to parse it into useful data? When designing an event flow in Vector what are the available mechanisms for testing the overall delivery and any transformations? What options are available to operators to support visibility into the running system? In terms of deployment topologies, what capabilities does Vector have to support high availability and/or data redundancy? What are some of the other considerations that operators and administrators of Vector should be considering? You have a fairly well defined roadmap for the different point versions of Vector. How did you determine what the priority ordering was and how quickly are you progressing on your roadmap? What is the available interface for adding and extending the capabilities of Vector? (source/transform/sink) What are some of the most interesting/innovative/unexpected ways that you have seen Vector used? What are some of the challenges that you have faced in building/publicizing Vector? For someone who is interested in using Vector, how would you characterize the overall maturity of the project currently? What is missing that you would consider necessary for production readiness? When is Vector the wrong choice? Contact Info Ben @binarylogic on Twitter binarylogic on GitHub Luke LinkedIn @lukesteensen on Twitter lukesteensen on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Vector GitHub Timber.io Observability SeatGeek Apache Kafka StatsD FluentD Splunk Filebeat Logstash Fluent Bit Rust Tokio Rust library TOML Lua Nginx HAProxy Web Assembly (WASM) Protocol Buffers Jepsen The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/10/2019 • 55 minutes, 19 seconds

Building A Community For Data Professionals at Data Council

Summary Data professionals are working in a domain that is rapidly evolving. In order to stay current we need access to deeply technical presentations that aren’t burdened by extraneous marketing. To fulfill that need Pete Soderling and his team have been running the Data Council series of conferences and meetups around the world. In this episode Pete discusses his motivation for starting these events, how they serve to bring the data community together, and the observations that he has made about the direction that we are moving. He also shares his experiences as an investor in developer oriented startups and his views on the importance of empowering engineers to launch their own companies. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Listen, I’m sure you work for a ‘data driven’ company – who doesn’t these days? Does your company use Amazon Redshift? Have you ever groaned over slow queries or are just afraid that Amazon Redshift is gonna fall over at some point? Well, you’ve got to talk to the folks over at intermix.io. They have built the “missing” Amazon Redshift console – it’s an amazing analytics product for data engineers to find and re-write slow queries and gives actionable recommendations to optimize data pipelines. WeWork, Postmates, and Medium are just a few of their customers. Go to dataengineeringpodcast.com/intermix today and use promo code DEP at sign up to get a $50 discount! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Council. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Pete Soderling about his work to build and grow a community for data professionals with the Data Council conferences and meetups, as well as his experiences as an investor in data oriented companies Interview Introduction How did you get involved in the area of data management? What was your original reason for focusing your efforts on fostering a community of data engineers? What was the state of recognition in the industry for that role at the time that you began your efforts? The current manifestation of your community efforts is in the form of the Data Council conferences and meetups. Previously they were known as Data Eng Conf and before that was Hakka Labs. Can you discuss the evolution of your efforts to grow this community? How has the community itself changed and grown over the past few years? Communities form around a huge variety of focal points. What are some of the complexities or challenges in building one based on something as nebulous as data? Where do you draw inspiration and direction for how to manage such a large and distributed community? What are some of the most interesting/challenging/unexpected aspects of community management that you have encountered? What are some ways that you have been surprised or delighted in your interactions with the data community? How do you approach sustainability of the Data Council community and the organization itself? The tagline that you have focused on for Data Council events is that they are no fluff, juxtaposing them against larger business oriented events. What are your guidelines for fulfilling that promise and why do you think that is an important distinction? In addition to your community building you are also an investor. How did you get involved in that side of your business and how does it fit into your overall mission? You also have a stated mission to help engineers build their own companies. In your opinion, how does an engineer led business differ from one that may be founded or run by a business oriented individual and why do you think that we need more of them? What are the ways that you typically work to empower engineering founders or encourage them to create their own businesses? What are some of the challenges that engineering founders face and what are some common difficulties or misunderstandings related to business? What are your opinions on venture-backed vs. "lifestyle" or bootstrapped businesses? What are the characteristics of a data business that you look at when evaluating a potential investment? What are some of the current industry trends that you are most excited by? What are some that you find concerning? What are your goals and plans for the future of Data Council? Contact Info @petesoder on Twitter LinkedIn @petesoder on Medium Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Data Council Database Design For Mere Mortals Bloomberg Garmin 500 Startups Geeks On A Plane Data Council NYC 2019 Track Summary Pete’s Angel List Syndicate DataOps Data Kitchen Episode DataOps Vs DevOps Episode Great Expectations Podcast.__init__ Interview Elementl Dagster Data Council Presentation Data Council Call For Proposals The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/2/2019 • 52 minutes, 46 seconds

Building Tools And Platforms For Data Analytics

Summary Data engineers are responsible for building tools and platforms to power the workflows of other members of the business. Each group of users has their own set of requirements for the way that they access and interact with those platforms depending on the insights they are trying to gather. Benn Stancil is the chief analyst at Mode Analytics and in this episode he explains the set of considerations and requirements that data analysts need in their tools and. He also explains useful patterns for collaboration between data engineers and data analysts, and what they can learn from each other. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Your host is Tobias Macey and today I’m interviewing Benn Stancil, chief analyst at Mode Analytics, about what data engineers need to know when building tools for analysts Interview Introduction How did you get involved in the area of data management? Can you start by describing some of the main features that you are looking for in the tools that you use? What are some of the common shortcomings that you have found in out-of-the-box tools that organizations use to build their data stack? What should data engineers be considering as they design and implement the foundational data platforms that higher order systems are built on, which are ultimately used by analysts and data scientists? In terms of mindset, what are the ways that data engineers and analysts can align and where are the points of conflict? In terms of team and organizational structure, what have you found to be useful patterns for reducing friction in the product lifecycle for data tools (internal or external)? What are some anti-patterns that data engineers can guard against as they are designing their pipelines? In your experience as an analyst, what have been the characteristics of the most seamless projects that you have been involved with? How much understanding of analytics are necessary for data engineers to be successful in their projects and careers? Conversely, how much understanding of data management should analysts have? What are the industry trends that you are most excited by as an analyst? Contact Info LinkedIn @bennstancil on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Closing Announcements Thank you for listening! Don’t forget to check out our other show, Podcast.__init__ to learn about the Python language, its community, and the innovative ways it is being used. Visit the site to subscribe to the show, sign up for the mailing list, and read the show notes. If you’ve learned something or tried out a project from the show then tell us about it! Email hosts@dataengineeringpodcast.com) with your story. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Links Mode Analytics Data Council Presentation Yammer StitchFix Blog Post SnowflakeDB Re:Dash Superset Marquez Amundsen Podcast Episode Elementl Dagster Data Council Presentation DBT Podcast Episode Great Expectations Podcast.__init__ Episode Delta Lake Podcast Episode Stitch Fivetran Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/26/2019 • 48 minutes, 6 seconds

A High Performance Platform For The Full Big Data Lifecycle

Summary Managing big data projects at scale is a perennial problem, with a wide variety of solutions that have evolved over the past 20 years. One of the early entrants that predates Hadoop and has since been open sourced is the HPCC (High Performance Computing Cluster) system. Designed as a fully integrated platform to meet the needs of enterprise grade analytics it provides a solution for the full lifecycle of data at massive scale. In this episode Flavio Villanustre, VP of infrastructure and products at HPCC Systems, shares the history of the platform, how it is architected for scale and speed, and the unique solutions that it provides for enterprise grade data analytics. He also discusses the motivations for open sourcing the platform, the detailed workflow that it enables, and how you can try it for your own projects. This was an interesting view of how a well engineered product can survive massive evolutionary shifts in the industry while remaining relevant and useful. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, Corinium Global Intelligence, and Data Counsil. Upcoming events include the O’Reilly AI conference, the Strata Data conference, the combined events of the Data Architecture Summit and Graphorum, and Data Council in Barcelona. Go to dataengineeringpodcast.com/conferences to learn more about these and other events, and take advantage of our partner discounts to save money when you register today. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Flavio Villanustre about the HPCC Systems project and his work at LexisNexis Risk Solutions Interview Introduction How did you get involved in the area of data management? Can you start by describing what the HPCC system is and the problems that you were facing at LexisNexis Risk Solutions which led to its creation? What was the overall state of the data landscape at the time and what was the motivation for releasing it as open source? Can you describe the high level architecture of the HPCC Systems platform and some of the ways that the design has changed over the years that it has been maintained? Given how long the project has been in use, can you talk about some of the ways that it has had to evolve to accomodate changing trends in usage and technologies for big data and advanced analytics? For someone who is using HPCC Systems, can you talk through a common workflow and the ways that the data traverses the various components? How does HPCC Systems manage persistence and scalability? What are the integration points available for extending and enhancing the HPCC Systems platform? What is involved in deploying and managing a production installation of HPCC Systems? The ECL language is an intriguing element of the overall system. What are some of the features that it provides which simplify processing and management of data? How does the Thor engine manage data transformation and manipulation? What are some of the unique features of Thor and how does it compare to other approaches for ETL and data integration? For extraction and analysis of data can you talk through the capabilities of the Roxie engine? How are you using the HPCC Systems platform in your work at LexisNexis? Despite being older than the Hadoop platform it doesn’t seem that HPCC Systems has seen the same level of growth and popularity. Can you share your perspective on the community for HPCC Systems and how it compares to that of Hadoop over the past decade? How is the HPCC Systems project governed, and what is your approach to sustainability? What are some of the additional capabilities that are only available in the enterprise distribution? When is the HPCC Systems platform the wrong choice, and what are some systems that you might use instead? What have been some of the most interesting/unexpected/novel ways that you have seen HPCC Systems used? What are some of the challenges that you have faced and lessons that you have learned while building and maintaining the HPCC Systems platform and community? What do you have planned for the future of HPCC Systems? Contact Info LinkedIn @fvillanustre on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links HPCC Systems LexisNexis Risk Solutions Risk Management Hadoop MapReduce Sybase Oracle DB AbInitio Data Lake SQL ECL DataFlow TensorFlow ECL IDE The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/19/2019 • 1 hour, 13 minutes, 45 seconds

Digging Into Data Replication At Fivetran

Summary The extract and load pattern of data replication is the most commonly needed process in data engineering workflows. Because of the myriad sources and destinations that are available, it is also among the most difficult tasks that we encounter. Fivetran is a platform that does the hard work for you and replicates information from your source systems into whichever data warehouse you use. In this episode CEO and co-founder George Fraser explains how it is built, how it got started, and the challenges that creep in at the edges when dealing with so many disparate systems that need to be made to work together. This is a great conversation to listen to for a better understanding of the challenges inherent in synchronizing your data. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and Corinium Global Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing George Fraser about FiveTran, a hosted platform for replicating your data from source to destination Interview Introduction How did you get involved in the area of data management? Can you start by describing the problem that Fivetran solves and the story of how it got started? Integration of multiple data sources (e.g. entity resolution) How is Fivetran architected and how has the overall system design changed since you first began working on it? monitoring and alerting Automated schema normalization. How does it work for customized data sources? Managing schema drift while avoiding data loss Change data capture What have you found to be the most complex or challenging data sources to work with reliably? Workflow for users getting started with Fivetran When is Fivetran the wrong choice for collecting and analyzing your data? What have you found to be the most challenging aspects of working in the space of data integrations?}} What have been the most interesting/unexpected/useful lessons that you have learned while building and growing Fivetran? What do you have planned for the future of Fivetran? Contact Info LinkedIn @frasergeorgew on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Fivetran Ralph Kimball DBT (Data Build Tool) Podcast Interview Looker Podcast Interview Cron Kubernetes Postgres Podcast Episode Oracle DB Salesforce Netsuite Marketo Jira Asana Cloudwatch Stackdriver The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/12/2019 • 44 minutes, 40 seconds

Solving Data Discovery At Lyft

Summary Data is only valuable if you use it for something, and the first step is knowing that it is available. As organizations grow and data sources proliferate it becomes difficult to keep track of everything, particularly for analysts and data scientists who are not involved with the collection and management of that information. Lyft has build the Amundsen platform to address the problem of data discovery and in this episode Tao Feng and Mark Grover explain how it works, why they built it, and how it has impacted the workflow of data professionals in their organization. If you are struggling to realize the value of your information because you don’t know what you have or where it is then give this a listen and then try out Amundsen for yourself. Announcements Welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Finding the data that you need is tricky, and Amundsen will help you solve that problem. And as your data grows in volume and complexity, there are foundational principles that you can follow to keep data workflows streamlined. Mode – the advanced analytics platform that Lyft trusts – has compiled 3 reasons to rethink data discovery. Read them at dataengineeringpodcast.com/mode-lyft. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, the Open Data Science Conference, and Corinium Intelligence. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Mark Grover and Tao Feng about Amundsen, the data discovery platform and metadata engine that powers self service data access at Lyft Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Amundsen is and the problems that it was designed to address? What was lacking in the existing projects at the time that led you to building a new platform from the ground up? How does Amundsen fit in the larger ecosystem of data tools? How does it compare to what WeWork is building with Marquez? Can you describe the overall architecture of Amundsen and how it has evolved since you began working on it? What were the main assumptions that you had going into this project and how have they been challenged or updated in the process of building and using it? What has been the impact of Amundsen on the workflows of data teams at Lyft? Can you talk through an example workflow for someone using Amundsen? Once a dataset has been located, how does Amundsen simplify the process of accessing that data for analysis or further processing? How does the information in Amundsen get populated and what is the process for keeping it up to date? What was your motivation for releasing it as open source and how much effort was involved in cleaning up the code for the public? What are some of the capabilities that you have intentionally decided not to implement yet? For someone who wants to run their own instance of Amundsen what is involved in getting it deployed and integrated? What have you found to be the most challenging aspects of building, using and maintaining Amundsen? What do you have planned for the future of Amundsen? Contact Info Tao LinkedIn feng-tao on GitHub Mark LinkedIn Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Amundsen Data Council Presentation Strata Presentation Blog Post Lyft Airflow Podcast.__init__ Episode LinkedIn Slack Marquez S3 Hive Presto Podcast Episode Spark PostgreSQL Google BigQuery Neo4J Apache Atlas Tableau Superset Alation Cloudera Navigator DynamoDB MongoDB Druid The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/5/2019 • 51 minutes, 48 seconds

Simplifying Data Integration Through Eventual Connectivity

Summary The ETL pattern that has become commonplace for integrating data from multiple sources has proven useful, but complex to maintain. For a small number of sources it is a tractable problem, but as the overall complexity of the data ecosystem continues to expand it may be time to identify new ways to tame the deluge of information. In this episode Tim Ward, CEO of CluedIn, explains the idea of eventual connectivity as a new paradigm for data integration. Rather than manually defining all of the mappings ahead of time, we can rely on the power of graph databases and some strategic metadata to allow connections to occur as the data becomes available. If you are struggling to maintain a tangle of data pipelines then you might find some new ideas for reducing your workload. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! To connect with the startups that are shaping the future and take advantage of the opportunities that they provide, check out Angel List where you can invest in innovative business, find a job, or post a position of your own. Sign up today at dataengineeringpodcast.com/angel and help support this show. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tim Ward about his thoughts on eventual connectivity as a new pattern to replace traditional ETL Interview Introduction How did you get involved in the area of data management? Can you start by discussing the challenges and shortcomings that you perceive in the existing practices of ETL? What is eventual connectivity and how does it address the problems with ETL in the current data landscape? In your white paper you mention the benefits of graph technology and how it solves the problem of data integration. Can you talk through an example use case? How do different implementations of graph databases impact their viability for this use case? Can you talk through the overall system architecture and data flow for an example implementation of eventual connectivity? How much up-front modeling is necessary to make this a viable approach to data integration? How do the volume and format of the source data impact the technology and architecture decisions that you would make? What are the limitations or edge cases that you have found when using this pattern? In modern ETL architectures there has been a lot of time and work put into workflow management systems for orchestrating data flows. Is there still a place for those tools when using the eventual connectivity pattern? What resources do you recommend for someone who wants to learn more about this approach and start using it in their organization? Contact Info Email LinkedIn @jerrong on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Eventual Connectivity White Paper CluedIn Podcast Episode Copenhagen Ewok Multivariate Testing CRM ERP ETL ELT DAG Graph Database Apache NiFi Podcast Episode Apache Airflow Podcast.init Episode BigQuery RedShift CosmosDB SAP HANA IOT == Internet of Things The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/29/2019 • 53 minutes, 47 seconds

Straining Your Data Lake Through A Data Mesh

Summary The current trend in data management is to centralize the responsibilities of storing and curating the organization’s information to a data engineering team. This organizational pattern is reinforced by the architectural pattern of data lakes as a solution for managing storage and access. In this episode Zhamak Dehghani shares an alternative approach in the form of a data mesh. Rather than connecting all of your data flows to one destination, empower your individual business units to create data products that can be consumed by other teams. This was an interesting exploration of a different way to think about the relationship between how your data is produced, how it is used, and how to build a technical platform that supports the organizational needs of your business. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to grow your professional network and find opportunities with the startups that are changing the world then Angel List is the place to go. Go to dataengineeringpodcast.com/angel to sign up today. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Upcoming events include the O’Reilly AI Conference, the Strata Data Conference, and the combined events of the Data Architecture Summit and Graphorum. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Zhamak Dehghani about building a distributed data mesh for a domain oriented approach to data management Interview Introduction How did you get involved in the area of data management? Can you start by providing your definition of a "data lake" and discussing some of the problems and challenges that they pose? What are some of the organizational and industry trends that tend to lead to this solution? You have written a detailed post outlining the concept of a "data mesh" as an alternative to data lakes. Can you give a summary of what you mean by that phrase? In a domain oriented data model, what are some useful methods for determining appropriate boundaries for the various data products? What are some of the challenges that arise in this data mesh approach and how do they compare to those of a data lake? One of the primary complications of any data platform, whether distributed or monolithic, is that of discoverability. How do you approach that in a data mesh scenario? A corollary to the issue of discovery is that of access and governance. What are some strategies to making that scalable and maintainable across different data products within an organization? Who is responsible for implementing and enforcing compliance regimes? One of the intended benefits of data lakes is the idea that data integration becomes easier by having everything in one place. What has been your experience in that regard? How do you approach the challenge of data integration in a domain oriented approach, particularly as it applies to aspects such as data freshness, semantic consistency, and schema evolution? Has latency of data retrieval proven to be an issue in your work? When it comes to the actual implementation of a data mesh, can you describe the technical and organizational approach that you recommend? How do team structures and dynamics shift in this scenario? What are the necessary skills for each team? Who is responsible for the overall lifecycle of the data in each domain, including modeling considerations and application design for how the source data is generated and captured? Is there a general scale of organization or problem domain where this approach would generate too much overhead and maintenance burden? For an organization that has an existing monolothic architecture, how do you suggest they approach decomposing their data into separately managed domains? Are there any other architectural considerations that data professionals should be considering that aren’t yet widespread? Contact Info LinkedIn @zhamakd on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh Thoughtworks Technology Radar Data Lake Data Warehouse James Dixon Azure Data Lake "Big Ball Of Mud" Anti-Pattern ETL ELT Hadoop Spark Kafka Event Sourcing Airflow Podcast.__init__ Episode Data Engineering Episode Data Catalog Master Data Management Podcast Episode Polyseme REST CNCF (Cloud Native Computing Foundation) Cloud Events Standard The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/22/2019 • 1 hour, 4 minutes, 27 seconds

Data Labeling That You Can Feel Good About With CloudFactory

Summary Successful machine learning and artificial intelligence projects require large volumes of data that is properly labelled. The challenge is that most data is not clean and well annotated, requiring a scalable data labeling process. Ideally this process can be done using the tools and systems that already power your analytics, rather than sending data into a black box. In this episode Mark Sears, CEO of CloudFactory, explains how he and his team built a platform that provides valuable service to businesses and meaningful work to developing nations. He shares the lessons learned in the early years of growing the business, the strategies that have allowed them to scale and train their workforce, and the benefits of working within their customer’s existing platforms. He also shares some valuable insights into the current state of the art for machine learning in the real world. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Mark Sears about Cloud Factory, masters of the art and science of labeling data for Machine Learning and more Interview Introduction How did you get involved in the area of data management? Can you start by explaining what CloudFactory is and the story behind it? What are some of the common requirements for feature extraction and data labelling that your customers contact you for? What integration points do you provide to your customers and what is your strategy for ensuring broad compatibility with their existing tools and workflows? Can you describe the workflow for a sample request from a customer, how that fans out to your cloud workers, and the interface or platform that they are working with to deliver the labelled data? What protocols do you have in place to ensure data quality and identify potential sources of bias? What role do humans play in the lifecycle for AI and ML projects? I understand that you provide skills development and community building for your cloud workers. Can you talk through your relationship with those employees and how that relates to your business goals? How do you manage and plan for elasticity in customer needs given the workforce requirements that you are dealing with? Can you share some stories of cloud workers who have benefited from their experience working with your company? What are some of the assumptions that you made early in the founding of your business which have been challenged or updated in the process of building and scaling CloudFactory? What have been some of the most interesting/unexpected ways that you have seen customers using your platform? What lessons have you learned in the process of building and growing CloudFactory that were most interesting/unexpected/useful? What are your thoughts on the future of work as AI and other digital technologies continue to disrupt existing industries and jobs? How does that tie into your plans for CloudFactory in the medium to long term? Contact Info @marktsears on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links CloudFactory Reading, UK Nepal Kenya Ruby on Rails Kathmandu Natural Language Processing (NLP) Computer Vision The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/15/2019 • 57 minutes, 50 seconds

Scale Your Analytics On The Clickhouse Data Warehouse

Summary The market for data warehouse platforms is large and varied, with options for every use case. ClickHouse is an open source, column-oriented database engine built for interactive analytics with linear scalability. In this episode Robert Hodges and Alexander Zaitsev explain how it is architected to provide these features, the various unique capabilities that it provides, and how to run it in production. It was interesting to learn about some of the custom data types and performance optimizations that are included. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Robert Hodges and Alexander Zaitsev about Clickhouse, an open source, column-oriented database for fast and scalable OLAP queries Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Clickhouse is and how you each got involved with it? What are the primary use cases that Clickhouse is targeting? Where does it fit in the database market and how does it compare to other column stores, both open source and commercial? Can you describe how Clickhouse is architected? Can you talk through the lifecycle of a given record or set of records from when they first get inserted into Clickhouse, through the engine and storage layer, and then the lookup process at query time? I noticed that Clickhouse has a feature for implementing data safeguards (deletion protection, etc.). Can you talk through how that factors into different use cases for Clickhouse? Aside from directly inserting a record via the client APIs can you talk through the options for loading data into Clickhouse? For the MySQL/Postgres replication functionality how do you maintain schema evolution from the source DB to Clickhouse? What are some of the advanced capabilities, such as SQL extensions, supported data types, etc. that are unique to Clickhouse? For someone getting started with Clickhouse can you describe how they should be thinking about data modeling? Recent entrants to the data warehouse market are encouraging users to insert raw, unprocessed records and then do their transformations with the database engine, as opposed to using a data lake as the staging ground for transformations prior to loading into the warehouse. Where does Clickhouse fall along that spectrum? How is scaling in Clickhouse implemented and what are the edge cases that users should be aware of? How is data replication and consistency managed? What is involved in deploying and maintaining an installation of Clickhouse? I noticed that Altinity is providing a Kubernetes operator for Clickhouse. What are the opportunities and tradeoffs presented by that platform for Clickhouse? What are some of the most interesting/unexpected/innovative ways that you have seen Clickhouse used? What are some of the most challenging aspects of working on Clickhouse itself, and or implementing systems on top of it? What are the shortcomings of Clickhouse and how do you address them at Altinity? When is Clickhouse the wrong choice? Contact Info Robert LinkedIn hodgesrm on GitHub Alexander alex-zaitsev on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Clickhouse Altinity OLAP M204 Sybase MySQL Vertica Yandex Yandex Metrica Google Analytics SQL Greenplum InfoBright InfiniDB MariaDB Spark SIMD (Single Instruction, Multiple Data) Mergesort ETL Change Data Capture MapReduce KDB OLTP Cassandra InfluxDB Prometheus SnowflakeDB Hive Hadoop The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/8/2019 • 1 hour, 11 minutes, 18 seconds

Stress Testing Kafka And Cassandra For Real-Time Anomaly Detection

Summary Anomaly detection is a capability that is useful in a variety of problem domains, including finance, internet of things, and systems monitoring. Scaling the volume of events that can be processed in real-time can be challenging, so Paul Brebner from Instaclustr set out to see how far he could push Kafka and Cassandra for this use case. In this interview he explains the system design that he tested, his findings for how these tools were able to work together, and how they behaved at different orders of scale. It was an interesting conversation about how he stress tested the Instaclustr managed service for benchmarking an application that has real-world utility. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Integrating data across the enterprise has been around for decades – so have the techniques to do it. But, a new way of integrating data and improving streams has evolved. By integrating each silo independently – data is able to integrate without any direct relation. At CluedIn they call it “eventual connectivity”. If you want to learn more on how to deliver fast access to your data across the enterprise leveraging this new method, and the technologies that make it possible, get a demo or presentation of the CluedIn Data Hub by visiting dataengineeringpodcast.com/cluedin. And don’t forget to thank them for supporting the show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Paul Brebner about his experience designing and building a scalable, real-time anomaly detection system using Kafka and Cassandra Interview Introduction How did you get involved in the area of data management? Can you start by describing the problem that you were trying to solve and the requirements that you were aiming for? What are some example cases where anomaly detection is useful or necessary? Once you had established the requirements in terms of functionality and data volume, what was your approach for determining the target architecture? What was your selection criteria for the various components of your system design? What tools and technologies did you consider in your initial assessment and which did you ultimately converge on? If you were to start over today would you do any of it differently? Can you talk through the algorithm that you used for detecting anomalous activity? What is the size/duration of the window within which you can effectively characterize trends and how do you collapse it down to a tractable search space? What were you using as a data source, and if it was synthetic how did you handle introducing anomalies in a realistic fashion? What were the main scalability bottlenecks that you encountered as you began ramping up the volume of data and the number of instances? How did those bottlenecks differ as you moved through different levels of scale? What were your assumptions going into this project and how accurate were they as you began testing and scaling the system that you built? What were some of the most interesting or unexpected lessons that you learned in the process of building this anomaly detection system? How have those lessons fed back to your work at Instaclustr? Contact Info LinkedIn @paulbrebner_ on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Instaclustr Kafka Cassandra Canberra, Australia Spark Anomaly Detection Kubernetes Prometheus OpenTracing Jaeger The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/2/2019 • 38 minutes, 2 seconds

The Workflow Engine For Data Engineers And Data Scientists

Summary Building a data platform that works equally well for data engineering and data science is a task that requires familiarity with the needs of both roles. Data engineering platforms have a strong focus on stateful execution and tasks that are strictly ordered based on dependency graphs. Data science platforms provide an environment that is conducive to rapid experimentation and iteration, with data flowing directly between stages. Jeremiah Lowin has gained experience in both styles of working, leading him to be frustrated with all of the available tools. In this episode he explains his motivation for creating a new workflow engine that marries the needs of data engineers and data scientists, how it helps to smooth the handoffs between teams working on data projects, and how the design lets you focus on what you care about while it handles the failure cases for you. It is exciting to see a new generation of workflow engine that is learning from the benefits and failures of previous tools for processing your data pipelines. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management.For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jeremiah Lowin about Prefect, a workflow platform for data engineering Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Prefect is and your motivation for creating it? What are the axes along which a workflow engine can differentiate itself, and which of those have you focused on for Prefect? In some of your blog posts and your PyData presentation you discuss the concept of negative vs. positive engineering. Can you briefly outline what you mean by that and the ways that Prefect handles the negative cases for you? How is Prefect itself implemented and what tools or systems have you relied on most heavily for inspiration? How do you manage passing data between stages in a pipeline when they are running across distributed nodes? What was your decision making process when deciding to use Dask as your supported execution engine? For tasks that require specific resources or dependencies how do you approach the idea of task affinity? Does Prefect support managing tasks that bridge network boundaries? What are some of the features or capabilities of Prefect that are misunderstood or overlooked by users which you think should be exercised more often? What are the limitations of the open source core as compared to the cloud offering that you are building? What were your assumptions going into this project and how have they been challenged or updated as you dug deeper into the problem domain and received feedback from users? What are some of the most interesting/innovative/unexpected ways that you have seen Prefect used? When is Prefect the wrong choice? In your experience working on Airflow and Prefect, what are some of the common challenges and anti-patterns that arise in data engineering projects? What are some best practices and industry trends that you are most excited by? What do you have planned for the future of the Prefect project and company? Contact Info LinkedIn @jlowin on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Prefect Airflow Dask Podcast Episode Prefect Blog PyData Presentation Tensorflow Workflow Engine The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/25/2019 • 1 hour, 8 minutes, 26 seconds

Maintaining Your Data Lake At Scale With Spark

Summary Building and maintaining a data lake is a choose your own adventure of tools, services, and evolving best practices. The flexibility and freedom that data lakes provide allows for generating significant value, but it can also lead to anti-patterns and inconsistent quality in your analytics. Delta Lake is an open source, opinionated framework built on top of Spark for interacting with and maintaining data lake platforms that incorporates the lessons learned at DataBricks from countless customer use cases. In this episode Michael Armbrust, the lead architect of Delta Lake, explains how the project is designed, how you can use it for building a maintainable data lake, and some useful patterns for progressively refining the data in your lake. This conversation was useful for getting a better idea of the challenges that exist in large scale data analytics, and the current state of the tradeoffs between data lakes and data warehouses in the cloud. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Michael Armbrust about Delta Lake, an open source storage layer that brings ACID transactions to Apache Spark and big data workloads. Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Delta Lake is and the motivation for creating it? What are some of the common antipatterns in data lake implementations and how does Delta Lake address them? What are the benefits of a data lake over a data warehouse? How has that equation changed in recent years with the availability of modern cloud data warehouses? How is Delta lake implemented and how has the design evolved since you first began working on it? What assumptions did you have going into the project and how have they been challenged as it has gained users? One of the compelling features is the option for enforcing data quality constraints. Can you talk through how those are defined and tested? In your experience, how do you manage schema evolution when working with large volumes of data? (e.g. rewriting all of the old files, or just eliding the missing columns/populating default values, etc.) Can you talk through how Delta Lake manages transactionality and data ownership? (e.g. what if you have other services interacting with the data store) Are there limits in terms of the volume of data that can be managed within a single transaction? How does unifying the interface for Spark to interact with batch and streaming data sets simplify the workflow for an end user? The Lambda architecture was popular in the early days of Hadoop but seems to have fallen out of favor. How does this unified interface resolve the shortcomings and complexities of that approach? What have been the most difficult/complex/challenging aspects of building Delta Lake? How is the data versioning in Delta Lake implemented? By keeping a copy of all iterations of a data set there is the opportunity for a great deal of additional cost. What are some options for mitigating that impact, either in Delta Lake itself or as a separate mechanism or process? What are the reasons for standardizing on Parquet as the storage format? What are some of the cases where that has led to greater complications? In addition to the transactionality and data validation that Delta Lake provides, can you also explain how indexing is implemented and highlight the challenges of keeping them up to date? When is Delta Lake the wrong choice? What problems did you consciously decide not to address? What is in store for the future of Delta Lake? Contact Info LinkedIn @michaelarmbrust on Twitter marmbrus on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Delta Lake DataBricks Spark SQL Microsoft SQL Server Databricks Delta Spark Summit Apache Spark Enterprise Data Curation Episode Data Lake Data Warehouse SnowflakeDB BigQuery Parquet Data Serialization Episode Hive Metastore Great Expectations Podcast.__init__ Interview Optimistic Concurrency/Optimistic Locking Presto Starburst Labs Podcast Interview Apache NiFi Podcast Interview Tensorflow Tableau Change Data Capture Apache Pulsar Podcast Interview Pravega Podcast Interview Multi-Version Concurrency Control MLFlow Avro ORC The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/17/2019 • 50 minutes, 50 seconds

Managing The Machine Learning Lifecycle

Summary Building a machine learning model can be difficult, but that is only half of the battle. Having a perfect model is only useful if you are able to get it into production. In this episode Stepan Pushkarev, founder of Hydrosphere, explains why deploying and maintaining machine learning projects in production is different from regular software projects and the challenges that they bring. He also describes the Hydrosphere platform, and how the different components work together to manage the full machine learning lifecycle of model deployment and retraining. This was a useful conversation to get a better understanding of the unique difficulties that exist for machine learning projects. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order! You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Stepan Pushkarev about Hydrosphere, the first open source platform for Data Science and Machine Learning Management automation Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Hydrosphere is and share its origin story? In your experience, what are the most challenging or complicated aspects of managing machine learning models in a production context? How does it differ from deployment and maintenance of a regular software application? Can you describe how Hydrosphere is architected and how the different components of the stack fit together? For someone who is using Hydrosphere in their production workflow, what would that look like? What is the difference in interaction with Hydrosphere for different roles within a data team? What are some of the types of metrics that you monitor to determine when and how to retrain deployed models? Which metrics do you track for testing and verifying the health of the data? What are the factors that contribute to model degradation in production and how do you incorporate contextual feedback into the training cycle to counteract them? How has the landscape and sophistication for real world usability of machine learning changed since you first began working on Hydrosphere? How has that influenced the design and direction of Hydrosphere, both as a project and a business? How has the design of Hydrosphere evolved since you first began working on it? What assumptions did you have when you began working on Hydrosphere and how have they been challenged or modified through growing the platform? What have been some of the most challenging or complex aspects of building and maintaining Hydrosphere? What do you have in store for the future of Hydrosphere? Contact Info LinkedIn spushkarev on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Hydrosphere GitHub Data Engineering Podcast at ODSC KD Nuggets Big Data Science: Expectation vs. Reality The Open Data Science Conference Scala InfluxDB RocksDB Docker Kubernetes Akka Python Pickle Protocol Buffers Kubeflow MLFlow TensorFlow Extended Kubeflow Pipelines Argo Airflow Podcast.__init__ Interview Envoy Istio DVC Podcast.__init__ Interview Generative Adversarial Networks The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/10/2019 • 1 hour, 2 minutes, 39 seconds

Evolving An ETL Pipeline For Better Productivity

Summary Building an ETL pipeline can be a significant undertaking, and sometimes it needs to be rebuilt when a better option becomes available. In this episode Aaron Gibralter, director of engineering at Greenhouse, joins Raghu Murthy, founder and CEO of DataCoral, to discuss the journey that he and his team took from an in-house ETL pipeline built out of open source components onto a paid service. He explains how their original implementation was built, why they decided to migrate to a paid service, and how they made that transition. He also discusses how the abstractions provided by DataCoral allows his data scientists to remain productive without requiring dedicated data engineers. If you are either considering how to build a data pipeline or debating whether to migrate your existing ETL to a service this is definitely worth listening to for some perspective. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! And to keep track of how your team is progressing on building new pipelines and tuning their workflows, you need a project management system designed by engineers, for engineers. Clubhouse lets you craft a workflow that fits your style, including per-team tasks, cross-project epics, a large suite of pre-built integrations, and a simple API for crafting your own. With such an intuitive tool it’s easy to make sure that everyone in the business is on the same page. Data Engineering Podcast listeners get 2 months free on any plan by going to dataengineeringpodcast.com/clubhouse today and signing up for a free trial. Support the show and get your data projects in order! You listen to this show to learn and stay up to date with the ways that Python is being used, including the latest in machine learning and data analysis. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Coming up this fall is the combined events of Graphorum and the Data Architecture Summit. The agendas have been announced and super early bird registration for up to $300 off is available until July 26th, with early bird pricing for up to $200 off through August 30th. Use the code BNLLC to get an additional 10% off any pass when you register. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Aaron Gibralter and Raghu Murthy about the experience of Greenhouse migrating their data pipeline to DataCoral Interview Introduction How did you get involved in the area of data management? Aaron, can you start by describing what Greenhouse is and some of the ways that you use data? Can you describe your overall data infrastructure and the state of your data pipeline before migrating to DataCoral? What are your primary sources of data and what are the targets that you are loading them into? What were your biggest pain points and what motivated you to re-evaluate your approach to ETL? What were your criteria for your replacement technology and how did you gather and evaluate your options? Once you made the decision to use DataCoral can you talk through the transition and cut-over process? What were some of the unexpected edge cases or shortcomings that you experienced when moving to DataCoral? What were the big wins? What was your evaluation framework for determining whether your re-engineering was successful? Now that you are using DataCoral how would you characterize the experiences of yourself and your team? If you have freed up time for your engineers, how are you allocating that spare capacity? What do you hope to see from DataCoral in the future? What advice do you have for anyone else who is either evaluating a re-architecture of their existing data platform or planning out a greenfield project? Contact Info Aaron agribralter on GitHub LinkedIn Raghu LinkedIn Medium Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Greenhouse We’re hiring Data Scientists and Software Engineers! Datacoral Airflow Podcast.init Interview Data Engineering Interview about running Airflow in production Periscope Data Mode Analytics Data Warehouse ETL Salesforce Zendesk Jira DataDog Asana GDPR Metabase Podcast Interview The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/4/2019 • 1 hour, 2 minutes, 21 seconds

Data Lineage For Your Pipelines

Summary Some problems in data are well defined and benefit from a ready-made set of tools. For everything else, there’s Pachyderm, the platform for data science that is built to scale. In this episode Joe Doliner, CEO and co-founder, explains how Pachyderm started as an attempt to make data provenance easier to track, how the platform is architected and used today, and examples of how the underlying principles manifest in the workflows of data engineers and data scientists as they collaborate on data projects. In addition to all of that he also shares his thoughts on their recent round of fund-raising and where the future will take them. If you are looking for a set of tools for building your data science workflows then Pachyderm is a solid choice, featuring data versioning, first class tracking of data lineage, and language agnostic data pipelines. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Joe Doliner about Pachyderm, a platform that lets you deploy and manage multi-stage, language-agnostic data pipelines while maintaining complete reproducibility and provenance Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Pachyderm is and how it got started? What is new in the last two years since I talked to Dan Whitenack in episode 1? How have the changes and additional features in Kubernetes impacted your work on Pachyderm? A recent development in the Kubernetes space is the Kubeflow project. How do its capabilities compare with or complement what you are doing in Pachyderm? Can you walk through the overall workflow for someone building an analysis pipeline in Pachyderm? How does that break down across different roles and responsibilities (e.g. data scientist vs data engineer)? There are a lot of concepts and moving parts in Pachyderm, from getting a Kubernetes cluster set up, to understanding the file system and processing pipeline, to understanding best practices. What are some of the common challenges or points of confusion that new users encounter? Data provenance is critical for understanding the end results of an analysis or ML model. Can you explain how the tracking in Pachyderm is implemented? What is the interface for exposing and exploring that provenance data? What are some of the advanced capabilities of Pachyderm that you would like to call out? With your recent round of fundraising I’m assuming there is new pressure to grow and scale your product and business. How are you approaching that and what are some of the challenges you are facing? What have been some of the most challenging/useful/unexpected lessons that you have learned in the process of building, maintaining, and growing the Pachyderm project and company? What do you have planned for the future of Pachyderm? Contact Info @jdoliner on Twitter LinkedIn jdoliner on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Pachyderm RethinkDB AirBnB Data Provenance Kubeflow Stateful Sets EtcD Airflow Kafka GitHub GitLab Docker Kubernetes CI == Continuous Integration CD == Continuous Delivery Ceph Podcast Interview Object Storage MiniKube FUSE == File System In User Space The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/27/2019 • 49 minutes, 1 second

Build Your Data Analytics Like An Engineer With DBT

Summary In recent years the traditional approach to building data warehouses has shifted from transforming records before loading, to transforming them afterwards. As a result, the tooling for those transformations needs to be reimagined. The data build tool (dbt) is designed to bring battle tested engineering practices to your analytics pipelines. By providing an opinionated set of best practices it simplifies collaboration and boosts confidence in your data teams. In this episode Drew Banin, creator of dbt, explains how it got started, how it is designed, and how you can start using it today to create reliable and well-tested reports in your favorite data warehouse. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Drew Banin about DBT, the Data Build Tool, a toolkit for building analytics the way that developers build applications Interview Introduction How did you get involved in the area of data management? Can you start by explaining what DBT is and your motivation for creating it? Where does it fit in the overall landscape of data tools and the lifecycle of data in an analytics pipeline? Can you talk through the workflow for someone using DBT? One of the useful features of DBT for stability of analytics is the ability to write and execute tests. Can you explain how those are implemented? The packaging capabilities are beneficial for enabling collaboration. Can you talk through how the packaging system is implemented? Are these packages driven by Fishtown Analytics or the dbt community? What are the limitations of modeling everything as a SELECT statement? Making SQL code reusable is notoriously difficult. How does the Jinja templating of DBT address this issue and what are the shortcomings? What are your thoughts on higher level approaches to SQL that compile down to the specific statements? Can you explain how DBT is implemented and how the design has evolved since you first began working on it? What are some of the features of DBT that are often overlooked which you find particularly useful? What are some of the most interesting/unexpected/innovative ways that you have seen DBT used? What are the additional features that the commercial version of DBT provides? What are some of the most useful or challenging lessons that you have learned in the process of building and maintaining DBT? When is it the wrong choice? What do you have planned for the future of DBT? Contact Info Email @drebanin on Twitter drebanin on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links DBT Fishtown Analytics 8Tracks Internet Radio Redshift Magento Stitch Data Fivetran Airflow Business Intelligence Jinja template language BigQuery Snowflake Version Control Git Continuous Integration Test Driven Development Snowplow Analytics Podcast Episode dbt-utils We Can Do Better Than SQL blog post from EdgeDB EdgeDB Looker LookML Podcast Interview Presto DB Podcast Interview Spark SQL Hive Azure SQL Data Warehouse Data Warehouse Data Lake Data Council Conference Slowly Changing Dimensions dbt Archival Mode Analytics Periscope BI dbt docs dbt repository The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/20/2019 • 56 minutes, 46 seconds

Using FoundationDB As The Bedrock For Your Distributed Systems

Summary The database market continues to expand, offering systems that are suited to virtually every use case. But what happens if you need something customized to your application? FoundationDB is a distributed key-value store that provides the primitives that you need to build a custom database platform. In this episode Ryan Worl explains how it is architected, how to use it for your applications, and provides examples of system design patterns that can be built on top of it. If you need a foundation for your distributed systems, then FoundationDB is definitely worth a closer look. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ryan Worl about FoundationDB, a distributed key/value store that gives you the power of ACID transactions in a NoSQL database Interview Introduction How did you get involved in the area of data management? Can you explain what FoundationDB is and how you got involved with the project? What are some of the unique use cases that FoundationDB enables? Can you describe how FoundationDB is architected? How is the ACID compliance implemented at the cluster level? What are some of the mechanisms built into FoundationDB that contribute to its fault tolerance? How are conflicts managed? FoundationDB has an interesting feature in the form of Layers that provide different semantics on the underlying storage. Can you describe how that is implemented and some of the interesting layers that are available? Is it possible to apply different layers, such as relational and document, to the same underlying objects in storage? One of the aspects of FoundationDB that is called out in the documentation and which I have heard about elsewhere is the performance that it provides. Can you describe some of the implementation mechanics of FoundationDB that allow it to provide such high throughput? For someone who wants to run FoundationDB can you describe a typical deployment topology? What are the scaling factors for the underlying storage and for the Layers that are operating on the cluster? Once you have a cluster deployed, what are some of the edge cases that users should watch out for? How are version upgrades managed in a cluster? What are some of the ways that FoundationDB impacts the way that an application developer or data engineer would architect their software as compared to working with something like Postgres or MongoDB? What are some of the more interesting/unusual/unexpected ways that you have seen FoundationDB used? When is FoundationDB the wrong choice? What is in store for the future of FoundationDB? Contact Info LinkedIn @ryanworl on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links FoundationDB Jepsen Andy Pavlo Archive.org – The Internet Archive FoundationDB Summit Flow Language C++ Actor Model Erlang Zookeeper Podcast Episode PAXOS consensus algorithm Multi-Version Concurrency Control (MVCC) AKA Optimistic Locking ACID CAP Theorem Redis Record Layer CloudKit Document Layer Segment Podcast Episode NVMe SnowflakeDB FlatBuffers Protocol Buffers Ryan Worl FoundationDB Summit Presentation Google F1 Google Spanner WaveFront EtcD B+ Tree Michael Stonebraker Three Vs Confluent The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/7/2019 • 1 hour, 6 minutes, 2 seconds

Running Your Database On Kubernetes With KubeDB

Summary Kubernetes is a driving force in the renaissance around deploying and running applications. However, managing the database layer is still a separate concern. The KubeDB project was created as a way of providing a simple mechanism for running your storage system in the same platform as your application. In this episode Tamal Saha explains how the KubeDB project got started, why you might want to run your database with Kubernetes, and how to get started. He also covers some of the challenges of managing stateful services in Kubernetes and how the fast pace of the community has contributed to the evolution of KubeDB. If you are at any stage of a Kubernetes implementation, or just thinking about it, this is definitely worth a listen to get some perspective on how to leverage it for your entire application stack. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tamal Saha about KubeDB, a project focused on making running production-grade databases easy on Kubernetes Interview Introduction How did you get involved in the area of data management? Can you start by explaining what KubeDB is and how the project got started? What are the main challenges associated with running a stateful system on top of Kubernetes? Why would someone want to run their database on a container platform rather than on a dedicated instance or with a hosted service? Can you describe how KubeDB is implemented and how that has evolved since you first started working on it? Can you talk through how KubeDB simplifies the process of deploying and maintaining databases? What is involved in adding support for a new database? How do the requirements change for systems that are natively clustered? How does KubeDB help with maintenance processes around upgrading existing databases to newer versions? How does the work that you are doing on KubeDB compare to what is available in StorageOS? Are there any other projects that are targeting similar goals? What have you found to be the most interesting/challenging/unexpected aspects of building KubeDB? What do you have planned for the future of the project? Contact Info LinkedIn @tsaha on Twitter Email tamalsaha on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links KubeDB AppsCode Kubernetes Kubernetes CRD (Custom Resource Definition) Kubernetes Operator Kubernetes Stateful Sets PostgreSQL Podcast Interview Hashicorp Vault Redis Elasticsearch Podcast Interview MySQL Memcached MongoDB Docker Rook Storage Orchestration for Kubernetes Ceph Podcast Interview EBS StorageOS GlusterFS OpenEBS CloudFoundry AppsCode Service Broker The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/29/2019 • 50 minutes, 54 seconds

Unpacking Fauna: A Global Scale Cloud Native Database

Summary One of the biggest challenges for any business trying to grow and reach customers globally is how to scale their data storage. FaunaDB is a cloud native database built by the engineers behind Twitter’s infrastructure and designed to serve the needs of modern systems. Evan Weaver is the co-founder and CEO of Fauna and in this episode he explains the unique capabilities of Fauna, compares the consensus and transaction algorithm to that used in other NewSQL systems, and describes the ways that it allows for new application design patterns. One of the unique aspects of Fauna that is worth drawing attention to is the first class support for temporality that simplifies querying of historical states of the data. It is definitely worth a good look for anyone building a platform that needs a simple to manage data layer that will scale with your business. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Evan Weaver about FaunaDB, a modern operational data platform built for your cloud Interview Introduction How did you get involved in the area of data management? Can you start by explaining what FaunaDB is and how it got started? What are some of the main use cases that FaunaDB is targeting? How does it compare to some of the other global scale databases that have been built in recent years such as CockroachDB? Can you describe the architecture of FaunaDB and how it has evolved? The consensus and replication protocol in Fauna is intriguing. Can you talk through how it works? What are some of the edge cases that users should be aware of? How are conflicts managed in Fauna? What is the underlying storage layer? How is the query layer designed to allow for different query patterns and model representations? How does data modeling in Fauna compare to that of relational or document databases? Can you describe the query format? What are some of the common difficulties or points of confusion around interacting with data in Fauna? What are some application design patterns that are enabled by using Fauna as the storage layer? Given the ability to replicate globally, how do you mitigate latency when interacting with the database? What are some of the most interesting or unexpected ways that you have seen Fauna used? When is it the wrong choice? What have been some of the most interesting/unexpected/challenging aspects of building the Fauna database and company? What do you have in store for the future of Fauna? Contact Info @evan on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Fauna Ruby on Rails CNET GitHub Twitter NoSQL Cassandra InnoDB Redis Memcached Timeseries Spanner Paper DynamoDB Paper Percolator ACID Calvin Protocol Daniel Abadi LINQ LSM Tree (Log-structured Merge-tree) Scala Change Data Capture GraphQL Podcast.init Interview About Graphene Fauna Query Language (FQL) CQL == Cassandra Query Language Object-Relational Databases LDAP == Lightweight Directory Access Protocol Auth0 OLAP == Online Analytical Processing Jepsen distributed systems safety research The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/22/2019 • 53 minutes, 50 seconds

Index Your Big Data With Pilosa For Faster Analytics

Summary Database indexes are critical to ensure fast lookups of your data, but they are inherently tied to the database engine. Pilosa is rewriting that equation by providing a flexible, scalable, performant engine for building an index of your data to enable high-speed aggregate analysis. In this episode Seebs explains how Pilosa fits in the broader data landscape, how it is architected, and how you can start using it for your own analysis. This was an interesting exploration of a different way to look at what a database can be. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. Understanding how your customers are using your product is critical for businesses of any size. To make it easier for startups to focus on delivering useful features Segment offers a flexible and reliable data infrastructure for your customer analytics and custom events. You only need to maintain one integration to instrument your code and get a future-proof way to send data to over 250 services with the flip of a switch. Not only does it free up your engineers’ time, it lets your business users decide what data they want where. Go to dataengineeringpodcast.com/segmentio today to sign up for their startup plan and get $25,000 in Segment credits and $1 million in free software from marketing and analytics companies like AWS, Google, and Intercom. On top of that you’ll get access to Analytics Academy for the educational resources you need to become an expert in data analytics for measuring product-market fit. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Seebs about Pilosa, an open source, distributed bitmap index Interview Introduction How did you get involved in the area of data management? Can you start by describing what Pilosa is and how the project got started? Where does Pilosa fit into the overall data ecosystem and how does it integrate into an existing stack? What types of use cases is Pilosa uniquely well suited for? The Pilosa data model is fairly unique. Can you talk through how it is represented and implemented? What are some approaches to modeling data that might be coming from a relational database or some structured flat files? How do you handle highly dimensional data? What are some of the decisions that need to be made early in the modeling process which could have ramifications later on in the lifecycle of the project? What are the scaling factors of Pilosa? What are some of the most interesting/challenging/unexpected lessons that you have learned in the process of building Pilosa? What is in store for the future of Pilosa? Contact Info Pilosa Website Email @slothware on Twitter Seebs seebs on GitHub Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links PQL (Pilosa Query Language) Roaring Bitmap Whitepaper The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/15/2019 • 43 minutes, 41 seconds

Serverless Data Pipelines On DataCoral

Summary How much time do you spend maintaining your data pipeline? How much end user value does that provide? Raghu Murthy founded DataCoral as a way to abstract the low level details of ETL so that you can focus on the actual problem that you are trying to solve. In this episode he explains his motivation for building the DataCoral platform, how it is leveraging serverless computing, the challenges of delivering software as a service to customer environments, and the architecture that he has designed to make batch data management easier to work with. This was a fascinating conversation with someone who has spent his entire career working on simplifying complex data problems. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Raghu Murthy about DataCoral, a platform that offers a fully managed and secure stack in your own cloud that delivers data to where you need it Interview Introduction How did you get involved in the area of data management? Can you start by explaining what DataCoral is and your motivation for founding it? How does the data-centric approach of DataCoral differ from the way that other platforms think about processing information? Can you describe how the DataCoral platform is designed and implemented, and how it has evolved since you first began working on it? How does the concept of a data slice play into the overall architecture of your platform? How do you manage transformations of data schemas and formats as they traverse different slices in your platform? On your site it mentions that you have the ability to automatically adjust to changes in external APIs, can you discuss how that manifests? What has been your experience, both positive and negative, in building on top of serverless components? Can you discuss the customer experience of onboarding onto Datacoral and how it differs between existing data platforms and greenfield projects? What are some of the slices that have proven to be the most challenging to implement? Are there any that you are currently building that you are most excited for? How much effort do you anticipate if and/or when you begin to support other cloud providers? When is Datacoral the wrong choice? What do you have planned for the future of Datacoral, both from a technical and business perspective? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Datacoral Yahoo! Apache Hive Relational Algebra Social Capital EIR == Entrepreneur In Residence Spark Kafka AWS Lambda DAG == Directed Acyclic Graph AWS Redshift AWS Athena AWS Glue Noisy Neighbor Problem CI/CD SnowflakeDB DataBricks Delta AWS Sagemaker The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/8/2019 • 53 minutes, 41 seconds

Why Analytics Projects Fail And What To Do About It

Summary Analytics projects fail all the time, resulting in lost opportunities and wasted resources. There are a number of factors that contribute to that failure and not all of them are under our control. However, many of them are and as data engineers we can help to keep our projects on the path to success. Eugene Khazin is the CEO of PrimeTSR where he is tasked with rescuing floundering analytics efforts and ensuring that they provide value to the business. In this episode he reflects on the ways that data projects can be structured to provide a higher probability of success and utility, how data engineers can get throughout the project lifecycle, and how to salvage a failed project so that some value can be gained from the effort. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Eugene Khazin about the leading causes for failure in analytics projects Interview Introduction How did you get involved in the area of data management? The term "analytics" has grown to mean many different things to different people, so can you start by sharing your definition of what is in scope for an "analytics project" for the purposes of this discussion? What are the criteria that you and your customers use to determine the success or failure of a project? I was recently speaking with someone who quoted a Gartner report stating an estimated failure rate of ~80% for analytics projects. Has your experience reflected this reality, and what have you found to be the leading causes of failure in your experience at PrimeTSR? As data engineers, what strategies can we pursue to increase the success rate of the projects that we work on? What are the contributing factors that are beyond our control, which we can help identify and surface early in the lifecycle of a project? In the event of a failed project, what are the lessons that we can learn and fold into our future work? How can we salvage a project and derive some value from the efforts that we have put into it? What are some useful signals to identify when a project is on the road to failure, and steps that can be taken to rescue it? What advice do you have for data engineers to help them be more active and effective in the lifecycle of an analytics project? Contact Info Email LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Prime TSR Descriptive, Predictive, and Prescriptive Analytics Azure Data Factory Azure Data Warehouse Mulesoft SSIS (SQL Server Integration Services) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/1/2019 • 36 minutes, 30 seconds

Building An Enterprise Data Fabric At CluedIn

Summary Data integration is one of the most challenging aspects of any data platform, especially as the variety of data sources and formats grow. Enterprise organizations feel this acutely due to the silos that occur naturally across business units. The CluedIn team experienced this issue first-hand in their previous roles, leading them to build a business aimed at building a managed data fabric for the enterprise. In this episode Tim Ward, CEO of CluedIn, joins me to explain how their platform is architected, how they manage the task of integrating with third-party platforms, automating entity extraction and master data management, and the work of providing multiple views of the same data for different use cases. I highly recommend listening closely to his explanation of how they manage consistency of the data that they process across different storage backends. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Alluxio is an open source, distributed data orchestration layer that makes it easier to scale your compute and your storage independently. By transparently pulling data from underlying silos, Alluxio unlocks the value of your data and allows for modern computation-intensive workloads to become truly elastic and flexible for the cloud. With Alluxio, companies like Barclays, JD.com, Tencent, and Two Sigma can manage data efficiently, accelerate business analytics, and ease the adoption of any cloud. Go to dataengineeringpodcast.com/alluxio today to learn more and thank them for their support. You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tim Ward about CluedIn, an integration platform for implementing your companies data fabric Interview Introduction How did you get involved in the area of data management? Before we get started, can you share your definition of what a data fabric is? Can you explain what CluedIn is and share the story of how it started? Can you describe your ideal customer? What are some of the primary ways that organizations are using CluedIn? Can you give an overview of the system architecture that you have built and how it has evolved since you first began building it? For a new customer of CluedIn, what is involved in the onboarding process? What are some of the most challenging aspects of data integration? What is your approach to managing the process of cleaning the data that you are ingesting? How much domain knowledge from a business or industry perspective do you incorporate during onboarding and ongoing execution? How do you preserve and expose data lineage/provenance to your customers? How do you manage changes or breakage in the interfaces that you use for source or destination systems? What are some of the signals that you monitor to ensure the continued healthy operation of your platform? What are some of the most notable customer success stories that you have experienced? Are there any notable failures that you have experienced, and if so, what were the lessons learned? What are some cases where CluedIn is not the right choice? What do you have planned for the future of CluedIn? Contact Info Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links CluedIn Copenhagen, Denmark A/B Testing Data Fabric Dataiku RapidMiner Azure Machine Learning Studio CRM (Customer Relationship Management) Graph Database Data Lake GraphQL DGraph Podcast Episode RabbitMQ GDPR (General Data Protection Regulation) Master Data Management Podcast Interview OAuth Docker Kubernetes Helm DevOps DataOps DevOps vs DataOps Podcast Interview Kafka The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/25/2019 • 57 minutes, 49 seconds

A DataOps vs DevOps Cookoff In The Data Kitchen

Summary Delivering a data analytics project on time and with accurate information is critical to the success of any business. DataOps is a set of practices to increase the probability of success by creating value early and often, and using feedback loops to keep your project on course. In this episode Chris Bergh, head chef of Data Kitchen, explains how DataOps differs from DevOps, how the industry has begun adopting DataOps, and how to adopt an agile approach to building your data platform. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. "There aren’t enough data conferences out there that focus on the community, so that’s why these folks built a better one": Data Council is the premier community powered data platforms & engineering event for software engineers, data engineers, machine learning experts, deep learning researchers & artificial intelligence buffs who want to discover tools & insights to build new products. This year they will host over 50 speakers and 500 attendees (yeah that’s one of the best "Attendee:Speaker" ratios out there) in San Francisco on April 17-18th and are offering a $200 discount to listeners of the Data Engineering Podcast. Use code: DEP-200 at checkout You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with organizations such as O’Reilly Media, Dataversity, and the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Chris Bergh about the current state of DataOps and why it’s more than just DevOps for data Interview Introduction How did you get involved in the area of data management? We talked last year about what DataOps is, but can you give a quick overview of how the industry has changed or updated the definition since then? It is easy to draw parallels between DataOps and DevOps, can you provide some clarity as to how they are different? How has the conversation around DataOps influenced the design decisions of platforms and system components that are targeting the "big data" and data analytics ecosystem? One of the commonalities is the desire to use collaboration as a means of reducing silos in a business. In the data management space, those silos are often in the form of distinct storage systems, whether application databases, corporate file shares, CRM systems, etc. What are some techniques that are rooted in the principles of DataOps that can help unify those data systems? Another shared principle is in the desire to create feedback cycles. How do those feedback loops manifest in the lifecycle of an analytics project? Testing is critical to ensure the continued health and success of a data project. What are some of the current utilities that are available to data engineers for building and executing tests to cover the data lifecycle, from collection through to analysis and delivery? What are some of the components of a data analytics lifecycle that are resistant to agile or iterative development? With the continued rise in the use of machine learning in production, how does that change the requirements for delivery and maintenance of an analytics platform? What are some of the trends that you are most excited for in the analytics and data platform space? Contact Info Data Kitchen Email Chris LinkedIn @ChrisBergh on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Download the "DataOps Cookbook" Data Kitchen Peace Corps MIT NASA Meyer’s Briggs Personality Test HBR (Harvard Business Review) MBA (Master of Business Administration) W. Edwards Deming DevOps Lean Manufacturing Tableau Excel Airflow Podcast.init Interview Looker Podcast Interview R Language Alteryx Data Lake Data Literacy Data Governance Datadog Kubernetes Kubeflow Metis Machine Gartner Hype Cycle The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/18/2019 • 54 minutes, 31 seconds

Customer Analytics At Scale With Segment

Summary Customer analytics is a problem domain that has given rise to its own industry. In order to gain a full understanding of what your users are doing and how best to serve them you may need to send data to multiple services, each with their own tracking code or APIs. To simplify this process and allow your non-engineering employees to gain access to the information they need to do their jobs Segment provides a single interface for capturing data and routing it to all of the places that you need it. In this interview Segment CTO and co-founder Calvin French-Owen explains how the company got started, how it manages to multiplex data streams from multiple sources to multiple destinations, and how it can simplify your work of gaining visibility into how your customers are engaging with your business. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. And for your machine learning workloads, they just announced dedicated CPU instances. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes and tell your friends and co-workers You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data management. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss out on this year’s conference season. We have partnered with O’Reilly Media for the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th. Here in Boston, starting on May 17th, you still have time to grab a ticket to the Enterprise Data World, and from April 30th to May 3rd is the Open Data Science Conference. Go to dataengineeringpodcast.com/conferences to learn more and take advantage of our partner discounts when you register. Your host is Tobias Macey and today I’m interviewing Calvin French-Owen about the data platform that Segment has built to handle multiplexing continuous streams of data from multiple sources to multiple destinations Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Segment is and how the business got started? What are some of the primary ways that your customers are using the Segment platform? How have the capabilities and use cases of the Segment platform changed since it was first launched? Layered on top of the data integration platform you have added the concepts of Protocols and Personas. Can you explain how each of those products fit into the overall structure of Segment and the driving force behind their design and use? What are some of the best practices for structuring custom events in a way that they can be easily integrated with downstream platforms? How do you manage changes or errors in the events generated by the various sources that you support? How is the Segment platform architected and how has that architecture evolved over the past few years? What are some of the unique challenges that you face as a result of being a many-to-many event routing platform? In addition to the various services that you integrate with for data delivery, you also support populating of data warehouses. What is involved in establishing and maintaining the schema and transformations for a customer? What have been some of the most interesting, unexpected, and/or challenging lessons that you have learned while building and growing the technical and business aspects of Segment? What are some of the features and improvements, both technical and business, that you have planned for the future? Contact Info LinkedIn @calvinfo on Twitter Website calvinfo on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Segment AWS ClassMetric Y Combinator Amplitude web and mobile analytics Mixpanel Kiss Metrics Hacker News Segment Connections User Analytics SalesForce Redshift BigQuery Kinesis Google Cloud PubSub Segment Protocols data governance product Segment Personas Heap Analytics Podcast Episode Hotel Tonight Golang Kafka GDPR RocksDB Dead Letter Queue Segment Centrifuge Webhook Google Analytics Intercom Stripe GRPC DynamoDB FoundationDB Parquet Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/4/2019 • 47 minutes, 46 seconds

Deep Learning For Data Engineers

Summary Deep learning is the latest class of technology that is gaining widespread interest. As data engineers we are responsible for building and managing the platforms that power these models. To help us understand what is involved, we are joined this week by Thomas Henson. In this episode he shares his experiences experimenting with deep learning, what data engineers need to know about the infrastructure and data requirements to power the models that your team is building, and how it can be used to supercharge our ETL pipelines. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out our friends at Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. And don’t forget to thank them for their continued support of this show! Managing and auditing access to your servers and databases is a problem that grows in difficulty alongside the growth of your teams. If you are tired of wasting your time cobbling together scripts and workarounds to give your developers, data scientists, and managers the permissions that they need then it’s time to talk to our friends at strongDM. They have built an easy to use platform that lets you leverage your company’s single sign on for your data platform. Go to dataengineeringpodcast.com/strongdm today to find out how you can simplify your systems. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat You listen to this show to learn and stay up to date with what’s happening in databases, streaming platforms, big data, and everything else you need to know about modern data platforms. For even more opportunities to meet, listen, and learn from your peers you don’t want to miss the Strata conference in San Francisco on March 25th and the Artificial Intelligence conference in NYC on April 15th, both run by our friends at O’Reilly Media. Go to dataengineeringpodcast.com/stratacon and dataengineeringpodcast.com/aicon to register today and get 20% off Your host is Tobias Macey and today I’m interviewing Thomas Henson about what data engineers need to know about deep learning, including how to use it for their own projects Interview Introduction How did you get involved in the area of data management? Can you start by giving an overview of what deep learning is for anyone who isn’t familiar with it? What has been your personal experience with deep learning and what set you down that path? What is involved in building a data pipeline and production infrastructure for a deep learning product? How does that differ from other types of analytics projects such as data warehousing or traditional ML? For anyone who is in the early stages of a deep learning project, what are some of the edge cases or gotchas that they should be aware of? What are your opinions on the level of involvement/understanding that data engineers should have with the analytical products that are being built with the information we collect and curate? What are some ways that we can use deep learning as part of the data management process? How does that shift the infrastructure requirements for our platforms? Cloud providers have been releasing numerous products to provide deep learning and/or GPUs as a managed platform. What are your thoughts on that layer of the build vs buy decision? What is your litmus test for whether to use deep learning vs explicit ML algorithms or a basic decision tree? Deep learning algorithms are often a black box in terms of how decisions are made, however regulations such as GDPR are introducing requirements to explain how a given decision gets made. How does that factor into determining what approach to take for a given project? For anyone who wants to learn more about deep learning, what are some resources that you recommend? Contact Info Website Pluralsight @henson_tm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Pluralsight Dell EMC Hadoop DBA (Database Administrator) Elasticsearch Podcast Episode Spark Podcast Episode MapReduce Deep Learning Machine Learning Neural Networks Feature Engineering SVD (Singular Value Decomposition) Andrew Ng Machine Learning Course Unstructured Data Solutions Team of Dell EMC Tensorflow PyTorch GPU (Graphics Processing Unit) Nvidia RAPIDS Project Hydrogen Submarine ETL (Extract, Transform, Load) Supervised Learning Unsupervised Learning Apache Kudu Podcast Episode CNN (Convolutional Neural Network) Sentiment Analysis DataRobot GDPR Weapons Of Math Destruction by Cathy O’Neil Backpropagation Deep Learning Bootcamps Thomas Henson Tensorflow Course on Pluralsight TFLearn Google ML Bootcamp Caffe deep learning framework The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/25/2019 • 42 minutes, 46 seconds

Speed Up Your Analytics With The Alluxio Distributed Storage System

Summary Distributed storage systems are the foundational layer of any big data stack. There are a variety of implementations which support different specialized use cases and come with associated tradeoffs. Alluxio is a distributed virtual filesystem which integrates with multiple persistent storage systems to provide a scalable, in-memory storage layer for scaling computational workloads independent of the size of your data. In this episode Bin Fan explains how he got involved with the project, how it is implemented, and the use cases that it is particularly well suited for. If your storage and compute layers are too tightly coupled and you want to scale them independently then Alluxio is the tool for the job. Introduction Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Bin Fan about Alluxio, a distributed virtual filesystem for unified access to disparate data sources Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Alluxio is and the history of the project? What are some of the use cases that Alluxio enables? How is Alluxio implemented and how has its architecture evolved over time? What are some of the techniques that you use to mitigate the impact of latency, particularly when interfacing with storage systems across cloud providers and private data centers? When dealing with large volumes of data over time it is often necessary to age out older records to cheaper storage. What capabilities does Alluxio provide for that lifecycle management? What are some of the most complex or challenging aspects of providing a unified abstraction across disparate storage platforms? What are the tradeoffs that are made to provide a single API across systems with varying capabilities? Testing and verification of distributed systems is a complex undertaking. Can you describe the approach that you use to ensure proper functionality of Alluxio as part of the development and release process? In order to allow for this large scale testing with any regularity it must be straightforward to deploy and configure Alluxio. What are some of the mechanisms that you have built into the platform to simplify the operational aspects? Can you describe a typical system topology that incorporates Alluxio? For someone planning a deployment of Alluxio, what should they be considering in terms of system requirements and deployment topologies? What are some edge cases or operational complexities that they should be aware of? What are some cases where Alluxio is the wrong choice? What are some projects or products that provide a similar capability to Alluxio? What do you have planned for the future of the Alluxio project and company? Contact Info LinkedIn @binfan on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Alluxio Project Company Carnegie Mellon University Memcached Key/Value Storage UC Berkeley AMPLab Apache Spark Podcast Episode Presto Podcast Episode Tensorflow HDFS LRU Cache Hive Metastore Iceberg Table Format Podcast Episode Java Dependency Hell Java Class Loader Apache Zookeeper Podcast Interview Raft Consensus Algorithm Consistent Hashing Alluxio Testing At Scale Blog Post S3Guard The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/19/2019 • 59 minutes, 44 seconds

Machine Learning In The Enterprise

Summary Machine learning is a class of technologies that promise to revolutionize business. Unfortunately, it can be difficult to identify and execute on ways that it can be used in large companies. Kevin Dewalt founded Prolego to help Fortune 500 companies build, launch, and maintain their first machine learning projects so that they can remain competitive in our landscape of constant change. In this episode he discusses why machine learning projects require a new set of capabilities, how to build a team from internal and external candidates, and how an example project progressed through each phase of maturity. This was a great conversation for anyone who wants to understand the benefits and tradeoffs of machine learning for their own projects and how to put it into practice. Introduction Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Kevin Dewalt about his experiences at Prolego, building machine learning projects for Fortune 500 companies Interview Introduction How did you get involved in the area of data management? For the benefit of software engineers and team leaders who are new to machine learning, can you briefly describe what machine learning is and why is it relevant to them? What is your primary mission at Prolego and how did you identify, execute on, and establish a presence in your particular market? How much of your sales process is spent on educating your clients about what AI or ML are and the benefits that these technologies can provide? What have you found to be the technical skills and capacity necessary for being successful in building and deploying a machine learning project? When engaging with a client, what have you found to be the most common areas of technical capacity or knowledge that are needed? Everyone talks about a talent shortage in machine learning. Can you suggest a recruiting or skills development process for companies which need to build out their data engineering practice? What challenges will teams typically encounter when creating an efficient working relationship between data scientists and data engineers? Can you briefly describe a successful project of developing a first ML model and putting it into production? What is the breakdown of how much time was spent on different activities such as data wrangling, model development, and data engineering pipeline development? When releasing to production, can you share the types of metrics that you track to ensure the health and proper functioning of the models? What does a deployable artifact for a machine learning/deep learning application look like? What basic technology stack is necessary for putting the first ML models into production? How does the build vs. buy debate break down in this space and what products do you typically recommend to your clients? What are the major risks associated with deploying ML models and how can a team mitigate them? Suppose a software engineer wants to break into ML. What data engineering skills would you suggest they learn? How should they position themselves for the right opportunity? Contact Info Email: Kevin Dewalt kevin@prolego.io and Russ Rands russ@prolego.io Connect on LinkedIn: Kevin Dewalt and Russ Rands Twitter: @kevindewalt Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Prolego Download our book: Become an AI Company in 90 Days Google Rules Of ML AI Winter Machine Learning Supervised Learning O’Reilly Strata Conference GE Rebranding Commercials Jez Humble: Stop Hiring Devops Experts (And Start Growing Them) SQL ORM Django RoR Tensorflow PyTorch Keras Data Engineering Podcast Episode About Data Teams DevOps For Data Teams – DevOps Days Boston Presentation by Tobias Jupyter Notebook Data Engineering Podcast: Notebooks at Netflix Pandas Podcast Interview Joel Grus JupyterCon Presentation Data Science From Scratch Expensify Airflow James Meickle Interview Git Jenkins Continuous Integration Practical Deep Learning For Coders Course by Jeremy Howard Data Carpentry The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/11/2019 • 48 minutes, 18 seconds

Cleaning And Curating Open Data For Archaeology

Summary Archaeologists collect and create a variety of data as part of their research and exploration. Open Context is a platform for cleaning, curating, and sharing this data. In this episode Eric Kansa describes how they process, clean, and normalize the data that they host, the challenges that they face with scaling ETL processes which require domain specific knowledge, and how the information contained in connections that they expose is being used for interesting projects. Introduction Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Eric Kansa about Open Context, a platform for publishing, managing, and sharing research data Interview Introduction How did you get involved in the area of data management? I did some database and GIS work for my dissertation in archaeology, back in the late 1990’s. I got frustrated at the lack of comparative data, and I got frustrated at all the work I put into creating data that nobody would likely use. So I decided to focus my energies in research data management. Can you start by describing what Open Context is and how it started? Open Context is an open access data publishing service for archaeology. It started because we need better ways of dissminating structured data and digital media than is possible with conventional articles, books and reports. What are your protocols for determining which data sets you will work with? Datasets need to come from research projects that meet the normal standards of professional conduct (laws, ethics, professional norms) articulated by archaeology’s professional societies. What are some of the challenges unique to research data? What are some of the unique requirements for processing, publishing, and archiving research data? You have to work on a shoe-string budget, essentially providing "public goods". Archaeologists typically don’t have much discretionary money available, and publishing and archiving data are not yet very common practices. Another issues is that it will take a long time to publish enough data to power many "meta-analyses" that draw upon many datasets. The issue is that lots of archaeological data describes very particular places and times. Because datasets can be so particularistic, finding data relevant to your interests can be hard. So, we face a monumental task in supplying enough data to satisfy many, many paricularistic interests. How much education is necessary around your content licensing for researchers who are interested in publishing their data with you? We require use of Creative Commons licenses, and greatly encourage the CC-BY license or CC-Zero (public domain) to try to keep things simple and easy to understand. Can you describe the system architecture that you use for Open Context? Open Context is a Django Python application, with a Postgres database and an Apache Solr index. It’s running on Google cloud services on a Debian linux. What is the process for cleaning and formatting the data that you host? How much domain expertise is necessary to ensure proper conversion of the source data? That’s one of the bottle necks. We have to do an ETL (extract transform load) on each dataset researchers submit for publication. Each dataset may need lots of cleaning and back and forth conversations with data creators. Can you discuss the challenges that you face in maintaining a consistent ontology? What pieces of metadata do you track for a given data set? Can you speak to the average size of data sets that you manage and any approach that you use to optimize for cost of storage and processing capacity? Can you walk through the lifecycle of a given data set? Data archiving is a complicated and difficult endeavor due to issues pertaining to changing data formats and storage media, as well as repeatability of computing environments to generate and/or process them. Can you discuss the technical and procedural approaches that you take to address those challenges? Once the data is stored you expose it for public use via a set of APIs which support linked data. Can you discuss any complexities that arise from needing to identify and expose interrelations between the data sets? What are some of the most interesting uses you have seen of the data that is hosted on Open Context? What have been some of the most interesting/useful/challenging lessons that you have learned while working on Open Context? What are your goals for the future of Open Context? Contact Info @ekansa on Twitter LinkedIn ResearchGate Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Open Context Bronze Age GIS (Geographic Information System) Filemaker Access Database Excel Creative Commons Open Context On Github Django PostgreSQL Apache Solr GeoJSON JSON-LD RDF OCHRE SKOS (Simple Knowledge Organization System) Django Reversion California Digital Library Zenodo CERN Digital Index of North American Archaeology (DINAA) Ansible Docker OpenRefine The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/4/2019 • 1 hour, 55 seconds

Managing Database Access Control For Teams With strongDM

Summary Controlling access to a database is a solved problem… right? It can be straightforward for small teams and a small number of storage engines, but once either or both of those start to scale then things quickly become complex and difficult to manage. After years of running across the same issues in numerous companies and even more projects Justin McCarthy built strongDM to solve database access management for everyone. In this episode he explains how the strongDM proxy works to grant and audit access to storage systems and the benefits that it provides to engineers and team leads. Introduction Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Justin McCarthy about StrongDM, a hosted service that simplifies access controls for your data Interview Introduction How did you get involved in the area of data management? Can you start by explaining the problem that StrongDM is solving and how the company got started? What are some of the most common challenges around managing access and authentication for data storage systems? What are some of the most interesting workarounds that you have seen? Which areas of authentication, authorization, and auditing are most commonly overlooked or misunderstood? Can you describe the architecture of your system? What strategies have you used to enable interfacing with such a wide variety of storage systems? What additional capabilities do you provide beyond what is natively available in the underlying systems? What are some of the most difficult aspects of managing varying levels of permission for different roles across the diversity of platforms that you support, given that they each have different capabilities natively? For a customer who is onboarding, what is involved in setting up your platform to integrate with their systems? What are some of the assumptions that you made about your problem domain and market when you first started which have been disproven? How do organizations in different industries react to your product and how do their policies around granting access to data differ? What are some of the most interesting/unexpected/challenging lessons that you have learned in the process of building and growing StrongDM? Contact Info LinkedIn @justinm on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links StrongDM Authentication Vs. Authorization Hashicorp Vault Configuration Management Chef Puppet SaltStack Ansible Okta SSO (Single Sign On SOC 2 Two Factor Authentication SSH (Secure SHell) RDP The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/29/2019 • 42 minutes, 17 seconds

Building Enterprise Big Data Systems At LEGO

Summary Building internal expertise around big data in a large organization is a major competitive advantage. However, it can be a difficult process due to compliance needs and the need to scale globally on day one. In this episode Jesper Søgaard and Keld Antonsen share the story of starting and growing the big data group at LEGO. They discuss the challenges of being at global scale from the start, hiring and training talented engineers, prototyping and deploying new systems in the cloud, and what they have learned in the process. This is a useful conversation for engineers, managers, and leadership who are interested in building enterprise big data systems. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Keld Antonsen and Jesper Soegaard about the data infrastructure and analytics that powers LEGO Interview Introduction How did you get involved in the area of data management? My understanding is that the big data group at LEGO is a fairly recent development. Can you share the story of how it got started? What kinds of data practices were in place prior to starting a dedicated group for managing the organization’s data? What was the transition process like, migrating data silos into a uniformly managed platform? What are the biggest data challenges that you face at LEGO? What are some of the most critical sources and types of data that you are managing? What are the main components of the data infrastructure that you have built to support the organizations analytical needs? What are some of the technologies that you have found to be most useful? Which have been the most problematic? What does the team structure look like for the data services at LEGO? Does that reflect in the types/numbers of systems that you support? What types of testing, monitoring, and metrics do you use to ensure the health of the systems you support? What have been some of the most interesting, challenging, or useful lessons that you have learned while building and maintaining the data platforms at LEGO? How have the data systems at Lego evolved over recent years as new technologies and techniques have been developed? How does the global nature of the LEGO business influence the design strategies and technology choices for your platform? What are you most excited for in the coming year? Contact Info Jesper LinkedIn Keld LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links LEGO Group ERP (Enterprise Resource Planning) Predictive Analytics Prescriptive Analytics Hadoop Center Of Excellence Continuous Integration Spark Podcast Episode Apache NiFi Podcast Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/21/2019 • 48 minutes, 3 seconds

TimescaleDB: The Timeseries Database Built For SQL And Scale - Episode 65

Summary The past year has been an active one for the timeseries market. New products have been launched, more businesses have moved to streaming analytics, and the team at Timescale has been keeping busy. In this episode the TimescaleDB CEO Ajay Kulkarni and CTO Michael Freedman stop by to talk about their 1.0 release, how the use cases for timeseries data have proliferated, and how they are continuing to simplify the task of processing your time oriented events. Introduction Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m welcoming Ajay Kulkarni and Mike Freedman back to talk about how TimescaleDB has grown and changed over the past year Interview Introduction How did you get involved in the area of data management? Can you refresh our memory about what TimescaleDB is? How has the market for timeseries databases changed since we last spoke? What has changed in the focus and features of the TimescaleDB project and company? Toward the end of 2018 you launched the 1.0 release of Timescale. What were your criteria for establishing that milestone? What were the most challenging aspects of reaching that goal? In terms of timeseries workloads, what are some of the factors that differ across varying use cases? How do those differences impact the ways in which Timescale is used by the end user, and built by your team? What are some of the initial assumptions that you made while first launching Timescale that have held true, and which have been disproven? How have the improvements and new features in the recent releases of PostgreSQL impacted the Timescale product? Have you been able to leverage some of the native improvements to simplify your implementation? Are there any use cases for Timescale that would have been previously impractical in vanilla Postgres that would now be reasonable without the help of Timescale? What is in store for the future of the Timescale product and organization? Contact Info Ajay @acoustik on Twitter LinkedIn Mike LinkedIn Website @michaelfreedman on Twitter Timescale Website Documentation Careers timescaledb on GitHub @timescaledb on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links TimescaleDB Original Appearance on the Data Engineering Podcast 1.0 Release Blog Post PostgreSQL Podcast Interview RDS DB-Engines MongoDB IOT (Internet Of Things) AWS Timestream Kafka Pulsar Podcast Episode Spark Podcast Episode Flink Podcast Episode Hadoop DevOps PipelineDB Podcast Interview Grafana Tableau Prometheus OLTP (Online Transaction Processing) Oracle DB Data Lake The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/14/2019 • 41 minutes, 25 seconds

Performing Fast Data Analytics Using Apache Kudu - Episode 64

Summary The Hadoop platform is purpose built for processing large, slow moving data in long-running batch jobs. As the ecosystem around it has grown, so has the need for fast data analytics on fast moving data. To fill this need the Kudu project was created with a column oriented table format that was tuned for high volumes of writes and rapid query execution across those tables. For a perfect pairing, they made it easy to connect to the Impala SQL engine. In this episode Brock Noland and Jordan Birdsell from PhData explain how Kudu is architected, how it compares to other storage systems in the Hadoop orbit, and how to start integrating it into you analytics pipeline. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Brock Noland and Jordan Birdsell about Apache Kudu and how it is able to provide fast analytics on fast data in the Hadoop ecosystem Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Kudu is and the motivation for building it? How does it fit into the Hadoop ecosystem? How does it compare to the work being done on the Iceberg table format? What are some of the common application and system design patterns that Kudu supports? How is Kudu architected and how has it evolved over the life of the project? There are many projects in and around the Hadoop ecosystem that rely on Zookeeper as a building block for consensus. What was the reasoning for using Raft in Kudu? How does the storage layer in Kudu differ from what would be found in systems like Hive or HBase? What are the implementation details in the Kudu storage interface that have had the greatest impact on its overall speed and performance? A number of the projects built for large scale data processing were not initially built with a focus on operational simplicity. What are the features of Kudu that simplify deployment and management of production infrastructure? What was the motivation for using C++ as the language target for Kudu? If you were to start the project over today what would you do differently? What are some situations where you would advise against using Kudu? What have you found to be the most interesting/unexpected/challenging lessons learned in the process of building and maintaining Kudu? What are you most excited about for the future of Kudu? Contact Info Brock LinkedIn @brocknoland on Twitter Jordan LinkedIn @jordanbirdsell jbirdsell on GitHub PhData Website phdata on GitHub @phdatainc on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Kudu PhData Getting Started with Apache Kudu Thomson Reuters Hadoop Oracle Exadata Slowly Changing Dimensions HDFS S3 Azure Blob Storage State Farm Stanly Black & Decker ETL (Extract, Transform, Load) Parquet Podcast Episode ORC HBase Spark Podcast Episode Impala Netflix Iceberg Podcast Episode Hive ACID IOT (Internet Of Things) Streamsets NiFi Podcast Episode Kafka Connect Moore’s Law 3D XPoint Raft Consensus Algorithm STONITH (Shoot The Other Node In The Head) Yarn Cython Podcast.__init__ Episode Pandas Podcast.__init__ Episode Cloudera Manager Apache Sentry Collibra The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/7/2019 • 50 minutes, 46 seconds

Simplifying Continuous Data Processing Using Stream Native Storage In Pravega with Tom Kaitchuck - Episode 63

Summary As more companies and organizations are working to gain a real-time view of their business, they are increasingly turning to stream processing technologies to fullfill that need. However, the storage requirements for continuous, unbounded streams of data are markedly different than that of batch oriented workloads. To address this shortcoming the team at Dell EMC has created the open source Pravega project. In this episode Tom Kaitchuk explains how Pravega simplifies storage and processing of data streams, how it integrates with processing engines such as Flink, and the unique capabilities that it provides in the area of exactly once processing and transactions. And if you listen at approximately the half-way mark, you can hear as the hosts mind is blown by the possibilities of treating everything, including schema information, as a stream. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. To help other people find the show please leave a review on iTunes, or Google Play Music, tell your friends and co-workers, and share it on social media. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tom Kaitchuck about Pravega, an open source data storage platform optimized for persistent streams Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Pravega is and the story behind it? What are the use cases for Pravega and how does it fit into the data ecosystem? How does it compare with systems such as Kafka and Pulsar for ingesting and persisting unbounded data? How do you represent a stream on-disk? What are the benefits of using this format for persisted streams? One of the compelling aspects of Pravega is the automatic sharding and resource allocation for variations in data patterns. Can you describe how that operates and the benefits that it provides? I am also intrigued by the automatic tiering of the persisted storage. How does that work and what options exist for managing the lifecycle of the data in the cluster? For someone who wants to build an application on top of Pravega, what interfaces does it provide and what architectural patterns does it lend itself toward? What are some of the unique system design patterns that are made possible by Pravega? How is Pravega architected internally? What is involved in integrating engines such as Spark, Flink, or Storm with Pravega? A common challenge for streaming systems is exactly once semantics. How does Pravega approach that problem? Does it have any special capabilities for simplifying processing of out-of-order events? For someone planning a deployment of Pravega, what is involved in building and scaling a cluster? What are some of the operational edge cases that users should be aware of? What are some of the most interesting, useful, or challenging experiences that you have had while building Pravega? What are some cases where you would recommend against using Pravega? What is in store for the future of Pravega? Contact Info tkaitchuk on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Pravega Amazon SQS (Simple Queue Service) Amazon Simple Workflow Service (SWF) Azure EMC Zookeeper Podcast Episode Bookkeeper Kafka Pulsar Podcast Episode RocksDB Flink Podcast Episode Spark Podcast Episode Heron Lambda Architecture Kappa Architecture Erasure Code Flink Forward Conference CAP Theorem The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/31/2018 • 44 minutes, 42 seconds

Continuously Query Your Time-Series Data Using PipelineDB with Derek Nelson and Usman Masood - Episode 62

Summary Processing high velocity time-series data in real-time is a complex challenge. The team at PipelineDB has built a continuous query engine that simplifies the task of computing aggregates across incoming streams of events. In this episode Derek Nelson and Usman Masood explain how it is architected, strategies for designing your data flows, how to scale it up and out, and edge cases to be aware of. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Usman Masood and Derek Nelson about PipelineDB, an open source continuous query engine for PostgreSQL Interview Introduction How did you get involved in the area of data management? Can you start by explaining what PipelineDB is and the motivation for creating it? What are the major use cases that it enables? What are some example applications that are uniquely well suited to the capabilities of PipelineDB? What are the major concepts and components that users of PipelineDB should be familiar with? Given the fact that it is a plugin for PostgreSQL, what level of compatibility exists between PipelineDB and other plugins such as Timescale and Citus? What are some of the common patterns for populating data streams? What are the options for scaling PipelineDB systems, both vertically and horizontally? How much elasticity does the system support in terms of changing volumes of inbound data? What are some of the limitations or edge cases that users should be aware of? Given that inbound data is not persisted to disk, how do you guard against data loss? Is it possible to archive the data in a stream, unaltered, to a separate destination table or other storage location? Can a separate table be used as an input stream? Since the data being processed by the continuous queries is potentially unbounded, how do you approach checkpointing or windowing the data in the continuous views? What are some of the features that you have found to be the most useful which users might initially overlook? What would be involved in generating an alert or notification on an aggregate output that was in some way anomalous? What are some of the most challenging aspects of building continuous aggregates on unbounded data? What have you found to be some of the most interesting, complex, or challenging aspects of building and maintaining PipelineDB? What are some of the most interesting or unexpected ways that you have seen PipelineDB used? When is PipelineDB the wrong choice? What do you have planned for the future of PipelineDB now that you have hit the 1.0 milestone? Contact Info Derek derekjn on GitHub LinkedIn Usman @usmanm on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links PipelineDB Stride PostgreSQL Podcast Episode AdRoll Probabilistic Data Structures TimescaleDB [Podcast Episode]( Hive Redshift Kafka Kinesis ZeroMQ Nanomsg HyperLogLog Bloom Filter The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/24/2018 • 1 hour, 3 minutes, 51 seconds

Advice On Scaling Your Data Pipeline Alongside Your Business with Christian Heinzmann - Episode 61

Summary Every business needs a pipeline for their critical data, even if it is just pasting into a spreadsheet. As the organization grows and gains more customers, the requirements for that pipeline will change. In this episode Christian Heinzmann, Head of Data Warehousing at Grubhub, discusses the various requirements for data pipelines and how the overall system architecture evolves as more data is being processed. He also covers the changes in how the output of the pipelines are used, how that impacts the expectations for accuracy and availability, and some useful advice on build vs. buy for the components of a data platform. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Christian Heinzmann about how data pipelines evolve as your business grows Interview Introduction How did you get involved in the area of data management? Can you start by sharing your definition of a data pipeline? At what point in the life of a project or organization should you start thinking about building a pipeline? In the early stages when the scale of the data and business are still small, what are some of the design characteristics that you should be targeting for your pipeline? What metrics/use cases should you be optimizing for at this point? What are some of the indicators that you look for to signal that you are reaching the next order of magnitude in terms of scale? How do the design requirements for a data pipeline change as you reach this stage? What are some of the challenges and complexities that begin to present themselves as you build and run your pipeline at medium scale? What are some of the changes that are necessary as you move to a large scale data pipeline? At each level of scale it is important to minimize the impact of the ETL process on the source systems. What are some strategies that you have employed to avoid degrading the performance of the application systems? In recent years there has been a shift to using data lakes as a staging ground before performing transformations. What are your thoughts on that approach? When performing transformations there is a potential for discarding information or losing fidelity. How have you worked to reduce the impact of this effect? Transformations of the source data can be brittle when the format or volume changes. How do you design the pipeline to be resilient to these types of changes? What are your selection criteria when determining what workflow or ETL engines to use in your pipeline? How has your preference of build vs buy changed at different scales of operation and as new/different projects become available? What are some of the dead ends or edge cases that you have had to deal with in your current role at Grubhub? What are some of the common mistakes or overlooked aspects of building a data pipeline that you have seen? What are your plans for improving your current pipeline at Grubhub? What are some references that you recommend for anyone who is designing a new data platform? Contact Info @sirchristian on Twitter Blog sirchristian on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Scaling ETL blog post GrubHub Data Warehouse Redshift Spark Spark In Action Podcast Episode Hive Amazon EMR Looker Podcast Episode Redash Metabase Podcast Episode A Primer on Enterprise Data Curation Pub/Sub (Publish-Subscribe Pattern) Change Data Capture Jenkins Python Azkaban Luigi Zendesk Data Lineage AirBnB Engineering Blog The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/17/2018 • 39 minutes, 22 seconds

Putting Apache Spark Into Action with Jean Georges Perrin - Episode 60

Summary Apache Spark is a popular and widely used tool for a variety of data oriented projects. With the large array of capabilities, and the complexity of the underlying system, it can be difficult to understand how to get started using it. Jean George Perrin has been so impressed by the versatility of Spark that he is writing a book for data engineers to hit the ground running. In this episode he helps to make sense of what Spark is, how it works, and the various ways that you can use it. He also discusses what you need to know to get it deployed and keep it running in a production environment and how it fits into the overall data ecosystem. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jean Georges Perrin, author of the upcoming Manning book Spark In Action 2nd Edition, about the ways that Spark is used and how it fits into the data landscape Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Spark is? What are some of the main use cases for Spark? What are some of the problems that Spark is uniquely suited to address? Who uses Spark? What are the tools offered to Spark users? How does it compare to some of the other streaming frameworks such as Flink, Kafka, or Storm? For someone building on top of Spark what are the main software design paradigms? How does the design of an application change as you go from a local development environment to a production cluster? Once your application is written, what is involved in deploying it to a production environment? What are some of the most useful strategies that you have seen for improving the efficiency and performance of a processing pipeline? What are some of the edge cases and architectural considerations that engineers should be considering as they begin to scale their deployments? What are some of the common ways that Spark is deployed, in terms of the cluster topology and the supporting technologies? What are the limitations of the Spark programming model? What are the cases where Spark is the wrong choice? What was your motivation for writing a book about Spark? Who is the target audience? What have been some of the most interesting or useful lessons that you have learned in the process of writing a book about Spark? What advice do you have for anyone who is considering or currently using Spark? Contact Info @jgperrin on Twitter Blog Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Book Discount Use the code poddataeng18 to get 40% off of all of Manning’s products at manning.com Links Apache Spark Spark In Action Book code examples in GitHub Informix International Informix Users Group MySQL Microsoft SQL Server ETL (Extract, Transform, Load) Spark SQL and Spark In Action‘s chapter 11 Spark ML and Spark In Action‘s chapter 18 Spark Streaming (structured) and Spark In Action‘s chapter 10 Spark GraphX Hadoop Jupyter Podcast Interview Zeppelin Databricks IBM Watson Studio Kafka Flink Podcast Episode AWS Kinesis Yarn HDFS Hive Scala PySpark DAG Spark Catalyst Spark Tungsten Spark UDF AWS EMR Mesos DC/OS Kubernetes Dataframes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/10/2018 • 50 minutes, 31 seconds

Apache Zookeeper As A Building Block For Distributed Systems with Patrick Hunt - Episode 59

Summary Distributed systems are complex to build and operate, and there are certain primitives that are common to a majority of them. Rather then re-implement the same capabilities every time, many projects build on top of Apache Zookeeper. In this episode Patrick Hunt explains how the Apache Zookeeper project was started, how it functions, and how it is used as a building block for other distributed systems. He also explains the operational considerations for running your own cluster, how it compares to more recent entrants such as Consul and EtcD, and what is in store for the future. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Patrick Hunt about Apache Zookeeper and how it is used as a building block for distributed systems Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Zookeeper is and how the project got started? What are the main motivations for using a centralized coordination service for distributed systems? What are the distributed systems primitives that are built into Zookeeper? What are some of the higher-order capabilities that Zookeeper provides to users who are building distributed systems on top of Zookeeper? What are some of the types of system level features that application developers will need which aren’t provided by Zookeeper? Can you discuss how Zookeeper is architected and how that design has evolved over time? What have you found to be some of the most complicated or difficult aspects of building and maintaining Zookeeper? What are the scaling factors for Zookeeper? What are the edge cases that users should be aware of? Where does it fall on the axes of the CAP theorem? What are the main failure modes for Zookeeper? How much of the recovery logic is left up to the end user of the Zookeeper cluster? Since there are a number of projects that rely on Zookeeper, many of which are likely to be run in the same environment (e.g. Kafka and Flink), what would be involved in sharing a single Zookeeper cluster among those multiple services? In recent years we have seen projects such as EtcD which is used by Kubernetes, and Consul. How does Zookeeper compare with those projects? What are some of the cases where Zookeeper is the wrong choice? How have the needs of distributed systems engineers changed since you first began working on Zookeeper? If you were to start the project over today, what would you do differently? Would you still use Java? What are some of the most interesting or unexpected ways that you have seen Zookeeper used? What do you have planned for the future of Zookeeper? Contact Info @phunt on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Zookeeper Cloudera Google Chubby Sourceforge HBase High Availability Fallacies of distributed computing Falsehoods programmers believe about networking Consul EtcD Apache Curator Raft Consensus Algorithm Zookeeper Atomic Broadcast SSD Write Cliff Apache Kafka Apache Flink Podcast Episode HDFS Kubernetes Netty Protocol Buffers Avro Rust The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/3/2018 • 54 minutes, 25 seconds

Set Up Your Own Data-as-a-Service Platform On Dremio with Tomer Shiran - Episode 58

Summary When your data lives in multiple locations, belonging to at least as many applications, it is exceedingly difficult to ask complex questions of it. The default way to manage this situation is by crafting pipelines that will extract the data from source systems and load it into a data lake or data warehouse. In order to make this situation more manageable and allow everyone in the business to gain value from the data the folks at Dremio built a self service data platform. In this episode Tomer Shiran, CEO and co-founder of Dremio, explains how it fits into the modern data landscape, how it works under the hood, and how you can start using it today to make your life easier. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Tomer Shiran about Dremio, the open source data as a service platform Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Dremio is and how the project and business got started? What was the motivation for keeping your primary product open source? What is the governance model for the project? How does Dremio fit in the current landscape of data tools? What are some use cases that Dremio is uniquely equipped to support? Do you think that Dremio obviates the need for a data warehouse or large scale data lake? How is Dremio architected internally? How has that architecture evolved from when it was first built? There are a large array of components (e.g. governance, lineage, catalog) built into Dremio that are often found in dedicated products. What are some of the strategies that you have as a business and development team to manage and integrate the complexity of the product? What are the benefits of integrating all of those capabilities into a single system? What are the drawbacks? One of the useful features of Dremio is the granular access controls. Can you discuss how those are implemented and controlled? For someone who is interested in deploying Dremio to their environment what is involved in getting it installed? What are the scaling factors? What are some of the most exciting features that have been added in recent releases? When is Dremio the wrong choice? What have been some of the most challenging aspects of building, maintaining, and growing the technical and business platform of Dremio? What do you have planned for the future of Dremio? Contact Info Tomer @tshiran on Twitter LinkedIn Dremio Website @dremio on Twitter dremio on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Dremio MapR Presto Business Intelligence Arrow Tableau Power BI Jupyter OLAP Cube Apache Foundation Hadoop Nikon DSLR Spark ETL (Extract, Transform, Load) Parquet Avro K8s Helm Yarn Gandiva Initiative for Apache Arrow LLVM TLS The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/26/2018 • 39 minutes, 18 seconds

Stateful, Distributed Stream Processing on Flink with Fabian Hueske - Episode 57

Summary Modern applications and data platforms aspire to process events and data in real time at scale and with low latency. Apache Flink is a true stream processing engine with an impressive set of capabilities for stateful computation at scale. In this episode Fabian Hueske, one of the original authors, explains how Flink is architected, how it is being used to power some of the world’s largest businesses, where it sits in the lanscape of stream processing tools, and how you can start using it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline, or want to test out the projects you hear about on the show, you’ll need somewhere to deploy it, so check out Linode. With 200Gbit private networking, scalable shared block storage, and a 40Gbit public network, you’ve got everything you need to run a fast, reliable, and bullet-proof data platform. If you need global distribution, they’ve got that covered too with world-wide datacenters including new ones in Toronto and Mumbai. Go to dataengineeringpodcast.com/linode today to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Fabian Hueske, co-author of the upcoming O’Reilly book Stream Processing With Apache Flink, about his work on Apache Flink, the stateful streaming engine Interview Introduction How did you get involved in the area of data management? Can you start by describing what Flink is and how the project got started? What are some of the primary ways that Flink is used? How does Flink compare to other streaming engines such as Spark, Kafka, Pulsar, and Storm? What are some use cases that Flink is uniquely qualified to handle? Where does Flink fit into the current data landscape? How is Flink architected? How has that architecture evolved? Are there any aspects of the current design that you would do differently if you started over today? How does scaling work in a Flink deployment? What are the scaling limits? What are some of the failure modes that users should be aware of? How is the statefulness of a cluster managed? What are the mechanisms for managing conflicts? What are the limiting factors for the volume of state that can be practically handled in a cluster and for a given purpose? Can state be shared across processes or tasks within a Flink cluster? What are the comparative challenges of working with bounded vs unbounded streams of data? How do you handle out of order events in Flink, especially as the delay for a given event increases? For someone who is using Flink in their environment, what are the primary means of interacting with and developing on top of it? What are some of the most challenging or complicated aspects of building and maintaining Flink? What are some of the most interesting or unexpected ways that you have seen Flink used? What are some of the improvements or new features that are planned for the future of Flink? What are some features or use cases that you are explicitly not planning to support? For people who participate in the training sessions that you offer through Data Artisans, what are some of the concepts that they are challenged by? What do they find most interesting or exciting? Contact Info LinkedIn @fhueske on Twitter fhueske on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Flink Data Artisans IBM DB2 Technische Universität Berlin Hadoop Relational Database Google Cloud Dataflow Spark Cascading Java RocksDB Flink Checkpoints Flink Savepoints Kafka Pulsar Storm Scala LINQ (Language INtegrated Query) SQL Backpressure Watermarks HDFS S3 Avro JSON Hive Metastore Dell EMC Pravega The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/19/2018 • 48 minutes, 1 second

How Upsolver Is Building A Data Lake Platform In The Cloud with Yoni Iny - Episode 56

Summary A data lake can be a highly valuable resource, as long as it is well built and well managed. Unfortunately, that can be a complex and time-consuming effort, requiring specialized knowledge and diverting resources from your primary business. In this episode Yoni Iny, CTO of Upsolver, discusses the various components that are necessary for a successful data lake project, how the Upsolver platform is architected, and how modern data lakes can benefit your organization. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Yoni Iny about Upsolver, a data lake platform that lets developers integrate and analyze streaming data with ease Interview Introduction How did you get involved in the area of data management? Can you start by describing what Upsolver is and how it got started? What are your goals for the platform? There are a lot of opinions on both sides of the data lake argument. When is it the right choice for a data platform? What are the shortcomings of a data lake architecture? How is Upsolver architected? How has that architecture changed over time? How do you manage schema validation for incoming data? What would you do differently if you were to start over today? What are the biggest challenges at each of the major stages of the data lake? What is the workflow for a user of Upsolver and how does it compare to a self-managed data lake? When is Upsolver the wrong choice for an organization considering implementation of a data platform? Is there a particular scale or level of data maturity for an organization at which they would be better served by moving management of their data lake in house? What features or improvements do you have planned for the future of Upsolver? Contact Info Yoni yoniiny on GitHub LinkedIn Upsolver Website @upsolver on Twitter LinkedIn Facebook Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Upsolver Data Lake Israeli Army Data Warehouse Data Engineering Podcast Episode About Data Curation Three Vs Kafka Spark Presto Drill Spot Instances Object Storage Cassandra Redis Latency Avro Parquet ORC Data Engineering Podcast Episode About Data Serialization Formats SSTables Run Length Encoding CSV (Comma Separated Values) Protocol Buffers Kinesis ETL DevOps Prometheus Cloudwatch DataDog InfluxDB SQL Pandas Confluent KSQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/11/2018 • 51 minutes, 50 seconds

Self Service Business Intelligence And Data Sharing Using Looker with Daniel Mintz - Episode 55

Summary Business intelligence is a necessity for any organization that wants to be able to make informed decisions based on the data that they collect. Unfortunately, it is common for different portions of the business to build their reports with different assumptions, leading to conflicting views and poor choices. Looker is a modern tool for building and sharing reports that makes it easy to get everyone on the same page. In this episode Daniel Mintz explains how the product is architected, the features that make it easy for any business user to access and explore their reports, and how you can use it for your organization today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Daniel Mintz about Looker, a a modern data platform that can serve the data needs of an entire company Interview Introduction How did you get involved in the area of data management? Can you start by describing what Looker is and the problem that it is aiming to solve? How do you define business intelligence? How is Looker unique from other approaches to business intelligence in the enterprise? How does it compare to open source platforms for BI? Can you describe the technical infrastructure that supports Looker? Given that you are connecting to the customer’s data store, how do you ensure sufficient security? For someone who is using Looker, what does their workflow look like? How does that change for different user roles (e.g. data engineer vs sales management) What are the scaling factors for Looker, both in terms of volume of data for reporting from, and for user concurrency? What are the most challenging aspects of building a business intelligence tool and company in the modern data ecosystem? What are the portions of the Looker architecture that you would do differently if you were to start over today? What are some of the most interesting or unusual uses of Looker that you have seen? What is in store for the future of Looker? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Looker Upworthy MoveOn.org LookML SQL Business Intelligence Data Warehouse Linux Hadoop BigQuery Snowflake Redshift DB2 PostGres ETL (Extract, Transform, Load) ELT (Extract, Load, Transform) Airflow Luigi NiFi Data Curation Episode Presto Hive Athena DRY (Don’t Repeat Yourself) Looker Action Hub Salesforce Marketo Twilio Netscape Navigator Dynamic Pricing Survival Analysis DevOps BigQuery ML Snowflake Data Sharehouse The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/5/2018 • 58 minutes, 4 seconds

Using Notebooks As The Unifying Layer For Data Roles At Netflix with Matthew Seal - Episode 54

Summary Jupyter notebooks have gained popularity among data scientists as an easy way to do exploratory analysis and build interactive reports. However, this can cause difficulties when trying to move the work of the data scientist into a more standard production environment, due to the translation efforts that are necessary. At Netflix they had the crazy idea that perhaps that last step isn’t necessary, and the production workflows can just run the notebooks directly. Matthew Seal is one of the primary engineers who has been tasked with building the tools and practices that allow the various data oriented roles to unify their work around notebooks. In this episode he explains the rationale for the effort, the challenges that it has posed, the development that has been done to make it work, and the benefits that it provides to the Netflix data platform teams. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Matthew Seal about the ways that Netflix is using Jupyter notebooks to bridge the gap between data roles Interview Introduction How did you get involved in the area of data management? Can you start by outlining the motivation for choosing Jupyter notebooks as the core interface for your data teams? Where are you using notebooks and where are you not? What is the technical infrastructure that you have built to suppport that design choice? Which team was driving the effort? Was it difficult to get buy in across teams? How much shared code have you been able to consolidate or reuse across teams/roles? Have you investigated the use of any of the other notebook platforms for similar workflows? What are some of the notebook anti-patterns that you have encountered and what conventions or tooling have you established to discourage them? What are some of the limitations of the notebook environment for the work that you are doing? What have been some of the most challenging aspects of building production workflows on top of Jupyter notebooks? What are some of the projects that are ongoing or planned for the future that you are most excited by? Contact Info Matthew Seal Email LinkedIn @codeseal on Twitter MSeal on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Netflix Notebook Blog Posts Nteract Tooling OpenGov Project Jupyter Zeppelin Notebooks Papermill Titus Commuter Scala Python R Emacs NBDime The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/29/2018 • 40 minutes, 54 seconds

Of Checklists, Ethics, and Data with Emily Miller and Peter Bull (Cross Post from Podcast.init) - Episode 53

Summary As data science becomes more widespread and has a bigger impact on the lives of people, it is important that those projects and products are built with a conscious consideration of ethics. Keeping ethical principles in mind throughout the lifecycle of a data project helps to reduce the overall effort of preventing negative outcomes from the use of the final product. Emily Miller and Peter Bull of Driven Data have created Deon to improve the communication and conversation around ethics among and between data teams. It is a Python project that generates a checklist of common concerns for data oriented projects at the various stages of the lifecycle where they should be considered. In this episode they discuss their motivation for creating the project, the challenges and benefits of maintaining such a checklist, and how you can start using it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and this week I am sharing an episode from my other show, Podcast.__init__, about a project from Driven Data called Deon. It is a simple tool that generates a checklist of ethical considerations for the various stages of the lifecycle for data oriented projects. This is an important topic for all of the teams involved in the management and creation of projects that leverage data. So give it a listen and if you like what you hear, be sure to check out the other episodes at pythonpodcast.com Interview Introductions How did you get introduced to Python? Can you start by describing what Deon is and your motivation for creating it? Why a checklist, specifically? What’s the advantage of this over an oath, for example? What is unique to data science in terms of the ethical concerns, as compared to traditional software engineering? What is the typical workflow for a team that is using Deon in their projects? Deon ships with a default checklist but allows for customization. What are some common addendums that you have seen? Have you received pushback on any of the default items? How does Deon simplify communication around ethics across team boundaries? What are some of the most often overlooked items? What are some of the most difficult ethical concerns to comply with for a typical data science project? How has Deon helped you at Driven Data? What are the customer facing impacts of embedding a discussion of ethics in the product development process? Some of the items on the default checklist coincide with regulatory requirements. Are there any cases where regulation is in conflict with an ethical concern that you would like to see practiced? What are your hopes for the future of the Deon project? Keep In Touch Emily LinkedIn ejm714 on GitHub Peter LinkedIn @pjbull on Twitter pjbull on GitHub Driven Data @drivendataorg on Twitter drivendataorg on GitHub Website Picks Tobias Richard Bond Glass Art Emily Tandem Coffee in Portland, Maine Peter The Model Bakery in Saint Helena and Napa, California Links Deon Driven Data International Development Brookings Institution Stata Econometrics Metis Bootcamp Pandas Podcast Episode C# .NET Podcast.__init__ Episode On Software Ethics Jupyter Notebook Podcast Episode Word2Vec cookiecutter data science Logistic Regression The intro and outro music is from Requiem for a Fish The Freak Fandango Orchestra / CC BY-SA

10/22/2018 • 45 minutes, 32 seconds

Improving The Performance Of Cloud-Native Big Data At Netflix Using The Iceberg Table Format with Ryan Blue - Episode 52

Summary With the growth of the Hadoop ecosystem came a proliferation of implementations for the Hive table format. Unfortunately, with no formal specification, each project works slightly different which increases the difficulty of integration across systems. The Hive format is also built with the assumptions of a local filesystem which results in painful edge cases when leveraging cloud object storage for a data lake. In this episode Ryan Blue explains how his work on the Iceberg table format specification and reference implementation has allowed Netflix to improve the performance and simplify operations for their S3 data lake. This is a highly detailed and technical exploration of how a well-engineered metadata layer can improve the speed, accuracy, and utility of large scale, multi-tenant, cloud-native data platforms. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ryan Blue about Iceberg, a Netflix project to implement a high performance table format for batch workloads Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Iceberg is and the motivation for creating it? Was the project built with open-source in mind or was it necessary to refactor it from an internal project for public use? How has the use of Iceberg simplified your work at Netflix? How is the reference implementation architected and how has it evolved since you first began work on it? What is involved in deploying it to a user’s environment? For someone who is interested in using Iceberg within their own environments, what is involved in integrating it with their existing query engine? Is there a migration path for pre-existing tables into the Iceberg format? How is schema evolution managed at the file level? How do you handle files on disk that don’t contain all of the fields specified in a table definition? One of the complicated problems in data modeling is managing table partitions. How does Iceberg help in that regard? What are the unique challenges posed by using S3 as the basis for a data lake? What are the benefits that outweigh the difficulties? What have been some of the most challenging or contentious details of the specification to define? What are some things that you have explicitly left out of the specification? What are your long-term goals for the Iceberg specification? Do you anticipate the reference implementation continuing to be used and maintained? Contact Info rdblue on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Iceberg Reference Implementation Iceberg Table Specification Netflix Hadoop Cloudera Avro Parquet Spark S3 HDFS Hive ORC S3mper Git Metacat Presto Pig DDL (Data Definition Language) Cost-Based Optimization The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/15/2018 • 53 minutes, 45 seconds

Combining Transactional And Analytical Workloads On MemSQL with Nikita Shamgunov - Episode 51

Summary One of the most complex aspects of managing data for analytical workloads is moving it from a transactional database into the data warehouse. What if you didn’t have to do that at all? MemSQL is a distributed database built to support concurrent use by transactional, application oriented, and analytical, high volume, workloads on the same hardware. In this episode the CEO of MemSQL describes how the company and database got started, how it is architected for scale and speed, and how it is being used in production. This was a deep dive on how to build a successful company around a powerful platform, and how that platform simplifies operations for enterprise grade data management. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. And the team at Metis Machine has shipped a proof-of-concept integration between the Skafos machine learning platform and the Tableau business intelligence tool, meaning that your BI team can now run the machine learning models custom built by your data science team. If you think that sounds awesome (and it is) then join the free webinar with Metis Machine on October 11th at 2 PM ET (11 AM PT). Metis Machine will walk through the architecture of the extension, demonstrate its capabilities in real time, and illustrate the use case for empowering your BI team to modify and run machine learning models directly from Tableau. Go to metismachine.com/webinars now to register. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Nikita Shamgunov about MemSQL, a newSQL database built for simultaneous transactional and analytic workloads Interview Introduction How did you get involved in the area of data management? Can you start by describing what MemSQL is and how the product and business first got started? What are the typical use cases for customers running MemSQL? What are the benefits of integrating the ingestion pipeline with the database engine? What are some typical ways that the ingest capability is leveraged by customers? How is MemSQL architected and how has the internal design evolved from when you first started working on it? Where does it fall on the axes of the CAP theorem? How much processing overhead is involved in the conversion from the column oriented data stored on disk to the row oriented data stored in memory? Can you describe the lifecycle of a write transaction? Can you discuss the techniques that are used in MemSQL to optimize for speed and overall system performance? How do you mitigate the impact of network latency throughout the cluster during query planning and execution? How much of the implementation of MemSQL is using custom built code vs. open source projects? What are some of the common difficulties that your customers encounter when building on top of or migrating to MemSQL? What have been some of the most challenging aspects of building and growing the technical and business implementation of MemSQL? When is MemSQL the wrong choice for a data platform? What do you have planned for the future of MemSQL? Contact Info @nikitashamgunov on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links MemSQL NewSQL Microsoft SQL Server St. Petersburg University of Fine Mechanics And Optics C C++ In-Memory Database RAM (Random Access Memory) Flash Storage Oracle DB PostgreSQL Podcast Episode Kafka Kinesis Wealth Management Data Warehouse ODBC S3 HDFS Avro Parquet Data Serialization Podcast Episode Broadcast Join Shuffle Join CAP Theorem Apache Arrow LZ4 S2 Geospatial Library Sybase SAP Hana Kubernetes The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/9/2018 • 56 minutes, 54 seconds

Building A Knowledge Graph From Public Data At Enigma With Chris Groskopf - Episode 50

Summary There are countless sources of data that are publicly available for use. Unfortunately, combining those sources and making them useful in aggregate is a time consuming and challenging process. The team at Enigma builds a knowledge graph for use in your own data projects. In this episode Chris Groskopf explains the platform they have built to consume large varieties and volumes of public data for constructing a graph for serving to their customers. He discusses the challenges they are facing to scale the platform and engineering processes, as well as the workflow that they have established to enable testing of their ETL jobs. This is a great episode to listen to for ideas on how to organize a data engineering organization. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Chris Groskopf about Enigma and how the are using public data sources to build a knowledge graph Interview Introduction How did you get involved in the area of data management? Can you give a brief overview of what Enigma has built and what the motivation was for starting the company? How do you define the concept of a knowledge graph? What are the processes involved in constructing a knowledge graph? Can you describe the overall architecture of your data platform and the systems that you use for storing and serving your knowledge graph? What are the most challenging or unexpected aspects of building the knowledge graph that you have encountered? How do you manage the software lifecycle for your ETL code? What kinds of unit, integration, or acceptance tests do you run to ensure that you don’t introduce regressions in your processing logic? What are the current challenges that you are facing in building and scaling your data infrastructure? How does the fact that your data sources are primarily public influence your pipeline design and what challenges does it pose? What techniques are you using to manage accuracy and consistency in the data that you ingest? Can you walk through the lifecycle of the data that you process from acquisition through to delivery to your customers? What are the weak spots in your platform that you are planning to address in upcoming projects? If you were to start from scratch today, what would you have done differently? What are some of the most interesting or unexpected uses of your product that you have seen? What is in store for the future of Enigma? Contact Info Email Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Enigma Chicago Tribune NPR Quartz CSVKit Agate Knowledge Graph Taxonomy Concourse Airflow Docker S3 Data Lake Parquet Podcast Episode Spark AWS Neptune AWS Batch Money Laundering Jupyter Notebook Papermill Jupytext Cauldron: The Un-Notebook Podcast.__init__ Episode The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

10/1/2018 • 52 minutes, 52 seconds

A Primer On Enterprise Data Curation with Todd Walter - Episode 49

Summary As your data needs scale across an organization the need for a carefully considered approach to collection, storage, organization, and access becomes increasingly critical. In this episode Todd Walter shares his considerable experience in data curation to clarify the many aspects that are necessary for a successful platform for your business. Using the metaphor of a museum curator carefully managing the precious resources on display and in the vaults, he discusses the various layers of an enterprise data strategy. This includes modeling the lifecycle of your information as a pipeline from the raw, messy, loosely structured records in your data lake, through a series of transformations and ultimately to your data warehouse. He also explains which layers are useful for the different members of the business, and which pitfalls to look out for along the path to a mature and flexible data platform. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Todd Walter about data curation and how to architect your data systems to support high quality, maintainable intelligence Interview Introduction How did you get involved in the area of data management? How do you define data curation? What are some of the high level concerns that are encapsulated in that effort? How does the size and maturity of a company affect the ways that they architect and interact with their data systems? Can you walk through the stages of an ideal lifecycle for data within the context of an organizations uses for it? What are some of the common mistakes that are made when designing a data architecture and how do they lead to failure? What has changed in terms of complexity and scope for data architecture and curation since you first started working in this space? As “big data” became more widely discussed the common mantra was to store everything because you never know when you’ll need the data that might get thrown away. As the industry is reaching a greater degree of maturity and more regulations are implemented there has been a shift to being more considerate as to what information gets stored and for how long. What are your views on that evolution and what is your litmus test for determining which data to keep? In terms of infrastructure, what are the components of a modern data architecture and how has that changed over the years? What is your opinion on the relative merits of a data warehouse vs a data lake and are they mutually exclusive? Once an architecture has been established, how do you allow for continued evolution to prevent stagnation and eventual failure? ETL has long been the default approach for building and enforcing data architecture, but there have been significant shifts in recent years due to the emergence of streaming systems and ELT approaches in new data warehouses. What are your thoughts on the landscape for managing data flows and migration and when to use which approach? What are some of the areas of data architecture and curation that are most often forgotten or ignored? What resources do you recommend for anyone who is interested in learning more about the landscape of data architecture and curation? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Teradata Data Architecture Data Curation Data Warehouse Chief Data Officer ETL (Extract, Transform, Load) Data Lake Metadata Data Lineage Data Provenance Strata Conference ELT (Extract, Load, Transform) Map-Reduce Hive Pig Spark Data Governance The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/24/2018 • 49 minutes, 35 seconds

Take Control Of Your Web Analytics Using Snowplow With Alexander Dean - Episode 48

Summary Every business with a website needs some way to keep track of how much traffic they are getting, where it is coming from, and which actions are being taken. The default in most cases is Google Analytics, but this can be limiting when you wish to perform detailed analysis of the captured data. To address this problem, Alex Dean co-founded Snowplow Analytics to build an open source platform that gives you total control of your website traffic data. In this episode he explains how the project and company got started, how the platform is architected, and how you can start using it today to get a clearer view of how your customers are interacting with your web and mobile applications. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat This is your host Tobias Macey and today I’m interviewing Alexander Dean about Snowplow Analytics Interview Introductions How did you get involved in the area of data engineering and data management? What is Snowplow Analytics and what problem were you trying to solve when you started the company? What is unique about customer event data from an ingestion and processing perspective? Challenges with properly matching up data between sources Data collection is one of the more difficult aspects of an analytics pipeline because of the potential for inconsistency or incorrect information. How is the collection portion of the Snowplow stack designed and how do you validate the correctness of the data? Cleanliness/accuracy What kinds of metrics should be tracked in an ingestion pipeline and how do you monitor them to ensure that everything is operating properly? Can you describe the overall architecture of the ingest pipeline that Snowplow provides? How has that architecture evolved from when you first started? What would you do differently if you were to start over today? Ensuring appropriate use of enrichment sources What have been some of the biggest challenges encountered while building and evolving Snowplow? What are some of the most interesting uses of your platform that you are aware of? Keep In Touch Alex @alexcrdean on Twitter LinkedIn Snowplow @snowplowdata on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Snowplow GitHub Deloitte Consulting OpenX Hadoop AWS EMR (Elastic Map-Reduce) Business Intelligence Data Warehousing Google Analytics CRM (Customer Relationship Management) S3 GDPR (General Data Protection Regulation) Kinesis Kafka Google Cloud Pub-Sub JSON-Schema Iglu IAB Bots And Spiders List Heap Analytics Podcast Interview Redshift SnowflakeDB Snowplow Insights Google Cloud Platform Azure GitLab The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/17/2018 • 47 minutes, 48 seconds

Keep Your Data And Query It Too Using Chaos Search with Thomas Hazel and Pete Cheslock - Episode 47

Summary Elasticsearch is a powerful tool for storing and analyzing data, but when using it for logs and other time oriented information it can become problematic to keep all of your history. Chaos Search was started to make it easy for you to keep all of your data and make it usable in S3, so that you can have the best of both worlds. In this episode the CTO, Thomas Hazel, and VP of Product, Pete Cheslock, describe how they have built a platform to let you keep all of your history, save money, and reduce your operational overhead. They also explain some of the types of data that you can use with Chaos Search, how to load it into S3, and when you might want to choose it over Amazon Athena for our serverless data analysis. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $/0 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Thomas Hazel about Chaos Search and their effort to bring historical depth to your Elasticsearch data Interview Introduction How did you get involved in the area of data management? Can you start by explaining what you have built at Chaos Search and the problems that you are trying to solve with it? What types of data are you focused on supporting? What are the challenges inherent to scaling an elasticsearch infrastructure to large volumes of log or metric data? Is there any need for an Elasticsearch cluster in addition to Chaos Search? For someone who is using Chaos Search, what mechanisms/formats would they use for loading their data into S3? What are the benefits of implementing the Elasticsearch API on top of your data in S3 as opposed to using systems such as Presto or Drill to interact with the same information via SQL? Given that the S3 API has become a de facto standard for many other object storage platforms, what would be involved in running Chaos Search on data stored outside of AWS? What mechanisms do you use to allow for such drastic space savings of indexed data in S3 versus in an Elasticsearch cluster? What is the system architecture that you have built to allow for querying terabytes of data in S3? What are the biggest contributors to query latency and what have you done to mitigate them? What are the options for access control when running queries against the data stored in S3? What are some of the most interesting or unexpected uses of Chaos Search and access to large amounts of historical log information that you have seen? What are your plans for the future of Chaos Search? Contact Info Pete Cheslock @petecheslock on Twitter Website Thomas Hazel @thomashazel on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Chaos Search AWS S3 Cassandra Elasticsearch Podcast Interview PostgreSQL Distributed Systems Information Theory Lucene Inverted Index Kibana Logstash NVMe AWS KMS Kinesis FluentD Parquet Athena Presto Drill Backblaze OpenStack Swift Minio EMR DataDog NewRelic Elastic Beats Metricbeat Graphite Snappy Scala Akka Elastalert Tensorflow X-Pack Data Lake The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/10/2018 • 48 minutes, 8 seconds

An Agile Approach To Master Data Management with Mark Marinelli - Episode 46

Summary With the proliferation of data sources to give a more comprehensive view of the information critical to your business it is even more important to have a canonical view of the entities that you care about. Is customer number 342 in your ERP the same as Bob Smith on Twitter? Using master data management to build a data catalog helps you answer these questions reliably and simplify the process of building your business intelligence reports. In this episode the head of product at Tamr, Mark Marinelli, discusses the challenges of building a master data set, why you should have one, and some of the techniques that modern platforms and systems provide for maintaining it. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. You work hard to make sure that your data is reliable and accurate, but can you say the same about the deployment of your machine learning models? The Skafos platform from Metis Machine was built to give your data scientists the end-to-end support that they need throughout the machine learning lifecycle. Skafos maximizes interoperability with your existing tools and platforms, and offers real-time insights and the ability to be up and running with cloud-based production scale infrastructure instantaneously. Request a demo at dataengineeringpodcast.com/metis-machine to learn more about how Metis Machine is operationalizing data science. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Mark Marinelli about data mastering for modern platforms Interview Introduction How did you get involved in the area of data management? Can you start by establishing a definition of data mastering that we can work from? How does the master data set get used within the overall analytical and processing systems of an organization? What is the traditional workflow for creating a master data set? What has changed in the current landscape of businesses and technology platforms that makes that approach impractical? What are the steps that an organization can take to evolve toward an agile approach to data mastering? At what scale of company or project does it makes sense to start building a master data set? What are the limitations of using ML/AI to merge data sets? What are the limitations of a golden master data set in practice? Are there particular formats of data or types of entities that pose a greater challenge when creating a canonical format for them? Are there specific problem domains that are more likely to benefit from a master data set? Once a golden master has been established, how are changes to that information handled in practice? (e.g. versioning of the data) What storage mechanisms are typically used for managing a master data set? Are there particular security, auditing, or access concerns that engineers should be considering when managing their golden master that goes beyond the rest of their data infrastructure? How do you manage latency issues when trying to reference the same entities from multiple disparate systems? What have you found to be the most common stumbling blocks for a group that is implementing a master data platform? What suggestions do you have to help prevent such a project from being derailed? What resources do you recommend for someone looking to learn more about the theoretical and practical aspects of data mastering for their organization? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Tamr Multi-Dimensional Database Master Data Management ETL EDW (Enterprise Data Warehouse) Waterfall Development Method Agile Development Method DataOps Feature Engineering Tableau Qlik Data Catalog PowerBI RDBMS (Relational Database Management System) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

9/3/2018 • 47 minutes, 16 seconds

Protecting Your Data In Use At Enveil with Ellison Anne Williams - Episode 45

Summary There are myriad reasons why data should be protected, and just as many ways to enforce it in tranist or at rest. Unfortunately, there is still a weak point where attackers can gain access to your unencrypted information. In this episode Ellison Anny Williams, CEO of Enveil, describes how her company uses homomorphic encryption to ensure that your analytical queries can be executed without ever having to decrypt your data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Ellison Anne Williams about Enveil, a pioneering data security company protecting Data in Use Interview Introduction How did you get involved in the area of data security? Can you start by explaining what your mission is with Enveil and how the company got started? One of the core aspects of your platform is the principal of homomorphic encryption. Can you explain what that is and how you are using it? What are some of the challenges associated with scaling homomorphic encryption? What are some difficulties associated with working on encrypted data sets? Can you describe the underlying architecture for your data platform? How has that architecture evolved from when you first began building it? What are some use cases that are unlocked by having a fully encrypted data platform? For someone using the Enveil platform, what does their workflow look like? A major reason for never decrypting data is to protect it from attackers and unauthorized access. What are some of the remaining attack vectors? What are some aspects of the data being protected that still require additional consideration to prevent leaking information? (e.g. identifying individuals based on geographic data, or purchase patterns) What do you have planned for the future of Enveil? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data security today? Links Enveil NSA GDPR Intellectual Property Zero Trust Homomorphic Encryption Ciphertext Hadoop PII (Personally Identifiable Information) TLS (Transport Layer Security) Spark Elasticsearch Side-channel attacks Spectre and Meltdown The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/27/2018 • 24 minutes, 41 seconds

Graph Databases In Production At Scale Using DGraph with Manish Jain - Episode 44

Summary The way that you store your data can have a huge impact on the ways that it can be practically used. For a substantial number of use cases, the optimal format for storing and querying that information is as a graph, however databases architected around that use case have historically been difficult to use at scale or for serving fast, distributed queries. In this episode Manish Jain explains how DGraph is overcoming those limitations, how the project got started, and how you can start using it today. He also discusses the various cases where a graph storage layer is beneficial, and when you would be better off using something else. In addition he talks about the challenges of building a distributed, consistent database and the tradeoffs that were made to make DGraph a reality. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. If you have ever wished that you could use the same tools for versioning and distributing your data that you use for your software then you owe it to yourself to check out what the fine folks at Quilt Data have built. Quilt is an open source platform for building a sane workflow around your data that works for your whole team, including version history, metatdata management, and flexible hosting. Stop by their booth at JupyterCon in New York City on August 22nd through the 24th to say Hi and tell them that the Data Engineering Podcast sent you! After that, keep an eye on the AWS marketplace for a pre-packaged version of Quilt for Teams to deploy into your own environment and stop fighting with your data. Python has quickly become one of the most widely used languages by both data engineers and data scientists, letting everyone on your team understand each other more easily. However, it can be tough learning it when you’re just starting out. Luckily, there’s an easy way to get involved. Written by MIT lecturer Ana Bell and published by Manning Publications, Get Programming: Learn to code with Python is the perfect way to get started working with Python. Ana’s experience as a teacher of Python really shines through, as you get hands-on with the language without being drowned in confusing jargon or theory. Filled with practical examples and step-by-step lessons to take on, Get Programming is perfect for people who just want to get stuck in with Python. Get your copy of the book with a special 40% discount for Data Engineering Podcast listeners by going to dataengineeringpodcast.com/get-programming and use the discount code PodInit40! Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Manish Jain about DGraph, a low latency, high throughput, native and distributed graph database. Interview Introduction How did you get involved in the area of data management? What is DGraph and what motivated you to build it? Graph databases and graph algorithms have been part of the computing landscape for decades. What has changed in recent years to allow for the current proliferation of graph oriented storage systems? The graph space is becoming crowded in recent years. How does DGraph compare to the current set of offerings? What are some of the common uses of graph storage systems? What are some potential uses that are often overlooked? There are a few ways that graph structures and properties can be implemented, including the ability to store data in the vertices connecting nodes and the structures that can be contained within the nodes themselves. How is information represented in DGraph and what are the tradeoffs in the approach that you chose? How does the query interface and data storage in DGraph differ from other options? What are your opinions on the graph query languages that have been adopted by other storages systems, such as Gremlin, Cypher, and GSQL? How is DGraph architected and how has that architecture evolved from when it first started? How do you balance the speed and agility of schema on read with the additional application complexity that is required, as opposed to schema on write? In your documentation you contend that DGraph is a viable replacement for RDBMS-oriented primary storage systems. What are the switching costs for someone looking to make that transition? What are the limitations of DGraph in terms of scalability or usability? Where does it fall along the axes of the CAP theorem? For someone who is interested in building on top of DGraph and deploying it to production, what does their workflow and operational overhead look like? What have been the most challenging aspects of building and growing the DGraph project and community? What are some of the most interesting or unexpected uses of DGraph that you are aware of? When is DGraph the wrong choice? What are your plans for the future of DGraph? Contact Info @manishrjain on Twitter manishrjain on GitHub Blog Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links DGraph Badger Google Knowledge Graph Graph Theory Graph Database SQL Relational Database NoSQL OLTP (On-Line Transaction Processing) Neo4J PostgreSQL MySQL BigTable Recommendation System Fraud Detection Customer 360 Usenet Express IPFS Gremlin Cypher GSQL GraphQL MetaWeb RAFT Spanner HBase Elasticsearch Kubernetes TLS (Transport Layer Security) Jepsen Tests The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/20/2018 • 42 minutes, 39 seconds

Putting Airflow Into Production With James Meickle - Episode 43

Summary The theory behind how a tool is supposed to work and the realities of putting it into practice are often at odds with each other. Learning the pitfalls and best practices from someone who has gained that knowledge the hard way can save you from wasted time and frustration. In this episode James Meickle discusses his recent experience building a new installation of Airflow. He points out the strengths, design flaws, and areas of improvement for the framework. He also describes the design patterns and workflows that his team has built to allow them to use Airflow as the basis of their data science platform. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing James Meickle about his experiences building a new Airflow installation Interview Introduction How did you get involved in the area of data management? What was your initial project requirement? What tooling did you consider in addition to Airflow? What aspects of the Airflow platform led you to choose it as your implementation target? Can you describe your current deployment architecture? How many engineers are involved in writing tasks for your Airflow installation? What resources were the most helpful while learning about Airflow design patterns? How have you architected your DAGs for deployment and extensibility? What kinds of tests and automation have you put in place to support the ongoing stability of your deployment? What are some of the dead-ends or other pitfalls that you encountered during the course of this project? What aspects of Airflow have you found to be lacking that you would like to see improved? What did you wish someone had told you before you started work on your Airflow installation? If you were to start over would you make the same choice? If Airflow wasn’t available what would be your second choice? What are your next steps for improvements and fixes? Contact Info @eronarn on Twitter Website eronarn on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Quantopian Harvard Brain Science Initiative DevOps Days Boston Google Maps API Cron ETL (Extract, Transform, Load) Azkaban Luigi AWS Glue Airflow Pachyderm Podcast Interview AirBnB Python YAML Ansible REST (Representational State Transfer) SAML (Security Assertion Markup Language) RBAC (Role-Based Access Control) Maxime Beauchemin Medium Blog Celery Dask Podcast Interview PostgreSQL Podcast Interview Redis Cloudformation Jupyter Notebook Qubole Astronomer Podcast Interview Gunicorn Kubernetes Airflow Improvement Proposals Python Enhancement Proposals (PEP) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/13/2018 • 48 minutes, 5 seconds

Taking A Tour Of PostgreSQL with Jonathan Katz - Episode 42

Summary One of the longest running and most popular open source database projects is PostgreSQL. Because of its extensibility and a community focus on stability it has stayed relevant as the ecosystem of development environments and data requirements have changed and evolved over its lifetime. It is difficult to capture any single facet of this database in a single conversation, let alone the entire surface area, but in this episode Jonathan Katz does an admirable job of it. He explains how Postgres started and how it has grown over the years, highlights the fundamental features that make it such a popular choice for application developers, and the ongoing efforts to add the complex features needed by the demanding workloads of today’s data layer. To cap it off he reviews some of the exciting features that the community is working on building into future releases. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Jonathan Katz about a high level view of PostgreSQL and the unique capabilities that it offers Interview Introduction How did you get involved in the area of data management? How did you get involved in the Postgres project? For anyone who hasn’t used it, can you describe what PostgreSQL is? Where did Postgres get started and how has it evolved over the intervening years? What are some of the primary characteristics of Postgres that would lead someone to choose it for a given project? What are some cases where Postgres is the wrong choice? What are some of the common points of confusion for new users of PostGreSQL? (particularly if they have prior database experience) The recent releases of Postgres have had some fairly substantial improvements and new features. How does the community manage to balance stability and reliability against the need to add new capabilities? What are the aspects of Postgres that allow it to remain relevant in the current landscape of rapid evolution at the data layer? Are there any plans to incorporate a distributed transaction layer into the core of the project along the lines of what has been done with Citus or CockroachDB? What is in store for the future of Postgres? Contact Info @jkatz05 on Twitter jkatz on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links PostgreSQL Crunchy Data Venuebook Paperless Post LAMP Stack MySQL PHP SQL ORDBMS Edgar Codd A Relational Model of Data for Large Shared Data Banks Relational Algebra Oracle DB UC Berkeley Dr. Michael Stonebraker Ingres Informix QUEL ANSI C CVS BSD License UUID JSON XML HStore PostGIS BTree Index GIN Index GIST Index KNN GIST SPGIST Full Text Search BRIN Index WAL (Write-Ahead Log) SQLite PGAdmin Vim Emacs Linux OLAP (Online Analytical Processing) Postgres IRC Postgres Slack Postgres Conferences UPSERT Postgres Roadmap CockroachDB Podcast Interview Citus Data Podcast Interview The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/6/2018 • 56 minutes, 21 seconds

Mobile Data Collection And Analysis Using Ona And Canopy With Peter Lubell-Doughtie - Episode 41

Summary With the attention being paid to the systems that power large volumes of high velocity data it is easy to forget about the value of data collection at human scales. Ona is a company that is building technologies to support mobile data collection, analysis of the aggregated information, and user-friendly presentations. In this episode CTO Peter Lubell-Doughtie describes the architecture of the platform, the types of environments and use cases where it is being employed, and the value of small data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Peter Lubell-Doughtie about using Ona for collecting data and processing it with Canopy Interview Introduction How did you get involved in the area of data management? What is Ona and how did the company get started? What are some examples of the types of customers that you work with? What types of data do you support in your collection platform? What are some of the mechanisms that you use to ensure the accuracy of the data that is being collected by users? Does your mobile collection platform allow for anyone to submit data without having to be associated with a given account or organization? What are some of the integration challenges that are unique to the types of data that get collected by mobile field workers? Can you describe the flow of the data from collection through to analysis? To help improve the utility of the data being collected you have started building Canopy. What was the tipping point where it became worth the time and effort to start that project? What are the architectural considerations that you factored in when designing it? What have you found to be the most challenging or unexpected aspects of building an enterprise data warehouse for general users? What are your plans for the future of Ona and Canopy? Contact Info Email pld on Github Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links OpenSRP Ona Canopy Open Data Kit Earth Institute at Columbia University Sustainable Engineering Lab WHO Bill and Melinda Gates Foundation XLSForms PostGIS Kafka Druid Superset Postgres Ansible Docker Terraform The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/30/2018 • 29 minutes, 14 seconds

Ceph: A Reliable And Scalable Distributed Filesystem with Sage Weil - Episode 40

Summary When working with large volumes of data that you need to access in parallel across multiple instances you need a distributed filesystem that will scale with your workload. Even better is when that same system provides multiple paradigms for interacting with the underlying storage. Ceph is a highly available, highly scalable, and performant system that has support for object storage, block storage, and native filesystem access. In this episode Sage Weil, the creator and lead maintainer of the project, discusses how it got started, how it works, and how you can start using it on your infrastructure today. He also explains where it fits in the current landscape of distributed storage and the plans for future improvements. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Join the community in the new Zulip chat workspace at dataengineeringpodcast.com/chat Your host is Tobias Macey and today I’m interviewing Sage Weil about Ceph, an open source distributed file system that supports block storage, object storage, and a file system interface. Interview Introduction How did you get involved in the area of data management? Can you start with an overview of what Ceph is? What was the motivation for starting the project? What are some of the most common use cases for Ceph? There are a large variety of distributed file systems. How would you characterize Ceph as it compares to other options (e.g. HDFS, GlusterFS, LionFS, SeaweedFS, etc.)? Given that there is no single point of failure, what mechanisms do you use to mitigate the impact of network partitions? What mechanisms are available to ensure data integrity across the cluster? How is Ceph implemented and how has the design evolved over time? What is required to deploy and manage a Ceph cluster? What are the scaling factors for a cluster? What are the limitations? How does Ceph handle mixed write workloads with either a high volume of small files or a smaller volume of larger files? In services such as S3 the data is segregated from block storage options like EBS or EFS. Since Ceph provides all of those interfaces in one project is it possible to use each of those interfaces to the same data objects in a Ceph cluster? In what situations would you advise someone against using Ceph? What are some of the most interested, unexpected, or challenging aspects of working with Ceph and the community? What are some of the plans that you have for the future of Ceph? Contact Info Email @liewegas on Twitter liewegas on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Ceph Red Hat DreamHost UC Santa Cruz Los Alamos National Labs Dream Objects OpenStack Proxmox POSIX GlusterFS Hadoop Ceph Architecture Paxos relatime Prometheus Zabbix Kubernetes NVMe DNS-SD Consul EtcD DNS SRV Record Zeroconf Bluestore XFS Erasure Coding NFS Seastar Rook The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/16/2018 • 48 minutes, 30 seconds

Building Data Flows In Apache NiFi With Kevin Doran and Andy LoPresto - Episode 39

Summary Data integration and routing is a constantly evolving problem and one that is fraught with edge cases and complicated requirements. The Apache NiFi project models this problem as a collection of data flows that are created through a self-service graphical interface. This framework provides a flexible platform for building a wide variety of integrations that can be managed and scaled easily to fit your particular needs. In this episode project members Kevin Doran and Andy LoPresto discuss the ways that NiFi can be used, how to start using it in your environment, and plans for future development. They also explained how it fits in the broad landscape of data tools, the interesting and challenging aspects of the project, and how to build new extensions. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Doran and Andy LoPresto about Apache NiFi Interview Introduction How did you get involved in the area of data management? Can you start by explaining what NiFi is? What is the motivation for building a GUI as the primary interface for the tool when the current trend is to represent everything as code? How did you get involved with the project? Where does it sit in the broader landscape of data tools? Does the data that is processed by NiFi flow through the servers that it is running on (á la Spark/Flink/Kafka), or does it orchestrate actions on other systems (á la Airflow/Oozie)? How do you manage versioning and backup of data flows, as well as promoting them between environments? One of the advertised features is tracking provenance for data flows that are managed by NiFi. How is that data collected and managed? What types of reporting are available across this information? What are some of the use cases or requirements that lend themselves well to being solved by NiFi? When is NiFi the wrong choice? What is involved in deploying and scaling a NiFi installation? What are some of the system/network parameters that should be considered? What are the scaling limitations? What have you found to be some of the most interesting, unexpected, and/or challenging aspects of building and maintaining the NiFi project and community? What do you have planned for the future of NiFi? Contact Info Kevin Doran @kevdoran on Twitter Email Andy LoPresto @yolopey on Twitter Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links NiFi HortonWorks DataFlow HortonWorks Apache Software Foundation Apple CSV XML JSON Perl Python Internet Scale Asset Management Documentum DataFlow NSA (National Security Agency) 24 (TV Show) Technology Transfer Program Agile Software Development Waterfall Spark Flink Kafka Oozie Luigi Airflow FluentD ETL (Extract, Transform, and Load) ESB (Enterprise Service Bus) MiNiFi Java C++ Provenance Kubernetes Apache Atlas Data Governance Kibana K-Nearest Neighbors DevOps DSL (Domain Specific Language) NiFi Registry Artifact Repository Nexus NiFi CLI Maven Archetype IoT Docker Backpressure NiFi Wiki TLS (Transport Layer Security) Mozilla TLS Observatory NiFi Flow Design System Data Lineage GDPR (General Data Protection Regulation) The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/8/2018 • 1 hour, 4 minutes, 15 seconds

Leveraging Human Intelligence For Better AI At Alegion With Cheryl Martin - Episode 38

Summary Data is often messy or incomplete, requiring human intervention to make sense of it before being usable as input to machine learning projects. This is problematic when the volume scales beyond a handful of records. In this episode Dr. Cheryl Martin, Chief Data Scientist for Alegion, discusses the importance of properly labeled information for machine learning and artificial intelligence projects, the systems that they have built to scale the process of incorporating human intelligence in the data preparation process, and the challenges inherent to such an endeavor. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Cheryl Martin, chief data scientist at Alegion, about data labelling at scale Interview Introduction How did you get involved in the area of data management? To start, can you explain the problem space that Alegion is targeting and how you operate? When is it necessary to include human intelligence as part of the data lifecycle for ML/AI projects? What are some of the biggest challenges associated with managing human input to data sets intended for machine usage? For someone who is acting as human-intelligence provider as part of the workforce, what does their workflow look like? What tools and processes do you have in place to ensure the accuracy of their inputs? How do you prevent bad actors from contributing data that would compromise the trained model? What are the limitations of crowd-sourced data labels? When is it beneficial to incorporate domain experts in the process? When doing data collection from various sources, how do you ensure that intellectual property rights are respected? How do you determine the taxonomies to be used for structuring data sets that are collected, labeled or enriched for your customers? What kinds of metadata do you track and how is that recorded/transmitted? Do you think that human intelligence will be a necessary piece of ML/AI forever? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Alegion University of Texas at Austin Cognitive Science Labeled Data Mechanical Turk Computer Vision Sentiment Analysis Speech Recognition Taxonomy Feature Engineering The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

7/2/2018 • 46 minutes, 13 seconds

Package Management And Distribution For Your Data Using Quilt with Kevin Moore - Episode 37

Summary Collaboration, distribution, and installation of software projects is largely a solved problem, but the same cannot be said of data. Every data team has a bespoke means of sharing data sets, versioning them, tracking related metadata and changes, and publishing them for use in the software systems that rely on them. The CEO and founder of Quilt Data, Kevin Moore, was sufficiently frustrated by this problem to create a platform that attempts to be the means by which data can be as collaborative and easy to work with as GitHub and your favorite programming language. In this episode he explains how the project came to be, how it works, and the many ways that you can start using it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Are you struggling to keep up with customer request and letting errors slip into production? Want to try some of the innovative ideas in this podcast but don’t have time? DataKitchen’s DataOps software allows your team to quickly iterate and deploy pipelines of code, models, and data sets while improving quality. Unlike a patchwork of manual operations, DataKitchen makes your team shine by providing an end to end DataOps solution with minimal programming that uses the tools you love. Join the DataOps movement and sign up for the newsletter at datakitchen.io/de today. After that learn more about why you should be doing DataOps by listening to the Head Chef in the Data Kitchen at dataengineeringpodcast.com/datakitchen Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kevin Moore about Quilt Data, a platform and tooling for packaging, distributing, and versioning data Interview Introduction How did you get involved in the area of data management? What is the intended use case for Quilt and how did the project get started? Can you step through a typical workflow of someone using Quilt? How does that change as you go from a single user to a team of data engineers and data scientists? Can you describe the elements of what a data package consists of? What was your criteria for the file formats that you chose? How is Quilt architected and what have been the most significant changes or evolutions since you first started? How is the data registry implemented? What are the limitations or edge cases that you have run into? What optimizations have you made to accelerate synchronization of the data to and from the repository? What are the limitations in terms of data volume, format, or usage? What is your goal with the business that you have built around the project? What are your plans for the future of Quilt? Contact Info Email LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Quilt Data GitHub Jobs Reproducible Data Dependencies in Jupyter Reproducible Machine Learning with Jupyter and Quilt Allen Institute: Programmatic Data Access with Quilt Quilt Example: MissingNo Oracle Pandas Jupyter Ycombinator Data.World Podcast Episode with CTO Bryon Jacob Kaggle Parquet HDF5 Arrow PySpark Excel Scala Binder Merkle Tree Allen Institute for Cell Science Flask PostGreSQL Docker Airflow Quilt Teams Hive Hive Metastore PrestoDB Podcast Episode Netflix Iceberg Kubernetes Helm The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/25/2018 • 41 minutes, 43 seconds

User Analytics In Depth At Heap with Dan Robinson - Episode 36

Summary Web and mobile analytics are an important part of any business, and difficult to get right. The most frustrating part is when you realize that you haven’t been tracking a key interaction, having to write custom logic to add that event, and then waiting to collect data. Heap is a platform that automatically tracks every event so that you can retroactively decide which actions are important to your business and easily build reports with or without SQL. In this episode Dan Robinson, CTO of Heap, describes how they have architected their data infrastructure, how they build their tracking agents, and the data virtualization layer that enables users to define their own labels. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Dan Robinson about Heap and their approach to collecting, storing, and analyzing large volumes of data Interview Introduction How did you get involved in the area of data management? Can you start by giving a brief overview of Heap? One of your differentiating features is the fact that you capture every interaction on web and mobile platforms for your customers. How do you prevent the user experience from suffering as a result of network congestion, while ensuring the reliable delivery of that data? Can you walk through the lifecycle of a single event from source to destination and the infrastructure components that it traverses to get there? Data collected in a user’s browser can often be messy due to various browser plugins, variations in runtime capabilities, etc. How do you ensure the integrity and accuracy of that information? What are some of the difficulties that you have faced in establishing a representation of events that allows for uniform processing and storage? What is your approach for merging and enriching event data with the information that you retrieve from your supported integrations? What challenges does that pose in your processing architecture? What are some of the problems that you have had to deal with to allow for processing and storing such large volumes of data? How has that architecture changed or evolved over the life of the company? What are some changes that you are anticipating in the near future? Can you describe your approach for synchronizing customer data with their individual Redshift instances and the difficulties that entails? What are some of the most interesting challenges that you have faced while building the technical and business aspects of Heap? What changes have been necessary as a result of GDPR? What are your plans for the future of Heap? Contact Info @danlovesproofs on twitter dan@drob.us @drob on github heapanalytics.com / @heap on twitter https://heapanalytics.com/blog/category/engineering?utmsource=rss&utmmedium=rss Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Heap Palantir User Analytics Google Analytics Piwik Mixpanel Hubspot Jepsen Chaos Engineering Node.js Kafka Scala Citus React MobX Redshift Heap SQL BigQuery Webhooks Drip Data Virtualization DNS PII SOC2 The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/17/2018 • 45 minutes, 27 seconds

CockroachDB In Depth with Peter Mattis - Episode 35

Summary With the increased ease of gaining access to servers in data centers across the world has come the need for supporting globally distributed data storage. With the first wave of cloud era databases the ability to replicate information geographically came at the expense of transactions and familiar query languages. To address these shortcomings the engineers at Cockroach Labs have built a globally distributed SQL database with full ACID semantics in Cockroach DB. In this episode Peter Mattis, the co-founder and VP of Engineering at Cockroach Labs, describes the architecture that underlies the database, the challenges they have faced along the way, and the ways that you can use it in your own environments today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Peter Mattis about CockroachDB, the SQL database for global cloud services Interview Introduction How did you get involved in the area of data management? What was the motivation for creating CockroachDB and building a business around it? Can you describe the architecture of CockroachDB and how it supports distributed ACID transactions? What are some of the tradeoffs that are necessary to allow for georeplicated data with distributed transactions? What are some of the problems that you have had to work around in the RAFT protocol to provide reliable operation of the clustering mechanism? Go is an unconventional language for building a database. What are the pros and cons of that choice? What are some of the common points of confusion that users of CockroachDB have when operating or interacting with it? What are the edge cases and failure modes that users should be aware of? I know that your SQL syntax is PostGreSQL compatible, so is it possible to use existing ORMs unmodified with CockroachDB? What are some examples of extensions that are specific to CockroachDB? What are some of the most interesting uses of CockroachDB that you have seen? When is CockroachDB the wrong choice? What do you have planned for the future of CockroachDB? Contact Info Peter LinkedIn petermattis on GitHub @petermattis on Twitter Cockroach Labs @CockroackDB on Twitter Website cockroachdb on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links CockroachDB Cockroach Labs SQL Google Bigtable Spanner NoSQL RDBMS (Relational Database Management System) “Big Iron” (colloquial term for mainframe computers) RAFT Consensus Algorithm Consensus MVCC (Multiversion Concurrency Control) Isolation Etcd GDPR Golang C++ Garbage Collection Metaprogramming Rust Static Linking Docker Kubernetes CAP Theorem PostGreSQL ORM (Object Relational Mapping) Information Schema PG Catalog Interleaved Tables Vertica Spark Change Data Capture The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/11/2018 • 43 minutes, 41 seconds

ArangoDB: Fast, Scalable, and Multi-Model Data Storage with Jan Steeman and Jan Stücke - Episode 34

Summary Using a multi-model database in your applications can greatly reduce the amount of infrastructure and complexity required. ArangoDB is a storage engine that supports documents, dey/value, and graph data formats, as well as being fast and scalable. In this episode Jan Steeman and Jan Stücke explain where Arango fits in the crowded database market, how it works under the hood, and how you can start working with it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Jan Stücke and Jan Steeman about ArangoDB, a multi-model distributed database for graph, document, and key/value storage. Interview Introduction How did you get involved in the area of data management? Can you give a high level description of what ArangoDB is and the motivation for creating it? What is the story behind the name? How is ArangoDB constructed? How does the underlying engine store the data to allow for the different ways of viewing it? What are some of the benefits of multi-model data storage? When does it become problematic? For users who are accustomed to a relational engine, how do they need to adjust their approach to data modeling when working with Arango? How does it compare to OrientDB? What are the options for scaling a running system? What are the limitations in terms of network architecture or data volumes? One of the unique aspects of ArangoDB is the Foxx framework for embedding microservices in the data layer. What benefits does that provide over a three tier architecture? What mechanisms do you have in place to prevent data breaches from security vulnerabilities in the Foxx code? What are some of the most interesting or surprising uses of this functionality that you have seen? What are some of the most challenging technical and business aspects of building and promoting ArangoDB? What do you have planned for the future of ArangoDB? Contact Info Jan Steemann jsteemann on GitHub @steemann on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links ArangoDB Köln Multi-model Database Graph Algorithms Apache 2 C++ ArangoDB Foxx Raft Protocol Target Partners RocksDB AQL (ArangoDB Query Language) OrientDB PostGreSQL OrientDB Studio Google Spanner 3-Tier Architecture Thomson-Reuters Arango Search Dell EMC Google S2 Index ArangoDB Geographic Functionality JSON Schema The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/4/2018 • 40 minutes, 5 seconds

The Alooma Data Pipeline With CTO Yair Weinberger - Episode 33

Summary Building an ETL pipeline is a common need across businesses and industries. It’s easy to get one started but difficult to manage as new requirements are added and greater scalability becomes necessary. Rather than duplicating the efforts of other engineers it might be best to use a hosted service to handle the plumbing so that you can focus on the parts that actually matter for your business. In this episode CTO and co-founder of Alooma, Yair Weinberger, explains how the platform addresses the common needs of data collection, manipulation, and storage while allowing for flexible processing. He describes the motivation for starting the company, how their infrastructure is architected, and the challenges of supporting multi-tenancy and a wide variety of integrations. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Yair Weinberger about Alooma, a company providing data pipelines as a service Interview Introduction How did you get involved in the area of data management? What is Alooma and what is the origin story? How is the Alooma platform architected? I want to go into stream VS batch here What are the most challenging components to scale? How do you manage the underlying infrastructure to support your SLA of 5 nines? What are some of the complexities introduced by processing data from multiple customers with various compliance requirements? How do you sandbox user’s processing code to avoid security exploits? What are some of the potential pitfalls for automatic schema management in the target database? Given the large number of integrations, how do you maintain the What are some challenges when creating integrations, isn’t it simply conforming with an external API? For someone getting started with Alooma what does the workflow look like? What are some of the most challenging aspects of building and maintaining Alooma? What are your plans for the future of Alooma? Contact Info LinkedIn @yairwein on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Alooma Convert Media Data Integration ESB (Enterprise Service Bus) Tibco Mulesoft ETL (Extract, Transform, Load) Informatica Microsoft SSIS OLAP Cube S3 Azure Cloud Storage Snowflake DB Redshift BigQuery Salesforce Hubspot Zendesk Spark The Log: What every software engineer should know about real-time data’s unifying abstraction by Jay Kreps RDBMS (Relational Database Management System) SaaS (Software as a Service) Change Data Capture Kafka Storm Google Cloud PubSub Amazon Kinesis Alooma Code Engine Zookeeper Idempotence Kafka Streams Kubernetes SOC2 Jython Docker Python Javascript Ruby Scala PII (Personally Identifiable Information) GDPR (General Data Protection Regulation) Amazon EMR (Elastic Map Reduce) Sequoia Capital Lightspeed Investors Redis Aerospike Cassandra MongoDB The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/28/2018 • 47 minutes, 50 seconds

PrestoDB and Starburst Data with Kamil Bajda-Pawlikowski - Episode 32

Summary Most businesses end up with data in a myriad of places with varying levels of structure. This makes it difficult to gain insights from across departments, projects, or people. Presto is a distributed SQL engine that allows you to tie all of your information together without having to first aggregate it all into a data warehouse. Kamil Bajda-Pawlikowski co-founded Starburst Data to provide support and tooling for Presto, as well as contributing advanced features back to the project. In this episode he describes how Presto is architected, how you can use it for your analytics, and the work that he is doing at Starburst Data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Kamil Bajda-Pawlikowski about Presto and his experiences with supporting it at Starburst Data Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Presto is? What are some of the common use cases and deployment patterns for Presto? How does Presto compare to Drill or Impala? What is it about Presto that led you to building a business around it? What are some of the most challenging aspects of running and scaling Presto? For someone who is using the Presto SQL interface, what are some of the considerations that they should keep in mind to avoid writing poorly performing queries? How does Presto represent data for translating between its SQL dialect and the API of the data stores that it interfaces with? What are some cases in which Presto is not the right solution? What types of support have you found to be the most commonly requested? What are some of the types of tooling or improvements that you have made to Presto in your distribution? What are some of the notable changes that your team has contributed upstream to Presto? Contact Info Website E-mail Twitter – @starburstdata Twitter – @prestodb Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Starburst Data Presto Hadapt Hadoop Hive Teradata PrestoCare Cost Based Optimizer ANSI SQL Spill To Disk Tempto Benchto Geospatial Functions Cassandra Accumulo Kafka Redis PostGreSQL The intro and outro music is from The Hug by The Freak Fandango Orchestra / {CC BY-SA](http://creativecommons.org/licenses/by-sa/3.0/)?utmsource=rss&utmmedium=rss

5/21/2018 • 42 minutes, 7 seconds

Brief Conversations From The Open Data Science Conference: Part 2 - Episode 31

Summary The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Andy Eschbacher of Carto. He dscribes some of the complexities inherent to working with geospatial data, how they are handling it, and some of the interesting use cases that they enable for their customers. Next is Todd Blaschka, COO of TigerGraph. He explains how graph databases differ from relational engines, where graph algorithms are useful, and how TigerGraph is built to alow for fast and scalable operation. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the mailing list, read the show notes, and get in touch. Your host is Tobias Macey and last week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. In this second part you will hear from Andy Eschbacher of Carto about the challenges of managing geospatial data, as well as Todd Blaschka of TigerGraph about graph databases and how his company has managed to build a fast and scalable platform for graph storage and traversal. Interview Andy Eschbacher From Carto What are the challenges associated with storing geospatial data? What are some of the common misconceptions that people have about working with geospatial data? Contact Info andy-esch on GitHub @MrEPhysics on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Carto Geospatial Analysis GeoJSON Todd Blaschka From TigerGraph What are graph databases and how do they differ from relational engines? What are some of the common difficulties that people have when deling with graph algorithms? How does data modeling for graph databases differ from relational stores? Contact Info LinkedIn @toddblaschka on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links TigerGraph Graph Databases The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/14/2018 • 26 minutes, 5 seconds

Brief Conversations From The Open Data Science Conference: Part 1 - Episode 30

Summary The Open Data Science Conference brings together a variety of data professionals each year in Boston. This week’s episode consists of a pair of brief interviews conducted on-site at the conference. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and this week I attended the Open Data Science Conference in Boston and recorded a few brief interviews on-site. First up you’ll hear from Alan Anders, the CTO of Applecart about their challenges with getting Spark to scale for constructing an entity graph from multiple data sources. Next I spoke with Stepan Pushkarev, the CEO, CTO, and Co-Founder of Hydrosphere.io about the challenges of running machine learning models in production and how his team tracks key metrics and samples production data to re-train and re-deploy those models for better accuracy and more robust operation. Interview Alan Anders from Applecart What are the challenges of gathering and processing data from multiple data sources and representing them in a unified manner for merging into single entities? What are the biggest technical hurdles at Applecart? Contact Info @alanjanders on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Spark DataBricks DataBricks Delta Applecart Stepan Pushkarev from Hydrosphere.io What is Hydropshere.io? What metrics do you track to determine when a machine learning model is not producing an appropriate output? How do you determine which data points to sample for retraining the model? How does the role of a machine learning engineer differ from data engineers and data scientists? Contact Info LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Hydrosphere Machine Learning Engineer The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

5/7/2018 • 32 minutes, 38 seconds

Metabase Self Service Business Intelligence with Sameer Al-Sakran - Episode 29

Summary Business Intelligence software is often cumbersome and requires specialized knowledge of the tools and data to be able to ask and answer questions about the state of the organization. Metabase is a tool built with the goal of making the act of discovering information and asking questions of an organizations data easy and self-service for non-technical users. In this episode the CEO of Metabase, Sameer Al-Sakran, discusses how and why the project got started, the ways that it can be used to build and share useful reports, some of the useful features planned for future releases, and how to get it set up to start using it in your environment. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Sameer Al-Sakran about Metabase, a free and open source tool for self service business intelligence Interview Introduction How did you get involved in the area of data management? The current goal for most companies is to be “data driven”. How would you define that concept? How does Metabase assist in that endeavor? What is the ratio of users that take advantage of the GUI query builder as opposed to writing raw SQL? What level of complexity is possible with the query builder? What have you found to be the typical use cases for Metabase in the context of an organization? How do you manage scaling for large or complex queries? What was the motivation for using Clojure as the language for implementing Metabase? What is involved in adding support for a new data source? What are the differentiating features of Metabase that would lead someone to choose it for their organization? What have been the most challenging aspects of building and growing Metabase, both from a technical and business perspective? What do you have planned for the future of Metabase? Contact Info Sameer salsakran on GitHub @sameeralsakran on Twitter LinkedIn Metabase Website @metabase on Twitter metabase on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Expa Metabase Blackjet Hadoop Imeem Maslow’s Hierarchy of Data Needs 2 Sided Marketplace Honeycomb Interview Excel Tableau Go-JEK Clojure React Python Scala JVM Redash How To Lie With Data Stripe Braintree Payments The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/30/2018 • 44 minutes, 46 seconds

Octopai: Metadata Management for Better Business Intelligence with Amnon Drori - Episode 28

Summary The information about how data is acquired and processed is often as important as the data itself. For this reason metadata management systems are built to track the journey of your business data to aid in analysis, presentation, and compliance. These systems are frequently cumbersome and difficult to maintain, so Octopai was founded to alleviate that burden. In this episode Amnon Drori, CEO and co-founder of Octopai, discusses the business problems he witnessed that led him to starting the company, how their systems are able to provide valuable tools and insights, and the direction that their product will be taking in the future. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 200Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Amnon Drori about OctopAI and the benefits of metadata management Interview Introduction How did you get involved in the area of data management? What is OctopAI and what was your motivation for founding it? What are some of the types of information that you classify and collect as metadata? Can you talk through the architecture of your platform? What are some of the challenges that are typically faced by metadata management systems? What is involved in deploying your metadata collection agents? Once the metadata has been collected what are some of the ways in which it can be used? What mechanisms do you use to ensure that customer data is segregated? How do you identify and handle sensitive information during the collection step? What are some of the most challenging aspects of your technical and business platforms that you have faced? What are some of the plans that you have for OctopAI going forward? Contact Info Amnon LinkedIn @octopaiamnon on Twitter OctopAI @OctopaiBI on Twitter Website Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links OctopAI Metadata Metadata Management Data Integrity CRM (Customer Relationship Management) ERP (Enterprise Resource Planning) Business Intelligence ETL (Extract, Transform, Load) Informatica SAP Data Governance SSIS (SQL Server Integration Services) Vertica Airflow Luigi Oozie GDPR (General Data Privacy Regulation) Root Cause Analysis The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/23/2018 • 39 minutes, 52 seconds

Data Engineering Weekly with Joe Crobak - Episode 27

Summary The rate of change in the data engineering industry is alternately exciting and exhausting. Joe Crobak found his way into the work of data management by accident as so many of us do. After being engrossed with researching the details of distributed systems and big data management for his work he began sharing his findings with friends. This led to his creation of the Hadoop Weekly newsletter, which he recently rebranded as the Data Engineering Weekly newsletter. In this episode he discusses his experiences working as a data engineer in industry and at the USDS, his motivations and methods for creating a newsleteter, and the insights that he has gleaned from it. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Joe Crobak about his work maintaining the Data Engineering Weekly newsletter, and the challenges of keeping up with the data engineering industry. Interview Introduction How did you get involved in the area of data management? What are some of the projects that you have been involved in that were most personally fulfilling? As an engineer at the USDS working on the healthcare.gov and medicare systems, what were some of the approaches that you used to manage sensitive data? Healthcare.gov has a storied history, how did the systems for processing and managing the data get architected to handle the amount of load that it was subjected to? What was your motivation for starting a newsletter about the Hadoop space? Can you speak to your reasoning for the recent rebranding of the newsletter? How much of the content that you surface in your newsletter is found during your day-to-day work, versus explicitly searching for it? After over 5 years of following the trends in data analytics and data infrastructure what are some of the most interesting or surprising developments? What have you found to be the fundamental skills or areas of experience that have maintained relevance as new technologies in data engineering have emerged? What is your workflow for finding and curating the content that goes into your newsletter? What is your personal algorithm for filtering which articles, tools, or commentary gets added to the final newsletter? How has your experience managing the newsletter influenced your areas of focus in your work and vice-versa? What are your plans going forward? Contact Info Data Eng Weekly Email Twitter – @joecrobak Twitter – @dataengweekly Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links USDS National Labs Cray Amazon EMR (Elastic Map-Reduce) Recommendation Engine Netflix Prize Hadoop Cloudera Puppet healthcare.gov Medicare Quality Payment Program HIPAA NIST National Institute of Standards and Technology PII (Personally Identifiable Information) Threat Modeling Apache JBoss Apache Web Server MarkLogic JMS (Java Message Service) Load Balancer COBOL Hadoop Weekly Data Engineering Weekly Foursquare NiFi Kubernetes Spark Flink Stream Processing DataStax RSS The Flavors of Data Science and Engineering CQRS Change Data Capture Jay Kreps The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/15/2018 • 43 minutes, 32 seconds

Defining DataOps with Chris Bergh - Episode 26

Summary Managing an analytics project can be difficult due to the number of systems involved and the need to ensure that new information can be delivered quickly and reliably. That challenge can be met by adopting practices and principles from lean manufacturing and agile software development, and the cross-functional collaboration, feedback loops, and focus on automation in the DevOps movement. In this episode Christopher Bergh discusses ways that you can start adding reliability and speed to your workflow to deliver results with confidence and consistency. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Christopher Bergh about DataKitchen and the rise of DataOps Interview Introduction How did you get involved in the area of data management? How do you define DataOps? How does it compare to the practices encouraged by the DevOps movement? How does it relate to or influence the role of a data engineer? How does a DataOps oriented workflow differ from other existing approaches for building data platforms? One of the aspects of DataOps that you call out is the practice of providing multiple environments to provide a platform for testing the various aspects of the analytics workflow in a non-production context. What are some of the techniques that are available for managing data in appropriate volumes across those deployments? The practice of testing logic as code is fairly well understood and has a large set of existing tools. What have you found to be some of the most effective methods for testing data as it flows through a system? One of the practices of DevOps is to create feedback loops that can be used to ensure that business needs are being met. What are the metrics that you track in your platform to define the value that is being created and how the various steps in the workflow are proceeding toward that goal? In order to keep feedback loops fast it is necessary for tests to run quickly. How do you balance the need for larger quantities of data to be used for verifying scalability/performance against optimizing for cost and speed in non-production environments? How does the DataKitchen platform simplify the process of operationalizing a data analytics workflow? As the need for rapid iteration and deployment of systems to capture, store, process, and analyze data becomes more prevalent how do you foresee that feeding back into the ways that the landscape of data tools are designed and developed? Contact Info LinkedIn @ChrisBergh on Twitter Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links DataOps Manifesto DataKitchen 2017: The Year Of DataOps Air Traffic Control Chief Data Officer (CDO) Gartner W. Edwards Deming DevOps Total Quality Management (TQM) Informatica Talend Agile Development Cattle Not Pets IDE (Integrated Development Environment) Tableau Delphix Dremio Pachyderm Continuous Delivery by Jez Humble and Dave Farley SLAs (Service Level Agreements) XKCD Image Recognition Comic Airflow Luigi DataKitchen Documentation Continuous Integration Continous Delivery Docker Version Control Git Looker The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/8/2018 • 54 minutes, 30 seconds

ThreatStack: Data Driven Cloud Security with Pete Cheslock and Patrick Cable - Episode 25

Summary Cloud computing and ubiquitous virtualization have changed the ways that our applications are built and deployed. This new environment requires a new way of tracking and addressing the security of our systems. ThreatStack is a platform that collects all of the data that your servers generate and monitors for unexpected anomalies in behavior that would indicate a breach and notifies you in near-realtime. In this episode ThreatStack’s director of operations, Pete Cheslock, and senior infrastructure security engineer, Patrick Cable, discuss the data infrastructure that supports their platform, how they capture and process the data from client systems, and how that information can be used to keep your systems safe from attackers. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Pete Cheslock and Pat Cable about the data infrastructure and security controls at ThreatStack Interview Introduction How did you get involved in the area of data management? Why don’t you start by explaining what ThreatStack does? What was lacking in the existing options (services and self-hosted/open source) that ThreatStack solves for? Can you describe the type(s) of data that you collect and how it is structured? What is the high level data infrastructure that you use for ingesting, storing, and analyzing your customer data? How do you ensure a consistent format of the information that you receive? How do you ensure that the various pieces of your platform are deployed using the proper configurations and operating as intended? How much configuration do you provide to the end user in terms of the captured data, such as sampling rate or additional context? I understand that your original architecture used RabbitMQ as your ingest mechanism, which you then migrated to Kafka. What was your initial motivation for that change? How much of a benefit has that been in terms of overall complexity and cost (both time and infrastructure)? How do you ensure the security and provenance of the data that you collect as it traverses your infrastructure? What are some of the most common vulnerabilities that you detect in your client’s infrastructure? For someone who wants to start using ThreatStack, what does the setup process look like? What have you found to be the most challenging aspects of building and managing the data processes in your environment? What are some of the projects that you have planned to improve the capacity or capabilities of your infrastructure? Contact Info Pete Cheslock @petecheslock on Twitter Website petecheslock on GitHub Patrick Cable @patcable on Twitter Website patcable on GitHub ThreatStack Website @threatstack on Twitter threatstack on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links ThreatStack SecDevOps Sonian EC2 Snort Snorby Suricata Tripwire Syscall (System Call) AuditD CloudTrail Naxsi Cloud Native File Integrity Monitoring (FIM) Amazon Web Services (AWS) RabbitMQ ZeroMQ Kafka Spark Slack PagerDuty JSON Microservices Cassandra ElasticSearch Sensu Service Discovery Honeypot Kubernetes PostGreSQL Druid Flink Launch Darkly Chef Consul Terraform CloudFormation The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

4/1/2018 • 51 minutes, 52 seconds

MarketStore: Managing Timeseries Financial Data with Hitoshi Harada and Christopher Ryan - Episode 24

Summary The data that is used in financial markets is time oriented and multidimensional, which makes it difficult to manage in either relational or timeseries databases. To make this information more manageable the team at Alapaca built a new data store specifically for retrieving and analyzing data generated by trading markets. In this episode Hitoshi Harada, the CTO of Alapaca, and Christopher Ryan, their lead software engineer, explain their motivation for building MarketStore, how it operates, and how it has helped to simplify their development workflows. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Christopher Ryan and Hitoshi Harada about MarketStore, a storage server for large volumes of financial timeseries data Interview Introduction How did you get involved in the area of data management? What was your motivation for creating MarketStore? What are the characteristics of financial time series data that make it challenging to manage? What are some of the workflows that MarketStore is used for at Alpaca and how were they managed before it was available? With MarketStore’s data coming from multiple third party services, how are you managing to keep the DB up-to-date and in sync with those services? What is the worst case scenario if there is a total failure in the data store? What guards have you built to prevent such a situation from occurring? Since MarketStore is used for querying and analyzing data having to do with financial markets and there are potentially large quantities of money being staked on the results of that analysis, how do you ensure that the operations being performed in MarketStore are accurate and repeatable? What were the most challenging aspects of building MarketStore and integrating it into the rest of your systems? Motivation for open sourcing the code? What is the next planned major feature for MarketStore, and what use-case is it aiming to support? Contact Info Christopher Email Hitoshi Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links MarketStore GitHub Release Announcement Alpaca IBM DB2 GreenPlum Algorithmic Trading Backtesting OHLC (Open-High-Low-Close) HDF5 Golang C++ Timeseries Database List InfluxDB JSONRPC Slait CircleCI GDAX The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/25/2018 • 33 minutes, 27 seconds

Stretching The Elastic Stack with Philipp Krenn - Episode 23

Summary Search is a common requirement for applications of all varieties. Elasticsearch was built to make it easy to include search functionality in projects built in any language. From that foundation, the rest of the Elastic Stack has been built, expanding to many more use cases in the proces. In this episode Philipp Krenn describes the various pieces of the stack, how they fit together, and how you can use them in your infrastructure to store, search, and analyze your data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to build your next pipeline you’ll need somewhere to deploy it, so check out Linode. With private networking, shared block storage, node balancers, and a 40Gbit network, all controlled by a brand new API you’ve got everything you need to run a bullet-proof data platform. Go to dataengineeringpodcast.com/linode to get a $20 credit and launch a new server in under a minute. For complete visibility into the health of your pipeline, including deployment tracking, and powerful alerting driven by machine-learning, DataDog has got you covered. With their monitoring, metrics, and log collection agent, including extensive integrations and distributed tracing, you’ll have everything you need to find and fix performance bottlenecks in no time. Go to dataengineeringpodcast.com/datadog today to start your free 14 day trial and get a sweet new T-Shirt. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. Your host is Tobias Macey and today I’m interviewing Philipp Krenn about the Elastic Stack and the ways that you can use it in your systems Interview Introduction How did you get involved in the area of data management? The Elasticsearch product has been around for a long time and is widely known, but can you give a brief overview of the other components that make up the Elastic Stack and how they work together? Beyond the common pattern of using Elasticsearch as a search engine connected to a web application, what are some of the other use cases for the various pieces of the stack? What are the common scaling bottlenecks that users should be aware of when they are dealing with large volumes of data? What do you consider to be the biggest competition to the Elastic Stack as you expand the capabilities and target usage patterns? What are the biggest challenges that you are tackling in the Elastic stack, technical or otherwise? What are the biggest challenges facing Elastic as a company in the near to medium term? Open source as a business model: https://www.elastic.co/blog/doubling-down-on-open?utm_source=rss&utm_medium=rss What is the vision for Elastic and the Elastic Stack going forward and what new features or functionality can we look forward to? Contact Info @xeraa on Twitter xeraa on GitHub Website Email Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Elastic Vienna – Capital of Austria What Is Developer Advocacy? NoSQL MongoDB Elasticsearch Cassandra Neo4J Hazelcast Apache Lucene Logstash Kibana Beats X-Pack ELK Stack Metrics APM (Application Performance Monitoring) GeoJSON Split Brain Elasticsearch Ingest Nodes PacketBeat Elastic Cloud Elasticon Kibana Canvas SwiftType The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/19/2018 • 51 minutes, 2 seconds

Database Refactoring Patterns with Pramod Sadalage - Episode 22

Summary As software lifecycles move faster, the database needs to be able to keep up. Practices such as version controlled migration scripts and iterative schema evolution provide the necessary mechanisms to ensure that your data layer is as agile as your application. Pramod Sadalage saw the need for these capabilities during the early days of the introduction of modern development practices and co-authored a book to codify a large number of patterns to aid practitioners, and in this episode he reflects on the current state of affairs and how things have changed over the past 12 years. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Pramod Sadalage about refactoring databases and integrating database design into an iterative development workflow Interview Introduction How did you get involved in the area of data management? You first co-authored Refactoring Databases in 2006. What was the state of software and database system development at the time and why did you find it necessary to write a book on this subject? What are the characteristics of a database that make them more difficult to manage in an iterative context? How does the practice of refactoring in the context of a database compare to that of software? How has the prevalence of data abstractions such as ORMs or ODMs impacted the practice of schema design and evolution? Is there a difference in strategy when refactoring the data layer of a system when using a non-relational storage system? How has the DevOps movement and the increased focus on automation affected the state of the art in database versioning and evolution? What have you found to be the most problematic aspects of databases when trying to evolve the functionality of a system? Looking back over the past 12 years, what has changed in the areas of database design and evolution? How has the landscape of tooling for managing and applying database versioning changed since you first wrote Refactoring Databases? What do you see as the biggest challenges facing us over the next few years? Contact Info Website pramodsadalage on GitHub @pramodsadalage on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Database Refactoring Website Book Thoughtworks Martin Fowler Agile Software Development XP (Extreme Programming) Continuous Integration The Book Wikipedia Test First Development DDL (Data Definition Language) DML (Data Modification Language) DevOps Flyway Liquibase DBMaintain Hibernate SQLAlchemy ORM (Object Relational Mapper) ODM (Object Document Mapper) NoSQL Document Database MongoDB OrientDB CouchBase CassandraDB Neo4j ArangoDB Unit Testing Integration Testing OLAP (On-Line Analytical Processing) OLTP (On-Line Transaction Processing) Data Warehouse Docker QA==Quality Assurance HIPAA (Health Insurance Portability and Accountability Act) PCI DSS (Payment Card Industry Data Security Standard) Polyglot Persistence Toplink Java ORM Ruby on Rails ActiveRecord Gem The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/12/2018 • 49 minutes, 5 seconds

The Future Data Economy with Roger Chen - Episode 21

Summary Data is an increasingly sought after raw material for business in the modern economy. One of the factors driving this trend is the increase in applications for machine learning and AI which require large quantities of information to work from. As the demand for data becomes more widespread the market for providing it will begin transform the ways that information is collected and shared among and between organizations. With his experience as a chair for the O’Reilly AI conference and an investor for data driven businesses Roger Chen is well versed in the challenges and solutions being facing us. In this episode he shares his perspective on the ways that businesses can work together to create shared data resources that will allow them to reduce the redundancy of their foundational data and improve their overall effectiveness in collecting useful training sets for their particular products. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Roger Chen about data liquidity and its impact on our future economies Interview Introduction How did you get involved in the area of data management? You wrote an essay discussing how the increasing usage of machine learning and artificial intelligence applications will result in a demand for data that necessitates what you refer to as ‘Data Liquidity’. Can you explain what you mean by that term? What are some examples of the types of data that you envision as being foundational to multiple organizations and problem domains? Can you provide some examples of the structures that could be created to facilitate data sharing across organizational boundaries? Many companies view their data as a strategic asset and are therefore loathe to provide access to other individuals or organizations. What encouragement can you provide that would convince them to externalize any of that information? What kinds of storage and transmission infrastructure and tooling are necessary to allow for wider distribution of, and collaboration on, data assets? What do you view as being the privacy implications from creating and sharing these larger pools of data inventory? What do you view as some of the technical challenges associated with identifying and separating shared data from those that are specific to the business model of the organization? With broader access to large data sets, how do you anticipate that impacting the types of businesses or products that are possible for smaller organizations? Contact Info @rgrchen on Twitter LinkedIn Angel List Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Electrical Engineering Berkeley Silicon Nanophotonics Data Liquidity In The Age Of Inference Data Silos Example of a Data Commons Cooperative Google Maps Moat: An article describing how Google Maps has refined raw data to create a new product Genomics Phenomics ImageNet Open Data Data Brokerage Smart Contracts IPFS Dat Protocol Homomorphic Encryption FileCoin Data Programming Snorkel Website Podcast Interview The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/5/2018 • 42 minutes, 47 seconds

Honeycomb Data Infrastructure with Sam Stokes - Episode 20

Summary One of the sources of data that often gets overlooked is the systems that we use to run our businesses. This data is not used to directly provide value to customers or understand the functioning of the business, but it is still a critical component of a successful system. Sam Stokes is an engineer at Honeycomb where he helps to build a platform that is able to capture all of the events and context that occur in our production environments and use them to answer all of your questions about what is happening in your system right now. In this episode he discusses the challenges inherent in capturing and analyzing event data, the tools that his team is using to make it possible, and how this type of knowledge can be used to improve your critical infrastructure. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Sam Stokes about his work at Honeycomb, a modern platform for observability of software systems Interview Introduction How did you get involved in the area of data management? What is Honeycomb and how did you get started at the company? Can you start by giving an overview of your data infrastructure and the path that an event takes from ingest to graph? What are the characteristics of the event data that you are dealing with and what challenges does it pose in terms of processing it at scale? In addition to the complexities of ingesting and storing data with a high degree of cardinality, being able to quickly analyze it for customer reporting poses a number of difficulties. Can you explain how you have built your systems to facilitate highly interactive usage patterns? A high degree of visibility into a running system is desirable for developers and systems adminstrators, but they are not always willing or able to invest the effort to fully instrument the code or servers that they want to track. What have you found to be the most difficult aspects of data collection, and do you have any tooling to simplify the implementation for user? How does Honeycomb compare to other systems that are available off the shelf or as a service, and when is it not the right tool? What have been some of the most challenging aspects of building, scaling, and marketing Honeycomb? Contact Info @samstokes on Twitter Blog samstokes on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Honeycomb Retriever Monitoring and Observability Kafka Column Oriented Storage Elasticsearch Elastic Stack Django Ruby on Rails Heroku Kubernetes Launch Darkly Splunk Datadog Cynefin Framework Go-Lang Terraform AWS The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/26/2018 • 41 minutes, 33 seconds

Data Teams with Will McGinnis - Episode 19

Summary The responsibilities of a data scientist and a data engineer often overlap and occasionally come to cross purposes. Despite these challenges it is possible for the two roles to work together effectively and produce valuable business outcomes. In this episode Will McGinnis discusses the opinions that he has gained from experience on how data teams can play to their strengths to the benefit of all. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Will McGinnis about the relationship and boundaries between data engineers and data scientists Interview Introduction How did you get involved in the area of data management? The terms “Data Scientist” and “Data Engineer” are fluid and seem to have a different meaning for everyone who uses them. Can you share how you define those terms? What parallels do you see between the relationships of data engineers and data scientists and those of developers and systems administrators? Is there a particular size of organization or problem that serves as a tipping point for when you start to separate the two roles into the responsibilities of more than one person or team? What are the benefits of splitting the responsibilities of data engineering and data science? What are the disadvantages? What are some strategies to ensure successful interaction between data engineers and data scientists? How do you view these roles evolving as they become more prevalent across companies and industries? Contact Info Website wdm0006 on GitHub @willmcginniser on Twitter LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Blog Post: Tendencies of Data Engineers and Data Scientists Predikto Categorical Encoders DevOps SciKit-Learn The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/19/2018 • 28 minutes, 38 seconds

TimescaleDB: Fast And Scalable Timeseries with Ajay Kulkarni and Mike Freedman - Episode 18

Summary As communications between machines become more commonplace the need to store the generated data in a time-oriented manner increases. The market for timeseries data stores has many contenders, but they are not all built to solve the same problems or to scale in the same manner. In this episode the founders of TimescaleDB, Ajay Kulkarni and Mike Freedman, discuss how Timescale was started, the problems that it solves, and how it works under the covers. They also explain how you can start using it in your infrastructure and their plans for the future. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ajay Kulkarni and Mike Freedman about Timescale DB, a scalable timeseries database built on top of PostGreSQL Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Timescale is and how the project got started? The landscape of time series databases is extensive and oftentimes difficult to navigate. How do you view your position in that market and what makes Timescale stand out from the other options? In your blog post that explains the design decisions for how Timescale is implemented you call out the fact that the inserted data is largely append only which simplifies the index management. How does Timescale handle out of order timestamps, such as from infrequently connected sensors or mobile devices? How is Timescale implemented and how has the internal architecture evolved since you first started working on it? What impact has the 10.0 release of PostGreSQL had on the design of the project? Is timescale compatible with systems such as Amazon RDS or Google Cloud SQL? For someone who wants to start using Timescale what is involved in deploying and maintaining it? What are the axes for scaling Timescale and what are the points where that scalability breaks down? Are you aware of anyone who has deployed it on top of Citus for scaling horizontally across instances? What has been the most challenging aspect of building and marketing Timescale? When is Timescale the wrong tool to use for time series data? One of the use cases that you call out on your website is for systems metrics and monitoring. How does Timescale fit into that ecosystem and can it be used along with tools such as Graphite or Prometheus? What are some of the most interesting uses of Timescale that you have seen? Which came first, Timescale the business or Timescale the database, and what is your strategy for ensuring that the open source project and the company around it both maintain their health? What features or improvements do you have planned for future releases of Timescale? Contact Info Ajay LinkedIn @acoustik on Twitter Timescale Blog Mike Website LinkedIn @michaelfreedman on Twitter Timescale Blog Timescale Website @timescaledb on Twitter GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Timescale PostGreSQL Citus Timescale Design Blog Post MIT NYU Stanford SDN Princeton Machine Data Timeseries Data List of Timeseries Databases NoSQL Online Transaction Processing (OLTP) Object Relational Mapper (ORM) Grafana Tableau Kafka When Boring Is Awesome PostGreSQL RDS Google Cloud SQL Azure DB Docker Continuous Aggregates Streaming Replication PGPool II Kubernetes Docker Swarm Citus Data Website Data Engineering Podcast Interview Database Indexing B-Tree Index GIN Index GIST Index STE Energy Redis Graphite Prometheus pgprometheus OpenMetrics Standard Proposal Timescale Parallel Copy Hadoop PostGIS KDB+ DevOps Internet of Things MongoDB Elastic DataBricks Apache Spark Confluent New Enterprise Associates MapD Benchmark Ventures Hortonworks 2σ Ventures CockroachDB Cloudflare EMC Timescale Blog: Why SQL is beating NoSQL, and what this means for the future of data The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/11/2018 • 1 hour, 2 minutes, 40 seconds

Pulsar: Fast And Scalable Messaging with Rajan Dhabalia and Matteo Merli - Episode 17

Summary One of the critical components for modern data infrastructure is a scalable and reliable messaging system. Publish-subscribe systems have been popular for many years, and recently stream oriented systems such as Kafka have been rising in prominence. This week Rajan Dhabalia and Matteo Merli discuss the work they have done on Pulsar, which supports both options, in addition to being globally scalable and fast. They explain how Pulsar is architected, how to scale it, and how it fits into your existing infrastructure. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Rajan Dhabalia and Matteo Merli about Pulsar, a distributed open source pub-sub messaging system Interview Introduction How did you get involved in the area of data management? Can you start by explaining what Pulsar is and what the original inspiration for the project was? What have been some of the most challenging aspects of building and promoting Pulsar? For someone who wants to run Pulsar, what are the infrastructure and network requirements that they should be considering and what is involved in deploying the various components? What are the scaling factors for Pulsar and what aspects of deployment and administration should users pay special attention to? What projects or services do you consider to be competitors to Pulsar and what makes it stand out in comparison? The documentation mentions that there is an API layer that provides drop-in compatibility with Kafka. Does that extend to also supporting some of the plugins that have developed on top of Kafka? One of the popular aspects of Kafka is the persistence of the message log, so I’m curious how Pulsar manages long-term storage and reprocessing of messages that have already been acknowledged? When is Pulsar the wrong tool to use? What are some of the improvements or new features that you have planned for the future of Pulsar? Contact Info Matteo merlimat on GitHub @merlimat on Twitter Rajan @dhabaliaraj on Twitter rhabalia on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Pulsar Publish-Subscribe Yahoo Streamlio ActiveMQ Kafka Bookkeeper SLA (Service Level Agreement) Write-Ahead Log Ansible Zookeeper Pulsar Deployment Instructions RabbitMQ Confluent Schema Registry Podcast Interview Kafka Connect Wallaroo Podcast Interview Kinesis Athenz The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

2/4/2018 • 53 minutes, 46 seconds

Dat: Distributed Versioned Data Sharing with Danielle Robinson and Joe Hand - Episode 16

Summary Sharing data across multiple computers, particularly when it is large and changing, is a difficult problem to solve. In order to provide a simpler way to distribute and version data sets among collaborators the Dat Project was created. In this episode Danielle Robinson and Joe Hand explain how the project got started, how it functions, and some of the many ways that it can be used. They also explain the plans that the team has for upcoming features and uses that you can watch out for in future releases. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers A few announcements: There is still time to register for the O’Reilly Strata Conference in San Jose, CA March 5th-8th. Use the link dataengineeringpodcast.com/strata-san-jose to register and save 20% The O’Reilly AI Conference is also coming up. Happening April 29th to the 30th in New York it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to dataengineeringpodcast.com/aicon-new-york to register and save 20% If you work with data or want to learn more about how the projects you have heard about on the show get used in the real world then join me at the Open Data Science Conference in Boston from May 1st through the 4th. It has become one of the largest events for data scientists, data engineers, and data driven businesses to get together and learn how to be more effective. To save 60% off your tickets go to dataengineeringpodcast.com/odsc-east-2018 and register. Your host is Tobias Macey and today I’m interviewing Danielle Robinson and Joe Hand about Dat Project, a distributed data sharing protocol for building applications of the future Interview Introduction How did you get involved in the area of data management? What is the Dat project and how did it get started? How have the grants to the Dat project influenced the focus and pace of development that was possible? Now that you have established a non-profit organization around Dat, what are your plans to support future sustainability and growth of the project? Can you explain how the Dat protocol is designed and how it has evolved since it was first started? How does Dat manage conflict resolution and data versioning when replicating between multiple machines? One of the primary use cases that is mentioned in the documentation and website for Dat is that of hosting and distributing open data sets, with a focus on researchers. How does Dat help with that effort and what improvements does it offer over other existing solutions? One of the difficult aspects of building a peer-to-peer protocol is that of establishing a critical mass of users to add value to the network. How have you approached that effort and how much progress do you feel that you have made? How does the peer-to-peer nature of the platform affect the architectural patterns for people wanting to build applications that are delivered via dat, vs the common three-tier architecture oriented around persistent databases? What mechanisms are available for content discovery, given the fact that Dat URLs are private and unguessable by default? For someone who wants to start using Dat today, what is involved in creating and/or consuming content that is available on the network? What have been the most challenging aspects of building and promoting Dat? What are some of the most interesting or inspiring uses of the Dat protocol that you are aware of? Contact Info Dat datproject.org Email @dat_project on Twitter Dat Chat Danielle Email @daniellecrobins Joe Email @joeahand on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Dat Project Code For Science and Society Neuroscience Cell Biology OpenCon Mozilla Science Open Education Open Access Open Data Fortune 500 Data Warehouse Knight Foundation Alfred P. Sloan Foundation Gordon and Betty Moore Foundation Dat In The Lab Dat in the Lab blog posts California Digital Library IPFS Dat on Open Collective – COMING SOON! ScienceFair Stencila eLIFE Git BitTorrent Dat Whitepaper Merkle Tree Certificate Transparency Dat Protocol Working Group Dat Multiwriter Development – Hyperdb Beaker Browser WebRTC IndexedDB Rust C Keybase PGP Wire Zenodo Dryad Data Sharing Dataverse RSync FTP Globus Fritter Fritter Demo Rotonde how to Joe’s website on Dat Dat Tutorial Data Rescue – NYTimes Coverage Data.gov Libraries+ Network UC Conservation Genomics Consortium Fair Data principles hypervision hypervision in browser The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA Click here to read the unedited transcript… Tobias Macey 00:13 Hello and welcome to the data engineering podcast the show about modern data management. When you’re ready to launch your next project, you’ll need somewhere to deploy it, you should check out Linotype data engineering podcast.com slash load and get a $20 credit to try out there fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to date engineering podcast com to subscribe to the show. Sign up for the newsletter read the show notes and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes or Google Play Music, tell your friends and co workers and share it on social media. I’ve got a couple of announcements before we start the show. There’s still time to register for the O’Reilly strata conference in San Jose, California how from March 5 to the eighth. Use the link data engineering podcast.com slash strata dash San Jose to register and save 20% off your tickets. The O’Reilly AI conference is also coming up happening April 29. To the 30th. In New York, it will give you a solid understanding of the latest breakthroughs and best practices in AI for business. Go to data engineering podcast.com slash AI con dash new dash York to register and save 20% off the tickets. Also, if you work with data or want to learn more about how the projects you have heard about on the show get used in the real world, then join me at the Open Data Science Conference happening in Boston from May 1 through the fourth. It has become one of the largest events for data scientists, data engineers and data driven businesses to get together and learn how to be more effective. To save 60% of your tickets go to data engineering podcast.com slash o d s c dash East dash 2018 and register. Your host is Tobias Macey. And today I’m interviewing Danielle Robinson and Joe hand about the DAP project the distributed data sharing protocol for building applications of the future. So Danielle, could you start by introducing yourself? Sure. Danielle Robinson 02:10 My name is Danielle Robinson. And I’m the CO executive director of code for science and society, which is the nonprofit that supports that project. I’ve been working on debt related projects first as a partnerships director for about a year now. And I’m here with my colleague, Joe hand, take it away, Joe. Joe Hand 02:32 Joe hand and I’m the other co executive director and the director of operations at code for science and society. And I’ve been a core contributor for about two years now. Tobias Macey 02:42 And Danielle, starting with you again, can you talk about how you first got involved and interested in the area of data management? Sure. Danielle Robinson 02:48 So I have a PhD in neuroscience. I finished that about a year and a half ago. And what I did during my PhD, my research was focused on cell biology Gee, really, without getting into the weeds too much on that a lot of time microscopes collecting some kind of medium sized aging data. And during that process, I became pretty frustrated with the academic and publishing systems that seemed to be limiting the access of access of people to the results of taxpayer funded research. So publications are behind paywalls. And data is either not published along with the paper or sometimes is published but not well archived and becomes inaccessible over time. So sort of compounding this traditionally, code has not really been thought of as an academic, a scholarly work. So that’s a whole nother conversation. But even though these things are changing data and code aren’t shared consistently, and are pretty inconsistently managed within labs, I think that’s fair to say. So and what that does is it makes it really hard to reproduce or replicate other people’s research, which is important for the scientific process. So during my PhD, I got really active in the open con and Mozilla science communities, which I encourage your listeners to check out. These communities build inter interdisciplinary connections between the open source world and open education, open access and open data communities. And that’s really important to like build things that people will actually use and make big cultural and policy changes that will make it easier to access research and share data. So it sort of I got involved, because of the partly because of the technical challenge. But also I’m interested in the people problems. So the changes to the incentive structure and the culture of research that are needed to make data management better on a day to day and make our research infrastructure stronger and more long lasting. Tobias Macey 04:54 And Joe, how did you get involved in data management? Joe Hand 04:57 Yeah, I’ve sort of gone back and forth between the sort of more academic or research a management and more traditional software side. So I really got started involved in data management when I was at a data visualization agency. And we basically built, you know, pretty web based visualization, interactive visualizations, for variety clients. This was cool, because it sort of allowed me to see like a large variety of data management techniques. So there was like the small scale, spreadsheet and manually updating data and spreadsheets, and then sending that off to visualize and to like, big fortune 500 companies that had data warehouses and full internal API’s that we got access to. So it’s really cool to see that sort of variety of, of data collection and data usage between all those organizations. So that was also good, because it, it sort of helped me understand how how to use data effectively. And that really means like telling a story around it. So you know, in order to sort of use data, you have to either use some math or some visual representation and the best the best stories around data combined, sort of bit of both of those. And then from there, I moved to a Research Institute. And we were tasked with building a data platform for international NGO. And they that group basically does census data collection in slums all over the world. And so as a research group, we were sort of trying interested in in using that data for research, but we also had to help them figure out how to collect that data. So before we came in with that project, they’d basically doing 30 years of data collection on paper, and then simulate sometimes manually entering that data into spreadsheets, and then trying to sort of share that around through thumb drives or Dropbox or sort of whatever tools they had access to. So this was cool, because it really gave me a great opportunity to see the other side of data management and analysis. So, you know, we work with the corporate clients, which sort of have big, lots of resources and computer computer resources and cloud servers. And this was sort of the other side where there’s, there’s very few resources, most of the data analysis happens offline. And a lot of the data transfer happens offline. So it was really cool to an interesting to see that, that a lot of the tools I’d been taking for granted sort of weren’t, couldn’t be applied in those in those areas. And then on the research side of things, I saw that, you know, as scientists and governments, they were just sort of haphazardly organizing data in the same way. So I was sort of trying to collect and download census data from about 30 countries. And we had to email right fax people, we got different CDs and paper documents and PDFs and other languages. So that really illustrated that there’s like a lot of data manage out there in a way that that I wasn’t totally familiar with. And it’s just, it’s just very crazy how everybody manages their data in different way. And that’s sort of a long, what I like to call the long tail of data management. So people that don’t use sort of traditional databases or manage it in their sort of unique ways. And most people managing day that in that way, you probably wouldn’t call it data, but it’s just sort of what they use to get their job done. And so once I started to sort of look at alternatives to managing that research data, I found that basically, and was hooked and started to contribute. So that’s sort of how I found that. Tobias Macey 08:16 So that leads us nicely into talking about what the project is. And as much of the origin story each of you might be aware of. And Joe, you already mentioned how you got involved in the project. But Danielle, if you could also share your involvement or how you got started with it as well, Danielle Robinson 08:33 yeah, I can tell the origin story. So the DAP project is an open source community building a protocol for peer to peer data sharing. And as a protocol, it’s similar to HTTP and how the protocols used today, but that adds extra security and automatic provisioning, and allows users to connect to a decentralized network in a decentralized network. You can store the data anywhere, either in a cloud or in a local computer, and it does work offline. And so data is built to make it easy for developers to build decentralized applications without worrying about moving data around and the people who originally developed it. And that’ll be Mathias, and Max and Chris, they’re scratching their own itch for building software to share and archive public and research data. And this is how Joe got involved, like he was saying before. And so it originally started as an open source project. And then that got a grant from the Knight Foundation in 2013, as a prototype grant focusing on government data, and then that was followed up in 2014, by a grant from the Alfred P. Sloan Foundation, and that grant focus more on scientific research and allowed the project to put a little more effort into working with researchers. And since then, we’ve been working to solve research data management problems by developing software on top of the debt protocol. And the most recent project is funded by the Gordon and anymore foundation. And now, that project started 2016. And that supports us it’s called debt in the lab, and I can get you a link to it on our blog. It supports us to work with California Digital Library and research groups in the University of California system to make it easier to move files around version data sets from support researchers through automating archiving. And so that’s a really cool project, because we get to work directly with researchers and do the kind of participatory design software stuff that we enjoy doing and create things that people will actually use. And we get to learn about really exciting research very, very different from the research, I did my PhD, one of the labs were working with a study see star Wasting Disease. So it’s really fascinating stuff. And we get to work right with them to make things that we’re going to fit into their workflows. So I started working with that, in the summer, right before that grant was funded. So I guess maybe six month before that grant was funded. And so I was came on as a consultant initially to help write grants and start talking about how to work directly with researchers and what to build that researchers will really help them move their data around and version control it. So So yeah, that’s how I became involved. And then in the fall, I transitioned to a partnerships position, and then the ED position in the last month. Tobias Macey 11:27 And you mentioned that a lot of the sort of boost to the project has come in the form of grants from a few different foundations. So I’m wondering if you can talk a bit about how those different grants have influenced the focus and pace of the development that was possible for the project? Joe Hand 11:42 Yeah, I mean, that really occupies a unique position in the open source world with that grant funding. So you know, for the first few years, it was closer to sort of a research project than a traditional product focused startup and other projects, other open source projects like that might be done part time as a side project, or just sort of for fun. But the grant funding really allowed the original developers to sign on and work full time, really solving harder problems that they might might be able to otherwise. So since we sort of got those grants, we’ve been able to toe the line between more user facing product and some research software. And the grant really gave us opportunity to, to tow that line, but also getting a field and connect with researchers and end users. So we can sort of innovate in with technical solutions, but really ground those real in reality with with specific scientific use cases. So you know, this balances really only possible because of that grant funding, which sort of gives us more flexibility and might have a little longer timeline than then VC money or or just like a open source, side project. But now we’re really at a critical juncture, I’d say we’re grant funding is not quite enough to cover what we want to do. But we’re lucky the protocol is really getting in a more stable position. And we’re starting to, to look at those user facing products on top and starting to build those those around around the core protocol. Tobias Macey 13:10 And the fact that you have received so many different rounds of grant funding, sort of lends credence to the fact that you’re solving a critical problem that lots of people are coming up against. And I’m wondering if there are any other projects or companies or organizations that are trying to tackle similar or related problems that you sort of view as co collaborators or competitors in the space? Where do you think that the DAP project is fairly uniquely positioned to solve the specific problems that it’s addressing? Joe Hand 13:44 Yeah, I mean, I would say we have, you know, there are other similar use cases and tools. And you know, a lot of that is around sharing open data sets, and sort of that the publishing of data, which Daniel might be able to talk more about, but on the on the sort of technical side, there is, you know, other I guess the biggest competitor or similar thing might be I PFS, which is another sort of decentralized protocol for for sharing and, and storing data in different ways. But we’re really we’re actually, you know, excited to work with these various companies. So you know, I PFS is more of a storage focus format. So basically allows content based storage on a distributed network. And that’s really more about sort of the the transfer protocol and, and being very interoperable without all these other solutions. So yeah, you know, that’s what we’re more excited about it is trying to understand how we can how we can use that in collaboration with all these other groups. Yeah, Danielle Robinson 14:41 I think I’m just close one, what Joe said, through my time coming up in the open con community and the Mozilla science community, there are a lot of people trying to improve access to data broadly. And I, most of the people, I know everyone in the space really takes collaboration, not competition, sort of approach, because there are a lot of different ways to solve the problem, depending on who what the end user wants. And there are there’s a lot of great projects working in the space. I would agree with Joe, I guess that IP address is the thing that people sometimes you know, like I’ll be at a an event and someone will say, what’s the difference between detonate, PFS, and I answered pretty much how judges answered. But it’s important to note that we know those people, and we have good relationships with them. And we’ve actually just been emailing with them about some kind of collaboration over the next year. So it’s there’s a lot of there’s a lot of really great projects in the open data and improving access to data space. And I basically support them all. So hopefully, there’s so much work to be done that I think there’s room for all the people in the space. Tobias Macey 15:58 And now that you have a style, a nonprofit organization around that, are there any particular plans that you have to support future sustainability and growth for the project? Danielle Robinson 16:09 Yes, future sustainability and growth for the project is what we wake up and think about every day, sometimes in the middle of the night. That’s the most important thing. And incorporating the nonprofit was a big step that happened, I think, the end of 2016. And so it’s critical as we move towards a self sustaining future. And importantly, it will also allow us to continue to support and incubate other open source projects in the space, which is something that I’m really excited about. For dat, our goal is to support a core group of top contributors through grants, revenue sharing, and donations. And so over the next 12 months will be pursuing grants and corporate donations, as well as rolling out an open collective page to help facilitate smaller donations, and continuing to develop products with an eye towards things that can generate revenue and support that idea that ecosystem at the same time, we’re also focusing on sustainability within the project itself. And what I mean by that is, you know, governance, community management. And so we are right now working with the developer community to formalize the technical process on the protocol through a working group. And those are really great calls, lots of great people are involved in that. And we really want to make sure that protocol decisions are made transparently. And it can involve a wider group of the community in the process. And we also want to make the path to participation, involvement and community leadership clear for newcomers. So by supporting the developer community, we hope to encourage like new and exciting implementations of the DAP protocol, some of the stuff that happened 2017, you know, from my perspective, working in the science and sort of came out of nowhere, and people are building, you know, amazing new social networks based on that. And it was really fun and exciting. And so just keeping the community healthy, and making sure that the the technical process and how decisions get made is really clear and transparent, I think was going to facilitate even more of that. And just another comment about being a nonprofit because code for science, and society is a nonprofit, we also act as a fiscal sponsor. And what that means is that like minded projects, who get grant funding that are not nonprofits, so they can’t accept the grant on their grant through us. And then we take a small percentage of that grant. And we use that to help those projects by linking them up with our community. I work with them on grant writing, and fundraising and strategy will support their own community engagement efforts and sometimes offer technical support. And we see this is really important to the ecosystem and a way to help smaller projects develop and succeed. So right now we do that with two projects. One of them is called sin Silla. And it can send a link for that. And the other one is called science fair. scintilla is an open source project predictable documents software funded by the Alfred P. Sloan Foundation. It’s looking to support researchers from data collection to document offering. And science fair is a peer to peer library built on data, which is designed to make it easy for scholars to curate collections of research on a certain topic, annotate them and share it with their colleagues. And so that project was funded by a prototype grant from a publisher called life. And they’re looking for additional funding. So we’re working with both of them. And in the first quarter of this year, Joe and I are working to formalize the process of how we work with these other projects and what we can offer them and hopefully, we’ll be in the position take on additional projects later this year. But I really enjoy that work. And I think, as someone so I went through the Mozilla fellowship, which was like a 10 month long, crazy period where Mozilla invested a lot in me and making sure I was meeting people and learning how to write grants and learning how to give good talks and all kinds of awesome investment. And so for a person who goes through a program like that, or a person who has a side project, there’s kind of there’s a need for groups in the space, who can incubate those projects, and help them as they develop from from the incubator stage to the, you know, middle stage before they scale up. So I thinking there’s, so as a fiscal sponsor, we were hoping to be able to support projects in that space. Tobias Macey 20:32 And digging into the debt protocol itself. When I was looking through the documentation, it mentioned that the actual protocol itself is agnostic to the implementation. And I know that the current reference implementation is done in JavaScript. So I’m wondering if you could describe a bit about how the protocol itself is designed, how the reference implementation is done, and how the overall protocol has evolved since it was first started and what your approach is to version in the protocol itself to ensure that people who are implementing it and other technologies or formats are able to ensure that they’re compliant with specific versions of the protocol as it evolves. Joe Hand 21:19 Yeah, so that’s basically a combination of ideas from from get BitTorrent, and just the the web in general. And so there are a few key properties in that, but basically, any implementation has to recreate. And those are content, integrity, decentralized mirroring of the data sets, network, privacy, incremental version, and then random access to the data. So we have a white paper that sort of explains all these in depth, but I’ll sort of explain how they work maybe in a basic use case. So let’s say I want to send some data to Danielle, which I do all the time. And I have a spreadsheet where I keep track of my coffee intake intake. So I want to live Danielle’s computer so she can make sure I’m not over caffeinated myself. So sort of similar to how you get started with get, I would put my spreadsheet in a folder and create a new dat. And so whenever I create a new debt, it makes a new key pair. So one the public key and was the private key. And the public key is basically the dat link, so kind of like a URL. So you can use that in any anything that speaks with the the DAP protocol. And you can just sort of open that up and look at all the files inside of that. And then the the private key allows me to write files to that. And it’s used to sign any of the new changes. And so the private key allows Danielle to verify that the changes actually came for me and that somebody else wasn’t, wasn’t trying to fake my data, or somebody wasn’t trying to man in the middle of my, my data when I was transferring it to Danielle. So I added my spreadsheet to the data. And then the date, what that does is break that file into little chunks. It hashes all those trunks and creates a Merkel tree with that. And that Merkel tree, basically has lots of cool properties is one of the key key sort of features of data. So the Merkel tree allows us to sparsely replicated data. So if we had a really big data set, and you only want one file, we can sort of use the Merkel tree to download one file and then still verify the integrity of that content with that incomplete data set. And the other part that allows us to do that is the register. So all the files are stored in one register, and all the metadata is stored in another register. And these registers are basically append only Ledger’s. They’re also sort of known as secure registers. Google has a project called certificate transparency, that has similar ideas. And these registers, basically, you pen, whenever new file changes, you might append that to the metadata register, and that register stories based permission about the structure of the file system, what version it is, and then any other metadata, like the creation time for the change time of that file. And so right now, you know, as you said, Tobias, we we sort of are very flexible on sort of how things are implemented. But right now we basically store the files as files. So that’s sort of allows for people to see the files normally and interact with them normally. But the cool part about that is that the the on disk file storage can be really flexible. So as long as the implementation has random access, basically, then they can store it in any different way. So we have, for example, a server edge store storage model built for the server that stores all of the files as a single file. So that sort of allows you to have less file descriptors open and sort of shut, gets the the file I O all constrained to one file. So once my file gets added, I can share my link privately with Danielle and I can send that over chat or something or just paste it somewhere. And then she can clone my dad on using our command line tool or the desktop tool or the beaker browser. And when she clones my dad, our computer is basically connect directly to each other. So we use a variety mechanisms to try and do that connection. That’s been one of the challenges that I can talk about later, sort of how to how to connect peer to peer and the challenges around that. But then once we do connect, will transfer the data either over TCP or UDP. So those are default network protocols that we use right now. But yeah, that can be as automated basically, on any other protocol. I think Mathias once said that, that if you could implement it over carrier pigeon, that would work fine, as long as you had a lot of pigeons. So we’re really open to sort of how how the data as far as the protocol, information gets transferred. And we’re working over a dat over HTTP implementation too. So this wouldn’t be peer to peer. But it would allow basically traditional server fallback if no peers or online or for services that don’t want to run a peer to peer for whatever reason, once Danielle clones my, she can open it just like a normal file and plug it into a bar or Python or whatever. And use her equation to measure my caffeine level. And then let’s say I drink another cup of coffee and update my spreadsheet, the changes will basically automatically be synced to her, as long as she’s still connected to me. And it will it will be synced throughout the network to anybody else that’s connected to me. So the meditate, meditate or register stores that updated file information. And then the content register stores just the change file blocks. So Danielle only have to sync the death of that content change rather than the whole dataset again. So this is really useful for the big data sets, you know, I think the whole thing. And yeah, we’ve had to design basically each of these pieces to be as modular as possible both within our JavaScript demo the implementation, but also in the protocol in general. So right now, developers can swap other network protocols data storage. So for example, if you want to use that in the browser, you can use web RTC for the network and discovery and then use index DB for data storage. So index DB has random access. So you can just plug that in, directly into that. And we have some modules for those. And that should be working. We did have a web RTC implementation we were supporting for a while, but we found it a bit inconsistent for our use cases, which is, you know, more around like large file sharing. But it’s still might be okay for for chat and other more text based things. So, yeah, all of our implementations in Node right now. I think that was that was both for, for usability and developer friendliness, and also just being able to work in the browser and across platforms. So we can distribute a binary now of that pretty easily. And you can run that in the browser or build dad tools on electron. So it sort of allows a wide range of, of developer tools built on top of that. But we have a few community members now working on different implementations and rust and see I think are the two, the two that are going right now. And so as far as the the protocol version in, that was actually one of the big conversations we were having in the last working group meeting. And that’s to be decided, basically, but through through the stages we’ve gone through, we’ve broken it quite a few times. And now we’re finally in a place where we we want to make sure not to break it moving forward. So there’s sort of space in the protocol for information like version history, or version of the protocol. So we’ll probably use that to signal the version and just figure out how, how the tools that are implementing it can fall back to the latest version. So before, before all the sort of file based stuff that went through a different a few different stages, it started really as a more like version, decentralized database. And then as as Max and Mathias and Krista sort of moved to the scientific use cases where they sort of removed more and more of the database architecture as it as it moved on and matured. So we basically, that transition was really driven by like user feedback and watching her researchers work. And we realized that so much of research data is still kept in files and basically moved manually between machines. So even if we were going to build like a special database, a lot of researchers still won’t be able to use that, because that sort of requires more more infrastructure than there they have time to support. So we really just kept working to build a general purpose solution that allows other people to build tools to solve those, those more specific problems. And the last point is that right now, all that transfer is basically one way so only one person can update the source. This is really useful for a lot of our research escape research cases where they’re getting data from lab equipment, where there’s like a specific source, and you just want to disseminate that information to various computers. But it really doesn’t work for collaboration. So that’s sort of the next thing that we’re working on. But we really want to make sure to solve, solve this sort of one way problem before we move to the harder problem of collaborative data sets. And this last major iteration is sort of the hardest. And that’s what we’re working in right now. But it’s sort of allows multiple users to write to the same that. And with that, we sort of get into problems like conflict resolution and, and duplicate updates and other other sort of harder distributed computing problems. Tobias Macey 30:24 And that partially answers one of the next questions I had, which was to ask about conflict resolution. But if there’s only one source that’s allowed to update the information, then that solves a lot of the problems that might arise by sinking all these data sets between multiple machines, because there aren’t going to be multiple parties changing the data concurrently. So you don’t have to worry about how to handle those use cases. And another question that I had from what you were talking about is the cryptography aspect of that sounds as though when you initialize the data, it just automatically generates the pressure private key. And so that private key is chronically linked with that particular data set. But is there any way to use for instance, Coinbase or jpg, to sign the source that in addition to the generated key to establish your identity for some for when you’re trying to share that information publicly? And not necessarily via some channel that already has established trust? Joe Hand 31:27 Yeah, I mean, you can sort of so once, I mean, you could, like do that within the that. We don’t really have any mechanism for doing that on top of that. So it’s, you know, we’re sort of going to throw that into user land right now. But, yeah, I mean, that’s a good good question. And we’ve we’ve had some people, I think, experimenting with different identity systems and and how to solve that problem. And I think we’re pretty excited about the, the new wire app, because that’s open source, and it uses end to end encryption and has some identity system and are sort of trying to see if we can sort of build that on top of wire. So that’s, that’s one of the things that we’re sort of experimenting with. Tobias Macey 32:09 And one of the primary use cases that is mentioned in the documentation, and the website for that is being able to host and distribute open data sets with a focus being on researchers and academic use cases. So I’m wondering if you can talk some more about how that helps with that particular effort and what improvements it offers over some of the existing solutions that researchers were using prior Danielle Robinson 32:33 there are solutions for both hosting and distributing data. And in terms of hosting and distribution. There’s a lot of great work, focused on data publication and making sure that data associated with publications is available online and thinking about the noto and Dryad or data verse. There are also other data hosting platforms such as see can or data dot world. And we really love the work these people do and we’ve collaborated with some of them are were involved in like, the organization of friendly org people life for the open source Alliance for open scholarship has some people from Dryad who are involved in it. And so it’s nice to work with them. And we’d love to work with them to use that to upload and distribute data. But right now, if researchers need to feed if researchers need to share files between many machines and keep them updated, and version, so for example, if there’s a large live updating data set, there really aren’t great solutions to address data version and sharing. So in terms of sharing, transferring lots of researchers still manually copy files between machines and servers, or use tools like our sink or FTP, which is how I handled it during my PhD. Other software such as Globus or even Dropbox box can require more IT infrastructure than small research group may have researchers like you know, they are all operating on limited grant funding. And they also depend on the it structure of their institution to get them access to certain things. So a researcher like me might spend all day collecting a terabyte of data on a microscope and then wait for hours or wait overnight to move it to another location. And the ideal situation from a data management perspective is that those raw data are automatically archived to the web server and sent to the researchers computer for processing. So you have an archived copy of the raw data that came off of the equipment. And in the process, files also need to be archived. So you need archives of the imaging files, in this case at each step in processing. And then when a publication is ready, the data processing pipeline, in order for it to be fully reproducible, you’ll need the code and you’ll need the data at different stages. And even without access to to compete, the computer, the cluster where the analysis was done, a person should be able to repeat that. And I say ideally, because this isn’t really how it’s happening. Now. archiving data, a different steps can be the some of the things that stop that from happening, or just cost of storage, and the availability of storage and researcher habits. So I definitely, you know, know some researchers who kept data on hard drives and Tupperware to protect them in case the sprinklers ever went off, which isn’t really like a long term solution, true facts. So that can make on can automate these archiving steps at different checkpoints and make the backups easier for researchers. As a former researcher, I’m interested in anything that makes better data management automatic for researchers. And so we’re also interested in version computer environments to help labs avoid the drawer full of jobs tribes problem, which is sadly, a quote from a senior scientist who was describing a bunch of data collected by her lab that she can no longer access, she has the drawer, she has the jazz drives, she can’t get in them, that data is essentially lost. And so researchers are really motivated to make sure when things are archived, they’re archived in a forum where they can actually be accessed. But I think, because researchers are so busy, it’s really hard to know like, when that is, so I think because we’re so focused on essentially like filling in the gaps between the services that researchers use, and it worked well for them and automating things, I think that that’s in a really good position to solve some of these problems. And if you have, you know, some of the researchers that we’re working with now, I’m thinking of one person who has a large data set and bioinformatics pipeline, and he’s at a UC lab, and he wants to get all the information to his closet right here in Washington State. And it’s taken months, and he has not been able to do it or he can get he can’t, he just can’t move that data across institutional lines. So and that’s a much longer conversation as to like why exactly that isn’t working. But we’re working with him to try to just make him make it possible for him to move the data and create a version iteration or a version emulation of his compute environment so that his collaborator can just do what he was doing and not need to spend four months worrying about dependencies and stuff. So yeah, hopefully, that’s the question. Tobias Macey 37:39 And one of the other difficult aspects of building a peer to peer protocol is the fact that in order for there to be sufficient value in the protocol itself is there needs to be a network behind it of people to be able to share that information with and share the bandwidth requirements for being able to distribute that in front. So I’m wondering how you have approached the effort of building up that network, and how much progress you feel you have made in that effort? Joe Hand 38:08 Yeah, I’m not sure we really view that as as that traditional peer to peer protocol, I’m using that model sort of relying on on network effects to scale. So you know, as Danielle said, we’re just trying to get data from A to B. And so our critical mass is basically to users on a given data set. So obviously, we want to first build something that offers better tools for those to users over traditional cloud or client server model. So if I’m transferring files to another researcher using Dropbox, you know, we have to transfer files via a third party and a third computer before it can get to the other computer. So rather than going direct between two computers, we have to go through a detour. And this has implications for speed, but also security bandwidth usage, and even something like energy usage. So by cutting off at their computer, we feel like we’re we’re already about adding value to the network, we’re sort of hoping that when when researchers that are doing this HDB transfer, they they can sort of see the value of going directly. And and using something that is version and can like be life synced over existing tools, like our st corrected E or, or the commercial services that might store data in the cloud. And you know, we really don’t have anything against the centralized services, we sort of recognize that they’re very useful sometimes. But they, they also aren’t the answer to everything. And so depending on the use case, decentralized system might make more sense than a centralized one. And so we sort of want to offer developer and users that option to make that choice, which we don’t really have right now. But in order to do that, we really have to start with peer to peer tools first. And then once we have that decentralized network, we can basically limit the network to one server peer in many clients, and then all of a sudden, it’s centralized. So we sort of understand that, that it’s easy to go from the centralized, decentralized, but it’s harder to go the other way around, we sort of have to start with a peer to peer network in order to solve all these different problems. And the other thing is that we sort of know, file systems are not going away. We know that that web browsers will continue to support static files. And we also know that people will basically want to move these things between computers, back them up, archive them, share them two different computers. So we sort of know files are going to be transferred a lot in the future. And that’s something we we can, we can depend on. And they probably even want to do this in a secure way sometimes, and maybe in an offline environment or a local network. And so we’re basically trying to build from that those basic principles, using sort of peer to peer transfer is the sort of bedrock of all that. And that’s sort of how we got to where we are now with the peer to peer network. But we’re not really worried that that we need a certain number of or critical mass of users to add value, because we just sort of feel like by building the right tools, with these principles, we can, we can start adding value, whether it’s a decentralized network or a centralized network. Tobias Macey 40:59 And one of the other use cases that’s been built on top of that is being able to build websites and applications that can be viewed by a web browsers and distributed peer to peer in that manner. So I’m wondering how much uptake you’ve seen and usage for that particular application of the protocol? And how much development effort is being focused on that particular use case? Joe Hand 41:20 Yeah, so you know, if I open my bigger browser right now, which is the main the main web implementation we have that Paul frizzy and Tara Bansal are working on, you know, if I open my my bigger browser, I think I usually have 50, to 100, or sometimes 200, peers that I connected right away. So that’s through some of the social network copies, like, wrote on their freighter, and then just some, like personal sites. And you know, we’ve sort of been working with the beaker browser folk probably for two years now. Sort of CO developing the protocol and, and seeing what they need support for in beaker. But you know, it sort of comes back comes back to that basic Brynn pull that we can recognize that a lot of websites are static files. And if we can just sort of support static files in the best way possible, then you can browse a lot of websites. And that even gives you the benefit of things that are more interactive, we know that they have to be developed. So they work offline, too. So both Cortana and Twitter can work offline. And then once you get back online, you can just sync the data sort of seamlessly. So that’s sort of the most exciting part about those. Danielle Robinson 42:29 You mean, fritter not. freighter is the Twitter clone that Tara Bansal and Paul made beakers, a lot of fun. And if you’ve never played around with it, I would encourage you to download it. I think it’s just speaker browser calm. And I’m not a developer by trade. But I have seriously enjoyed playing around on beaker. And I think the some of the more frivolous things like printer that have come out of it are a lot of fun, and really speak to the potential of peer to peer networks in today’s era as people are becoming increasingly frustrated with the centralized platforms. Tobias Macey 43:13 And the fact that the content that’s being distributed via that using the browser is primarily static in nature, I’m wondering how that affects the sort of architectural patterns that people are used to using with the common three tier architecture. And what are you’ve already mentioned, a couple of social network applications that have been built on top of it, but I’m wondering if there any others that are built on top of and delivered via that, that you’re aware of the you could talk about that speak to some of the ways that people are taking advantage of that in more of the consumer space? Joe Hand 43:47 Yeah, I mean, I think, you know, one of the big shifts that have made this easier is having databases in the browser, so things like index DB or other local storage databases, and then be able to sync those two other computers. So as long as you sort of know that, I’m writing to my database, and that, you know, if I’m writing my, I think people are trying to build games off this. So you know, you could build a chess game where I write to my local database, and then you have some logic for determining if a move is valid or not, and then sinking that to your competitor, you know, it sort of provides, it’s a more constrained environment. But I think that also gives you a benefit of, of sort of being able to constrain your development and, and not requiring these external services or external database calls or whatever. I know that I’ve tried a few times to sort of develop projects are just like fun little things. And it is a challenge, it’s a challenge, because you sort of have to think differently, how those things work, and you can’t rely necessarily on on external services, you know, whether that’s something as simple as like, loading fonts from external service, or CSS styles or whatever, external JavaScript, you sort of want that all to be packaged within one, one day, if you want to ensure it’s all going to work. So it’s def has, you know, you think of a little differently even on those those simple things. But yeah, it does constrain the sort of bigger applications. And, you know, I think the other area that that we could see development is more in electron applications. So maybe not in beaker, but electron, using that sort of framework as as a platform for other types of applications that might need those more sort of flexible models. So science fair, which is one of our hosted projects, is a really good example of how, how to use that in a way to distribute data, but still sort of have a full application. So basically, you can distribute all the data for the application over that and keep it updated through the live sinking. And users can basically download the the PDFs that they need to read, or the journals or the figures they want to read. And just download whatever they want sort of allowing developers to have that flexible model where you can distribute things peer to peer and have both the live sinking, but also just downloading whatever data that users need, and just providing that framework for, for that data management. Tobias Macey 46:15 And one of the other challenges that’s posed, particularly for this public distribution, use case is that content discovery, because the By default, the URLs that are generated, are private, and ungraspable, because they’re essentially just hashes of the content. So I’m wondering if there are any particular mechanisms that you either have built or planned or started discussing for being able to facilitate content discovery of the information that’s being distributed by these different networks? Joe Hand 46:50 Yeah, this is definitely an open question. I sort of fall back on my comment answer, which is depends on the the tool that we’re using and the different communities and there’s going to be different approaches, some might be more decentralized, and some might be centralized. So, for example, with data set discovery, you know, there’s a lot of good centralized services for data set publishing, as Daniel mentioned, like pseudo or data verse. So these are places that already have discovery engines, I guess we’ll say, and they published data sets. So you know, you could sort of similarly publish that URL along with those those data sets so that people could sort of have an alternative way to download those data sets. So that’s, that’s sort of one way that we’ve been thinking about discovery is sort of leveraging these existing solutions that are doing a really good job in their domain, and trying to work with them to start using that for their their data management. Another sort of hacky solution, I guess I’ll say is using existing domains and DNS. So basically, you can publish a regular HTTP site on your URL, and give it a specific well known file, and that points to your debt address. And then the baker browser can find that file and tell you that a peer to peer version of that site is available. So we’re basically leveraging the existing DNS infrastructure to start to discover content just with existing URLs. And I think a lot of the discovery will be more community based. So in, for example, fritter in rotund people are starting to build crawlers or search bots, to discover users or search and so basically, just sort of looking at where there is need, and identifying, you know, different types of crawlers to build and, and how to connect those communities in different ways. So we’re really excited to see what what ideas pop in that in that area. And they’ll probably come in in a decentralized way, we hope. Tobias Macey 48:46 And for somebody who wants to start using that what is involved in creating and or consuming the content that’s available on the network, or if there any particular resources that are available to get somebody up to speed and understand how it works and some of the different uses that they could put it to? Danielle Robinson 49:05 Sure, I can take that. And Joe just chime in. If you think of anything else, we built a tutorial for our work with the labs and for Ma’s fest this year that’s at try dash calm. And this tutorial takes you through how to work with the command line tool and some basics about beaker. And please tell us if you find a bug, there may be bugs morning. But it was working pretty well when I use the last and it’s in the browser. And you can either share data with yourself it spins up a little virtual machine. So you can share data with yourself or you can do it with a friend and share data with your friend. So beakers also super easy for a user who wants to get started, you can visit pages of her dad just like you would a normal web page. For example, you can go to this website, and we’ll give Tobias the link to that. And just change the end PTP to dat. And so it looks like dat colon slash slash j handout space. And beaker also has this fun thing that lets you create a new site with a single click. And you can also fork sites and edit them and make your own copies of things, which is fun if you’re like learning about how to build several websites. So you can go to bigger browser calm and learn about that. And I think we’ve already talked about return and fritter. And we’ll add links into people who want to learn more about that. And then for data focused users, you can use that for sharing or transferring files, either with the desktop application or the command line interface. And so if you’re interested, we encourage you to play around the community is really friendly and helpful to new people. Joe and I are always on the IRC channel or on Twitter. So if you have questions, feel free to ask and we love talking to new people, because that’s how all the exciting stuff happens in this community. So Tobias Macey 50:58 and what have been some of the most challenging aspects of building the project in the community and promoting the use cases and capabilities of the project, Danielle Robinson 51:10 I can speak a little bit to promoting it in the academic research. So in academic research, probably similar to many of the industries where your listeners work, software decisions are not always made for entirely rational reasons. There’s tension between what your boss wants what the IT department has approved, that means institutional data security needs, and then the perceived time cost of developing a new workflow and getting used to a new protocol. So we try to work directly with researchers to make sure the things we build are easy and secure. But it is a lot of promotion and outreach to get their scientists to try a new workflow. They’re really busy. And the incentives are all you know, get more grants, do more projects, publish more papers. And so even if something will eventually make your life easier, it’s hard to sink in time up front. One thing I noticed, and this is probably common to all industries is that people will I’ll be talking to someone and they’ll say, Oh, you know, archiving the data from my research group is not a problem for me. And then they’ll proceed to describe a super problematic data management workflow. And it’s not a problem for them anymore, because they’re used to it. So it doesn’t hurt day to day. But you know, doing things like waiting until the point of publication, then try to go back and archive all the raw data, maybe someone was collected by a postdoc who’s now gone, other was collected by a summer student who used a non standard naming scheme for all the files, you know, there’s just a million ways that that stuff can go wrong. So for now, we’re focusing on developing real world use cases, and participating in you know, community education around data management. And we want to build stuff that’s meaningful for researchers and others who work with data. And we think that by working with people and doing the nonprofit thing, grants is going to be the way to get us there. God want to talk a little bit about building. Joe Hand 53:03 Yeah, sure. So you know, in terms of building it, I mean, I haven’t done too much work on the core protocol. So I can’t say much around the difficult design decisions there. I’m the main developer on the command line tool. And the most of the challenging decisions, they’re all are about sort of user interfaces, not necessarily technical problems. And so as Danielle said, it’s sort of as much about people as it is around software and and those decisions. But I think, you know, one of the, the most challenging thing that we’ve run into a lot is, is basically network issues. So in the peer to peer network, you know, you have to figure out how to connect to peers directly in a network, they might not be supposed to do that. So I think a lot of that is from BitTorrent sort of making different institutions restrict peer to peer networking in different ways. And, and so we’re sort of having to fight that battle against these existing restrictions and trying to find out how these networks are restrictive, and how we can continue to have success in connecting peers directly rather than through through a third party server. And it’s funny because, or maybe not funny, but some of the strictest network, we found are actually in academic institutions. And so, you know, some, for example, one of the UC campuses, I think, we found out that computers can never connect directly to each other computers on that same network. So if we wanted to transfer data between two computers sitting right next to each other, we basically have to go through an external cloud server just to get it to the computer sitting right next to each other, or, you know, you suddenly like a hard drive, or a thumb drive or whatever. But you know, that sort of thing. All these different sort of network configurations, I think, is one of the hardest parts, both in terms of implementation. But also in terms of testing, since we can’t, we can’t like readily get into these UC campuses or sort of see what the, what the network setup is. So we’re sort of trying to create more tools around network scene and both testing networks in the wild, but also just sort of using virtual networks to test different different types of network setups and sort of leverage that those two things combined to try and get around around all these network connection issues. So yeah, I think, you know, I would love to ask Mathias to this question around the design decisions in terms of the core protocol. But, but I can’t really say much about that, unfortunately. Tobias Macey 55:29 And are there any particularly interesting or inspiring uses of that, that you’re aware of that you’d like to share? Danielle Robinson 55:36 Sure, I can share a couple of things that we were involved in. During last in January 2016, we were involved in the data rescue and libraries plus network community. And that was the movement to archive government funded research at trusted public institutions like libraries and archives. And as a part of that, we got to work with some of the really awesome people at California Digital Library, California Digital Library is really cool, because it is digital library with a mandate to preserve and archive and steward the data that’s produced in the UC system. And it supports the entire UC system. And the people are great. And so we worked with them to make the the first ever backup of data.gov in January of 2016. And I think my colleague had 40 terabytes of metadata sitting in his living room for a while as we were working up to the transfer. And so that was a really cool project. And it has produced a useful thing. And it’s sort of, you know, we got to work with some of the data.gov people to make that happen. And they, you know, they were like how, really, it has never been backed up, that it was a good time to do it. But believe it or not, it’s actually pretty hard to find funding for that work. And we have more work we’d like to do in that space. archiving copies of federally funded research at trusted institutions is a really critical step towards ensuring the long term preservation of the research that gets done in this country. So hopefully, 2018 will see those projects funded or new collaborations in that space. Also, it’s a fantastic community, because it’s a lot of really interesting librarians and archivists who have great perspective on long term data preservation, and I love working with them. So hopefully, we can do something else there. Then the other thing that I’m really excited about is the working on the data in the lab project working on the debt container. issue. And I don’t mind over a little over time. So I don’t know how much I shouldn’t go into this. But we’ve learned a lot about really interesting research. And so we’re working to develop a container based simulation of a Research Computing cluster, that can run on any machine or in the cloud. And then by creating a container that will include the complete software environment of the cluster, researchers across the UC system can quickly get analysis pipelines that they’re working on us usable in other locations. And this Believe it or not, is it is it big problem, I was sort of surprised when one researcher told me she had been working for four months to get a pipeline running at UC Merced said that had been developed at UCLA. And that’s like, you could drive back and forth between her said, and UCLA a bunch of times in four months. But it’s this little stuff that really slows research down. And so I’m really excited about the potential there. And we wrote, we’ve written a couple blog posts on that. So I can add the links to those blog posts and in the follow up. Joe Hand 58:36 And I’d say the most novel use that I’m sort of excited about is called hyper vision. And it’s basically video streaming and built on that Mathias booth, one of the lead developers on that is prototyping sort of something similar with the Danish public TV. And they basically want to live stream their, their channels over the peer to peer network. So I’m excited about that, because I’d really love to get more public television and Public Radio distributing content, peer to peer, so we can sort of reduce their their infrastructure costs and hopefully, allow for for more of that great content to come out. Tobias Macey 59:09 Are there any other topics that we didn’t discuss yet? What do you think we should talk about before we close out the show? Danielle Robinson 59:15 Um, I think I’m feeling pretty good. What about you, Joe? Joe Hand 59:18 Yeah, I think that’s it for me. Okay. Tobias Macey 59:20 So for anybody who wants to keep up to date with the work you’re doing or get in touch, we’ll have you each add your preferred contact, excuse me, your preferred contact information to the show notes. And as a final question, to give the listeners something else to think about, from your perspective, what is the biggest gap in the tooling or technology that’s available for data management today? Joe Hand 59:42 I’d say transferring files, which feels really funny to say that, but to me, it’s still a problem that’s not really well solved. Just how do you get files from A to B in a consistent and easy to use manner, especially want a solution that doesn’t really require a command line, and is still secure, and hopefully doesn’t go through a third party service. Because hopefully, that means it works offline. So a lot of what I saw in the sort of developing world is the need for data management that works offline. And I think that’s, that’s one of the biggest gaps that we don’t really address yet. So there are a lot of great data data management tools out there. But I think they sort of aimed more at data scientists or software focused users that might use manage databases or something like a dupe. But there’s really a ton of users out there that don’t really have tools. Indeed, and most of the world is still offline or with inconsistent internet and putting everything through the servers on the cloud isn’t really feasible. But the alternatives now require sort of careful data management and manual data management if you don’t want to lose all your data. So we really hope to find a good balance between those those two needs in those two use cases. Yeah. Danielle Robinson 01:00:48 Plus one with Joe said, transferring files, it does feel funny to say that, but it is still a problem in a lot of industries, and especially where I come from in research science. And from my perspective, I guess the other issue is that, you know, the people problems are always as hard or harder than the technical problems. So if people don’t think that it’s important to share data or archive data, in an accessible and usable form, we could have the world’s best easy to use tool, and it wouldn’t impact the landscape or the accessibility of data. And similarly, if people are sharing data that’s not usable, because it’s missing experimental context, or it’s in a proprietary format, or because it’s shared under a restrictive license, it’s also not going to impact the landscape, or be useful to the scientific community or the public. So working to change, we want to build great tools. But I also want to work to change the incentive structure and research to ensure that good data management practices are rewarded. And so that data is shared in a usable form. That’s really key. And I’ll add a link in the show notes to the fair data principles, which means data should be fundable, testable, interoperable, and reusable, something that your listeners might want to check out if they’re not familiar with it. It’s a framework developed in academia. But I’m not sure actually how much impacts had outside of that sphere. But it would be interesting to talk to your listeners a little bit about that. And yeah, I’ll put my contact info in the show notes. And I’d love to connect with anyone and or answer any further questions about that, and what we’re going to try to do with coatings for science and society over the next year. So thanks a lot, Tobias, for inviting us. Tobias Macey 01:02:30 Yeah, absolutely. Thank you both for taking the time out of your days to join me and talk about the work you’re doing. It’s definitely a very interesting project with a lot of useful potential. And so I’m excited to see where you go from now into the future. So thank you both for your time and I hope you enjoy the rest of your evening. Unknown Speaker 01:02:48 Thank you. Thank you. Transcribed by https://otter.ai?utm_source=rss&utm_medium=rss

1/29/2018 • 1 hour, 2 minutes, 58 seconds

Snorkel: Extracting Value From Dark Data with Alex Ratner - Episode 15

Summary The majority of the conversation around machine learning and big data pertains to well-structured and cleaned data sets. Unfortunately, that is just a small percentage of the information that is available, so the rest of the sources of knowledge in a company are housed in so-called “Dark Data” sets. In this episode Alex Ratner explains how the work that he and his fellow researchers are doing on Snorkel can be used to extract value by leveraging labeling functions written by domain experts to generate training sets for machine learning models. He also explains how this approach can be used to democratize machine learning by making it feasible for organizations with smaller data sets than those required by most tooling. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Alex Ratner about Snorkel and Dark Data Interview Introduction How did you get involved in the area of data management? Can you start by sharing your definition of dark data and how Snorkel helps to extract value from it? What are some of the most challenging aspects of building labelling functions and what tools or techniques are available to verify their validity and effectiveness in producing accurate outcomes? Can you provide some examples of how Snorkel can be used to build useful models in production contexts for companies or problem domains where data collection is difficult to do at large scale? For someone who wants to use Snorkel, what are the steps involved in processing the source data and what tooling or systems are necessary to analyse the outputs for generating usable insights? How is Snorkel architected and how has the design evolved over its lifetime? What are some situations where Snorkel would be poorly suited for use? What are some of the most interesting applications of Snorkel that you are aware of? What are some of the other projects that you and your group are working on that interact with Snorkel? What are some of the features or improvements that you have planned for future releases of Snorkel? Contact Info Website ajratner on Github @ajratner on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Stanford DAWN HazyResearch Snorkel Christopher Ré Dark Data DARPA Memex Training Data FDA ImageNet National Library of Medicine Empirical Studies of Conflict Data Augmentation PyTorch Tensorflow Generative Model Discriminative Model Weak Supervision The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/22/2018 • 37 minutes, 12 seconds

CRDTs and Distributed Consensus with Christopher Meiklejohn - Episode 14

Summary As we scale our systems to handle larger volumes of data, geographically distributed users, and varied data sources the requirement to distribute the computational resources for managing that information becomes more pronounced. In order to ensure that all of the distributed nodes in our systems agree with each other we need to build mechanisms to properly handle replication of data and conflict resolution. In this episode Christopher Meiklejohn discusses the research he is doing with Conflict-Free Replicated Data Types (CRDTs) and how they fit in with existing methods for sharing and sharding data. He also shares resources for systems that leverage CRDTs, how you can incorporate them into your systems, and when they might not be the right solution. It is a fascinating and informative treatment of a topic that is becoming increasingly relevant in a data driven world. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Christopher Meiklejohn about establishing consensus in distributed systems Interview Introduction How did you get involved in the area of data management? You have dealt with CRDTs with your work in industry, as well as in your research. Can you start by explaining what a CRDT is, how you first began working with them, and some of their current manifestations? Other than CRDTs, what are some of the methods for establishing consensus across nodes in a system and how does increased scale affect their relative effectiveness? One of the projects that you have been involved in which relies on CRDTs is LASP. Can you describe what LASP is and what your role in the project has been? Can you provide examples of some production systems or available tools that are leveraging CRDTs? If someone wants to take advantage of CRDTs in their applications or data processing, what are the available off-the-shelf options, and what would be involved in implementing custom data types? What areas of research are you most excited about right now? Given that you are currently working on your PhD, do you have any thoughts on the projects or industries that you would like to be involved in once your degree is completed? Contact Info Website cmeiklejohn on GitHub Google Scholar Citations Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Basho Riak Syncfree LASP CRDT Mesosphere CAP Theorem Cassandra DynamoDB Bayou System (Xerox PARC) Multivalue Register Paxos RAFT Byzantine Fault Tolerance Two Phase Commit Spanner ReactiveX Tensorflow Erlang Docker Kubernetes Erleans Orleans Atom Editor Automerge Martin Klepman Akka Delta CRDTs Antidote DB Kops Eventual Consistency Causal Consistency ACID Transactions Joe Hellerstein The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/15/2018 • 45 minutes, 43 seconds

Citus Data: Distributed PostGreSQL for Big Data with Ozgun Erdogan and Craig Kerstiens - Episode 13

Summary PostGreSQL has become one of the most popular and widely used databases, and for good reason. The level of extensibility that it supports has allowed it to be used in virtually every environment. At Citus Data they have built an extension to support running it in a distributed fashion across large volumes of data with parallelized queries for improved performance. In this episode Ozgun Erdogan, the CTO of Citus, and Craig Kerstiens, Citus Product Manager, discuss how the company got started, the work that they are doing to scale out PostGreSQL, and how you can start using it in your environment. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ozgun Erdogan and Craig Kerstiens about Citus, worry free PostGreSQL Interview Introduction How did you get involved in the area of data management? Can you describe what Citus is and how the project got started? Why did you start with Postgres vs. building something from the ground up? What was the reasoning behind converting Citus from a fork of PostGres to being an extension and releasing an open source version? How well does Citus work with other Postgres extensions, such as PostGIS, PipelineDB, or Timescale? How does Citus compare to options such as PostGres-XL or the Postgres compatible Aurora service from Amazon? How does Citus operate under the covers to enable clustering and replication across multiple hosts? What are the failure modes of Citus and how does it handle loss of nodes in the cluster? For someone who is interested in migrating to Citus, what is involved in getting it deployed and moving the data out of an existing system? How do the different options for leveraging Citus compare to each other and how do you determine which features to release or withhold in the open source version? Are there any use cases that Citus enables which would be impractical to attempt in native Postgres? What have been some of the most challenging aspects of building the Citus extension? What are the situations where you would advise against using Citus? What are some of the most interesting or impressive uses of Citus that you have seen? What are some of the features that you have planned for future releases of Citus? Contact Info Citus Data citusdata.com @citusdata on Twitter citusdata on GitHub Craig Email Website @craigkerstiens on Twitter Ozgun Email ozgune on GitHub Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Citus Data PostGreSQL NoSQL Timescale SQL blog post PostGIS PostGreSQL Graph Database JSONB Data Type PipelineDB Timescale PostGres-XL Aurora PostGres Amazon RDS Streaming Replication CitusMX CTE (Common Table Expression) HipMunk Citus Sharding Blog Post Wal-e Wal-g Heap Analytics HyperLogLog C-Store The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/8/2018 • 46 minutes, 44 seconds

Wallaroo with Sean T. Allen - Episode 12

Summary Data oriented applications that need to operate on large, fast-moving sterams of information can be difficult to build and scale due to the need to manage their state. In this episode Sean T. Allen, VP of engineering for Wallaroo Labs, explains how Wallaroo was designed and built to reduce the cognitive overhead of building this style of project. He explains the motivation for building Wallaroo, how it is implemented, and how you can start using it today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Sean T. Allen about Wallaroo, a framework for building and operating stateful data applications at scale Interview Introduction How did you get involved in the area of data engineering? What is Wallaroo and how did the project get started? What is the Pony language, and what features does it have that make it well suited for the problem area that you are focusing on? Why did you choose to focus first on Python as the language for interacting with Wallaroo and how is that integration implemented? How is Wallaroo architected internally to allow for distributed state management? Is the state persistent, or is it only maintained long enough to complete the desired computation? If so, what format do you use for long term storage of the data? What have been the most challenging aspects of building the Wallaroo platform? Which axes of the CAP theorem have you optimized for? For someone who wants to build an application on top of Wallaroo, what is involved in getting started? Once you have a working application, what resources are necessary for deploying to production and what are the scaling factors? What are the failure modes that users of Wallaroo need to account for in their application or infrastructure? What are some situations or problem types for which Wallaroo would be the wrong choice? What are some of the most interesting or unexpected uses of Wallaroo that you have seen? What do you have planned for the future of Wallaroo? Contact Info IRC Mailing List Wallaroo Labs Twitter Email Personal Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Wallaroo Labs Storm Applied Apache Storm Risk Analysis Pony Language Erlang Akka Tail Latency High Performance Computing Python Apache Software Foundation Beyond Distributed Transactions: An Apostate’s View Consistent Hashing Jepsen Lineage Driven Fault Injection Chaos Engineering QCon 2016 Talk Codemesh in London: How did I get here? CAP Theorem CRDT Sync Free Project Basho Wallaroo on GitHub Docker Puppet Chef Ansible SaltStack Kafka TCP Dask Data Engineering Episode About Dask Beowulf Cluster Redis Flink Haskell The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/25/2017 • 59 minutes, 13 seconds

SiriDB: Scalable Open Source Timeseries Database with Jeroen van der Heijden - Episode 11

Summary Time series databases have long been the cornerstone of a robust metrics system, but the existing options are often difficult to manage in production. In this episode Jeroen van der Heijden explains his motivation for writing a new database, SiriDB, the challenges that he faced in doing so, and how it works under the hood. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Jeroen van der Heijden about SiriDB, a next generation time series database Interview Introduction How did you get involved in the area of data engineering? What is SiriDB and how did the project get started? What was the inspiration for the name? What was the landscape of time series databases at the time that you first began work on Siri? How does Siri compare to other time series databases such as InfluxDB, Timescale, KairosDB, etc.? What do you view as the competition for Siri? How is the server architected and how has the design evolved over the time that you have been working on it? Can you describe how the clustering mechanism functions? Is it possible to create pools with more than two servers? What are the failure modes for SiriDB and where does it fall on the spectrum for the CAP theorem? In the documentation it mentions needing to specify the retention period for the shards when creating a database. What is the reasoning for that and what happens to the individual metrics as they age beyond that time horizon? One of the common difficulties when using a time series database in an operations context is the need for high cardinality of the metrics. How are metrics identified in Siri and is there any support for tagging? What have been the most challenging aspects of building Siri? In what situations or environments would you advise against using Siri? Contact Info joente on Github LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links SiriDB Oversight InfluxDB LevelDB OpenTSDB Timescale DB KairosDB Write Ahead Log Grafana The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/18/2017 • 33 minutes, 52 seconds

Confluent Schema Registry with Ewen Cheslack-Postava - Episode 10

Summary To process your data you need to know what shape it has, which is why schemas are important. When you are processing that data in multiple systems it can be difficult to ensure that they all have an accurate representation of that schema, which is why Confluent has built a schema registry that plugs into Kafka. In this episode Ewen Cheslack-Postava explains what the schema registry is, how it can be used, and how they built it. He also discusses how it can be extended for other deployment targets and use cases, and additional features that are planned for future releases. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Ewen Cheslack-Postava about the Confluent Schema Registry Interview Introduction How did you get involved in the area of data engineering? What is the schema registry and what was the motivating factor for building it? If you are using Avro, what benefits does the schema registry provide over and above the capabilities of Avro’s built in schemas? How did you settle on Avro as the format to support and what would be involved in expanding that support to other serialization options? Conversely, what would be involved in using a storage backend other than Kafka? What are some of the alternative technologies available for people who aren’t using Kafka in their infrastructure? What are some of the biggest challenges that you faced while designing and building the schema registry? What is the tipping point in terms of system scale or complexity when it makes sense to invest in a shared schema registry and what are the alternatives for smaller organizations? What are some of the features or enhancements that you have in mind for future work? Contact Info ewencp on GitHub Website @ewencp on Twitter Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links Kafka Confluent Schema Registry Second Life Eve Online Yes, Virginia, You Really Do Need a Schema Registry JSON-Schema Parquet Avro Thrift Protocol Buffers Zookeeper Kafka Connect The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/10/2017 • 49 minutes, 21 seconds

data.world with Bryon Jacob - Episode 9

Summary We have tools and platforms for collaborating on software projects and linking them together, wouldn’t it be nice to have the same capabilities for data? The team at data.world are working on building a platform to host and share data sets for public and private use that can be linked together to build a semantic web of information. The CTO, Bryon Jacob, discusses how the company got started, their mission, and how they have built and evolved their technical infrastructure. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Bryon Jacob about the technology and purpose that drive data.world Interview Introduction How did you first get involved in the area of data management? What is data.world and what is its mission and how does your status as a B Corporation tie into that? The platform that you have built provides hosting for a large variety of data sizes and types. What does the technical infrastructure consist of and how has that architecture evolved from when you first launched? What are some of the scaling problems that you have had to deal with as the amount and variety of data that you host has increased? What are some of the technical challenges that you have been faced with that are unique to the task of hosting a heterogeneous assortment of data sets that intended for shared use? How do you deal with issues of privacy or compliance associated with data sets that are submitted to the platform? What are some of the improvements or new capabilities that you are planning to implement as part of the data.world platform? What are the projects or companies that you consider to be your competitors? What are some of the most interesting or unexpected uses of the data.world platform that you are aware of? Contact Information @bryonjacob on Twitter bryonjacob on GitHub LinkedIn Parting Question From your perspective, what is the biggest gap in the tooling or technology for data management today? Links data.world HomeAway Semantic Web Knowledge Engineering Ontology Open Data RDF CSVW SPARQL DBPedia Triplestore Header Dictionary Triples Apache Jena Tabula Tableau Connector Excel Connector Data For Democracy Jonathan Morgan The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

12/3/2017 • 46 minutes, 24 seconds

Data Serialization Formats with Doug Cutting and Julien Le Dem - Episode 8

Summary With the wealth of formats for sending and storing data it can be difficult to determine which one to use. In this episode Doug Cutting, creator of Avro, and Julien Le Dem, creator of Parquet, dig into the different classes of serialization formats, what their strengths are, and how to choose one for your workload. They also discuss the role of Arrow as a mechanism for in-memory data sharing and how hardware evolution will influence the state of the art for data formats. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Julien Le Dem and Doug Cutting about data serialization formats and how to pick the right one for your systems. Interview Introduction How did you first get involved in the area of data management? What are the main serialization formats used for data storage and analysis? What are the tradeoffs that are offered by the different formats? How have the different storage and analysis tools influenced the types of storage formats that are available? You’ve each developed a new on-disk data format, Avro and Parquet respectively. What were your motivations for investing that time and effort? Why is it important for data engineers to carefully consider the format in which they transfer their data between systems? What are the switching costs involved in moving from one format to another after you have started using it in a production system? What are some of the new or upcoming formats that you are each excited about? How do you anticipate the evolving hardware, patterns, and tools for processing data to influence the types of storage formats that maintain or grow their popularity? Contact Information Doug: cutting on GitHub Blog @cutting on Twitter Julien Email @J_ on Twitter Blog julienledem on GitHub Links Apache Avro Apache Parquet Apache Arrow Hadoop Apache Pig Xerox Parc Excite Nutch Vertica Dremel White Paper Twitter Blog on Release of Parquet CSV XML Hive Impala Presto Spark SQL Brotli ZStandard Apache Drill Trevni Apache Calcite The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/22/2017 • 51 minutes, 43 seconds

Buzzfeed Data Infrastructure with Walter Menendez - Episode 7

Summary Buzzfeed needs to be able to understand how its users are interacting with the myriad articles, videos, etc. that they are posting. This lets them produce new content that will continue to be well-received. To surface the insights that they need to grow their business they need a robust data infrastructure to reliably capture all of those interactions. Walter Menendez is a data engineer on their infrastructure team and in this episode he describes how they manage data ingestion from a wide array of sources and create an interface for their data scientists to produce valuable conclusions. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data management When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at dataengineeringpodcast.com/linode and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Continuous delivery lets you get new features in front of your users as fast as possible without introducing bugs or breaking production and GoCD is the open source platform made by the people at Thoughtworks who wrote the book about it. Go to dataengineeringpodcast.com/gocd to download and launch it today. Enterprise add-ons and professional support are available for added peace of mind. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Walter Menendez about the data engineering platform at Buzzfeed Interview Introduction How did you get involved in the area of data management? How is the data engineering team at Buzzfeed structured and what kinds of projects are you responsible for? What are some of the types of data inputs and outputs that you work with at Buzzfeed? Is the core of your system using a real-time streaming approach or is it primarily batch-oriented and what are the business needs that drive that decision? What does the architecture of your data platform look like and what are some of the most significant areas of technical debt? Which platforms and languages are most widely leveraged in your team and what are some of the outliers? What are some of the most significant challenges that you face, both technically and organizationally? What are some of the dead ends that you have run into or failed projects that you have tried? What has been the most successful project that you have completed and how do you measure that success? Contact Info @hackwalter on Twitter walterm on GitHub Links Data Literacy MIT Media Lab Tumblr Data Capital Data Infrastructure Google Analytics Datadog Python Numpy SciPy NLTK Go Language NSQ Tornado PySpark AWS EMR Redshift Tracking Pixel Google Cloud Don’t try to be google Stop Hiring DevOps Engineers and Start Growing Them The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

11/14/2017 • 43 minutes, 40 seconds

Astronomer with Ry Walker - Episode 6

Summary Building a data pipeline that is reliable and flexible is a difficult task, especially when you have a small team. Astronomer is a platform that lets you skip straight to processing your valuable business data. Ry Walker, the CEO of Astronomer, explains how the company got started, how the platform works, and their commitment to open source. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers This is your host Tobias Macey and today I’m interviewing Ry Walker, CEO of Astronomer, the platform for data engineering. Interview Introduction How did you first get involved in the area of data management? What is Astronomer and how did it get started? Regulatory challenges of processing other people’s data What does your data pipelining architecture look like? What are the most challenging aspects of building a general purpose data management environment? What are some of the most significant sources of technical debt in your platform? Can you share some of the failures that you have encountered while architecting or building your platform and company and how you overcame them? There are certain areas of the overall data engineering workflow that are well defined and have numerous tools to choose from. What are some of the unsolved problems in data management? What are some of the most interesting or unexpected uses of your platform that you are aware of? Contact Information Email @rywalker on Twitter Links Astronomer Kiss Metrics Segment Marketing tools chart Clickstream HIPAA FERPA PCI Mesos Mesos DC/OS Airflow SSIS Marathon Prometheus Grafana Terraform Kafka Spark ELK Stack React GraphQL PostGreSQL MongoDB Ceph Druid Aries Vault Adapter Pattern Docker Kinesis API Gateway Kong AWS Lambda Flink Redshift NOAA Informatica SnapLogic Meteor The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

8/6/2017 • 42 minutes, 50 seconds

Rebuilding Yelp's Data Pipeline with Justin Cunningham - Episode 5

Summary Yelp needs to be able to consume and process all of the user interactions that happen in their platform in as close to real-time as possible. To achieve that goal they embarked on a journey to refactor their monolithic architecture to be more modular and modern, and then they open sourced it! In this episode Justin Cunningham joins me to discuss the decisions they made and the lessons they learned in the process, including what worked, what didn’t, and what he would do differently if he was starting over today. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure When you’re ready to launch your next project you’ll need somewhere to deploy it. Check out Linode at www.dataengineeringpodcast.com/linode?utm_source=rss&utm_medium=rss and get a $20 credit to try out their fast and reliable Linux virtual servers for running your data pipelines or trying out the tools you hear about on the show. Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Justin Cunningham about Yelp’s data pipeline Interview with Justin Cunningham Introduction How did you get involved in the area of data engineering? Can you start by giving an overview of your pipeline and the type of workload that you are optimizing for? What are some of the dead ends that you experienced while designing and implementing your pipeline? As you were picking the components for your pipeline, how did you prioritize the build vs buy decisions and what are the pieces that you ended up building in-house? What are some of the failure modes that you have experienced in the various parts of your pipeline and how have you engineered around them? What are you using to automate deployment and maintenance of your various components and how do you monitor them for availability and accuracy? While you were re-architecting your monolithic application into a service oriented architecture and defining the flows of data, how were you able to make the switch while verifying that you were not introducing unintended mutations into the data being produced? Did you plan to open-source the work that you were doing from the start, or was that decision made after the project was completed? What were some of the challenges associated with making sure that it was properly structured to be amenable to making it public? What advice would you give to anyone who is starting a brand new project and how would that advice differ for someone who is trying to retrofit a data management architecture onto an existing project? Keep in touch Yelp Engineering Blog Email Links Kafka Redshift ETL Business Intelligence Change Data Capture LinkedIn Data Bus Apache Storm Apache Flink Confluent Apache Avro Game Days Chaos Monkey Simian Army PaaSta Apache Mesos Marathon SignalFX Sensu Thrift Protocol Buffers JSON Schema Debezium Kafka Connect Apache Beam The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

6/18/2017 • 42 minutes, 27 seconds

ScyllaDB with Eyal Gutkind - Episode 4

Summary If you like the features of Cassandra DB but wish it ran faster with fewer resources then ScyllaDB is the answer you have been looking for. In this episode Eyal Gutkind explains how Scylla was created and how it differentiates itself in the crowded database market. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Eyal Gutkind about ScyllaDB Interview Introduction How did you get involved in the area of data management? What is ScyllaDB and why would someone choose to use it? How do you ensure sufficient reliability and accuracy of the database engine? The large draw of Scylla is that it is a drop in replacement of Cassandra with faster performance and no requirement to manage th JVM. What are some of the technical and architectural design choices that have enabled you to do that? Deployment and tuning What challenges are inroduced as a result of needing to maintain API compatibility with a diferent product? Do you have visibility or advance knowledge of what new interfaces are being added to the Apache Cassandra project, or are you forced to play a game of keep up? Are there any issues with compatibility of plugins for CassandraDB running on Scylla? For someone who wants to deploy and tune Scylla, what are the steps involved? Is it possible to join a Scylla cluster to an existing Cassandra cluster for live data migration and zero downtime swap? What prompted the decision to form a company around the database? What are some other uses of Seastar? Keep in touch Eyal LinkedIn ScyllaDB Website @ScyllaDB on Twitter GitHub Mailing List Slack Links Seastar Project DataStax XFS TitanDB OpenTSDB KairosDB CQL Pedis The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/18/2017 • 35 minutes, 6 seconds

Defining Data Engineering with Maxime Beauchemin - Episode 3

Summary What exactly is data engineering? How has it evolved in recent years and where is it going? How do you get started in the field? In this episode, Maxime Beauchemin joins me to discuss these questions and more. Transcript provided by CastSource Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Maxime Beauchemin Questions Introduction How did you get involved in the field of data engineering? How do you define data engineering and how has that changed in recent years? Do you think that the DevOps movement over the past few years has had any impact on the discipline of data engineering? If so, what kinds of cross-over have you seen? For someone who wants to get started in the field of data engineering what are some of the necessary skills? What do you see as the biggest challenges facing data engineers currently? At what scale does it become necessary to differentiate between someone who does data engineering vs data infrastructure and what are the differences in terms of skill set and problem domain? How much analytical knowledge is necessary for a typical data engineer? What are some of the most important considerations when establishing new data sources to ensure that the resulting information is of sufficient quality? You have commented on the fact that data engineering borrows a number of elements from software engineering. Where does the concept of unit testing fit in data management and what are some of the most effective patterns for implementing that practice? How has the work done by data engineers and managers of data infrastructure bled back into mainstream software and systems engineering in terms of tools and best practices? How do you see the role of data engineers evolving in the next few years? Keep In Touch @mistercrunch on Twitter mistercrunch on GitHub Medium Links Datadog Airflow The Rise of the Data Engineer Druid.io Luigi Apache Beam Samza Hive Data Modeling The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

3/5/2017 • 45 minutes, 20 seconds

Dask with Matthew Rocklin - Episode 2

Summary There is a vast constellation of tools and platforms for processing and analyzing your data. In this episode Matthew Rocklin talks about how Dask fills the gap between a task oriented workflow tool and an in memory processing framework, and how it brings the power of Python to bear on the problem of big data. Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Matthew Rocklin about Dask and the Blaze ecosystem. Interview with Matthew Rocklin Introduction How did you get involved in the area of data engineering? Dask began its life as part of the Blaze project. Can you start by describing what Dask is and how it originated? There are a vast number of tools in the field of data analytics. What are some of the specific use cases that Dask was built for that weren’t able to be solved by the existing options? One of the compelling features of Dask is the fact that it is a Python library that allows for distributed computation at a scale that has largely been the exclusive domain of tools in the Hadoop ecosystem. Why do you think that the JVM has been the reigning platform in the data analytics space for so long? Do you consider Dask, along with the larger Blaze ecosystem, to be a competitor to the Hadoop ecosystem, either now or in the future? Are you seeing many Hadoop or Spark solutions being migrated to Dask? If so, what are the common reasons? There is a strong focus for using Dask as a tool for interactive exploration of data. How does it compare to something like Apache Drill? For anyone looking to integrate Dask into an existing code base that is already using NumPy or Pandas, what does that process look like? How do the task graph capabilities compare to something like Airflow or Luigi? Looking through the documentation for the graph specification in Dask, it appears that there is the potential to introduce cycles or other bugs into a large or complex task chain. Is there any built-in tooling to check for that before submitting the graph for execution? What are some of the most interesting or unexpected projects that you have seen Dask used for? What do you perceive as being the most relevant aspects of Dask for data engineering/data infrastructure practitioners, as compared to the end users of the systems that they support? What are some of the most significant problems that you have been faced with, and which still need to be overcome in the Dask project? I know that the work on Dask is largely performed under the umbrella of PyData and sponsored by Continuum Analytics. What are your thoughts on the financial landscape for open source data analytics and distributed computation frameworks as compared to the broader world of open source projects? Keep in touch @mrocklin on Twitter mrocklin on GitHub Links http://matthewrocklin.com/blog/work/2016/09/22/cluster-deployments?utm_source=rss&utm_medium=rss https://opendatascience.com/blog/dask-for-institutions/?utm_source=rss&utm_medium=rss Continuum Analytics 2sigma X-Array Tornado Website Podcast Interview Airflow Luigi Mesos Kubernetes Spark Dryad Yarn Read The Docs XData The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/22/2017 • 46 minutes

Pachyderm with Daniel Whitenack - Episode 1

Summary Do you wish that you could track the changes in your data the same way that you track the changes in your code? Pachyderm is a platform for building a data lake with a versioned file system. It also lets you use whatever languages you want to run your analysis with its container based task graph. This week Daniel Whitenack shares the story of how the project got started, how it works under the covers, and how you can get started using it today! Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, and tell your friends and co-workers Your host is Tobias Macey and today I’m interviewing Daniel Whitenack about Pachyderm, a modern container based system for building and analyzing a versioned data lake. Interview with Daniel Whitenack Introduction How did you get started in the data engineering space? What is pachyderm and what problem were you trying to solve when the project was started? Where does the name come from? What are some of the competing projects in the space and what features does Pachyderm offer that would convince someone to choose it over the other options? Because of the fact that the analysis code and the data that it acts on are all versioned together it allows for tracking the provenance of the end result. Why is this such an important capability in the context of data engineering and analytics? What does Pachyderm use for the distribution and scaling mechanism of the file system? Given that you can version your data and track all of the modifications made to it in a manner that allows for traversal of those changesets, how much additional storage is necessary over and above the original capacity needed for the raw data? For a typical use of Pachyderm would someone keep all of the revisions in perpetuity or are the changesets primarily just useful in the context of an analysis workflow? Given that the state of the data is calculated by applying the diffs in sequence what impact does that have on processing speed and what are some of the ways of mitigating that? Another compelling feature of Pachyderm is the fact that it natively supports the use of any language for interacting with your data. Why is this such an important capability and why is it more difficult with alternative solutions? How did you implement this feature so that it would be maintainable and easy to implement for end users? Given that the intent of using containers is for encapsulating the analysis code from experimentation through to production, it seems that there is the potential for the implementations to run into problems as they scale. What are some things that users should be aware of to help mitigate this? The data pipeline and dependency graph tooling is a useful addition to the combination of file system and processing interface. Does that preclude any requirement for external tools such as Luigi or Airflow? I see that the docs mention using the map reduce pattern for analyzing the data in Pachyderm. Does it support other approaches such as streaming or tools like Apache Drill? What are some of the most interesting deployments and uses of Pachyderm that you have seen? What are some of the areas that you are looking for help from the community and are there any particular issues that the listeners can check out to get started with the project? Keep in touch Daniel Twitter – @dwhitena Pachyderm Website Free Weekend Project GopherNotes Links AirBnB RethinkDB Flocker Infinite Project Git LFS Luigi Airflow Kafka Kubernetes Rkt SciKit Learn Docker Minikube General Fusion The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/14/2017 • 44 minutes, 42 seconds

Introducing The Show

Preamble Hello and welcome to the Data Engineering Podcast, the show about modern data infrastructure Go to dataengineeringpodcast.com to subscribe to the show, sign up for the newsletter, read the show notes, and get in touch. You can help support the show by checking out the Patreon page which is linked from the site. To help other people find the show you can leave a review on iTunes, or Google Play Music, share it on social media, and tell your friends and co-workers. I’m your host, Tobias Macey, and today I’m speaking with Maxime Beauchemin about what it means to be a data engineer. Interview Who am I Systems administrator and software engineer, now DevOps, focus on automation Host of Podcast.__init__ How did I get involved in data management Why am I starting a podcast about Data Engineering Interesting area with a lot of activity Not currently any shows focused on data engineering What kinds of topics do I want to cover Data stores Pipelines Tooling Automation Monitoring Testing Best practices Common challenges Defining the role/job hunting Relationship with data engineers/data analysts Get in touch and subscribe Website Newsletter Twitter Email The intro and outro music is from The Hug by The Freak Fandango Orchestra / CC BY-SA

1/8/2017 • 4 minutes, 23 seconds