This is my first try at anything open source so any feedback is welcome :)

  • katoOP
    link
    fedilink
    arrow-up
    5
    ·
    6 months ago

    ETL stands for extract transform and load and it is a widely used architecture for data pipelines where you load some data from different sources (like an S3 or gcs bucket), apply some transformation logic to either aggregate the data or do some other data transformation like changing the schema and then output the result as a different data product.

    These pipelines are then usually run on a schedule or triggered to periodically output data for different time periods to be able to deal with large sets of data by breaking them down into more manageable pieces for a downstream data science team or for a team of data analysts for example.

    What this library is aiming at is to combine the querying capabilities of datafusion which is a query parser and query engine, with the delta lake protocol to provide a pretty capable framework to build these pipelines in a short amount of time. I’ve used both datafusion and delta-rs for some time and I really love these projects as they enable me to use rust in my day job as a data engineer which is usually a python dominated field.

    However they are quite complex as they cover a wide variety of usecases and this library tries to reduce the complexity using them by constraining them for the use case of building simple data pipelines.