Hi, I want to know what is the best way to keep the databases I use in different projects? I use a lot of CSVs that I need to prepare every time I’m working with them (I just copy paste the code from other projects) but would like to make some module that I can import and it have all the processes of the databases for example for this database I usually do columns = [(configuration of, my columns)], names = [names], dates = [list of columns dates], dtypes ={column: type},

then database_1 = pd.read_fwf(**kwargs), database_2 = pd.read_fwf(**kwargs), database_3 = pd.read_fwf(**kwargs)…

Then database = pd.concat([database_1…])

But I would like to have a module that I could import and have all my databases and configuration of ETL in it so I could just do something like ‘database = my_module.dabase’ to import the database, without all that process everytime.

Thanks for any help.

  • driving_crooner@lemmy.eco.brOP
    link
    fedilink
    arrow-up
    3
    ·
    6 months ago

    There’s some data that comes in CSV, other are database files, in the SQL server, excel or web apis. From some of them I need to combine multiple sources with different formags even.

    I guess I could have a database with everything more tidy, easier to use, secure and with less failure ratio. I’m still going to prepare the databases (I’m thinking on dataframe objects on a pickle, but I want to experiment with parquetd) so they don’t have to be processed every time, but I wanted something I could just write the name of the database and get the update version.

    • originalfrozenbanana@lemm.ee
      link
      fedilink
      arrow-up
      3
      ·
      6 months ago

      This sounds kind of like a data warehouse. Depending on the size of the data and number of connections I’d say script or database or module, this is a much bigger problem. Look into dbt (data build tool) and airflow

        • odium
          link
          fedilink
          arrow-up
          2
          ·
          6 months ago

          I would say consider having a script that combines all these sources into a single data mart for your monthly reports. Could also be useful for the ad hoc studies, but idk how much of the same fields you’re using for these studies.

    • odium
      link
      fedilink
      arrow-up
      1
      ·
      6 months ago

      What are you trying to output in the end (dashboard? Report? Table?), how often are these inputs coming in, and how often do you run your process?

      • driving_crooner@lemmy.eco.brOP
        link
        fedilink
        arrow-up
        2
        ·
        6 months ago

        There’s some reports that need to be run monthly, they need to be edited each month to add the directories with the new databases and it causes problems, some of them im trying to solve with this. There’s also a lot of ad hoc statistics studies I need to do, that use the same bases.

        • 4am@lemm.ee
          link
          fedilink
          arrow-up
          2
          ·
          6 months ago

          It does sound to me like ingesting all these different formats into a normalized database (aka data warehousing) and then building your tools to report from that centralized warehouse is the way to go. Your warehouse could also track ingestion dates, original format converted from, etc. and then your tools only need to know that one source of truth.

          Is there any reason not to build this as a two-step process of 1) ingestion to a central database and 2) reporting from said database?