Hi, I want to know what is the best way to keep the databases I use in different projects? I use a lot of CSVs that I need to prepare every time I’m working with them (I just copy paste the code from other projects) but would like to make some module that I can import and it have all the processes of the databases for example for this database I usually do columns = [(configuration of, my columns)], names = [names], dates = [list of columns dates], dtypes ={column: type},

then database_1 = pd.read_fwf(**kwargs), database_2 = pd.read_fwf(**kwargs), database_3 = pd.read_fwf(**kwargs)…

Then database = pd.concat([database_1…])

But I would like to have a module that I could import and have all my databases and configuration of ETL in it so I could just do something like ‘database = my_module.dabase’ to import the database, without all that process everytime.

Thanks for any help.

  • Eager Eagle@lemmy.world
    link
    fedilink
    English
    arrow-up
    9
    ·
    edit-2
    5 months ago

    But I would like to have a module that I could import and have all my databases and configuration of ETL[…]

    ok, then write a module. I’m not sure what’s being asked. The best way is what works well for you.

  • originalfrozenbanana@lemm.ee
    link
    fedilink
    arrow-up
    7
    ·
    5 months ago

    Here “database” seems to mean a pandas dataframe. Sounds like you need to create a database using Postgres or sqlite or something similar, and recreate that database from a backup or database dump whenever you need it. You could host that database in the cloud or on your own network as well, if you need access remotely.

    For instance see this pandas doc https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_sql.html

    • driving_crooner@lemmy.eco.brOP
      link
      fedilink
      arrow-up
      4
      ·
      5 months ago

      Thanks, I could solve it creating a file with a def get_database(name):

      if name == ‘database’:

      all the process to create the database

      return database

      And then df = get_database(‘database’) execute all the processes and return it.

      • originalfrozenbanana@lemm.ee
        link
        fedilink
        arrow-up
        6
        ·
        5 months ago

        I am a little curious about the conditional. I have a suspicion that this is a bit of over engineering.

        The problem you seem to be trying to solve is “I need to access the same data in multiple ways, places, or projects.” That’s what a database is really great for. However, if you just need to combine the same csv files you have on disk over and over, why not combine them and dump the output to a csv? Next time you need it, just load the combined csv. FWIW this is loosely what SQLite is doing.

        If you are defining a method or function that performs these ETL operations over and over, and the underlying data is not changing, I think updating your local files to be the desired content and format is actually what you want.

        If instead you’re trying to explore modules, imports, abstraction, writing DRY code, or other software development fundamentals- great! Play around, it’s a great way to learn how might also recommend picking up some books! Usually your local library has some books on Python development, object oriented programming, and data engineering basics that you might find fascinating (as I have)

        • ericjmorey
          link
          fedilink
          arrow-up
          3
          ·
          5 months ago

          My local library gives me access to O’Reilly Online, so free textbook access for just about any topic.

        • driving_crooner@lemmy.eco.brOP
          link
          fedilink
          arrow-up
          3
          ·
          5 months ago

          There’s some data that comes in CSV, other are database files, in the SQL server, excel or web apis. From some of them I need to combine multiple sources with different formags even.

          I guess I could have a database with everything more tidy, easier to use, secure and with less failure ratio. I’m still going to prepare the databases (I’m thinking on dataframe objects on a pickle, but I want to experiment with parquetd) so they don’t have to be processed every time, but I wanted something I could just write the name of the database and get the update version.

          • originalfrozenbanana@lemm.ee
            link
            fedilink
            arrow-up
            3
            ·
            5 months ago

            This sounds kind of like a data warehouse. Depending on the size of the data and number of connections I’d say script or database or module, this is a much bigger problem. Look into dbt (data build tool) and airflow

              • odium
                link
                fedilink
                arrow-up
                2
                ·
                5 months ago

                I would say consider having a script that combines all these sources into a single data mart for your monthly reports. Could also be useful for the ad hoc studies, but idk how much of the same fields you’re using for these studies.

          • odium
            link
            fedilink
            arrow-up
            1
            ·
            5 months ago

            What are you trying to output in the end (dashboard? Report? Table?), how often are these inputs coming in, and how often do you run your process?

            • driving_crooner@lemmy.eco.brOP
              link
              fedilink
              arrow-up
              2
              ·
              5 months ago

              There’s some reports that need to be run monthly, they need to be edited each month to add the directories with the new databases and it causes problems, some of them im trying to solve with this. There’s also a lot of ad hoc statistics studies I need to do, that use the same bases.

              • 4am@lemm.ee
                link
                fedilink
                arrow-up
                2
                ·
                5 months ago

                It does sound to me like ingesting all these different formats into a normalized database (aka data warehousing) and then building your tools to report from that centralized warehouse is the way to go. Your warehouse could also track ingestion dates, original format converted from, etc. and then your tools only need to know that one source of truth.

                Is there any reason not to build this as a two-step process of 1) ingestion to a central database and 2) reporting from said database?

  • milkisklim@lemm.ee
    link
    fedilink
    arrow-up
    2
    ·
    5 months ago

    I’m not the biggest expert, but wouldn’t this be the whole point of polars’s lazy construction?

  • gedhrel@lemmy.world
    link
    fedilink
    arrow-up
    2
    ·
    5 months ago

    There’s not much here to go on. Are you asking how to write a module that you can import?

    Are these the same set of DB files every time? Are the columns and other configurations the same? Are you writing new python code every month?

    Are you using some ETL process to spit out a bunch of files that you’d like to have imported and available easily? Are the formats the same but the filenames differ?

    I think it’s the first thing you’re after. There are a bunch of tutorials knocking around about this, eg, https://www.digitalocean.com/community/tutorials/how-to-write-modules-in-python-3

    You might also be asking: if I write a module, how do I make it available for all my new python projects to use? You could just copy your whatever-my-module-is-called.py file around to your new projects (this might be simplest) but if you’re also expecting to be updating it and would like all of your projects to use the updated code, there are alternatives. One is to add the directory containing it to your PYTHONPATH. Another is to install it (in edit mode) in your python environment.

    [I get the impression you’re a data person rather than a programmer - perhaps you have a colleague who’s more of the latter you can tap up for this? It doesn’t have to be difficult, but there’s typically a little bit of ceremony involved in setting up a shared module however you choose to do it.]

    • gedhrel@lemmy.world
      link
      fedilink
      arrow-up
      1
      ·
      edit-2
      5 months ago

      If it is the first thing, just put the db setup code you’re using in one file, call it “database.py

      database.py

      # the code you commonly use, ending with
      database = ...
      

      From a second file in the same directory, write: main_program.py

      from database import database
      # The first "database" here is the module name.
      # The second "database" is a variable you set inside that module.
      # You can also write this as follows:
      # import database
      # ... and use `database.database` to refer to the same thing
      # but that involves "stuttering" throughout your code.
      
      # use `database` as you would before - it refers to the "database" object that was found in the "database.py" module
      

      then run it with python main_program.py

      The main thing to realise here is that there are two names involved. One’s the module, the other is the variable (or function name) you set inside that module that you want to get access to.

    • driving_crooner@lemmy.eco.brOP
      link
      fedilink
      arrow-up
      1
      ·
      5 months ago

      Are you asking how to write a module that you can import?

      Yes, kinda.

      Are these the same set of DB files every time? Are the columns and other configurations the same? Are you writing new python code every month?

      They get updated by the accounting team each month. Some of them are csv, other come from an access database file, other from the sql server.

      Some of the code need to be run each month with the updated databases, but there’s a lot of ad hoc statistical studies that my boss ask for that use the same databases.

      Are you using some ETL process to spit out a bunch of files that you’d like to have imported and available easily? Are the formats the same but the filenames differ?

      I guess yes. And not, the accountants keep the same filenames but change the directory lmao.

      I think it’s the first thing you’re after. There are a bunch of tutorials knocking around about this, eg,

      Thanks, im checking it out.

      how do I make it available for all my new python projects to use?

      import sys sys.path.append('my\\modules\\directory) import my_module

      I get the impression you’re a data person rather than a programmer -perhaps you have a colleague who’s more of the latter you can tap up for this?

      You’re right, I’m an actuarie. I wanted to do computer science instead of actuarial sciences, but I tough that it would be better getting an actuarial degree and then doing a masters on CS (still in planning, maybe 2026). I’m the only guy on the company who uses python and people here thinks I’m a genius because I have automated some boring things from excel.

      • gedhrel@lemmy.world
        link
        fedilink
        arrow-up
        1
        ·
        5 months ago

        If things are changing a bit each month, then in your module rather than a plain variable assignment

        darabase = ...
        

        you might want a function that you can pass in parameters to represent the things that can change:

        def database(dir, ...):
            ...
            return ...
        

        Then you can call it like this:

        from database import database
        db = database("/some/path")
        

        … gope that makes some sense.