• @nieceandtows
    link
    19
    edit-2
    10 months ago

    Flip side of the coin, I had a sysadmin who wouldn’t increase the tmp size from 1gb because ‘I don’t need more than that recommended size’. I deploy tons of etl jobs, and they download gbs of files for processing to this globally known temp storage. I got it changed for one server successfully after much back and forth, but the other one I just overrode it in my config files for every script.

    • stevecrox
      link
      fedilink
      1110 months ago

      This is why Java rocks with ETL, the language is built to access files via input/output streams.

      It means you don’t need to download a local copy of a file, you can drop it into a data lake (S3, HDFS, etc…) and pass around a URI reference.

      Considering the size of Large Language Models I really am surprised at how poor streaming is handled within Python.

      • @nieceandtows
        link
        810 months ago

        Yeah python does lack in such things. Half a decade ago, I setup an ml model for tableau using python, and things were fine until one day it just wouldn’t finish anymore. Turns out the model got bigger and python filled out the ram and the swap trying to load the whole model in memory.

        • stevecrox
          link
          fedilink
          410 months ago

          During the pandemic I had some unoccupied python graduates I wanted to teach data engineering to.

          Initially I had them implement REST wrappers around Apache OpenNLP and SpaCy and then compare the results of random data sets (project Gutenberg, sharepoint, etc…).

          I ended up stealing a grad data scientist because we couldn’t find a difference (while there was a difference in confidence, the actual matches were identical).

          SpaCy required 1vCPU and 12GiB of RAM to produce the same result as OpenNLP that was running on 0.5 vCPU and 4.5 GiB of RAM.

          2 grads were assigned a Spring Boot/Camel/OpenNLP stack and 2 a Spacy/Flask application. It took both groups 4 weeks to get a working result.

          The team slowly acquired lockdown staff so I introduced Minio/RabbitMQ/Nifi/Hadoop/Express/React and then different file types (not raw UTF-8, but what about doc, pdf, etc…) for NLP pipelines. They built a fairly complex NLP processing system with a data exploration UI.

          I figured I had a group to help me figure out Python best approach in the space, but Python limitations just lead to stuff like needing a Kubernetes volume to host data.

          Conversely none of the data scientists we acquired were willing to code in anything but Python.

          I tried arguing in my company of the time there was a huge unsolved bit of market there (e.g. MLOP’s)

          Alas unless you can show profit on the first customer no business would invest. Which is why I am trying to start a business.