8 May 2026 · Francesco Mucio
Building a Data Job Board for Berlin: An Interview at the Mirror
A self-interview about newsletters, scrapers, and building something for a community you love.
I wanted to write about how I built databerlin.net but struggled to organise my thoughts without going too technical. So I had Claude interview me as a technical journalist — and let the conversation do the work. This is what came out.
Let’s start at the beginning. What actually drove you to build the website?
We started with the meetup community, and at a certain point we said, why don’t we have a newsletter? The newsletter was mainly about curated job listings — no agencies, no recruiters, just direct companies — plus interesting articles I found during the week and upcoming events in the Berlin data scene. As a dad, I can’t attend all the meetups myself, but I really enjoy the different data communities here. So I thought: at least I can help people discover them.
After a while, we asked our readers what they wanted more of. Most said: more jobs. The problem was Substack has character limits. So we needed a different solution. That’s when the idea for the website came: let’s ingest job postings directly from companies and make it easy to find them — without relying on LinkedIn.
When you decided to build the ingestion pipeline with dlt, why that tool specifically?
I wanted to use a framework for data ingestion to remove a moving part. By standardizing the loading step, I could focus on the one variable that actually changes: how to extract data from each company’s website or ATS. I went with dlt because it’s a great tool for exactly that — and I have experience with it, I’ve contributed to the codebase. When Claude Code generates dlt code, I can debug it. That gave me confidence.
How does the extraction actually work? Are companies submitting jobs directly, or are you scraping?
Nobody is submitting anything. These companies have no idea we exist, and I can’t walk up to them and say “hey, publish your jobs on our website too.” So we go and get the data ourselves.
Initially, I hoped for clean REST APIs. Sometimes we get that — a nicely formatted JSON response, very easy. But often we have to scrape. And sometimes we find the XML sitemap of a website and leverage that. What surprised me most wasn’t that any one ATS was particularly complex — it’s how fragmented the market is. We currently have 26 pipelines in total: around 19 standard ATS integrations where the vendor gives you a decent API or predictable structure, 5 semi-custom ones where you have to do a bit more work, one full HTML scraper, and one pipeline that’s just hand-curated YAML entries for companies that don’t fit anywhere else.
The three biggest ATS platforms cover about sixty percent of the jobs we ingest. But there’s a long tail, and each new one requires figuring out a new pattern.
Can you walk us through a particularly tricky example?
Workday. It was painful. There are TLS and SSL certificate issues, and on top of that, different companies run different versions — some on the old platform, some on a newer one. There were moments where even with the best models, Claude was just going in circles telling me it wasn’t possible. I had to dig into the HTML source myself, find the elements containing the actual job data, and then bring that back to Claude Code.
When I hit those walls, I use multiple LLMs. I might use Gemini to understand what ATS a company is using, experiment a bit, and then bring the working approach into Claude Code. Usually it takes only two or three iterations once I understand the problem.
And I only invest that effort when it’s worth it — when multiple companies use the same ATS, or when a company is particularly relevant to Berlin’s data ecosystem.
Where does all this data land?
MotherDuck. I started with a simple setup, and I’m staying with MotherDuck because I also want to keep the history of jobs over time — that could be interesting data to analyze later. The schema is fixed and intentional: title, description, location, company, link, and a few other fields. When I’m building a new pipeline, Claude knows exactly what shape the output needs to be.
dlt handles the loading. The extraction — scraping, parsing, requesting — that’s the custom logic per ATS, using things like Beautiful Soup or raw HTTP requests.
With 200+ companies, how are you only showing the relevant jobs?
We pull from all of them, but we apply two filters. The first is location: the job has to be based in Berlin, open to all of Germany, or remote within Europe. The second is role: we use keyword matching to identify data-related positions — data engineer, analytics engineer, ML engineer, and so on.
String matching has its limits, of course. You search for “AI” and you might get AI salespeople. You miss an “engineering manager” whose team is actually a data team, because that’s only visible in the description, not the title. We accept those blindspots for now. Parsing every job description with an LLM to classify it properly would be too expensive at this stage.
What we’re experimenting with now is a two-dimensional filtering system — one axis for domain (data, AI, ML, analytics, specific tooling), one axis for role type (engineer, analyst, scientist, researcher, leader). That should give users much more useful search and browsing.
Tell us about the tech stack behind the website itself.
The pipeline runs on a GitHub Actions workflow. It pulls jobs from all the companies, loads them into MotherDuck, runs transformation scripts to shape the data for the frontend, exports a JSON file, and commits it to the repository. That commit triggers Astro to regenerate the static site, which then deploys to Vercel.
Everything runs on free tiers. GitHub Actions, MotherDuck, Vercel. The whole thing costs nothing to operate.
For SEO, we generate a dedicated page for every company and every job listing. Each company page includes a description, their website, their tech blog if they have one, and their current open roles. That company metadata is bootstrapped using an LLM when we first onboard a company — it searches the web and populates a config file. Then Astro turns it all into fast, crawlable static pages.
How do you find out when something breaks?
Honestly, I don’t have active monitoring. I don’t expect companies to change their ATS very often. But before every deployment, I run all the pipelines and check for errors. If something has broken, I see it there, and I instruct Claude Code to fix it.
What features are you thinking about next?
We try to collect feedback through different channels — a feedback button on the site, LinkedIn, Slack, and of course in person at the meetup. That’s how the seniority filter and the search box came to life. Someone asked, we shipped it.
I have a few ideas in mind. Job alerts are the obvious one, but honestly, people can already do that on LinkedIn, so I’m not sure it’s worth the effort. Comments on companies and jobs could be interesting, but that opens a can of worms I’m really not ready to deal with.
One thing I find genuinely interesting: we’ve all seen the AI guru posts about how different ATS parse CVs differently — fonts, columns, keywords. Since we actually know which ATS is behind each job listing, we could use that. The idea would be to give candidates a prompt they can take to their LLM of choice — feed in your CV and the job posting, and get a version formatted with that specific ATS in mind. We wouldn’t process CVs on our side, which would cost compute and money, and I don’t want to charge people who are looking for jobs. But a well-crafted prompt? That costs us nothing and could be genuinely useful.
And the business model?
We don’t charge job seekers anything, and we won’t. People looking for jobs shouldn’t have to pay. Long term, the model is sponsorships from companies — showing them how many data professionals clicked on their listings, how much visibility they’re getting in the Berlin community. We’re building the traffic numbers first. Right now, what we offer is visibility for companies that are already hosting our meetups. That’s already something.
Data Berlin’s job board is live at databerlin.net. The meetup community is free to join and open to anyone working in or interested in data in Berlin.