Scraping the Web: A Data Pipeline for Ingesting PDFs, HTML, and MP3s for AI Analysis

Have you ever wondered what's truly on a website? Not just the homepage, but every blog post, every product page, every hidden corner. I recently needed to perform a full content audit on a large site to find all embedded YouTube videos and create a text archive for an AI training project. Doing this manually would have taken weeks.

So, I did what any data engineer would do: I built a tool for the job.

I created a multi-threaded Python script that acts like a team of digital archaeologists, rapidly and recursively exploring an entire website to unearth its contents.

The Tool: A Multi-Threaded Web Spider

At its core, the script is a web spider built with two of Python's most powerful libraries for web tasks: Requests for fetching web pages and BeautifulSoup for parsing the HTML. But to make it fast, I used Python's `ThreadPoolExecutor` to allow it to process multiple pages at once.

Here’s how the "dig" works:

Start at the Entrance: The script begins at a single starting URL.
Map the Room: On that page, it finds all internal links leading to other pages on the same site.
Excavate the Content: While there, it performs two key actions:
- It carefully extracts all the visible text content and saves it to a `.txt` file.
- It scans the entire HTML source code for any mention of a YouTube video, pulling out the unique video ID.
Explore New Tunnels: It adds all the newly discovered, unvisited links to its "to-do" list and dispatches its team of concurrent workers to explore them.

This process repeats until every single reachable page on the website has been visited, mapped, and excavated.

The Treasure: What We Found

After letting the script run, I was left with two invaluable sets of data in an output folder:

A `page_text_content` directory containing a complete text archive of the entire site, ready to be used as a corpus for training a language model.
A single file, `all_youtube_links.txt`, containing a clean list of every unique YouTube video ID embedded anywhere on the site—a complete video inventory.

I also added additional transformational and API's to extract even more data. For instance, for YouTube videos, I needed transcripts, so I utilized yt_dlp to extract the available transcripts from YouTube. This allowed me to save the text and cleanly format for AI ingestion. So now I could have all the HTML content, YouTube content, PDFs, and MP3s. And ultimately it end up in NoteBookLM for a master set of customized content. All files were optimized and compressed batch files. I even had a script to remove silence from audio files, ramp up to 2.5x the normal speed, and then merge files together to keep all files within NoteBookLM file limits. So a ton of data could be compressed into like 40 files, and then the data could be used as desired.

This project is a perfect demonstration of how a relatively simple Python script can automate a massive data-gathering task. It's a foundational tool for anyone in content strategy, SEO auditing, or data science who needs to turn an entire website into a structured dataset. I've made the full code available on GitHub for you to use and adapt.

View Project on GitHub

The Tool: A Multi-Threaded Web Spider

Here’s how the "dig" works:

The Treasure: What We Found

About Tim Chilcote