Show HN: Large Scale Article Extract of Newspapers 1730s-1960s Hello HN, over the past 7 months I've spent nearly 3,000 hours on building SNEWPAPERS, the first historical newpaper archive with full-text extractions, nearly perfect OCR, a vast categorization taxonomy and of course with semantic and agentic search capabilities. Problem:
I wanted to search through newspaper archives, but when I tried every service only lets you search for keywords and dates, and gives you back raw images of the papers, and too many of them with no context. A sea of noise. Solution:
I taught machines how to read the newspapers and so far I've extracted the content from > 600k pages (about 5TB) from the Chronicling America collection. Problems I had to deal with were an infinite variety of layouts, font sizes, image scan qualities, resolutions, aspect ratios, navigating around the images on the page. I also had to figure out how to get OCR to be nearly perfect so people wouldn't hate reading the extracts. I stitched together a multi-model pipeline (layout tech, ocr tech, llm, vllm) with heuristics to go from layout -> segmentation -> classification. I put it all in OpenSearch / Postgres and made it semantically searchable and also put an agentic search tool on top that knows how to use the API really well and helps you write queries to find what you're looking for. Happy to discuss AWS architecture and scaling as well, that was tough! If you have five minutes and you just want to jump in and have your own personalized experience, what I would suggest is: Before searching for anything, go to the Sleuth page
Ask it about anything from 1736 to 1963, maybe 1 or 2 follow up questions
Then go to the search page so you can see the queries it wrote for you (bottom left "saved queries") and uncover more info on whatever it is you're interested in If you think it's cool and you want to learn more, then there's about 10 minutes of video guides on the various capabilities in "Guide" on the nav bar Some other people have also taken a crack at this, notably: https://dell-research-harvard.github.io/resources/americanst... (very good attempt)
https://ift.tt/drRvDS5 (focused on images) https://snewpapers.com/ May 2, 2026 at 01:42AM
Tags
Hacker News