Automated Web-Based Book Content Extraction Framework

Problem

The project involved processing multiple subject-based book web pages and converting them into structured HTML files.

Each book consisted of multiple pages that needed to be navigated and saved as HTML files after performing specific content modifications.
A significant amount of repetitive code existed across the HTML files, which required standardization.
Images and embedded YouTube links had to be downloaded from multiple book pages and embedded into their respective HTML files.
All generated HTML files needed to be organized into module-wise folders for each subject.
Images had to be stored in a separate folder, distinct from the HTML files.

Performing these tasks manually was highly time-consuming and posed a high risk of human error.

Solution

To address these challenges, an automated solution was designed and implemented using Selenium with Java. The solution was divided into modular scripts, each responsible for a specific task:

Page Navigation and HTML Content Extraction
- Selenium WebDriver was used to iterate through multi-level book pages.
- The JavaScriptExecutor method is used to retrieve page content dynamically and modify the HTML as required.
HTML Transformation and Persistence
- Custom DOM manipulation scripts were executed via JavaScript to normalize repetitive HTML blocks.
- Saved changed HTML files using FileWriter and buffered/IO streams in the folder module-wise.
Media Asset Extraction
- Image resources were extracted by parsing img[src] attributes from the DOM.
- Images were downloaded using Java’s URLConnection and saved to module-wise folders.
- Embedded YouTube video URLs were extracted by navigating into iframe elements and saved module-wise.
Post-processing and Structural Normalization
- Jsoup library was used to modify common HTML attributes across all files and inserting extracted images and video URLs to all subjects module-wise.
File Handling and Organization
- Subject-wise folder creation
- Module-based HTML file relocation
- Media asset segregation into centralized and subject-specific directories

Results

The automated solution delivered significant improvements:

Reduced processing time dramatically compared to manual effort
Eliminated repetitive manual coding tasks
Minimized the risk of human error
Ensured consistent folder structure and standardized HTML formatting across all subjects

Business Impact (Before and After)

Metrics	Before Automation	After Automation
Content Extraction Time	Manual processing of each book page, several hours to days per subject	Automated extraction; completed in a few days per subject
Media (Images & YouTube URLs) Handling	Manual download and embedding, high chance of broken links	Automatic extraction, download, and embedding with accurate paths and folder mapping
HTML Standardization	Highly inconsistent, repetitive code manually edited across files	Uniform, fully standardized HTML structure via automated DOM and Jsoup transformations
Manual Errors	Frequent errors in HTML formatting, missing media, and incorrect folder structure	Human error has been nearly eliminated, consistent and reliable output
Folder Structuring	Manual creation of subject/module folders, prone to mistakes	Automated module-wise and subject-wise folder generation using Java NIO
Team Productivity	High manual workload, time spent on repetitive tasks	Significant time savings, team focuses on high-value tasks
Scalability	Difficult to scale, effort increased linearly with more subjects/modules	Highly scalable, new subjects/modules processed with minimal changes
Reusability of Code/Framework	No reusable components, work repeated for each set of pages	Modular automation scripts reused across all subjects and future projects

Development Timeline

The end-to-end automation framework was rapidly designed and developed within just a few days to meet the evolving requirements of the development team. The work included requirement analysis, modular script development, iterative testing, and validation across multiple subjects.

What would have taken nearly a year of manual effort was transformed into a fully automated workflow capable of completing the same volume of work in a few weeks.

Through this automated process, the system successfully generated:

2,208 structured folders
Over 114000 HTML files across numerous subjects.

Conclusion

The automation framework delivered for this project goes far beyond a simple script—it represents a strategic, reusable product foundation for large-scale content transformation. It showcases Sarvika’s expertise in building robust, future-ready automation projects that can process thousands of pages with accuracy, consistency, and speed.