Bachelor Thesis

Scraping the Web to Decode Laptop Pricing: My Bachelor's Thesis Journey

For my bachelor's thesis, I dove deep into a question that anyone shopping for a laptop has probably wondered: what actually determines laptop prices? Instead of just accepting that "better specs cost more," I decided to scrape the web and build statistical models to understand exactly how much each component and feature affects pricing.

The Challenge: Making Sense of Laptop Pricing

Shopping for laptops can be overwhelming. You're bombarded with technical specifications – processor cores, RAM capacity, storage types, graphics cards – but it's often unclear how much each feature should actually cost. I wanted to build a data-driven approach to understand these relationships.

My research question was straightforward: Can we use web scraping and statistical modeling to predict laptop prices based on their technical specifications, and which features are the strongest price drivers?

Building the Data Pipeline

I chose Alternate.de, a major German electronics retailer, as my data source. Using Python, I built a web scraping system that could systematically collect laptop specifications and prices at scale.

The technical setup included:

Python requests for HTTP calls with proper delays (3 seconds between requests to be respectful)
BeautifulSoup with LXML for parsing HTML and extracting specifications
ThreadPool for efficient multithreaded data collection
Regular expressions for cleaning and standardizing the messy specification data

The scraping process worked in two phases: first, I collected product links from 15 pages of laptop listings, then I visited each individual product page to extract detailed technical specifications from the "techData" sections.

Data Quality: The Unglamorous but Critical Part

Starting with over 1,700 laptops, I quickly learned that real-world data is messy. After extensive cleaning, I ended up with 1,043 high-quality observations. I had to:

Remove laptops with missing critical specifications
Exclude complex cases like dual graphics card setups
Filter out eMMC storage devices (different market segment)
Standardize manufacturer naming conventions
Create derived variables like pixels per inch (PPI) for display quality

The data cleaning process taught me that 80% of data science is often about getting your data into a usable state – a lesson that extends far beyond academic research.

The Models: From Simple to Sophisticated

I built three increasingly sophisticated statistical models using Python's statsmodels and scikit-learn:

Model 1: A baseline linear regression with core predictors (processor, memory, storage, graphics) Model 2: Enhanced model incorporating display specifications and additional features Model 3: Logarithmic transformation model to better handle the wide price range

The results were impressive – my models explained between 86% and 93% of the price variation in laptops. Using 5-fold cross-validation, I consistently achieved R-squared values above 0.85, meaning the models were genuinely predictive, not just fitting to noise.

Key Discoveries: What Really Drives Laptop Prices

The analysis revealed several fascinating insights:

Graphics Performance Matters Most: The number of pixel shaders in graphics cards emerged as one of the strongest price predictors. This makes sense – gaming and professional graphics work command premium prices.

Processor Specifications Are Complex: Both the number of cores and clock frequency matter, but their interaction effects are important. More cores don't always mean higher prices if the base frequency is lower.

Storage Type Premium: The difference between SSD and HDD storage showed a clear price premium that goes beyond just capacity differences.

Display Quality Pays: Higher resolution displays and better pixel density (PPI) significantly affect pricing, especially in premium segments.

Memory Sweet Spots: RAM capacity showed diminishing returns – the jump from 8GB to 16GB had a bigger price impact than from 16GB to 32GB.

Technical Challenges and Solutions

Building this system taught me several important lessons:

Handling Missing Data: I had to classify missing values as either Missing Completely at Random (MCAR) or Missing Not at Random (MNAR) and handle each appropriately.

Feature Engineering: Raw specifications needed significant processing. Creating variables like PPI required combining screen size and resolution data in meaningful ways.

Model Validation: I used formal statistical tests like the Breusch-Pagan test for heteroscedasticity and Variance Inflation Factor (VIF) analysis for multicollinearity.

Robustness Testing: Cross-validation results remained consistent across different data splits, confirming the models weren't overfitting.

Beyond the Numbers: Real-World Applications

This research has practical applications beyond academic interest:

Price Comparison Tools: The models could power automated systems to identify overpriced or undervalued laptops in the market.

Market Analysis: Manufacturers and retailers could use similar approaches to understand competitive positioning.

Consumer Education: The coefficients from the models show consumers exactly how much they should expect to pay for specific features.

Automated Valuation: Used laptop marketplaces could implement similar systems for pricing recommendations.

Lessons Learned and Future Directions

This project taught me that successful data science requires equal parts technical skill and domain knowledge. Understanding laptop specifications was as important as knowing statistical modeling techniques.

If I were to extend this research, I'd love to:

Incorporate temporal analysis to understand how prices change over product lifecycles
Add brand premium analysis with more sophisticated modeling
Expand to multiple retailers for broader market coverage
Include user reviews and ratings as predictive features

The Bigger Picture

What started as curiosity about laptop pricing became a comprehensive exploration of web scraping, data cleaning, statistical modeling, and machine learning validation. The project demonstrated that with the right tools and methodology, we can extract meaningful insights from the vast amount of product data available online.

For anyone considering similar research: the combination of web scraping and statistical modeling is powerful, but success depends heavily on careful data quality management and robust validation procedures. The technical implementation is often straightforward – the real challenge is ensuring your results are reliable and meaningful.

The laptop market may seem chaotic, but underneath the marketing and positioning, there are clear, quantifiable relationships between specifications and prices. Sometimes the best way to understand a complex system is to scrape it, model it, and let the data tell the story.

Bachelor Thesis Code

Open PDF

PDF preview is not supported on mobile devices

View PDF