
Fechado
Publicado
Pago na entrega
I want to scrape all legal text from EUR LEX (url here: [login to view URL]). The count is about a million documents. For every document please: • Download the official PDF in its original layout. • Pull the full plain text in BOTH English (EN) and German (DE). • Generate a companion JSON file per document containing: – document_id (as it appears in the URL), – year, – raw_text_en, – raw_text_de, – pdf_file_name set exactly to “<document_id>.pdf”. Folder structure can be simple—one directory that holds each PDF and its matching JSON with identical IDs. Accuracy matters because the data feeds directly into a research pipeline, so please handle character encoding and long parliamentary tables correctly. A lightweight Python solution using requests/BeautifulSoup or Scrapy is perfect; headless Selenium is fine if needed for dynamic pages. All code, a brief README, and any [login to view URL] must be included so I can reproduce the run locally. Once the script finishes, send the ZIP containing: 1. Source code. 2. The PDFs. 3. All JSON files. No ongoing schedule is required—just this single extraction.
ID do Projeto: 40147285
16 propostas
Projeto remoto
Ativo há 17 dias
Defina seu orçamento e seu prazo
Seja pago pelo seu trabalho
Descreva sua proposta
É grátis para se inscrever e fazer ofertas em trabalhos
16 freelancers estão ofertando em média ₹23.965 INR for esse trabalho

Hi, I am an IIT Grad. I will make it a reality for you. I can complete this project using Python, Selenium, and BeautifulSoup for web scraping, as well as PyPDF2 and pandas for PDF processing and data manipulation. I will use a headless browser (e. Kindly click on the chat button so we can discuss and get started. Will share you my prior projects done and my resume too. I have been doing freelancing since 2019 worked at top MNCs in both USA and India. Lets connect
₹12.500 INR em 7 dias
5,4
5,4

I can deliver this large-scale scraping project with 100% accuracy. Handling 1 million documents requires a robust architecture, and I have the expertise to build a Multi-threaded Python pipeline that handles complex parliamentary tables and UTF-8 encoding without data loss.
₹12.500 INR em 7 dias
4,3
4,3

Dear [Client Name], I have reviewed your project requirements on scraping legal texts from EUR LEX. With extensive experience in web scraping and data extraction, I can deliver a scalable solution using Python, BeautifulSoup, and Selenium if required. Ensuring accurate extraction of PDFs, English, and German texts, I will also provide JSON files per document as specified. Your research pipeline will benefit from my attention to detail in handling character encoding and tables. Upon completion, I will promptly send the ZIP file with all deliverables. Let's discuss how I can streamline this process for you. Looking forward to your response. Best regards, Aurallian
₹20.650 INR em 30 dias
3,0
3,0

Hi, I can build a stable, long-running automation script that is designed for reliability and long-term use. The focus will be on creating a clean, maintainable solution rather than a one-off script. While I may not handle large-scale manual data extraction on my local machine, I will deliver a fully automated script that you can run independently to sync all required data accurately. I can complete this quickly and at a very competitive price. Let’s discuss your requirements and choose the best approach for long-term stability.
₹14.000 INR em 1 dia
2,5
2,5

Hello, I’ve carefully reviewed your project requirements and clearly understand the tasks involved. I have 13 years of experience and strong expertise in the exact skills this project requires. I have successfully delivered similar projects before and can share relevant samples if needed. I will complete this within your expected timeline while maintaining quality and clear communication. I look forward to working with you and contributing sincerely to your project’s success.
₹25.000 INR em 7 dias
2,6
2,6

Do you want a fully reproducible scraper for all EUR-Lex legal texts? Hi, I can build a Python scraper to handle your EUR-Lex extraction requirements. The script will: Scrape all documents (~1 million) from EUR-Lex. Download each document as a PDF with the original layout. Extract the full plain text in English (EN) and German (DE).
₹25.000 INR em 7 dias
0,0
0,0

Hello, This project requires scale, precision, and reproducibility, and I can deliver all three. I will implement a robust Python scraper (Scrapy + Requests; Selenium only if needed) to process ~1M EUR-Lex documents. For each document, the pipeline will: Download the official PDF in original layout Extract complete EN & DE plain text with proper Unicode handling Generate a matching JSON with document_id, year, raw_text_en, raw_text_de, and pdf_file_name The system will include retry logic, rate-limit handling, encoding safety, and validation checks to ensure accuracy for downstream research use. All code will be clean, documented, and reproducible. About me: I’m Yuvraj Chugh, a Security Engineer with experience building high-volume scraping and data pipelines, FastAPI microservices, and ML systems on government and enterprise datasets, where data integrity is critical. Delivery: ZIP with source code, PDFs, JSONs, README, and requirements.txt. I can start with a pilot batch if required. Best regards, Yuvraj Chugh
₹30.000 INR em 5 dias
0,0
0,0

I’m confident I’m the ideal person for your project to scrape all legal text from EUR LEX, ensuring a clean and professional approach to downloading PDFs and pulling full texts in both English and German. Your need for a seamless, accurate extraction with automated handling of character encoding and complex tables is clear, and I’ll focus on creating a user-friendly Python script using requests and BeautifulSoup, as you prefer. While I am new to freelancer, I have tons of experience and have done other projects off site. I would love to chat more about your project! Regards, Henning Munnik
₹18.750 INR em 30 dias
0,0
0,0

Hi, I can deliver this in a structured, reproducible way without cutting corners. I will build a Python-based extraction pipeline using Requests/BeautifulSoup or Scrapy, with Selenium only where strictly required. Each document will be processed into its original PDF, extracted EN and DE plain text, and a companion JSON matching your exact schema and naming rules. Given the volume involved, I propose starting with a small acceptance sample to validate text accuracy, encoding, and folder structure before scaling. The scraper will include retry handling, logging of skipped documents, and a clean README with environment setup so the full run can be reproduced locally. My focus is accuracy, transparency, and controlled scaling rather than a fragile one-shot scrape.
₹30.000 INR em 7 dias
0,0
0,0

Nashik, India
Membro desde jan. 14, 2026
$30-250 USD
₹750-1250 INR / hora
₹100-400 INR / hora
$2-8 USD / hora
$750-1500 USD
$3000-5000 USD
$30-250 USD
$8-15 USD / hora
₹750-1250 INR / hora
₹12500-37500 INR
₹1500-12500 INR
$30-250 USD
$30-250 USD
$30-250 AUD
$10-30 USD
$30-250 USD
₹100-400 INR / hora
₹12500-37500 INR
€30-250 EUR
₹12500-37500 INR