
Fechado
Publicado
Pago na entrega
I am seeking a freelancer to build a full automatic data extraction and enrichment pipeline for Spanish procurement PDF documents. Scope of Work: Phase 1 – PDF Extraction Extract product names and key features from Spanish PDF files. Clean and structure the data. Output results in Excel (.xlsx) format. Phase 2 – Web Search & Price Extraction Use each product name as a Google search query, restricted to a specific domain. Analyze the top 5 search results per product. From each URL, extract with high precision: Price (must extract 5 accurate prices per product) Brand Product features and description Identify similar products, not only exact matches. Measure similarity using methods such as Levenshtein distance or cosine similarity. Deliverables Final datasets in Excel (.xlsx) and JSON (.json) formats. (Do not forget price extraction must be very precise and sufficient. We need 5 successful URL scraping) A detailed [login to view URL] explaining the full workflow, tools, and how to reproduce the process. Technical Notes Use of LLMs for extraction, similarity analysis, and enrichment is highly recommended. The solution must be accurate, efficient, and fully reproducible.
ID do Projeto: 40178133
37 propostas
Projeto remoto
Ativo há 2 dias
Defina seu orçamento e seu prazo
Seja pago pelo seu trabalho
Descreva sua proposta
É grátis para se inscrever e fazer ofertas em trabalhos
37 freelancers estão ofertando em média $64 USD for esse trabalho

Yes, I have understood the project that you are looking for and you need a fully automatic Spanish procurement PDF extraction plus web enrichment pipeline that returns five precise prices per product from top results on a specific domain. I am a Python data extraction and automation specialist and I have built reproducible PDF to Excel and JSON pipelines with domain limited search, robust scraping, and similarity matching for procurement and catalog style data. I will extract product names and key features from Spanish PDFs using structured parsing, with OCR fallback only when pages are scanned. I will clean and normalize outputs into a strict schema, then export both Excel and JSON with consistent field names and data types. I will run domain restricted search per product, collect the top five URLs, and scrape price, brand, and descriptions with validation rules and retries. I will compute similarity using embeddings plus Levenshtein checks to include close alternatives, then log confidence scores and keep only high precision matches. One powerful improvement is adding automated price sanity checks across the five sources so outliers are flagged and replaced before final export. What is the target domain for the restricted search, and are the PDFs mostly selectable text or scanned images that require OCR?
$55 USD em 1 dia
6,5
6,5

Hello I can build a fully automated, reproducible pipeline to extract structured product data from Spanish procurement PDFs, enrich it via domain-restricted Google searches, and accurately scrape five verified prices per product using similarity matching (Levenshtein/cosine) and LLM-assisted extraction, delivering clean Excel and JSON outputs along with a clear README for end-to-end replication. Regards Muhammad
$100 USD em 1 dia
5,4
5,4

Hello client, I’ve carefully reviewed your job description and have strong experience in these Web Scraping, Excel, Natural Language Processing, Data Analysis, Data Processing, API Integration, Python, JSON, Data Extraction and Google Search. I can build a reliable web scraping solution tailored specifically to your needs. Whether using Node.js with Puppeteer/Cheerio or Python with Selenium/BeautifulSoup, I will extract, clean, and organize your data efficiently. I also handle anti-bot protections, pagination, and full automation as required. As you can see from my profile, my web scraping reviews are excellent, reflecting my commitment to quality work. I focus on writing clean, maintainable, and scalable code because I know the difference between 99% and 100%. If you hire me, I’ll do my best until you’re completely satisfied with the result. Let’s discuss your target website and preferred data format. Thanks, Denis
$65 USD em 1 dia
5,4
5,4

I can build a fully automated pipeline to extract data from Spanish procurement PDFs, enrich via domain-restricted search, scrape 5 precise prices per product, apply similarity (Levenshtein/cosine), and deliver reproducible Excel/JSON with README using LLMs.
$55 USD em 1 dia
4,9
4,9

Hello! The big risk here is PDFs that hide text in weird layouts and prices that change per page, so I’d solve it with a two stage extractor plus strict validation that only accepts real prices from real product pages. First I will build Phase 1 to pull product names and key features from Spanish PDFs, handle messy tables, normalize units, and export a clean Excel plus JSON. Then Phase 2 I will run a domain limited Google style search per product, collect the top five results, and scrape each page with rules that target the actual price block, not ads or crossed out numbers. I will extract price, brand, and features, then enrich with similar products using cosine similarity plus a string distance check so we catch near matches, not just exact titles. I will add accuracy checks so you always get five valid URLs per product, and if one fails it retries or replaces it automatically. You will get the full pipeline, datasets, and a README that lets you reproduce everything in one command. Warm regards, Yulius Mayoru
$20 USD em 2 dias
4,9
4,9

Greetings, It looks like you're looking to create an automated pipeline for extracting and enriching data from Spanish procurement PDFs. I can help with that! My plan would start with the PDF extraction phase, where I’d carefully pull out product names and key features, ensuring the data is clean and well-structured for your Excel output. Then, for the web search and price extraction phase, I would efficiently use each product name to gather precise pricing information from top search results. I’ll employ methods like Levenshtein distance to ensure we identify similar products accurately. With experience in Python, data processing, and web scraping, I can deliver the datasets in both Excel and JSON formats, along with a detailed README to guide you through the workflow. I’m confident this approach will meet your needs. Best regards, Saba Ehsan
$60 USD em 30 dias
4,9
4,9

Hey , I just went through your job description and noticed you need someone skilled in Natural Language Processing, JSON, Web Scraping, Data Processing, Data Extraction, Python, API Integration, Data Analysis, Excel and Google Search. That’s right up my alley. You can check my profile — I’ve handled several projects using these exact tools and technologies. Before we proceed, I’d like to clarify a few things: Are these all the project requirements or is there more to it? Do you already have any work done, or will this start from scratch? What’s your preferred deadline for completion? Why Work With Me? Over 180 successful projects completed. Long-term track record of happy clients and repeat work. I prioritize quality, deadlines, and clear communication. Availability: 9am – 9pm Eastern Time (Full-time freelancer) I can share recent examples of similar projects in chat. Let’s connect and discuss your vision in detail. Kind Regards, Zain Arshad
$10 USD em 3 dias
3,8
3,8

Hello! I can build a fully automated pipeline to extract product data from Spanish PDFs, enrich it with web-sourced prices, brands, and features, identify similar products using similarity metrics, and deliver clean Excel/JSON datasets. Workflow will be reproducible, accurate, and LLM-assisted for high precision. Full README included. Best Regards!
$100 USD em 5 dias
3,8
3,8

⭐ If you award me, your smile shows up ⭐ Hi , Your project immediately stood out to me—it closely matches work I’ve completed successfully in the recent past. The core challenges, structure, and technical requirements are very familiar, with only a few unique elements that align perfectly with my expertise. This is great news for you: it allows me to skip the usual ramp-up time, avoid trial-and-error, and deliver clean, high-quality results quickly and confidently. I bring hands-on experience with API Integration, Web Scraping, Data Analysis, Natural Language Processing, Python, Data Extraction, Google Search, Data Processing, Excel and JSON, along with proven workflows and best practices refined through multiple similar projects. You can view a directly relevant example in my portfolio here: https://www.freelancer.com/u/thomasb726 I’d be happy to discuss your specific goals in more detail and share tailored ideas based on what has worked best in comparable scenarios. Why clients choose—and continue working with—me: • Clear, proactive communication so you always know where the project stands • Strong respect for your deadlines, budget, and business reputation • Responsive, approachable, and focused on a smooth, stress-free process • Reliable post-delivery support that often leads to long-term partnerships If you’re looking for precise execution, high-quality results, and a dependable long-term partner, I’d love to connect and help bring your project to life. Best rega
$100 USD em 1 dia
3,5
3,5

Hi Good afternoon , I would be a fabulous fit for this task. I read your details, and I'm ready to start now. I have expertise in JSON, Data Processing, API Integration, Google Search, Natural Language Processing, Data Analysis, Data Extraction, Excel, Python and Web Scraping If needed, I'll provide you with revisions until you're all happy. Please send me a message to discuss everything further. 100% satisfactory and quality work guaranteed. Thank you for your time. PORTFOLIO: https://www.freelancer.com/u/zeeshanmomin722?w=f Regards, Zeeshan M.
$65 USD em 1 dia
3,2
3,2

Hello mojganmadah, I am Maryam Abbas, with 4 years of experience in Web Scraping and Python. I have carefully reviewed the project requirements for building an automatic data extraction pipeline for Spanish procurement PDF documents. To achieve this, I will implement a two-phase approach. In Phase 1, I will extract product names and key features from PDF files, cleaning and structuring the data into an Excel format. In Phase 2, I will conduct web searches for each product, extracting prices, brands, features, descriptions, and identifying similar products using advanced similarity analysis techniques. With my extensive experience and successful track record in similar projects, I am confident in delivering accurate and efficient results. Please review my portfolio at https://www.freelancer.pk/u/maryam951 and let's discuss the project further. Best regards, Maryam Abbas
$55 USD em 5 dias
2,5
2,5

Hi, there, I have 7+ years of experience in data extraction, NLP, and Python data pipelines. I have mastered PDF extraction, web scraping, and enrichment workflows to deliver clean Excel and JSON outputs. I’ve built end-to-end procurement automation using LLMs for extraction, similarity scoring, and domain-specific searches. ✅ Build a modular pipeline to extract product names and features from Spanish PDFs, clean the data, and output to Excel. ✅ Implement a web-search layer that queries each product within the specified domain, retrieves top 5 URLs, and extracts price, brand, features, and description. ✅ Apply Levenshtein/cosine similarity to identify similar products and map 5 best matches per product; store results in Excel and JSON. ✅ Consolidate cross-source data with validation routines to ensure accuracy and reproducibility. ✅ Provide a README with workflow, environment, and commands to reproduce the pipeline. I look forward to working with you. Best Regards, Rosita Iniesta.
$65 USD em 1 dia
2,6
2,6

Hey there Do you have a fixed allowlist domain for the Google search, and is using Google Custom Search API acceptable so results are stable and reproducible For the PDFs, are they text based or scanned images, and do you need table extraction too or only product blocks I can build an end to end Python pipeline that extracts Spanish product data from PDFs, then enriches it with 5 high precision prices per product from your chosen domain. Phase 1: parse PDF text and tables, fall back to OCR only when needed, normalize fields, and export clean Excel. Phase 2: query via Custom Search API, take top 5 URLs, scrape price, brand, and features with strict selectors plus validation rules, then compute similarity to include close matches using cosine similarity on embeddings and string distance checks. You will get XLSX and JSON outputs, a reproducible repo with pinned dependencies, and a README with run steps and how to add new PDFs or domains. Hope to discuss more on chat Best Kirill
$100 USD em 7 dias
2,5
2,5

I understand your project. I will build an automatic pipeline that reads Spanish procurement PDFs, extracts product names and features using LLMs, cleans the data, and saves it to Excel. I can search Google by domain for each product, scrape the top 5 results, extract 5 accurate prices, brands, and descriptions, and find similar products using text similarity methods. I will deliver Excel and JSON files with a clear README so the full process can be run again easily. I would really appreciate it if you connect with me in the chat. I will discuss the price and other details with you directly in the chat. Please come, I’ll be waiting for you there. Warm Regards, Sheikh Huzaifa A.
$55 USD em 7 dias
0,0
0,0

⭐⭐⭐⭐⭐ Timeline: 3 days | Cost: $80 | Availability: ready to start immediately ❤️ Hello! I enjoy building end-to-end data pipelines where structured insights are extracted from unstructured PDFs and enriched with high-quality web data. Your project—Spanish procurement PDFs, precise price extraction, and product enrichment—is exactly the type of workflow I specialize in. The main challenge is achieving high-precision extraction from PDFs in Spanish while accurately enriching each product with multiple comparable prices and features from the web. My approach is: -- ✔ PDF Extraction: Parse Spanish PDFs using OCR/LLMs to reliably extract product names and key features, clean and structure the data, and export to Excel (.xlsx) and JSON. -- ✔ Web Search & Price Enrichment: For each product, query Google restricted to your specified domain, scrape the top 5 URLs, extract prices, brands, features, and descriptions with precision. Similarity is measured via Levenshtein distance or cosine similarity to capture near matches. The final deliverable is a fully automated, reproducible system delivering clean, accurate datasets in both Excel and JSON, ready for further analysis. If you want, I can provide a quick proof-of-concept on a few PDFs first to show extraction accuracy and enrichment quality before running the full batch.
$80 USD em 3 dias
0,0
0,0

Hi, I’ve reviewed your project to create an automated data extraction and enrichment pipeline for Spanish procurement PDFs. Your focus on precise product information extraction and rigorous price scraping across specific domains is clear, and I’m confident I can deliver a robust and maintainable solution. I have solid experience with Python-driven PDF data extraction and web scraping, including using NLP and similarity metrics like Levenshtein distance to identify related products. Leveraging LLMs for enhanced extraction and analysis fits well with my approach to ensure clean, accurate datasets. I will ensure the pipeline outputs structured Excel and JSON data sets alongside detailed documentation to make the workflow fully reproducible and understandable. Let’s align on specific domain targets and any preferred scraping tools or limits early on so the solution fits your exact needs. I’m ready to start defining the approach immediately and can deliver phase one results rapidly. Which specific domains should the Google search queries be restricted to for price extraction? Thanks, Andrew
$65 USD em 1 dia
0,0
0,0

I recently helped a client improve the clarity and accuracy of their data extraction process, reducing errors and streamlining their workflow for better results. I will help you build a seamless, fully automatic pipeline to extract and enrich data from Spanish procurement PDFs, delivering clean, structured datasets with precise price extraction and similarity analysis. I have built a track record of delivering reliable work ahead of schedule, with attention to detail that consistently leads to repeat clients. Your need for very precise price extraction from multiple URLs and user-friendly final datasets stands out as a priority. I offer expertise in automated data extraction, natural language processing, and web scraping. While I’m just getting started here, I’ve worked on similar projects outside this platform and consistently delivered results clients loved. I’m aiming to build that same 5-star track record here. No risk, no obligation, happy to have a short, free discussion to see if this is a good fit. Regards, Rozz.
$50 USD em 4 dias
0,0
0,0

✅⭐⭐⭐✅ I am ready to make your project a complete success! ✅⭐⭐⭐✅ I’ve analyzed your requirements, and this is a data-extraction and enrichment project where accuracy, reproducibility, and scalability are essential. The goal is to transform Spanish procurement PDFs into structured product data, enrich it with precise pricing and feature information from targeted websites, and output clean datasets in Excel and JSON. My approach is to build a Python-based pipeline: first, PDFs will be parsed using libraries like PyMuPDF or pdfplumber and cleaned into structured tables. For enrichment, each product name will be used as a Google search query (domain-restricted) and the top 5 results scraped using BeautifulSoup or Selenium, with LLM-assisted parsing for precise price, brand, and feature extraction. Similarity analysis will be implemented with Levenshtein distance or cosine similarity embeddings to capture approximate matches. Data will be compiled into Excel and JSON, ensuring at least 5 accurate price entries per product. A comprehensive README will document the workflow, dependencies, and reproducibility steps. You’ll receive a fully automated, reproducible pipeline that reliably converts PDFs into enriched, precise datasets, ready for analysis. Looking forward to work with you for your project. Thank you !
$100 USD em 7 dias
0,0
0,0

Hello, I bring years of experience building end-to-end data pipelines for documents and web data. I design Python-driven solutions to auto-extract product names and key features from Spanish PDFs, clean and structure results, and output final datasets in Excel (.xlsx) and JSON (.json), plus a detailed README. I have delivered similar work using PyMuPDF/ pdfminer for PDF extraction, spaCy/LLMs for entity extraction, and robust web scraping (requests, Playwright) with domain-restricted Google searches to retrieve five precise prices per product, plus brand, features, and descriptions, and a similarity analysis (Levenshtein/cosine) to identify similar items. I can handle the work end-to-end with accuracy and reproducibility, and I guarantee a clean, ready-to-run pipeline. Best regards, Billy Bryan
$65 USD em 1 dia
0,0
0,0

Having been a dedicated freelancer for 20 years and having worked intimately with Python throughout my career, I am confident I can meet and exceed your expectations on this project. I have gained extensive experience in web-based tasks and data extraction, making me well-equipped to tackle both phases of your Spanish procurement document project. I am fully comfortable navigating the intricacies of PDF data extraction while keeping their structure intact and organized for optimal output in Excel format. Furthermore, my abilities extend to skills such as web scraping, searching through Google and sorting through large amounts of information with precision - key qualities this project requires. I'm often commended for the level of thoroughness I bring to my work, which should give you peace of mind regarding the 5 accurate price extractions per product from 5 search URLs as outlined in the project.
$55 USD em 7 dias
0,0
0,0

Lausanne, Switzerland
Método de pagamento verificado
Membro desde dez. 9, 2025
$10-100 USD
$10-100 USD
$10-100 USD
$10-100 USD
₹12500-37500 INR
$30-250 USD
$3000-5000 USD
₹12500-37500 INR
₹1500-12500 INR
₹12500-37500 INR
₹37500-75000 INR
₹600-1500 INR
₹1500-12500 INR
₹600-1500 INR
₹12500-37500 INR
₹600-1500 INR
€2-6 EUR / hora
$10-30 USD
mín. $50 USD / hora
₹10000-20000 INR
£10-15 GBP / hora
$10-30 USD
$30-250 USD
₹37500-75000 INR