
Fechado
Publicado
Pago na entrega
I need help preparing a dataset from CVEfixes for a file-level vulnerability classification task. My target programming languages are: PHP Java JavaScript Python My target CWE scope is: CWE-20 CWE-22 CWE-79 CWE-89 CWE-352 Important note: this will be a file-level dataset, and the labels in CVEfixes should be treated as derived / proxy file-level labels, not perfect manual ground truth. I want the preparation pipeline to be strict, realistic, and academically defensible. The dataset has a real class imbalance problem, and I want it handled carefully. What I want: I want the dataset to be organized by language first. That means I need 4 separate language-specific datasets: Java PHP JavaScript Python Then, for each language-specific dataset, I want a split into: train validation test So each language should end up with its own 3 split files. Important constraints: The split must be strict and leakage-aware Duplicates and near-duplicates should be handled carefully before or during splitting If possible, avoid putting highly similar samples across train/validation/test Augmentation must be applied only to the training split Validation and test must remain original, real, and unaugmented I do not want any synthetic or LLM-generated samples in validation or test The final evaluation setting must stay realistic and fair Augmentation requirement: To address class imbalance, you may use an LLM to generate augmented variants in a controlled way, but only under these conditions: augmentation must be applied after the split only the train split may be augmented augmented samples must not leak into validation or test validation and test should remain fully original and unchanged the process should improve training balance without making evaluation unrealistic What I need help with I want help designing and preparing the full data preparation notebook correctly, including: filtering CVEfixes to the required languages filtering to the required CWE scope building a clean file-level dataset handling duplicate or pathological repeated samples carefully splitting each language dataset into train/validation/test in a strict way applying augmentation only on the training split improving class balance without contaminating validation or test Final goal: The final result should be: Java: train / validation / test PHP: train / validation / test JavaScript: train / validation / test Python: train / validation / test with augmentation used only on train, and with validation/test kept fully real and untouched.
ID do Projeto: 40322407
57 propostas
Projeto remoto
Ativo há 27 dias
Defina seu orçamento e seu prazo
Seja pago pelo seu trabalho
Descreva sua proposta
É grátis para se inscrever e fazer ofertas em trabalhos
57 freelancers estão ofertando em média $497 USD for esse trabalho

⭐⭐⭐⭐⭐ Prepare File-Level Dataset for Vulnerability Classification Task ❇️ Hi My Friend, I hope you are doing well. I've reviewed your project needs and see you're looking for help preparing a dataset from CVEfixes for a file-level vulnerability classification task. You don’t need to look any further; Zohaib is here to assist you! My team has successfully completed over 50 similar projects for dataset preparation. I will create a strict and realistic data preparation pipeline that meets your requirements. ➡️ Why Me? I can easily prepare your dataset as I have 5 years of experience in data preparation and classification tasks. My skills include data filtering, organizing datasets, and handling class imbalance. Additionally, I have a strong grip on Python, Java, PHP, and JavaScript, ensuring a comprehensive approach to your project. ➡️ Let's have a quick chat to discuss your project in detail, and I can show you samples of my previous work. Looking forward to discussing this with you! ➡️ Skills & Experience: ✅ Data Preparation ✅ Dataset Organization ✅ Class Imbalance Handling ✅ Python Programming ✅ Java Programming ✅ PHP Programming ✅ JavaScript Programming ✅ Data Filtering ✅ File-Level Dataset Creation ✅ Augmentation Techniques ✅ Validation and Testing ✅ Project Management Waiting for your response! Best Regards, Zohaib
$350 USD em 2 dias
8,1
8,1

Hello, I understand you need a thorough dataset preparation for file-level vulnerability classification using the CVEfixes dataset across PHP, Java, JavaScript, and Python. I'll provide a strict and defensible pipeline that filters by your specified CWE scopes, organizes data by language, and carefully handles duplicates and near-duplicates to prevent data leakage. The splits will be strictly leakage-aware ensuring no overlap between train, validation, and test sets, keeping validation and test sets original and unaltered. For imbalance, I will apply cautious augmentation only to the train sets after splitting, without affecting evaluation integrity. The approach will focus on creating clean, realistic datasets adhering to your academic standards. Could you please clarify your preferred criterion for defining near-duplicates to ensure precise filtering? Could you share more details on how strictly you want to avoid samples with partial similarity in different splits? Do you have specific augmentation strategies or controls in mind beyond LLM-based augmentation? What evaluation metrics or downstream tasks will be prioritized so the splits can be optimized accordingly? How large is your current CVEfixes dataset, so I can estimate processing and runtime needs? How do you define near-duplicates for filtering to ensure the dataset splits remain strict and leakage-free? Best regards,
$750 USD em 13 dias
7,6
7,6

Hi I have strong experience preparing academically defensible ML datasets for vulnerability classification using CVE/CWE-aligned sources, leakage-aware splitting, deduplication, and controlled augmentation pipelines. The main technical challenge here is turning CVEfixes into a realistic file-level dataset without introducing split leakage, duplicate contamination, or inflated results from proxy labels and synthetic balancing. I can build a full notebook pipeline that filters CVEfixes by your four target languages and five target CWEs, constructs strict proxy file-level labels, removes exact and near-duplicate samples carefully, and produces separate train/validation/test splits for Java, PHP, JavaScript, and Python. I will structure the split logic to minimize repository, commit, and similarity leakage as much as the source permits, while keeping validation and test fully original and untouched. For class imbalance, I can apply controlled augmentation only after splitting and only on the training split, with clear provenance so augmented data never contaminates validation or test. My approach emphasizes reproducibility, defensible assumptions, and detailed documentation of trade-offs around proxy labeling, duplicate handling, and augmentation policy. The final output will be a clean preparation notebook plus per-language split files ready for downstream training and fair evaluation. Thanks, Hercules
$500 USD em 7 dias
7,0
7,0

Hi there, I'm excited about your project on preparing the CVEfixes dataset for file-level vulnerability classification. With extensive experience in data science, programming in PHP, Java, JavaScript, and Python, and a solid background in handling similar datasets, I am well-equipped to create a precise and effective data preparation notebook. I understand the importance of adhering to strict data handling principles, especially considering the need for a realistic evaluation setting and careful class imbalance management. I will ensure that each language dataset is organized correctly with strict train/validation/test splits, maintaining leakage awareness, while applying augmentation only to the training split as you specified. My approach will focus on building clean, file-level datasets while handling duplicates meticulously. Let's schedule a time to discuss your project in more detail. I'm looking forward to your message right away! How do you plan to validate the effectiveness of the dataset once it's prepared?
$610 USD em 6 dias
6,8
6,8

I'm Iosif Peterfi, 15+ years delivering practical systems and security improvements across web, automation, and cloud. This is my speciality: designing defensible data preparation pipelines for high-stakes classification tasks, with strict leakage controls, reproducibility, and balanced sampling across languages. You need four language-specific, file-level CVEfixes datasets (Java, PHP, JavaScript, Python), each with train, validation, and test splits, leakage-aware deduplication, and augmentation only on training samples to address class imbalance. I will filter CVEfixes to the four languages and the CWE scope you listed, build a clean file-level dataset, remove duplicates and pathological repeats, create strict, leakage-aware splits, apply augmentation only on the training split, and provide a reproducible notebook plus clear documentation. The result will be ready-to-run, auditable data splits with clear definitions of scope, quality controls, and risk mitigation, delivering tangible improvements in modeling fairness and realism. Last quarter I helped a fintech team prepare a language-specific vulnerability dataset, reduced duplicates and leakage, and achieved a balanced class distribution, boosting validation F1 by 8%. This work reduced evaluation risk and supported fair comparisons across languages. Let's chat - I can walk you through my approach in 15 minutes.
$1.200 USD em 5 dias
6,8
6,8

Hello, I understand you need a rigorous file-level vulnerability dataset from CVEfixes for PHP, Java, JavaScript, and Python, focusing on CWE-20, 22, 79, 89, 352. I will filter samples by language and CWE, handle duplicates and near-duplicates carefully, and organize each language into train, validation, and test splits. The splitting will be strict and leakage-aware, ensuring highly similar samples don’t appear across splits. For class imbalance, I will apply controlled augmentation using LLMs only on the training split after splitting. Validation and test will remain fully original, ensuring realistic, academically defensible evaluation. The final deliverable will include a clean, split, and augmented dataset for each language, along with a detailed Jupyter/Python notebook documenting the pipeline, filtering logic, and augmentation procedures. Thanks, Asif
$750 USD em 11 dias
6,6
6,6

As an experienced full-stack Web & App Developer with proven proficiency in JavaScript, PHP, and Python; I'm well-equipped to handle your dataset preparation project proficiently. I have a strong background in managing datasets efficiently, building and implementing clean file-level datasets - all of which are vital for the success of your project. My professional experience has also exposed me to the complexities of working with imbalanced datasets and leveraging specific augmentation techniques effectively to improve training balance without compromising evaluation results. This is extremely important considering your need for strict split management, leakage avoidance, and keeping validation/test sets completely real and original. My approach to projects is pragmatic; I aim at delivering exceptional quality, meeting specific client needs while prioritizing problem-solving for real-world tasks. I believe my range of expertise matches up strongly with the demands of your project, making me an ideal fit for addressing your dataset preparation requirements at every step – starting from filtering CVEfixes to handling duplicates and carefully performing the splits. Let's collaborate for a successful project!
$250 USD em 3 dias
6,7
6,7

Hello, With my 8+ years of experience as a Full Stack Developer, and since the scope of your project overlaps with my expertise, I believe I'm the ideal Freelancer for your dataset preparation needs. Although my work has primarily revolved around building intelligent AI-powered platforms, data analysis, and model integrations, I've been engaged with multiple ventures that had rigorous academic demands like yours. I’ve worked with languages like Java, PHP, JavaScript, and Python extensively and have developed an in-depth knowledge of these fields. Moreover, the nature of your project and its requirement for strict adherence to real-life validation motivated me further to express my interest- it is crucial to develop a reliable dataset that produces accurate results. Throughout my career, I've experienced handling class imbalances, duplicate/near-duplicate samples, dataset splitting, and augmentation - all within strict leakage-aware protocols. This placed me at an advantage when it comes to precisely managing these issues for you. choosing me for this task ensures meticulous handling of your data in accordance with academic defensibility while exploiting smart automation solutions I’ve built over years! Let’s build something ground-breaking together!
$750 USD em 7 dias
6,7
6,7

Hello, Can we discuss about your CVEfixes dataset prep project cause I have built a leakage-safe pipeline that converts commit-level data into clean file-level samples with strict splits using PyTorch. I’ll filter by language and CWE, dedupe smartly, split without leakage, and balance only via train augmentation. Should splits be repo-based or time-based? How will you detect near-duplicates across files? What balance ratio do you want per CWE? One thing—commit overlap can silently leak signals if not grouped early. Best regards, Devendra S.
$1.000 USD em 14 dias
6,4
6,4

Hello, With over 7 years of experience in Data Processing, Data Science, Statistical Analysis, and Python, I have the expertise required to handle your project effectively. I have carefully reviewed the requirements for preparing a dataset for file-level vulnerability classification using the CVEfixes dataset. To accomplish this project, I will start by filtering the CVEfixes dataset based on the required programming languages (PHP, Java, JavaScript, Python) and the specified CWE scope (CWE-20, CWE-22, CWE-79, CWE-89, CWE-352). I will then organize the dataset into language-specific datasets and split each dataset into training, validation, and test sets, ensuring strictness, realism, and academic defensibility throughout the process. Furthermore, I will address the class imbalance issue by applying augmentation techniques specifically to the training split while keeping the validation and test sets original and untouched. By carefully handling duplicates, maintaining strict splits, and ensuring that augmented samples do not leak into the evaluation sets, I will deliver a comprehensive data preparation notebook that meets your requirements. I would like to discuss this project further with you. Please connect with me via chat to explore the details and clarify any additional aspects. You can visit my profile at: https://www.freelancer.com/u/HiraMahmood4072 Thank you.
$275 USD em 2 dias
6,3
6,3

I appreciate the thoroughness of your project description and I assure you that my technical and analytical skills are well-suited to tackle it. With my substantial expertise in Data Science, Statistical Analysis, Machine Learning, and Neural Networks, I am committed to meeting all your dataset preparation requirements and delivering a result that is thorough, academically defensible, and embraces industry best-practices. Given your project specifications on language-specific datasets and the strict splits into training/validation/testing phases – and taking into account the challenging class imbalance issue - I will diligently organize the CVEfixes dataset according to PHP, Java, JavaScript, and Python languages. Drawing upon my experience with Exploration Data Analysis (EDA), I'll leverage it to filter for samples relevant to CWE-20, CWE-22, CWE-79, CWE-89, and CWE-352. Let's collaborate to build robust data preparation workflows aligned with fair evaluation settings. Through feature engineering methods like dimensionality reduction, predictive modeling, I will design graceful EDA visualizations revealing patterns & anomalies efficiently. Rest assured that we will respect your request for a realistic dataset keeping in tandem with academic rigor throughout the stages.
$500 USD em 7 dias
6,0
6,0

hi! i have reviewed the details of your project and i can do this!!. we have handled similar projects successfully, and I am confident we can deliver high quality results for you. we prefer clear communication and regular updates so that the project progresses smoothly and meets your expectations. let's have a detailed discussion, as it will help me give you a complete plan, including a timeline and estimated budget. I will share my portfolio in the chat to show relevant examples of our past work. looking forward to your response. mughiraa
$500 USD em 7 dias
5,5
5,5

Hello, I hope you're doing well! I'm a Top-Rated Full-Stack Developer with 12+ years of experience delivering scalable, high-quality digital solutions. I’ve completed 100+ projects across Node.js, Python (Django/Flask), PHP (Laravel, CodeIgniter, Yii2), React/Vue/Angular, Shopify & E-commerce, and Mobile Apps (React Native, Flutter). My clients value my reliability, communication, and commitment to deadlines. Why I’m a Great Fit: > I carefully review project requirements to ensure full clarity. > I deliver clean, maintainable, and scalable code. > I break complex tasks into structured milestones with regular updates. > I work seamlessly across time zones using Jira, Asana, and Basecamp. What I Can Deliver: > Full-Stack Web & Backend Development (APIs, Platforms) > E-commerce & Marketplace Solutions (Shopify, WooCommerce, Magento) > Mobile Apps (React Native, Flutter) > AI/ML Integrations, Chatbots & Analytics > DevOps (AWS, Docker, CI/CD), Testing & QA > UI/UX Support & Technical Consulting Best regards,
$500 USD em 7 dias
6,0
6,0

Hello sir, Did go through your job description and glad to share that I have enormous experience in working with Dataset Preparation for File-Level Vulnerability Classification on CVEfixes dataset I'm a seasoned programmer and Engineer with quality experience in Flutter, React, Node.JS, SpringBoot, Frontend and Backend Development, Python, Matlab, R studio, C, C++, C#, OpenCV, OpenGL, Tesseract OCR, google vision, Statistical programming/R progamming data analysis Computing for Data Analysis Time Series & Econometric, Machine learning, AI, Deep learning, Matlab and Mathematica, 3D modeling, CAD/CAM,AutoCAD, 2D, Architectural Engineering, SolidWorks, Unity 3D, PCB, Electronics, Arduino, Automation, Embedded and Firmware , IOT, Electrical/Mechanical Engineering I am a TOP Rated Freelancer, and you can check my reviews here as well: https://www.freelancer.com/u/mzdesmag. Looking forward to potentially working together on this project. Thanks and Best regards, Adekunle.
$250 USD em 2 dias
5,4
5,4

Hello, Hope you're doing great! I am a Professional PHP Developer who builds secure, high-performance, and business-focused web applications. I work with custom PHP as well as modern frameworks, ensuring every project is scalable, optimized, and easy to maintain for long-term growth. What I Do Custom web applications and business automation systems REST API development and third-party API integration Secure authentication systems, admin panels, and dashboards Fast, responsive, and mobile-friendly websites Website migration, bug fixing, code refactoring, and performance optimization Why Clients Prefer My Work Clean project structure with scalable architecture Secure coding standards with optimized performance Clear communication and professional approach On-time delivery with regular progress updates Focus on long-term reliability and maintainability Ready to Start Share your project requirements or reference website — I will carefully analyze it and provide: Best technical strategy and development plan Clear timeline with milestone breakdown Transparent budget estimate Looking forward to building a powerful and successful solution for you!
$250 USD em 7 dias
5,3
5,3

Hello! This is James from Hollywood, and I’m excited about the opportunity to assist with preparing the dataset for file-level vulnerability classification on the CVEfixes dataset. I’ve carefully read your project description and believe I have the relevant skills and expertise, backed by over 15 years of experience in data processing, statistical analysis, and programming. To ensure I fully understand your needs, could you please clarify the following questions? 1. Are there specific file formats or structures you prefer for the dataset? 2. What criteria do you envision for classifying the vulnerabilities at the file level? My approach would involve analyzing the CVEfixes dataset, identifying key attributes, and implementing a systematic classification strategy. This could include data cleaning, normalization, and augmentation to ensure the dataset aligns with your project goals. With my background in building data pipelines and experience in AI and automation, I’m confident I can deliver a well-structured, comprehensive dataset that meets your specifications. I’ve successfully completed similar projects, such as developing a vulnerability assessment tool for a cybersecurity firm and creating data augmentation frameworks for a SaaS platform. Let’s connect to discuss your project further! I'm looking forward to your response.
$500 USD em 3 dias
5,2
5,2

Hi, I understand that you need a rigorously prepared, file-level dataset from CVEfixes for vulnerability classification across PHP, Java, JavaScript, and Python. The goal is to create language-specific datasets split into train, validation, and test sets while handling class imbalance carefully, preventing data leakage, and keeping validation and test sets fully real and unaugmented. My approach would involve filtering CVEfixes to your target languages and CWEs, deduplicating and removing near-duplicates, and splitting each dataset in a strict, leakage-aware manner. I will then apply augmentation only to the training splits to improve class balance, ensuring the validation and test sets remain untouched. I can provide a well-documented Jupyter notebook implementing the full pipeline, including reproducible filtering, splitting, and augmentation steps. Pre-delivery, I will validate the splits for leakage, check class distributions, ensure augmentation integrity, and verify reproducibility of the dataset preparation process. Best, Justin
$500 USD em 7 dias
5,3
5,3

I can handle this dataset preparation for file-level vulnerability classification using the CVEfixes dataset. Your goal to create strict, leakage-aware, and manageable language-specific datasets for PHP, Java, JavaScript, and Python is clear. I'll implement a pipeline to carefully treat labels as proxy ground truths, handle class imbalance with suitable techniques, and rigorously process duplicates and near-duplicates to avoid leakage. I'll use Python primarily for data processing and statistical analysis to build a clean, academically defensible pipeline. The output will be four language-specific datasets, each split into train, validation, and test sets with no leakage, ready for your classification tasks. My approach focuses on clean, reproducible code that can scale and be adapted to similar datasets or future updates. This will provide a solid foundation for your vulnerability classification. Would you like me to prepare a detailed outline of the dataset preparation steps and code structure before we proceed?
$375 USD em 7 dias
5,4
5,4

As an experienced backend developer with a strong skillset in Data Processing and PHP, I believe I'm the right fit for your dataset preparation project. I fully comprehend the significance of academic defensibility in data preparation, particularly when dealing with derived labels like those in CVEfixes. My extensive background in creating and maintaining web applications using Laravel framework proves my capability to design and implement clean database structures for your datasets. Handling class imbalance is always a challenge but it's a challenge that I'm well-prepared to tackle. I can leverage my expertise to ensure strict splitting of the datasets, avoiding leakage while simultaneously handling duplicates and near-duplicates with utmost care. With your augmentation requirement, I'll focus only on the training split as instructed, ensuring no leakage occurs into validation or test sets. Your final evaluation's realism and fairness will be upheld throughout. Ultimately, I understand that you desire an organically augmented dataset that doesn't compromise the integrity of validation and testing phases. My commitment to delivering high-quality work, keen eye for details, and ability to find practical solutions to complex problems makes me confident that I can successfully complete this project with high standards meeting your specific needs. Let's discuss how we can efficiently cross each hurdle together!
$250 USD em 5 dias
5,0
5,0

Hi hatoon1, Just last week I completed a similar task successfully, so I can get started on this without any ramp-up time. How should file-level labels be defined: positives as the pre-fix versions of all files touched by the CVE-linked fixing commit (using CVEfixes CWEs), and negatives as post-fix versions and/or unrelated files from the same repo/time window; or a different scheme? Is cloning upstream repos and pinning SHAs acceptable for file retrieval (dropping missing or license-issue files), and should forks/mirrors be collapsed to a single origin for dedup? Suggestion 1: Do repository- and near-duplicate–aware splitting: tokenize→normalize→w-shingles→MinHash/LSH clusters; split by repo+cluster groups, stratified by CWE and file size, with a temporal holdout for test. Suggestion 2: Use CWE-specific, train-only augmentation that preserves semantics (AST-guided edits) and vulnerability traits (pattern/static checks), with per-repo/cluster caps and post-aug dedup to prevent leakage/drift. Action Plan: Phase 1 Ingest CVEfixes; filter by language (PHP, Java, JavaScript, Python) and CWEs (20,22,79,89,352); clone repos, map fix commits. Phase 2 Build file-level samples (pre-fix/fixed), attach metadata; finalize labeling. Phase 3 QC: license screen, size bounds, path anomalies; remove exact dups. Phase 4 Near-dup clustering (normalize, MinHash/LSH); file lineages Best Regards, Sid
$750 USD em 9 dias
5,3
5,3

Makkah, Saudi Arabia
Membro desde mar. 24, 2026
$250-750 USD
$250-750 USD
₹600-1500 INR
$250-750 USD
₹12500-37500 INR
$30-250 USD
₹600-1500 INR
$1500-3000 USD
₹750-1250 INR / hora
$10-30 USD
$400-800 USD
$8-15 USD / hora
$10-30 USD
$15-25 USD / hora
$10-30 USD
₹12500-37500 INR
$750-1500 SGD
€250 EUR
₹12500-37500 INR
$15-25 USD / hora
$30-250 USD
$10-30 USD