A Chinese Open Model Beat Claude on Semgrep's Own Frontier Coding-Agent Test

01You're now the agent's supervisor, not the typist — and the dashboard is leaving your desk

Cursor released a mobile app this week that lets developers steer coding agents from a phone, away from the editor. Days later, OpenAI posted a video teasing hardware for Codex, its coding tool, with a launch date of July 15th. Two companies, two form factors, one shared assumption: the code is being written somewhere you no longer have to sit.

The mobile app is the clearer signal. Cursor built its name on an editor where agents autocomplete and rewrite code inside the file. The phone app does something different. It gives developers remote oversight of agents already running, according to TechCrunch. You check progress, redirect, approve. The keyboard is optional because the typing is no longer the job.

OpenAI's teaser points the same direction with far less detail. The video shows a square device with several buttons and the caption "Your favorite Codex shortcuts are getting an upgrade," per The Verge. The Verge notes this is not the separate AI device OpenAI is reported to be developing on a different track. What the buttons do, what the device is for, how it connects to Codex: none of that is public. Known facts are the shape, the buttons, and the date.

Strip the speculation and a structural shift remains. Agents have reached enough autonomy that their output needs a human watching, but the human no longer needs the IDE in front of them. The product surface is moving from a code-completion interface to a supervision panel, and that panel is becoming portable. Cursor put it on a phone. OpenAI is putting something on a desk, or in a hand, with physical buttons.

That reframes the developer's role inside the loop. The work being designed for is monitoring, not authorship: starting a task, watching an agent execute, intervening when it drifts. A phone app and a button box are tools for a supervisor, not a writer. Both ship before either company has shown the autonomy that would justify walking away from the editor entirely.

The next data point arrives July 15th, when OpenAI reveals what the device actually does. Until then, the only confirmed change is where developers are expected to be while their code gets written: not necessarily at the desk.

Developer role shifting from writing code to approving agent outputoversight tools moving off the desktop to phones and dedicated hardwareJuly 15th OpenAI launch will show whether Codex hardware is real workflow or marketing

Sources

Cursor now has a mobile app for guiding your coding agent on the gotechcrunch.com OpenAI is teasing new hardware… for Codextheverge.com

02The Semgrep team reused its frontier-agent test on open models. A Chinese one beat Claude.

Semgrep had a narrow question, not a leaderboard to settle. The security firm wanted to know how much of an AI agent's vulnerability-finding ability comes from the model itself versus the harness wrapped around it — the scaffolding that feeds a model the repository, decides what it sees, and parses what it returns. So the team took its IDOR benchmark, the same dataset and the same prompt it had used to evaluate frontier coding agents, and pointed it at a batch of open-weight models given nothing but that prompt.

IDOR stands for Insecure Direct Object Reference: an access-control flaw that amounts to one user reaching data belonging to another. Detecting it cleanly is the kind of task where Semgrep expected closed frontier models to lead.

GLM-5.2, the open-weight model from Zhipu AI, scored 39% F1 on IDOR detection at roughly $0.17 per vulnerability found. Claude Code scored 32%. Among models handed only a prompt, the strongest open-weight option also beat Claude Opus 4.8, which Semgrep says surprised the team. They had not set out to crown an open-weight champion.

The result carries limits Semgrep states plainly. GLM-5.2 still trailed the company's own multimodal pipeline, which scored 53–61% F1. That pipeline runs inside a purpose-built harness doing much of the heavy lifting, which is the variable Semgrep was trying to isolate in the first place. The model number is the model alone; the harness adds the rest.

The Verge framed it more narrowly still. Zhipu, also known as Z.ai, released GLM-5.2 as open-weight, and some researchers claim it matches Anthropic's Mythos in certain bug-finding and cybersecurity scenarios. On more general tasks, GLM lags behind models from Anthropic and OpenAI. The cybersecurity result is a specific competence, not across-the-board parity.

For security researchers, the practical takeaway is the deployment model. A bug-finding capability that beats Claude on this benchmark now ships as open weights, runs locally, and costs cents per finding.

Self-hosted, open-weight bug-finding now beats Claude Code at ~$0.17 per vulnsecurity teams can run IDOR detection on-prem without frontier API accessparity is narrow — GLM matches on cybersecurity, trails Anthropic and OpenAI elsewhere

Sources

GLM 5.2 beats Claude in our benchmarkssemgrep.dev China's Z.ai claims it can match Mythos on cybersecuritytheverge.com

03The health details you type into ChatGPT are legal to sell. Lawmakers want to change that.

Two Democratic lawmakers are preparing a bill to bar the sale of Americans' health and location data to brokers, and its reach extends somewhere users rarely consider: the messages they type into AI chatbots. According to The Verge, the coming proposal would cover information disclosed to ChatGPT or Claude. Current law leaves an opening. Nothing stops a company from packaging what someone reveals in a chat and selling it downstream.

Data brokers buy and resell personal records with little disclosure to the people described in them. Health and location entries rank among the most valuable, because they expose conditions, habits, and movements. A single chat log can hold all three at once.

The threat is not theoretical. In the Palisades wildfire arson case, prosecutors entered the defendant's ChatGPT logs into evidence, The Verge reported. The logs sat beside iPhone location data, security camera footage, and witness testimony. That fire became one of the deadliest in Los Angeles history. A chatbot transcript landed in the same evidentiary file as the phone records.

So one record points two ways. People treat a chatbot like a private confidant and type symptoms, locations, and worse. The same text can be sold to a broker, or pulled into court by the state. The bill would close the commercial route. It does nothing about the subpoena.

The lawmakers frame this as an industry-wide problem, not a complaint about a single company. The proposal targets the practice of moving sensitive data to brokers rather than any one firm's product. Whether AI transcripts qualify as "health" data under the final text will decide how much of a chat falls inside the ban.

Chatbot disclosures sit outside HIPAA and broker rules todaybill would block resale of users' health and location datafinal definition decides whether AI chats count as health data

Sources

Lawmakers want to ban AI companies from selling your health datatheverge.com Prosecutors used ChatGPT logs as evidence in the Palisades fire trialtheverge.com

South Korea's chipmakers pledge $550B to expand memory production Samsung and SK Hynix, the world's two largest memory makers, committed over $550 billion to build additional fabs and memory research facilities. The spending responds to a severe DRAM and HBM shortage straining AI hardware supply, locally dubbed "RAMageddon." techcrunch.com

Anthropic gives California government Claude at half price Anthropic signed a deal with Governor Newsom letting California state agencies buy Claude access at a 50% discount. The agreement deepens Anthropic's ties to the state while it remains at odds with the federal government. techcrunch.com

Meta contractors posed as teens to test rival chatbots Hundreds of contractors on a Meta project impersonated minors to probe how Gemini and ChatGPT respond to prompts about suicide, sex, and drugs, WIRED found. The testing aimed to map competitors' safety responses on high-risk topics. wired.com

HP expands OpenAI partnership across its operations HP scaled its OpenAI Frontier partnership to deploy AI in customer support, internal software development, and enterprise operations. The deal extends OpenAI's reach into a major PC manufacturer's product and back-office systems. openai.com

Arena turns its free AI leaderboard into a $100M business Arena, the crowd-voted model-ranking site widely cited across the industry, now runs a $100 million commercial operation. The startup launched its paid service in September 2025. techcrunch.com

Palantir builds a secure AI engine for US agencies on NVIDIA Nemotron Palantir released an inference engine running NVIDIA Nemotron open models inside closed government environments. The system targets US agencies that need to run AI on classified data without external connectivity. blogs.nvidia.com

Anthropic's Claude reaches general availability on Azure GB300 hardware Microsoft made Claude models in its Foundry platform generally available, hosted on Azure and running on NVIDIA GB300 Blackwell Ultra GPUs. Azure enterprises can now build agents on Claude without leaving Microsoft's stack. blogs.nvidia.com

Flexion trains a humanoid robot for office tasks Flexion Robotics, founded by former Nvidia engineers, demonstrated a humanoid robot handling routine office work. The company uses a training method aimed at teaching robots practical manual tasks. wired.com

Proception settles Tesla trade-secret suit and raises $11M Robot-hand startup Proception resolved Tesla's trade-secret lawsuit and announced an $11 million raise. The company collects training data to build dexterous robotic hands, one of robotics' hardest problems. techcrunch.com

TIDAL stops paying for AI-generated music TIDAL cut monetization for AI-generated tracks and will use automated tools to remove uploads that impersonate real artists or groups. The policy targets royalty payouts on synthetic music. techcrunch.com

Google opens Gemini's personalized image generation to free US users Google extended Gemini's personalized image generation to eligible free US accounts. The feature creates images using a user's interests and data pulled from connected Google apps. techcrunch.com

Firefly runs NVIDIA Jetson in lunar orbit Firefly Aerospace's Blue Ghost Mission 2, targeted for late 2026, will run its Ocula imaging service on NVIDIA Jetson in lunar orbit. The system processes images on orbit and transmits only selected results to Earth. blogs.nvidia.com