summaryrefslogtreecommitdiff
path: root/net/ipv4/udp.c
AgeCommit message (Collapse)Author
2018-09-29udp4: fix IP_CMSG_CHECKSUM for connected socketsPaolo Abeni
[ Upstream commit 2b5a921740a55c00223a797d075b9c77c42cb171 ] commit 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion") left out the early demux path for connected sockets. As a result IP_CMSG_CHECKSUM gives wrong values for such socket when GRO is not enabled/available. This change addresses the issue by moving the csum conversion to a common helper and using such helper in both the default and the early demux rx path. Fixes: 2abb7cdc0dc8 ("udp: Add support for doing checksum unnecessary conversion") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-05-19ipv4: fix memory leaks in udp_sendmsg, ping_v4_sendmsgAndrey Ignatov
[ Upstream commit 1b97013bfb11d66f041de691de6f0fec748ce016 ] Fix more memory leaks in ip_cmsg_send() callers. Part of them were fixed earlier in 919483096bfe. * udp_sendmsg one was there since the beginning when linux sources were first added to git; * ping_v4_sendmsg one was copy/pasted in c319b4d76b9e. Whenever return happens in udp_sendmsg() or ping_v4_sendmsg() IP options have to be freed if they were allocated previously. Add label so that future callers (if any) can use it instead of kfree() before return that is easy to forget. Fixes: c319b4d76b9e (net: ipv4: add IPPROTO_ICMP socket kind) Signed-off-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2018-03-11udplite: fix partial checksum initializationAlexey Kodanev
[ Upstream commit 15f35d49c93f4fa9875235e7bf3e3783d2dd7a1b ] Since UDP-Lite is always using checksum, the following path is triggered when calculating pseudo header for it: udp4_csum_init() or udp6_csum_init() skb_checksum_init_zero_check() __skb_checksum_validate_complete() The problem can appear if skb->len is less than CHECKSUM_BREAK. In this particular case __skb_checksum_validate_complete() also invokes __skb_checksum_complete(skb). If UDP-Lite is using partial checksum that covers only part of a packet, the function will return bad checksum and the packet will be dropped. It can be fixed if we skip skb_checksum_init_zero_check() and only set the required pseudo header checksum for UDP-Lite with partial checksum before udp4_csum_init()/udp6_csum_init() functions return. Fixes: ed70fcfcee95 ("net: Call skb_checksum_init in IPv4") Fixes: e4f45b7f40bd ("net: Call skb_checksum_init in IPv6") Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-11-18soreuseport: fix initialization raceCraig Gallek
[ Upstream commit 1b5f962e71bfad6284574655c406597535c3ea7a ] Syzkaller stumbled upon a way to trigger WARNING: CPU: 1 PID: 13881 at net/core/sock_reuseport.c:41 reuseport_alloc+0x306/0x3b0 net/core/sock_reuseport.c:39 There are two initialization paths for the sock_reuseport structure in a socket: Through the udp/tcp bind paths of SO_REUSEPORT sockets or through SO_ATTACH_REUSEPORT_[CE]BPF before bind. The existing implementation assumedthat the socket lock protected both of these paths when it actually only protects the SO_ATTACH_REUSEPORT path. Syzkaller triggered this double allocation by running these paths concurrently. This patch moves the check for double allocation into the reuseport_alloc function which is protected by a global spin lock. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: c125e80b8868 ("soreuseport: fast reuseport TCP socket selection") Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2017-08-12udp: consistently apply ufo or fragmentationWillem de Bruijn
[ Upstream commit 85f1bd9a7b5a79d5baa8bf44af19658f7bf77bfa ] When iteratively building a UDP datagram with MSG_MORE and that datagram exceeds MTU, consistently choose UFO or fragmentation. Once skb_is_gso, always apply ufo. Conversely, once a datagram is split across multiple skbs, do not consider ufo. Sendpage already maintains the first invariant, only add the second. IPv6 does not have a sendpage implementation to modify. A gso skb must have a partial checksum, do not follow sk_no_check_tx in udp_send_skb. Found by syzkaller. Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2016-11-24udplite: call proper backlog handlersEric Dumazet
In commits 93821778def10 ("udp: Fix rcv socket locking") and f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite was forgotten. This leads to crashes if UDPlite header is pulled twice, which happens starting from commit e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Bug found by syzkaller team, thanks a lot guys ! Note that backlog use in UDP/UDPlite is scheduled to be removed starting from linux-4.10, so this patch is only needed up to linux-4.9 Fixes: 93821778def1 ("udp: Fix rcv socket locking") Fixes: f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb") Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-11-15udp: restore UDPlite many-cast deliveryPablo Neira
Honor udptable parameter that is passed to __udp*_lib_mcast_deliver(), otherwise udplite broadcast/multicast use the wrong table and it breaks. Fixes: 2dc41cff7545 ("udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver.") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-26udp: fix IP_CHECKSUM handlingEric Dumazet
First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr)); Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") forgot to adjust the offsets now UDP headers are pulled before skb are put in receive queue. Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sam Kumar <samanthakumar@google.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-10-20udp: must lock the socket in udp_disconnect()Eric Dumazet
Baozeng Ding reported KASAN traces showing uses after free in udp_lib_get_port() and other related UDP functions. A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash. I could write a reproducer with two threads doing : static int sock_fd; static void *thr1(void *arg) { for (;;) { connect(sock_fd, (const struct sockaddr *)arg, sizeof(struct sockaddr_in)); } } static void *thr2(void *arg) { struct sockaddr_in unspec; for (;;) { memset(&unspec, 0, sizeof(unspec)); connect(sock_fd, (const struct sockaddr *)&unspec, sizeof(unspec)); } } Problem is that udp_disconnect() could run without holding socket lock, and this was causing list corruptions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-09-10net: ipv4: Remove l3mdev_get_saddrDavid Ahern
No longer needed Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-30Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23udp: get rid of sk_prot_clear_portaddr_nulls()Eric Dumazet
Since we no longer use SLAB_DESTROY_BY_RCU for UDP, we do not need sk_prot_clear_portaddr_nulls() helper. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23net: diag: support SOCK_DESTROY for UDP socketsDavid Ahern
This implements SOCK_DESTROY for UDP sockets similar to what was done for TCP with commit c1e64e298b8ca ("net: diag: Support destroying TCP sockets.") A process with a UDP socket targeted for destroy is awakened and recvmsg fails with ECONNABORTED. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23udp: get rid of SLAB_DESTROY_BY_RCU allocationsEric Dumazet
After commit ca065d0cf80f ("udp: no longer use SLAB_DESTROY_BY_RCU") we do not need this special allocation mode anymore, even if it is harmless. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-23udp: fix poll() issue with zero sized packetsEric Dumazet
Laura tracked poll() [and friends] regression caused by commit e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") udp_poll() needs to know if there is a valid packet in receive queue, even if its payload length is 0. Change first_packet_length() to return an signed int, and use -1 as the indication of an empty queue. Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Reported-by: Laura Abbott <labbott@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Tested-by: Laura Abbott <labbott@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-08-19udp: include addrconf.hEric Dumazet
Include ipv4_rcv_saddr_equal() definition to avoid this sparse error : net/ipv4/udp.c:362:5: warning: symbol 'ipv4_rcv_saddr_equal' was not declared. Should it be static? Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-25udp: use sk_filter_trim_cap for udp{,6}_queue_rcv_skbDaniel Borkmann
After a612769774a3 ("udp: prevent bugcheck if filter truncates packet too much"), there followed various other fixes for similar cases such as f4979fcea7fd ("rose: limit sk_filter trim to payload"). Latter introduced a new helper sk_filter_trim_cap(), where we can pass the trim limit directly to the socket filter handling. Make use of it here as well with sizeof(struct udphdr) as lower cap limit and drop the extra skb->len test in UDP's input path. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Willem de Bruijn <willemb@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-07-11udp: prevent bugcheck if filter truncates packet too muchMichal Kubeček
If socket filter truncates an udp packet below the length of UDP header in udpv6_queue_rcv_skb() or udp_queue_rcv_skb(), it will trigger a BUG_ON in skb_pull_rcsum(). This BUG_ON (and therefore a system crash if kernel is configured that way) can be easily enforced by an unprivileged user which was reported as CVE-2016-6162. For a reproducer, see http://seclists.org/oss-sec/2016/q3/8 Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Reported-by: Marco Grassi <marco.gra@gmail.com> Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14udp reuseport: fix packet of same flow hashed to different socketSu, Xuemin
There is a corner case in which udp packets belonging to a same flow are hashed to different socket when hslot->count changes from 10 to 11: 1) When hslot->count <= 10, __udp_lib_lookup() searches udp_table->hash, and always passes 'daddr' to udp_ehashfn(). 2) When hslot->count > 10, __udp_lib_lookup() searches udp_table->hash2, but may pass 'INADDR_ANY' to udp_ehashfn() if the sockets are bound to INADDR_ANY instead of some specific addr. That means when hslot->count changes from 10 to 11, the hash calculated by udp_ehashfn() is also changed, and the udp packets belonging to a same flow will be hashed to different socket. This is easily reproduced: 1) Create 10 udp sockets and bind all of them to 0.0.0.0:40000. 2) From the same host send udp packets to 127.0.0.1:40000, record the socket index which receives the packets. 3) Create 1 more udp socket and bind it to 0.0.0.0:44096. The number 44096 is 40000 + UDP_HASH_SIZE(4096), this makes the new socket put into the same hslot as the aformentioned 10 sockets, and makes the hslot->count change from 10 to 11. 4) From the same host send udp packets to 127.0.0.1:40000, and the socket index which receives the packets will be different from the one received in step 2. This should not happen as the socket bound to 0.0.0.0:44096 should not change the behavior of the sockets bound to 0.0.0.0:40000. It's the same case for IPv6, and this patch also fixes that. Signed-off-by: Su, Xuemin <suxm@chinanetcenter.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-14ipv4: fix checksum annotation in udp4_csum_initHannes Frederic Sowa
Reported-by: Cong Wang <xiyou.wangcong@gmail.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Tom Herbert <tom@herbertland.com> Fixes: 4068579e1e098fa ("net: Implmement RFC 6936 (zero RX csums for UDP/IPv6") Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-06-02Possible problem with e6afc8ac ("udp: remove headers from UDP packets before ↵Eric Dumazet
queueing") Paul Moore tracked a regression caused by a recent commit, which mistakenly assumed that sk_filter() could be avoided if socket had no current BPF filter. The intent was to avoid udp_lib_checksum_complete() overhead. But sk_filter() also checks skb_pfmemalloc() and security_sock_rcv_skb(), so better call it. Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Paul Moore <paul@paul-moore.com> Tested-by: Paul Moore <paul@paul-moore.com> Tested-by: Stephen Smalley <sds@tycho.nsa.gov> Cc: samanthakumar <samanthakumar@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-20udp: prevent skbs lingering in tunnel socket queuesHannes Frederic Sowa
In case we find a socket with encapsulation enabled we should call the encap_recv function even if just a udp header without payload is available. The callbacks are responsible for correctly verifying and dropping the packets. Also, in case the header validation fails for geneve and vxlan we shouldn't put the skb back into the socket queue, no one will pick them up there. Instead we can simply discard them in the respective encap_recv functions. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-13udp: Resolve NULL pointer dereference over flow-based vxlan deviceAlexander Duyck
While testing an OpenStack configuration using VXLANs I saw the following call trace: RIP: 0010:[<ffffffff815fad49>] udp4_lib_lookup_skb+0x49/0x80 RSP: 0018:ffff88103867bc50 EFLAGS: 00010286 RAX: ffff88103269bf00 RBX: ffff88103269bf00 RCX: 00000000ffffffff RDX: 0000000000004300 RSI: 0000000000000000 RDI: ffff880f2932e780 RBP: ffff88103867bc60 R08: 0000000000000000 R09: 000000009001a8c0 R10: 0000000000004400 R11: ffffffff81333a58 R12: ffff880f2932e794 R13: 0000000000000014 R14: 0000000000000014 R15: ffffe8efbfd89ca0 FS: 0000000000000000(0000) GS:ffff88103fd80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000488 CR3: 0000000001c06000 CR4: 00000000001426e0 Stack: ffffffff81576515 ffffffff815733c0 ffff88103867bc98 ffffffff815fcc17 ffff88103269bf00 ffffe8efbfd89ca0 0000000000000014 0000000000000080 ffffe8efbfd89ca0 ffff88103867bcc8 ffffffff815fcf8b ffff880f2932e794 Call Trace: [<ffffffff81576515>] ? skb_checksum+0x35/0x50 [<ffffffff815733c0>] ? skb_push+0x40/0x40 [<ffffffff815fcc17>] udp_gro_receive+0x57/0x130 [<ffffffff815fcf8b>] udp4_gro_receive+0x10b/0x2c0 [<ffffffff81605863>] inet_gro_receive+0x1d3/0x270 [<ffffffff81589e59>] dev_gro_receive+0x269/0x3b0 [<ffffffff8158a1b8>] napi_gro_receive+0x38/0x120 [<ffffffffa0871297>] gro_cell_poll+0x57/0x80 [vxlan] [<ffffffff815899d0>] net_rx_action+0x160/0x380 [<ffffffff816965c7>] __do_softirq+0xd7/0x2c5 [<ffffffff8107d969>] run_ksoftirqd+0x29/0x50 [<ffffffff8109a50f>] smpboot_thread_fn+0x10f/0x160 [<ffffffff8109a400>] ? sort_range+0x30/0x30 [<ffffffff81096da8>] kthread+0xd8/0xf0 [<ffffffff81693c82>] ret_from_fork+0x22/0x40 [<ffffffff81096cd0>] ? kthread_park+0x60/0x60 The following trace is seen when receiving a DHCP request over a flow-based VXLAN tunnel. I believe this is caused by the metadata dst having a NULL dev value and as a result dev_net(dev) is causing a NULL pointer dereference. To resolve this I am replacing the check for skb_dst(skb)->dev with just skb->dev. This makes sense as the callers of this function are usually in the receive path and as such skb->dev should always be populated. In addition other functions in the area where these are called are already using dev_net(skb->dev) to determine the namespace the UDP packet belongs in. Fixes: 63058308cd55 ("udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skb") Signed-off-by: Alexander Duyck <aduyck@mirantis.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-05-02udp: prepare for non BH masking at backlog processingEric Dumazet
UDP uses the generic socket backlog code, and this will soon be changed to not disable BH when protocol is called back. We need to use appropriate SNMP accessors. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: udp: rename UDP_INC_STATS_BH()Eric Dumazet
Rename UDP_INC_STATS_BH() to __UDP_INC_STATS(), and UDP6_INC_STATS_BH() to __UDP6_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: rename ICMP_INC_STATS_BH()Eric Dumazet
Rename ICMP_INC_STATS_BH() to __ICMP_INC_STATS() Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-27net: snmp: kill various STATS_USER() helpersEric Dumazet
In the old days (before linux-3.0), SNMP counters were duplicated, one for user context, and one for BH context. After commit 8f0ea0fe3a03 ("snmp: reduce percpu needs by 50%") we have a single copy, and what really matters is preemption being enabled or disabled, since we use this_cpu_inc() or __this_cpu_inc() respectively. We therefore kill SNMP_INC_STATS_USER(), SNMP_ADD_STATS_USER(), NET_INC_STATS_USER(), NET_ADD_STATS_USER(), SCTP_INC_STATS_USER(), SNMP_INC_STATS64_USER(), SNMP_ADD_STATS64_USER(), TCP_ADD_STATS_USER(), UDP_INC_STATS_USER(), UDP6_INC_STATS_USER(), and XFRM_INC_STATS_USER() Following patches will rename __BH helpers to make clear their usage is not tied to BH being disabled. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts were two cases of simple overlapping changes, nothing serious. In the UDP case, we need to add a hlist_add_tail_rcu() to linux/rculist.h, because we've moved UDP socket handling away from using nulls lists. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-18udp: fix if statement in SIOCINQ ioctlDan Carpenter
We deleted a line of code and accidentally made the "return put_user()" part of the if statement when it's supposed to be unconditional. Fixes: 9f9a45beaa96 ('udp: do not expect udp headers on ioctl SIOCINQ') Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-14soreuseport: fix ordering for mixed v4/v6 socketsCraig Gallek
With the SO_REUSEPORT socket option, it is possible to create sockets in the AF_INET and AF_INET6 domains which are bound to the same IPv4 address. This is only possible with SO_REUSEPORT and when not using IPV6_V6ONLY on the AF_INET6 sockets. Prior to the commits referenced below, an incoming IPv4 packet would always be routed to a socket of type AF_INET when this mixed-mode was used. After those changes, the same packet would be routed to the most recently bound socket (if this happened to be an AF_INET6 socket, it would have an IPv4 mapped IPv6 address). The change in behavior occurred because the recent SO_REUSEPORT optimizations short-circuit the socket scoring logic as soon as they find a match. They did not take into account the scoring logic that favors AF_INET sockets over AF_INET6 sockets in the event of a tie. To fix this problem, this patch changes the insertion order of AF_INET and AF_INET6 addresses in the TCP and UDP socket lists when the sockets have SO_REUSEPORT set. AF_INET sockets will be inserted at the head of the list and AF_INET6 sockets with SO_REUSEPORT set will always be inserted at the tail of the list. This will force AF_INET sockets to always be considered first. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Fixes: 125e80b88687 ("soreuseport: fast reuseport TCP socket selection") Reported-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13udp: do not expect udp headers in recv cmsg IP_CMSG_CHECKSUMWillem de Bruijn
On udp sockets, recv cmsg IP_CMSG_CHECKSUM returns a checksum over the packet payload. Since commit e6afc8ace6dd pulled the headers, taking skb->data as the start of transport header is incorrect. Use the transport header pointer. Also, when peeking at an offset from the start of the packet, only return a checksum from the start of the peeked data. Note that the cmsg does not subtract a tail checkum when reading truncated data. Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-13udp: do not expect udp headers on ioctl SIOCINQWillem de Bruijn
On udp sockets, ioctl SIOCINQ returns the payload size of the first packet. Since commit e6afc8ace6dd pulled the headers, the result is incorrect when subtracting header length. Remove that operation. Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-07udp: Add udp6_lib_lookup_skb and udp4_lib_lookup_skbTom Herbert
Add externally visible functions to lookup a UDP socket by skb. This will be used for GRO in UDP sockets. These functions also check if skb->dst is set, and if it is not skb->dev is used to get dev_net. This allows calling lookup functions before dst has been set on the skbuff. Signed-off-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-05udp: enable MSG_PEEK at non-zero offsetsamanthakumar
Enable peeking at UDP datagrams at the offset specified with socket option SOL_SOCKET/SO_PEEK_OFF. Peek at any datagram in the queue, up to the end of the given datagram. Implement the SO_PEEK_OFF semantics introduced in commit ef64a54f6e55 ("sock: Introduce the SO_PEEK_OFF sock option"). Increase the offset on peek, decrease it on regular reads. When peeking, always checksum the packet immediately, to avoid recomputation on subsequent peeks and final read. The socket lock is not held for the duration of udp_recvmsg, so peek and read operations can run concurrently. Only the last store to sk_peek_off is preserved. Signed-off-by: Sam Kumar <samanthakumar@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-05udp: remove headers from UDP packets before queueingsamanthakumar
Remove UDP transport headers before queueing packets for reception. This change simplifies a follow-up patch to add MSG_PEEK support. Signed-off-by: Sam Kumar <samanthakumar@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04udp: no longer use SLAB_DESTROY_BY_RCUEric Dumazet
Tom Herbert would like not touching UDP socket refcnt for encapsulated traffic. For this to happen, we need to use normal RCU rules, with a grace period before freeing a socket. UDP sockets are not short lived in the high usage case, so the added cost of call_rcu() should not be a concern. This actually removes a lot of complexity in UDP stack. Multicast receives no longer need to hold a bucket spinlock. Note that ip early demux still needs to take a reference on the socket. Same remark for functions used by xt_socket and xt_PROXY netfilter modules, but this might be changed later. Performance for a single UDP socket receiving flood traffic from many RX queues/cpus. Simple udp_rx using simple recvfrom() loop : 438 kpps instead of 374 kpps : 17 % increase of the peak rate. v2: Addressed Willem de Bruijn feedback in multicast handling - keep early demux break in __udp4_lib_demux_lookup() Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tom Herbert <tom@herbertland.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04sock: enable timestamping using control messagesSoheil Hassas Yeganeh
Currently, SOL_TIMESTAMPING can only be enabled using setsockopt. This is very costly when users want to sample writes to gather tx timestamps. Add support for enabling SO_TIMESTAMPING via control messages by using tsflags added in `struct sockcm_cookie` (added in the previous patches in this series) to set the tx_flags of the last skb created in a sendmsg. With this patch, the timestamp recording bits in tx_flags of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg. Please note that this is only effective for overriding the recording timestamps flags. Users should enable timestamp reporting (e.g., SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using socket options and then should ask for SOF_TIMESTAMPING_TX_* using control messages per sendmsg to sample timestamps for each write. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-04-04ipv4: process socket-level control messages in IPv4Soheil Hassas Yeganeh
Process socket-level control messages by invoking __sock_cmsg_send in ip_cmsg_send for control messages on the SOL_SOCKET layer. This makes sure whenever ip_cmsg_send is called in udp, icmp, and raw, we also process socket-level control messages. Note that this commit interprets new control messages that were ignored before. As such, this commit does not change the behavior of IPv4 control messages. Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-03-22ipv4: fix broadcast packets receptionPaolo Abeni
Currently, ingress ipv4 broadcast datagrams are dropped since, in udp_v4_early_demux(), ip_check_mc_rcu() is invoked even on bcast packets. This patch addresses the issue, invoking ip_check_mc_rcu() only for mcast packets. Fixes: 6e5403093261 ("ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-23Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
Conflicts: drivers/net/phy/bcm7xxx.c drivers/net/phy/marvell.c drivers/net/vxlan.c All three conflicts were cases of simple overlapping changes. Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-13ipv4: fix memory leaks in ip_cmsg_send() callersEric Dumazet
Dmitry reported memory leaks of IP options allocated in ip_cmsg_send() when/if this function returns an error. Callers are responsible for the freeing. Many thanks to Dmitry for the report and diagnostic. Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: udp: always set up for CHECKSUM_PARTIAL offloadEdward Cree
If the dst device doesn't support it, it'll get fixed up later anyway by validate_xmit_skb(). Also, this allows us to take advantage of LCO to avoid summing the payload multiple times. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-12net: local checksum offload for encapsulationEdward Cree
The arithmetic properties of the ones-complement checksum mean that a correctly checksummed inner packet, including its checksum, has a ones complement sum depending only on whatever value was used to initialise the checksum field before checksumming (in the case of TCP and UDP, this is the ones complement sum of the pseudo header, complemented). Consequently, if we are going to offload the inner checksum with CHECKSUM_PARTIAL, we can compute the outer checksum based only on the packed data not covered by the inner checksum, and the initial value of the inner checksum field. Signed-off-by: Edward Cree <ecree@solarflare.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-02-11soreuseport: fast reuseport TCP socket selectionCraig Gallek
This change extends the fast SO_REUSEPORT socket lookup implemented for UDP to TCP. Listener sockets with SO_REUSEPORT and the same receive address are additionally added to an array for faster random access. This means that only a single socket from the group must be found in the listener list before any socket in the group can be used to receive a packet. Previously, every socket in the group needed to be considered before handing off the incoming packet. This feature also exposes the ability to use a BPF program when selecting a socket from a reuseport group. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-19udp: fix potential infinite loop in SO_REUSEPORT logicEric Dumazet
Using a combination of connected and un-connected sockets, Dmitry was able to trigger soft lockups with his fuzzer. The problem is that sockets in the SO_REUSEPORT array might have different scores. Right after sk2=socket(), setsockopt(sk2,...,SO_REUSEPORT, on) and bind(sk2, ...), but _before_ the connect(sk2) is done, sk2 is added into the soreuseport array, with a score which is smaller than the score of first socket sk1 found in hash table (I am speaking of the regular UDP hash table), if sk1 had the connect() done, giving a +8 to its score. hash bucket [X] -> sk1 -> sk2 -> NULL sk1 score = 14 (because it did a connect()) sk2 score = 6 SO_REUSEPORT fast selection is an optimization. If it turns out the score of the selected socket does not match score of first socket, just fallback to old SO_REUSEPORT logic instead of trying to be too smart. Normal SO_REUSEPORT users do not mix different kind of sockets, as this mechanism is used for load balance traffic. Fixes: e32ea7e74727 ("soreuseport: fast reuseport UDP socket selection") Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Craig Gallek <kraigatgoog@gmail.com> Acked-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-06Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller
2016-01-06soreuseport: pass skb to secondary UDP socket lookupCraig Gallek
This socket-lookup path did not pass along the skb in question in my original BPF-based socket selection patch. The skb in the udpN_lib_lookup2 path can be used for BPF-based socket selection just like it is in the 'traditional' udpN_lib_lookup path. udpN_lib_lookup2 kicks in when there are greater than 10 sockets in the same hlist slot. Coincidentally, I chose 10 sockets per reuseport group in my functional test, so the lookup2 path was not excersised. This adds an additional set of tests with 20 sockets. Fixes: 538950a1b752 ("soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPF") Fixes: 3ca8e4029969 ("soreuseport: BPF selection functional test") Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04net: Propagate lookup failure in l3mdev_get_saddr to callerDavid Ahern
Commands run in a vrf context are not failing as expected on a route lookup: root@kenny:~# ip ro ls table vrf-red unreachable default root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254 ping: Warning: source address might be selected on device other than vrf-red. PING 10.100.1.254 (10.100.1.254) from 0.0.0.0 vrf-red: 56(84) bytes of data. --- 10.100.1.254 ping statistics --- 2 packets transmitted, 0 received, 100% packet loss, time 999ms Since the vrf table does not have a route for 10.100.1.254 the ping should have failed. The saddr lookup causes a full VRF table lookup. Propogating a lookup failure to the user allows the command to fail as expected: root@kenny:~# ping -I vrf-red -c1 -w1 10.100.1.254 connect: No route to host Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04soreuseport: setsockopt SO_ATTACH_REUSEPORT_[CE]BPFCraig Gallek
Expose socket options for setting a classic or extended BPF program for use when selecting sockets in an SO_REUSEPORT group. These options can be used on the first socket to belong to a group before bind or on any socket in the group after bind. This change includes refactoring of the existing sk_filter code to allow reuse of the existing BPF filter validation checks. Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2016-01-04soreuseport: fast reuseport UDP socket selectionCraig Gallek
Include a struct sock_reuseport instance when a UDP socket binds to a specific address for the first time with the reuseport flag set. When selecting a socket for an incoming UDP packet, use the information available in sock_reuseport if present. This required adding an additional field to the UDP source address equality function to differentiate between exact and wildcard matches. The original use case allowed wildcard matches when checking for existing port uses during bind. The new use case of adding a socket to a reuseport group requires exact address matching. Performance test (using a machine with 2 CPU sockets and a total of 48 cores): Create reuseport groups of varying size. Use one socket from this group per user thread (pinning each thread to a different core) calling recvmmsg in a tight loop. Record number of messages received per second while saturating a 10G link. 10 sockets: 18% increase (~2.8M -> 3.3M pkts/s) 20 sockets: 14% increase (~2.9M -> 3.3M pkts/s) 40 sockets: 13% increase (~3.0M -> 3.4M pkts/s) This work is based off a similar implementation written by Ying Cai <ycai@google.com> for implementing policy-based reuseport selection. Signed-off-by: Craig Gallek <kraig@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>